Visual and Inertial Odometry and SLAM

Metric 6 degree of freedom state estimation is possible with a single camera and an inertial measurement unit. Using only a monocular vision sensor, the trajectory of the camera can be recovered, up to a scale factor, using visual odometry. We investigate algorithms for visual and visual-inertial odometry, as well as methods to improve the performance of existing VO and VIO pipelines.


Sight Guide: Vision Assistance for Blind People - Cybathlon 2024

Our team, Sight Guide, competed in the Vision Assistance Race at the CYBATHLON 2024 - the "cyber Olympics" designed to push the boundaries of assistive technology. In this race, our system guided a blind participant through everyday tasks such as walking along a sidewalk, sorting colors, ordering from a touchscreen, purchasing a box of tea, and navigating a forest—all powered by computer vision assistance! We used two RGB cameras and a depth camera for localization and 3D mapping and semantic understanding, and a belt for haptic feedback. Our system received the Jury Award for the most innovative and user-friendly solution. Check out our project page. This project is the result of the joint collaboration among the University of Zurich (UZH), ETH Zurich (ETH), and the Zurich University of Applied Sciences (ZHAW).

End-to-End Learned Event- and Image-based Visual Odometry

Visual Odometry (VO) is crucial for autonomous robotic navigation, especially in GPS-denied environments like planetary terrains. To improve robustness, recent model-based VO systems have begun combining standard and event-based cameras. While event cameras excel in low-light and high-speed motion, standard cameras provide dense and easier-to-track features. However, the field of image- and event-based VO still predominantly relies on model-based methods and is yet to fully integrate recent image-only advancements leveraging end-to-end learning-based architectures. Seamlessly integrating the two modalities remains challenging due to their different nature, one asynchronous, the other not, limiting the potential for a more effective image- and event-based VO. We introduce RAMP-VO, the first end-to-end learned image- and event-based VO system. It leverages novel Recurrent, Asynchronous, and Massively Parallel (RAMP) encoders capable of fusing asynchronous events with image data, providing 8x faster inference and 33% more accurate predictions than existing solutions. Despite being trained only in simulation, RAMP-VO outperforms previous methods on the newly introduced Apollo and Malapert datasets, and on existing benchmarks, where it improves image- and event-based methods by 58.8% and 30.6%, paving the way for robust and asynchronous VO in space.


References

IROS24_Pellerito

Roberto Pellerito, Marco Cannici, Daniel Gehrig, Joris Belhadj, Olivier Dubois-Matra, Massimo Casasco, Davide Scaramuzza

Deep Visual Odometry with Events and Frames

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024.

PDF Code and Data Video


Structure-Invariant Range-Visual-Inertial Odometry

The Mars Science Helicopter (MSH) mission aims to deploy the next generation of unmanned helicopters on Mars, targeting landing sites in highly irregular terrain such as Valles Marineris, the largest canyons in the Solar system with elevation variances of up to 8000 meters. Unlike its predecessor, the Mars 2020 mission, which relied on a state estimation system assuming planar terrain, MSH requires a novel approach due to the complex topography of the landing site. This paper introduces a novel range-visual-inertial odometry system tailored for the unique challenges of the MSH mission. Our system extends the state-of-the-art xVIO framework by fusing consistent range information with visual and inertial measurements, preventing metric scale drift in the absence of visual-inertial excitation (mono camera and constant velocity descent), and enabling landing on any terrain structure, without requiring any planar terrain assumption. Through extensive testing in image-based simulations using actual terrain structure and textures collected in Mars orbit, we demonstrate that our range-VIO approach estimates terrain-relative velocity meeting the stringent mission requirements, and outperforming existing methods.


References

IROS24_Alberico

Ivan Alberico, Jeff Delaune, Giovanni Cioffi, Davide Scaramuzza

Structure-Invariant Range-Visual-Inertial Odometry

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024.

PDF Video


Reinforcement Learning Meets Visual Odometry

Visual Odometry (VO) is essential to downstream mobile robotics and augmented/virtual reality tasks. Despite recent advances, existing VO methods still rely on heuristic design choices that require several weeks of hyperparameter tuning by human experts, hindering generalizability and robustness. We address these challenges by reframing VO as a sequential decision-making task and applying Reinforcement Learning (RL) to adapt the VO process dynamically. Our approach introduces a neural network, operating as an agent within the VO pipeline, to make decisions such as keyframe and grid-size selection based on real-time conditions. Our method minimizes reliance on heuristic choices using a reward function based on pose error, runtime, and other metrics to guide the system. Our RL framework treats the VO system and the image sequence as an environment, with the agent receiving observations from keypoints, map statistics, and prior poses. Experimental results using classical VO methods and public benchmarks demonstrate improvements in accuracy and robustness, validating the generalizability of our RL-enhanced VO approach to different scenarios. We believe this paradigm shift advances VO technology by eliminating the need for time-intensive parameter tuning of heuristics.


References

ECCV24_Messikommer

Nico Messikommer*, Giovanni Cioffi*, Mathias Gehrig, Davide Scaramuzza

Reinforcement Learning Meets Visual Odometry

European Conference on Computer Vision (ECCV), 2024.

PDF Video Code


HDVIO: Improving Localization and Disturbance Estimation with Hybrid Dynamics VIO

Visual-inertial odometry (VIO) is the most common approach for estimating the state of autonomous micro aerial vehicles using only onboard sensors. Existing methods improve VIO performance by including a dynamics model in the estimation pipeline. However, such methods degrade in the presence of low-fidelity vehicle models and continuous external disturbances, such as wind. Our proposed method, HDVIO, overcomes these limitations by using a hybrid dynamics model that combines a point-mass vehicle model with a learning-based component that captures complex aerodynamic effects. HDVIO estimates the external force and the full robot state by leveraging the discrepancy between the actual motion and the predicted motion of the hybrid dynamics model. Our hybrid dynamics model uses a history of thrust and IMU measurements to predict the vehicle dynamics. To demonstrate the performance of our method, we present results on both public and novel drone dynamics datasets and show real-world experiments of a quadrotor flying in strong winds up to 25 km/h. The results show that our approach improves the motion and external force estimation compared to the state-of-the-art by up to 33% and 40%, respectively. Furthermore, differently from existing methods, we show that it is possible to predict the vehicle dynamics accurately while having no explicit knowledge of its full state.


References

HDVIO

G. Cioffi*, L. Bauersfeld*, D. Scaramuzza

HDVIO: Improving Localization and Disturbance Estimation with Hybrid Dynamics VIO

Robotics: Science and Systems (RSS), 2023.

PDF YouTube


E-NeRF: Neural Radiance Fields from a Moving Event Camera

E-NeRF: Neural Radiance Fields from a Moving Event Camera

Estimating neural radiance fields (NeRFs) from "ideal" images has been extensively studied in the computer vision community. Most approaches assume optimal illumination and slow camera motion. These assumptions are often violated in robotic applications, where images may contain motion blur, and the scene may not have suitable illumination. This can cause significant problems for downstream tasks such as navigation, inspection, or visualization of the scene. To alleviate these problems, we present E-NeRF, the first method which estimates a volumetric scene representation in the form of a NeRF from a fast-moving event camera. Our method can recover NeRFs during very fast motion and in high-dynamic-range conditions where frame-based approaches fail. We show that rendering high-quality frames is possible by only providing an event stream as input. Furthermore, by combining events and frames, we can estimate NeRFs of higher quality than state-of-the-art approaches under severe motion blur. We also show that combining events and frames can overcome failure cases of NeRF estimation in scenarios where only a few input views are available without requiring additional regularization.


References

E-NeRF: Neural Radiance Fields from a Moving Event Camera

S. Klenk, L. Koestler, D. Scaramuzza, D. Cremers

E-NeRF: Neural Radiance Fields from a Moving Event Camera

IEEE Robotics and Automation Letters (RA-L), 2023.

PDF Code


SLAM for Visually Impaired People: A Survey

SLAM for Visually Impaired People: A Survey

In recent decades, several assistive technologies for visually impaired and blind (VIB) people have been developed to improve their ability to navigate independently and safely. At the same time, simultaneous localization and mapping (SLAM) techniques have become sufficiently robust and efficient to be adopted in the development of assistive technologies. In this paper, we first report the results of an anonymous survey conducted with VIB people to understand their experience and needs; we focus on digital assistive technologies that help them with indoor and outdoor navigation. Then, we present a literature review of assistive technologies based on SLAM. We discuss proposed approaches and indicate their pros and cons. We conclude by presenting future opportunities and challenges in this domain.


References

SLAM for Visually Impaired People: A Survey

Marziyeh Bamdad, Davide Scaramuzza, Alireza Darvishy

SLAM for Visually Impaired People: A Survey

arXiv, 2022

PDF


Data-Efficient Collaborative Decentralized Thermal-Inertial Odometry

Data-Efficient Collaborative Decentralized Thermal-Inertial Odometry

We propose a system solution to achieve data-efficient, decentralized state estimation for a team of flying robots using thermal images and inertial measurements. Each robot can fly independently, and exchange data when possible to refine its state estimate. Our system front-end applies an online photometric calibration to refine the thermal images so as to enhance feature tracking and place recognition. Our system back-end uses a covariance intersection fusion strategy to neglect the cross-correlation between agents so as to lower memory usage and computational cost. The communication pipeline uses Vector of Locally Aggregated Descriptors (VLAD) to construct a request-response policy that requires low bandwidth usage. We test our collaborative method on both synthetic and real-world data. Our results show that the proposed method improves by up to 46% trajectory estimation with respect to an individual-agent approach, while reducing up to 89% the communication exchange. Datasets and code are released to the public, extending the already-public JPL xVIO library.


References

Data-Efficient Collaborative Decentralized Thermal-Inertial Odometry

V. Polizzi, R. Hewitt, J. Hidalgo-Carrió, J. Delaune and D. Scaramuzza

Data-Efficient Collaborative Decentralized Thermal-Inertial Odometry

IEEE Robotics and Automation Letters (RA-L), 2022

PDF Poster Video Code


Exploring Event Camera-based Odometry for Planetary Robots

Exploring Event Camera-based Odometry for Planetary Robots

Due to their resilience to motion blur and high robustness in low-light and high dynamic range conditions, event cameras are poised to become enabling sensors for vision-based exploration on future Mars helicopter missions. However, existing event-based visual-inertial odometry (VIO) algorithms either suffer from high tracking errors or are brittle, since they cannot cope with significant depth uncertainties caused by an unforeseen loss of tracking or other effects. In this work, we introduce EKLT-VIO, which addresses both limitations by combining a state-of-the-art event-based frontend with a filter-based backend. This makes it both accurate and robust to uncertainties, outperforming event- and frame-based VIO algorithms on challenging benchmarks by 32%. In addition, we demonstrate accurate performance in hover-like conditions (outperforming existing event-based methods) as well as high robustness in newly collected Mars-like and high-dynamic-range sequences, where existing frame-based methods fail. In doing so, we show that event-based VIO is the way forward for vision-based exploration on Mars.


References

Exploring Event Camera-based Odometry for Planetary Robots

F. Mahlknecht, D. Gehrig, J. Nash, F. M. Rockenbauer, B. Morrell, J. Delaune and D. Scaramuzza

Exploring Event Camera-based Odometry for Planetary Robots

Robotics and Automation Letters (RAL), 2022

PDF Code & Datasets Video


Hilti-Oxford Dataset: A Millimetre-Accurate Benchmark for Simultaneous Localization and Mapping

Hilti-Oxford Dataset: A Millimetre-Accurate Benchmark for Simultaneous Localization and Mapping

Simultaneous Localization and Mapping (SLAM) is being deployed in real-world applications, however many state- of-the-art solutions still struggle in many common scenarios. A key necessity in progressing SLAM research is the availability of high-quality datasets and fair and transparent benchmarking. To this end, we have created the Hilti-Oxford Dataset, to push state- of-the-art SLAM systems to their limits. The dataset has a variety of challenges ranging from sparse and regular construction sites to a 17th century neoclassical building with fine details and curved surfaces. To encourage multi-modal SLAM approaches, we designed a data collection platform featuring a lidar, five cameras, and an IMU (Inertial Measurement Unit). With the goal of benchmarking SLAM algorithms for tasks where accuracy and robustness are paramount, we implemented a novel ground truth collection method that enables our dataset to accurately measure SLAM pose errors with millimeter accuracy. To further ensure accuracy, the extrinsics of our platform were verified with a micrometer-accurate scanner, and temporal calibration was managed online using hardware time synchronization. The multi-modality and diversity of our dataset attracted a large field of academic and industrial researchers to enter the second edition of the Hilti SLAM challenge, which concluded in June 2022. The results of the challenge show that while the top three teams could achieve accuracy of 2 cm or better for some sequences, the performance dropped off in more difficult sequences.


References

Hilti-Oxford Dataset: A Millimeter-Accurate Benchmark for Simultaneous Localization and Mapping

L. Zhang, M. Helmberger, L. Fu, D. Wisth, M. Camurri, D. Scaramuzza, M. Fallon

Hilti-Oxford Dataset: A Millimeter-Accurate Benchmark for Simultaneous Localization and Mapping

IEEE Robotics and Automation Letters (RA-L), 2023.

PDF YouTube Dataset


The Hilti SLAM Challenge Dataset


In this work, we propose a new dataset, the Hilti SLAM Challenge Dataset. The sensor platform used to collect this dataset contains a number of visual, lidar and inertial sensors which have all been rigorously calibrated. All data is temporally aligned to support precise multi-sensor fusion. Each dataset includes accurate ground truth to allow direct testing of SLAM results. Raw data as well as intrinsic and extrinsic sensor calibration data from twelve datasets in various environments is provided. Each environment represents common scenarios found in building construction sites in various stages of completion.


References

Arxiv21_HILTI

M. Helmberger, K. Morin, B. Berner, N. Kumar, G. Cioffi, D. Scaramuzza

The Hilti SLAM Challenge Dataset

Robotics and Automation Letters (RAL), 2022

PDF Dataset Video Talk


Ultimate SLAM? Combining Events, Images, and IMU for Robust Visual SLAM in HDR and High Speed Scenarios

In this paper, we present the first state estimation pipeline that leverages the complementary advantages of a standard camera with an event camera by fusing in a tightly-coupled manner events, standard frames, and inertial measurements. We show on the Event Camera Dataset that our hybrid pipeline leads to an accuracy improvement of 130% over event-only pipelines, and 85% over standard-frames only visual-inertial systems, while still being computationally tractable.

Furthermore, we use our pipeline to demonstrate - to the best of our knowledge - the first autonomous quadrotor flight using an event camera for state estimation, unlocking flight scenarios that were not reachable with traditional visual inertial odometry, such as low-light environments and high dynamic range scenes.


References

RAL18_VidalRebecq

A. Rosinol Vidal, H.Rebecq, T. Horstschaefer, D. Scaramuzza

Ultimate SLAM? Combining Events, Images, and IMU for Robust Visual SLAM in HDR and High Speed Scenarios

IEEE Robotics and Automation Letters (RA-L), 2018.

PDF YouTube ICRA18 Video Pitch Poster Results (raw trajectories) Project Webpage Source Code


Event-aided Direct Sparse Odometry

Event-aided Direct Sparse Odometry

We introduce EDS, a direct monocular visual odometry using events and frames. Our algorithm leverages the event generation model to track the camera motion in the blind time between frames. The method formulates a direct probabilistic approach of observed brightness increments. Per-pixel brightness increments are predicted using a sparse number of selected 3D points and are compared to the events via the brightness increment error to estimate camera motion. The method recovers a semi-dense 3D map using photometric bundle adjustment. EDS is the first method to perform 6-DOF VO using events and frames with a direct approach. By design it overcomes the problem of changing appearance in indirect methods. We also show that, for a target error performance, EDS can work at lower frame rates than state-of-the-art frame-based VO solutions. This opens the door to low-power motion-tracking applications where frames are sparingly triggered "on demand'' and our method tracks the motion in between. We release code and datasets to the public.


References

EDS

J. Hidalgo-Carrió, G.Gallego, D. Scaramuzza

Event-aided Direct Sparse Odometry

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

Oral Presentation.

PDF YouTube Code Poster Dataset CVPR Video


Continuous-Time vs. Discrete-Time Vision-based SLAM: A Comparative Study

Robotic practitioners generally approach the vision-based SLAM problem through discrete-time formulations. This has the advantage of a consolidated theory and very good understanding of success and failure cases. However, discrete-time SLAM needs tailored algorithms and simplifying assumptions when high-rate and/or asynchronous measurements, coming from different sensors, are present in the estimation process. Conversely, continuous-time SLAM, often overlooked by practitioners, does not suffer from these limitations. Indeed, it allows integrating new sensor data asynchronously without adding a new optimization variable for each new measurement. In this way, the integration of asynchronous or continuous high-rate streams of sensor data does not require tailored and highly-engineered algorithms, enabling the fusion of multiple sensor modalities in an intuitive fashion. On the down side, continuous time introduces a prior that could worsen the trajectory estimates in some unfavorable situations. In this work, we aim at systematically comparing the advantages and limitations of the two formulations in vision-based SLAM. To do so, we perform an extensive experimental analysis, varying robot type, speed of motion, and sensor modalities. Our experimental analysis suggests that, independently of the trajectory type, continuous-time SLAM is superior to its discrete counterpart whenever the sensors are not time-synchronized. In the context of this work, we developed, and open source, a modular and efficient software architecture containing state-of-the-art algorithms to solve the SLAM problem in discrete and continuous time.


References

RAL2021_Cioffi

G. Cioffi, T. Cieslewski, D. Scaramuzza

Continuous-Time vs. Discrete-Time Vision-based SLAM: A Comparative Study

Robotics and Automation Letters (RAL), 2022

PDF Code YouTube


Augmenting Visual Place Recognition with Structural Cues


In this work, we propose to augment image-based place recognition with structural cues. Specifically, these structural cues are obtained using structure-from-motion, such that no additional sensors are needed for place recognition. This is achieved by augmenting the 2D convolutional neural network (CNN) typically used for image-based place recognition with a 3D CNN that takes as input a voxel grid derived from the structure-from-motion point cloud. We evaluate different methods for fusing the 2D and 3D features and obtain best performance with global average pooling and simple concatenation. The resulting descriptor exhibits superior recognition performance compared to descriptors extracted from only one of the input modalities, including state-of-the-art image-based descriptors. Especially at low descriptor dimensionalities, we outperform state-of-the-art descriptors by up to 90%.


References

Augmenting Visual Place Recognition with Structural Cues

A. Oertel, T. Cieslewski, D. Scaramuzza

Augmenting Visual Place Recognition with Structural Cues

IEEE Robotics and Automation Letters (RA-L), 2020.

PDF YouTube


Fisher Information Field: an Efficient and Differentiable Map for Perception-aware Planning

Considering visual localization accuracy at the planning time gives preference to robot motion that can be better localized and is of benefit to vision-based navigation. To integrate the knowledge about localization accuracy in planning, a common approach is to compute the Fisher information of the pose estimation process from a set of sparse landmarks. However, this approach scales linearly with the number of landmarks and introduces redundant computation. To overcome these drawbacks, we propose the first dedicated map for evaluating the Fisher information of 6 degree-of-freedom visual localization for perception-aware planning. We separate and precompute the rotational invariant component from the Fisher information and store it in a voxel grid, namely the Fisher information field. The Fisher information for arbitrary poses can then be computed from the field in constant time. Experimental results show that the proposed Fisher information field can be applied to different planning algorithms and is at least 10 times faster than using the point cloud. Moreover, the proposed map is differentiable, resulting in better performance in trajectory optimization algorithms.

References

arXiv20_Zhang_FIF

Z. Zhang, D. Scaramuzza

Fisher Information Field: an Efficient and Differentiable Map for Perception-aware Planning

arXiv preprint, 2020.

PDF Video Code


ICRA19_Zhang

Z. Zhang, D. Scaramuzza

Beyond Point Clouds: Fisher Information Field for Active Visual Localization

IEEE International Conference on Robotics and Automation, 2019.

PDF Video Code


Tightly-coupled Fusion of Global Positional Measurements in Optimization-based Visual-Inertial Odometry


Motivated by the goal of achieving robust, drift-free pose estimation in long-term autonomous navigation, in this work we propose a methodology to fuse global positional information with visual and inertial measurements in a tightly-coupled nonlinear-optimization based estimator. Differently from previous works, which are loosely-coupled, the use of a tightly-coupled approach allows exploiting the correlations amongst all the measurements. A sliding window of the most recent system states is estimated by minimizing a cost function that includes visual re-projection errors, relative inertial errors, and global positional residuals. We use IMU preintegration to formulate the inertial residuals and leverage the outcome of such algorithm to efficiently compute the global position residuals. The experimental results show that the proposed method achieves accurate and globally consistent estimates, with negligible increase of the optimization computational cost. Our method consistently outperforms the loosely-coupled fusion approach. The mean position error is reduced up to 50% with respect to the loosely-coupled approach in outdoor Unmanned Aerial Vehicle (UAV) flights, where the global position information is given by noisy GPS measurements. To the best of our knowledge, this is the first work where global positional measurements are tightly fused in an optimization-based visual-inertial odometry algorithm, leveraging the IMU preintegration method to define the global positional factors.


References

IROS20_Cioffi

Giovanni Cioffi, Davide Scaramuzza

Tightly-coupled Fusion of Global Positional Measurements in Optimization-based Visual-Inertial Odometry

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, 2020.

PDF YouTube Code


Reference Pose Generation for Visual Localization via Learned Features and View Synthesis


Visual Localization is one of the key enabling technologies for autonomous driving and augmented reality. High quality datasets with accurate 6 Degree-of-Freedom (DoF) reference poses are the foundation for benchmarking and improving existing methods. Traditionally, reference poses have been obtained via Structure-from-Motion (SfM). However, SfM itself relies on local features which are prone to fail when images were taken under different conditions, e.g., day/night changes. At the same time, manually annotating feature correspondences is not scalable and potentially inaccurate. In this work, we propose a semi-automated approach to generate reference poses based on feature matching between renderings of a 3D model and real images via learned features. Given an initial pose estimate, our approach iteratively refines the pose based on feature matches against a rendering of the model from the current pose estimate. We significantly improve the nighttime reference poses of the popular Aachen Day-Night dataset, showing that state-of-the-art visual localization methods perform better (up to 47%) than predicted by the original reference poses. We extend the dataset with new nighttime test images, provide uncertainty estimates for our new reference poses, and introduce a new evaluation criterion. We will make our reference poses and our framework publicly available upon publication.


References

IJCV20_Zhang

Zichao Zhang, Torsten Sattler, Davide Scaramuzza

Reference Pose Generation for Long-term Visual Localization via Learned Features
and View Synthesis

International Journal of Computer Vision (IJCV), 2020.

PDF Online Visual Localization Benchmark


GPU-Accelerated Frontend for High-Speed VIO


The recent introduction of powerful embedded graphics processing units (GPUs) has allowed for unforeseen improvements in real-time computer vision applications. It has enabled algorithms to run onboard, well above the standard video rates, yielding not only higher information processing capability, but also reduced latency. This work focuses on the applicability of efficient low-level, GPU hardware-specific instructions to improve on existing computer vision algorithms in the field of visual-inertial odometry (VIO). While most steps of a VIO pipeline work on visual features, they rely on image data for detection and tracking, of which both steps are well suited for parallelization. Especially non-maxima suppression and the subsequent feature selection are prominent contributors to the overall image processing latency. Our work first revisits the problem of non-maxima suppression for feature detection specifically on GPUs, and proposes a solution that selects local response maxima, imposes spatial feature distribution, and extracts features simultaneously. Our second contribution introduces an enhanced FAST feature detector that applies the aforementioned non-maxima suppression method. Finally, we compare our method to other state-of-the-art CPU and GPU implementations, where we always outperform all of them in feature tracking and detection, resulting in over 1000fps throughput on an embedded Jetson TX2 platform. Additionally, we demonstrate our work integrated in a VIO pipeline achieving a metric state estimation at ~200fps.


References

Arxiv20_Nagy

Balazs Nagy, Philipp Foehn, D. Scaramuzza

Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, 2020.

PDF Code YouTube


Voxel Map for Visual SLAM


In modern visual SLAM systems, it is a standard practice to retrieve potential candidate map points from overlapping keyframes for further feature matching or direct tracking. In this work, we argue that keyframes are not the optimal choice for this task, due to several inherent limitations, such as weak geometric reasoning and poor scalability. We propose a voxel-map representation to efficiently retrieve map points for visual SLAM. In particular, we organize the map points in a regular voxel grid. Visible points from a camera pose are queried by sampling the camera frustum in a raycasting manner, which can be done in constant time using an efficient voxel hashing method. Compared with keyframes, the retrieved points using our method are geometrically guaranteed to fall in the camera field-of-view, and occluded points can be identified and removed to a certain extend. This method also naturally scales up to large scenes and complicated multi-camera configurations. Experimental results show that our voxel map representation is as efficient as a keyframe map with 5 keyframes and provides significantly higher localization accuracy (average 46% improvement in RMSE) on the EuRoC dataset. The proposed voxel-map representation is a general approach to a fundamental functionality in visual SLAM and widely applicable

References

ICRA20_Muglikar

M. Muglikar, Z. Zhang, D. Scaramuzza

Voxel Map for Visual SLAM

IEEE International Conference on Robotics and Automation, 2020.

PDF ICRA2020 Pitch Video


Redesigning SLAM for Arbitrary Multi-Camera Systems

Adding more cameras to SLAM systems improves robustness and accuracy but complicates the design of the visual front-end significantly. Thus, most systems in the literature are tailored for specific camera configurations. In this work, we aim at an adaptive SLAM system that works for arbitrary multi-camera setups. To this end, we revisit several common building blocks in visual SLAM. In particular, we propose an adaptive initialization scheme, a sensor-agnostic, information-theoretic keyframe selection algorithm, and a scalable voxel-based map. These techniques make little assumption about the actual camera setups and prefer theoretically grounded methods over heuristics. We adapt a state-of-the-art visual-inertial odometry with these modifications, and experimental results show that the modified pipeline can adapt to a wide range of camera setups (e.g., 2 to 6 cameras in one experiment) without the need of sensor-specific modifications or tuning.

References

ICRA20_Kuo

J. Kuo, M. Muglikar, Z. Zhang, D. Scaramuzza

Redesigning SLAM for Arbitrary Multi-Camera Systems

IEEE International Conference on Robotics and Automation, 2020.

PDF Video ICRA2020 Pitch Video


Smart Interest Points


Detecting interest points is a key component of vision-based estimation algorithms, such as visual odometry or visual SLAM. In the context of distributed visual SLAM, we have encountered the need to minimize the amount of data that is sent between robots, which, for relative pose estimation, translates into the need to find a minimum set of interest points that is sufficiently reliably detected between viewpoints to ensure relative pose estimation. We have decided to solve this problem at a fundamental level, that is, at the point detector, using machine learning.

In SIPS, we introduce the succinctness metric, which allows to quantify performance of interest point detectors with respect to this goal. At the same time, we propose an unsupervised training method for CNN interest point detectors which requires no labels - only uncalibrated image sequences. The proposed method is able to establish relative poses with a minimum of extracted interest points. However, descriptors still need to be extracted and transmitted to establish these poses.

This problem is addressed in IMIPs, where we propose the first feature matching pipeline that works by implicit matching, without the need of descriptors. In IMIPs, the detector CNN has multiple output channels, and each channel generates a single interest point. Between viewpoints, interest points obtained from the same channel are considered implicitly matched. This allows matching points with as little as 3 bytes per point - the point coordinates in an up to 4096 x 4096 image.


References

IMIPs

T. Cieslewski, M. Bloesch, D. Scaramuzza

Matching Features without Descriptors:
Implicitly Matched Interest Points

British Machine Vision Conference (BMVC), Cardiff, 2019.

PDF Poster Code and Data


SIPs: Succinct Interest Points from Unsupervised Inlierness Probability Learning

T. Cieslewski, K. G. Derpanis, D. Scaramuzza

SIPs: Succinct Interest Points from Unsupervised Inlierness Probability Learning

IEEE International Conference on 3D Vision (3DV), 2019.

PDF Poster YouTube Code and Data


Visual-Inertial Odometry of Aerial Robotics

encyclopedia_vio
Visual-Inertial odometry (VIO) is the process of estimating the state (pose and velocity) of an agent (e.g., an aerial robot) by using only the input of one or more cameras plus one or more Inertial Measurement Units (IMUs) attached to it. VIO is the only viable alternative to GPS and lidar-based odometry to achieve accurate state estimation. Since both cameras and IMUs are very cheap, these sensor types are ubiquitous in all today's aerial robots.

References

encyclopedia19_scaramuzza

D. Scaramuzza, Z. Zhang

Visual-Inertial Odometry of Aerial Robots

Encyclopedia of Robotics, Springer, 2019

PDF


Probabilistic, Continuous-Time Trajectory Evaluation for SLAM

Trajectory Evaluation
Despite the existence of different error metrics for trajectory evaluation in SLAM, their theoretical justifications and connections are rarely studied, and few methods handle temporal association properly. In this work, we propose to formulate the trajectory evaluation problem in a probabilistic, continuous-time framework. By modeling the groundtruth as random variables, the concepts of absolute and relative error are generalized to be likelihood. Moreover, the groundtruth is represented as a piecewise Gaussian Process in continuous-time. Within this framework, we are able to establish theoretical connections between relative and absolute error metrics and handle temporal association in a principled manner.

References

WICRA19_Zhang

Z. Zhang, D. Scaramuzza

Rethinking Trajectory Evaluation for SLAM: a Probabilistic, Continuous-Time Approach

ICRA19 Workshop on Dataset Generation and Benchmarking of SLAM Algorithms for Robotics and VR/AR

Best Paper Award!

PDF


Visual Inertial Model-based Odometry and Force Estimation

In recent years, many approaches to Visual Inertial Odometry (VIO) have become available. However, they neither exploit the robot's dynamics and known actuation inputs, nor differentiate between desired motion due to actuation and unwanted perturbation due to external force. For many robotic applications, it is often essential to sense the external force acting on the system due to, for example, interactions, contacts, and disturbances. Adding a motion constraint to an estimator leads to a discrepancy between the model-predicted motion and the actual motion. Our approach exploits this discrepancy and resolves it by simultaneously estimating the motion and the external force. We propose a relative motion constraint combining the robot's dynamics and the external force in a preintegrated residual, resulting in a tightly-coupled, sliding-window estimator exploiting all correlations among all variables. We implement our Visual Inertial Model-based Odometry (VIMO) system into a state-of-the-art VIO approach and evaluate it against the original pipeline without motion constraints on both simulated and real-world data. The results show that our approach increases the accuracy of the estimator up to 29\% compared to the original VIO, and provides external force estimates at no extra computational cost. To the best of our knowledge, this is the first approach exploiting model dynamics by jointly estimating motion and external force.

References

ICRA19_Zhang

B. Nisar, P. Foehn, D. Falanga, D. Scaramuzza

VIMO: Simultaneous Visual Inertial Model-based Odometry and Force Estimation

Robotics: Science and Systems (RSS), Freiburg, 2019

PDF Video Code


Fisher Information Field for Active Visual Localization

For mobile robots to localize robustly, actively considering the perception requirement at the planning stage is essential. In this paper, we propose a novel representation for active visual localization. By formulating the Fisher information and sensor visibility carefully, we are able to summarize the localization information into a discrete grid, namely the Fisher information field. The information for arbitrary poses can then be computed from the field in constant time, without the need of costly iterating all the 3D landmarks. Experimental results on simulated and real-world data show the great potential of our method in efficient active localization and perception- aware planning. To benefit related research, we release our implementation of the information field to the public.

References

ICRA19_Zhang

Z. Zhang, D. Scaramuzza

Beyond Point Clouds: Fisher Information Field for Active Visual Localization

IEEE International Conference on Robotics and Automation, 2019.

PDF Video Code


A Tutorial on Quantitative Trajectory Evaluation for Visual(-Inertial) Odometry

Trajectory Evaluation
In this tutorial, we provide principled methods to quantitatively evaluate the quality of an estimated trajectory from visual(-inertial) odometry (VO/VIO), which is the foundation of benchmarking the accuracy of different algorithms. First, we show how to determine the transformation type to use in trajectory alignment based on the specific sensing modality (i.e., monocular, stereo and visual-inertial). Second, we describe commonly used error metrics (i.e., the absolute trajectory error and the relative error) and their strengths and weaknesses. To make the methodology presented for VO/VIO applicable to other setups, we also generalize our formulation to any given sensing modality. To facilitate the reproducibility of related research, we publicly release our implementation of the methods described in this tutorial.

References

Trajectory Evaluation

Z. Zhang, D. Scaramuzza

A Tutorial on Quantitative Trajectory Evaluation for Visual(-Inertial) Odometry

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, 2018.

PDF PPT VO/VIO Evaluation Toolbox


On the Comparison of Gauge Freedom Handling in Optimization-based Visual-Inertial State Estimation

Gauge Comparison
It is well known that visual-inertial state estimation is possible up to a four degrees-of-freedom (DoF) transformation (rotation around gravity and translation), and the extra DoFs ("gauge freedom") have to be handled properly. While different approaches for handling the gauge freedom have been used in practice, no previous study has been carried out to systematically analyze their differences. In this paper, we present the first comparative analysis of different methods for handling the gauge freedom in optimization-based visual-inertial state estimation. We experimentally compare three commonly used approaches: fixing the unobservable states to some given values, setting a prior on such states, or letting the states evolve freely during optimization. Specifically, we show that (i) the accuracy and computational time of the three methods are similar, with the free gauge approach being slightly faster; (ii) the covariance estimation from the free gauge approach appears dramatically different, but is actually tightly related to the other approaches. Our findings are validated both in simulation and on real-world datasets and can be useful for designing optimization-based visual-inertial state estimation algorithms.

References

Gauge Comparison

Z. Zhang, G, Gallego, D. Scaramuzza

On the Comparison of Gauge Freedom Handling in Optimization-based Visual-Inertial State Estimation

IEEE Robotics and Automation Letters (RA-L), 2018.

PDF PPT Code


Visual-Inertial Odometry Benchmarking

Flying robots require a combination of accuracy and low latency in their state estimation in order to achieve stable and robust flight. However, due to the power and payload constraints of aerial platforms, state estimation algorithms must provide these qualities under the computational constraints of embedded hardware. Cameras and inertial measurement units (IMUs) satisfy these power and payload constraints, so visual-inertial odometry (VIO) algorithms are popular choices for state estimation in these scenarios, in addition to their ability to operate without external localization from motion capture or global positioning systems. It is not clear from existing results in the literature, however, which VIO algorithms perform well under the accuracy, latency, and computational constraints of a flying robot with onboard state estimation. This paper evaluates an array of publicly-available VIO pipelines (MSCKF, OKVIS, ROVIO, VINS-Mono, SVO+MSF, and SVO+GTSAM) on different hardware configurations, including several single-board computer systems that are typically found on flying robots. The evaluation considers the pose estimation accuracy, per-frame processing time, and CPU and memory load while processing the EuRoC datasets, which contain six degree of freedom (6DoF) trajectories typical of flying robots. We present our complete results as a benchmark for the research community.

References

A Benchmark Comparison of Monocular VIO Algorithms for Flying Robots

J. Delmerico, D. Scaramuzza

A Benchmark Comparison of Monocular Visual-Inertial Odometry Algorithms for Flying Robots

IEEE International Conference on Robotics and Automation (ICRA), 2018.

PDF Video PPT


Active Exposure Control for Robust Visual Odometry in High Dynamic Range (HDR) Environments

In this paper, we propose an active exposure control method to improve the robustness of visual odometry in HDR (high dynamic range) environments. Our method evaluates the proper exposure time by maximizing a robust gradient-based image quality metric. The optimization is achieved by exploiting the photometric response function of the camera. Our exposure control method is evaluated in different real world environments and outperforms both the built-in auto-exposure function of the camera and a fixed exposure time. To validate the benefit of our approach, we test different state-of-the-art visual odometry pipelines (namely, ORB-SLAM2, DSO, and SVO 2.0) and demonstrate significant improved performance using our exposure control method in very challenging HDR environments. We release the code open-source.

References

ICRA17_Zhang

Z. Zhang, C. Forster, D. Scaramuzza

Active Exposure Control for Robust Visual Odometry in HDR Environments

IEEE International Conference on Robotics and Automation (ICRA), 2017.

PDF PPT YouTube Code


IMU Preintegration on Manifold for Efficient Visual-Inertial Maximum-a-Posteriori Estimation

Recent results in monocular visual-inertial navigation (VIN) have shown that optimization-based approaches outperform filtering methods in terms of accuracy due to their capability to relinearize past states. However, the improvement comes at the cost of increased computational complexity. In this paper, we address this issue by preintegrating inertial measurements between selected keyframes. The preintegration allows us to accurately summarize hundreds of inertial measurements into a single relative motion constraint. Our first contribution is a preintegration theory that properly addresses the manifold structure of the rotation group and carefully deals with uncertainty propagation. The measurements are integrated in a local frame, which eliminates the need to repeat the integration when the linearization point changes while leaving the opportunity for belated bias corrections. The second contribution is to show that the preintegrated IMU model can be seamlessly integrated in a visual-inertial pipeline under the unifying framework of factor graphs. This enables the use of a structureless model for visual measurements, further accelerating the computation. The third contribution is an extensive evaluation of our monocular VIN pipeline: experimental results confirm that our system is very fast and demonstrates superior accuracy with respect to competitive state-of-the-art filtering and optimization algorithms, including off-the-shelf systems such as Google Tango.

References

RSS15_Forster

C. Forster, L. Carlone, F. Dellaert, D. Scaramuzza

On-Manifold Preintegration for Real-Time Visual-Inertial Odometry

IEEE Transactions on Robotics, in press, 2016.

PDF YouTube


RSS2015_Forster

C. Forster, L. Carlone, F. Dellaert, D. Scaramuzza

IMU Preintegration on Manifold for Efficient Visual-Inertial Maximum-a-Posteriori Estimation

Robotics: Science and Systems (RSS), Rome, 2015.

Best Paper Award Finalist! Oral Presentation: Acceptance Rate 4%

PDF Supplementary material YouTube


SVO: Fast Semi-Direct Monocular Visual Odometry


We propose a semi-direct monocular visual odometry algorithm that is precise, robust, and faster than current state-of-the-art methods. The semi-direct approach eliminates the need of costly feature extraction and robust matching techniques for motion estimation. Our algorithm operates directly on pixel intensities, which results in subpixel precision at high frame-rates. A probabilistic mapping method that explicitly models outlier measurements is used to estimate 3D points, which results in fewer outliers and more reliable points. Precise and high frame-rate motion estimation brings increased robustness in scenes of little, repetitive, and high-frequency texture. The algorithm is applied to micro-aerial-vehicle stateestimation in GPS-denied environments and runs at 55 frames per second on the onboard embedded computer and at more than 300 frames per second on a consumer laptop.


This video shows results from a modification of the SVO algorithm that generalizes to a set of rigidly attached (not necessarily overlapping) cameras. Simultaneously, we run a CPU implementation of the REMODE algorithm on the front, left, and right camera. Everything runs in real-time on a laptop computer. Parking garage dataset courtesy of NVIDIA.

References

TRO17_Forster-SVO

Christian Forster, Zichao Zhang, Michael Gassner, Manuel Werlberger, Davide Scaramuzza

SVO: Semi-Direct Visual Odometry for Monocular and Multi-Camera Systems

IEEE Transactions on Robotics, Vol. 33, Issue 2, pages 249-265, Apr. 2017.

Includes comparison against ORB-SLAM, LSD-SLAM, and DSO and comparison among Dense, Semi-dense, and Sparse Direct Image Alignment.

PDF YouTube Binaries Download


ICRA2014_Forster

C. Forster, M. Pizzoli, D. Scaramuzza

SVO: Fast Semi-Direct Monocular Visual Odometry

IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, 2014.

PDF YouTube Software SVO 2.0 Binaries Download


ICRA2014_Pizzoli

M. Pizzoli, C. Forster, D. Scaramuzza

REMODE: Probabilistic, Monocular Dense Reconstruction in Real Time

IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, 2014.

PDF YouTube


1-point RANSAC

Given a car equipped with an omnidirectional camera, the motion of the vehicle can be purely recovered from salient features tracked over time. We propose the 1-Point RANSAC algorithm which runs at 800 Hz on a normal laptop. To our knowledge, this is the most efficient visual odometry algorithm.



This video shows the estimation of the vehicle motion from image features. The video demonstrate the approach described in our paper which uses 1-point RANSAC algorithm to remove the outliers. Except for the features extraction process, the outlier removal and the motion estimation steps take less than 1 ms on a normal laptop computer.

References

D. Scaramuzza and F. Fraundorfer. Visual Odometry: Part I - The First 30 Years and Fundamentals. IEEE Robotics and Automation Magazine, Volume 18, issue 4, 2011. [ PDF ]
F. Fraundorfer and D. Scaramuzza. Visual odometry: Part II - Matching, robustness, optimization, and applications. IEEE Robotics and Automation Magazine, Volume 19, issue 2, 2012. [ PDF ]
D. Scaramuzza. 1-Point-RANSAC Structure from Motion for Vehicle-Mounted Cameras by Exploiting Non-holonomic Constraints. International Journal of Computer Vision, Volume 95, Issue 1, 2011. [ PDF ]
D. Scaramuzza. Performance Evaluation of 1-Point-RANSAC Visual Odometry. Journal of Field Robotics, Volume 28, issue 5, 2011. PDF ]
D. Scaramuzza, A. Censi, K. Daniilidis. Exploiting Motion Priors in Visual Odometry for Vehicle-Mounted Cameras with Non-holonomic Constraints. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2011), San Francisco, September, 2011. [ PDF ]
L. Kneip, D. Scaramuzza, R. Siegwart. A Novel Parameterization of the Perspective-Three-Point Problem for a Direct Computation of Absolute Camera Position and Orientation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, USA, 2011. [ PDF ] [C/C++ code]
L. Kneip, A. Martinelli, S. Weiss, D. Scaramuzza, R. Siegwart. A Closed-Form Solution for Absolute Scale Velocity Determination Combining Inertial Measurements and a Single Feature Correspondence. IEEE International Conference on Robotics and Automation (ICRA 2011), Shanghai, 2011. [ PDF ]
D. Scaramuzza, F. Fraundorfer, and M. Pollefeys. Closing the Loop in Appearance-Guided Omnidirectional Visual Odometry by Using Vocabulary Trees. Robotics and Autonomous System Journal (Elsevier), Volume 58, issue 6, June, 2010. [ PDF ]
L. Kneip, D. Scaramuzza, R. Siegwart. On the Initialization of Statistical Optimum Filters with Application to Motion Estimation. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2010), Taipei, October, 2010. [ PDF ]
F. Fraundorfer, D. Scaramuzza, M. Pollefeys. A Constricted Bundle Adjustment Parameterization for Relative Scale Estimation in Visual Odometry. IEEE International Conference on Robotics and Automation (ICRA 2010), Anchorage, Alaska, May, 2010. [ PDF ]
D. Scaramuzza, L. Spinello, R. Triebel, R., Siegwart. Key Technologies for Intelligent and Safer Cars from Motion Estimation to Predictive Motion Planning. IEEE International Conference on Industrial Electronics, Bari, Italy, July, 2010. [ PDF ]
D. Sabatta, D. Scaramuzza, R. Siegwart. Improved Appearance-Based Matching in Similar and Dynamic Environments Using a Vocabulary Tree. IEEE International Conference on Robotics and Automation (ICRA 2010), Anchorage, Alaska, May, 2010. [ PDF ]
D. Scaramuzza, F. Fraundorfer, M. Pollefeys, R. Siegwart. Absolute Scale in Structure from Motion from a Single Vehicle Mounted Camera by Exploiting Nonholonomic Constraints. IEEE International Conference on Computer Vision (ICCV 2009), Kyoto, September-October, 2009. [ PDF ]
D. Scaramuzza, F. Fraundorfer, R. Siegwart. Real-Time Monocular Visual Odometry for On-Road Vehicles with 1-Point RANSAC. IEEE International Conference on Robotics and Automation (ICRA 2009), Kobe, Japan, May, 2009. [ PDF ]
D. Scaramuzza, R. Siegwart. Appearance-Guided Monocular Omnidirectional Visual Odometry for Outdoor Ground Vehicles. IEEE Transactions on Robotics, Volume 24, issue 5, October 2008. [ PDF ]

Robot Localization Using Soft Object Detection

Most of the work done in localization, mapping, and navigation for both ground and aerial vehicles has been done by means of point landmarks or occupancy grids, using vision or laser range finders. However, to make these robots one day able to cooperate with humans in complex scenarios, we need to build semantic maps of the environment. In this work we address map-based localization using "soft" object detection. Soft object detection differs from "hard" object detection in that we do not extract an "affirmative/negative" response about the presence of the object but rather we compute, for each pixel in the current frame, the probability that the object under consideration is there. This gives raise to many false positive (see the multiple peaks in the object "heat-map") that are disambiguated during motion by the particle filter.

References

R. Anati, D. Scaramuzza, K. Derpanis, K. Daniilidis.

Robot Localization Using Soft Object Detection

IEEE International Conference on Robotics and Automation (ICRA), St. Paul, 2012.

PDF