Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction


Recurrent Asynchronous Multimodal Networks


Description

Event cameras are novel vision sensors that report per-pixel brightness changes as a stream of asynchronous "events". They offer significant advantages compared to standard cameras due to their high temporal resolution, high dynamic range and lack of motion blur. However, events only measure the varying component of the visual signal, which limits their ability to encode scene context. By contrast, standard cameras measure absolute intensity frames, which capture a much richer representation of the scene. Both sensors are thus complementary. However, due to the asynchronous nature of events, combining them with synchronous images remains challenging, especially for learning-based methods. This is because traditional recurrent neural networks (RNNs) are not designed for asynchronous and irregular data from additional sensors. To address this challenge, we introduce Recurrent Asynchronous Multimodal (RAM) networks, which generalize traditional RNNs to handle asynchronous and irregular data from multiple sensors. Inspired by traditional RNNs, RAM networks maintain a hidden state that is updated asynchronously and can be queried at any time to generate a prediction. We apply this novel architecture to monocular depth estimation with events and frames where we show an improvement over state-of-the-art methods by up to 30\% in terms of mean absolute depth error. To enable further research on multimodal learning with events, we release EventScape, a new dataset with events, intensity frames, semantic labels, and depth maps recorded in the CARLA simulator.

Citing

If you use this work in your research, please cite the following paper:

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction

D. Gehrig*, M. Rüegg*, M. Gehrig, J. Hidalgo-Carrió, D. Scaramuzza

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction

IEEE Robotics and Automation Letters (RA-L), 2021.

PDF Code Project Page ICRA 2021 Video Pitch Slides




Code for Depth Prediction

The reconstruction code and a trained model are available here.

Event-based Dataset for Multimodal Learning - EventScape

The dataset used to pretrain the model in our paper Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction is available in the table below. Crucially, compared to the DENSE dataset, EventScape also contains moving pedestrians to enable development of pedestrian-aware perception algorithms. A sensor size of 512 x 256 pixels is used with a a focal length of 256 pixels. EventScape is split into training, validation and test folders which contain 536, 103 and 119 sequences respectively. Each sequence contains between 100 and 150 samples, each with semantic labels, depth labels, raw events, frames and vehicle navigation data. The vehicle navigation data contain the following information: position, orientation, angular velocity, linear velocity, steering angle, brake state and throttle. Below you can find samples of each dataset split. To download the whole dataset use the following links: Training set (71 Gb), Validation set (12 Gb), Test set (14 Gb). Note that, while the test and validation set each were recorded in Town 05, they were recorded in different locations with no overlap.


Training Sequences

Download the whole training dataset (71 Gb),
Sequence Preview
Download Link
train_00_town01.zip (121 MB)
train_01_town02.zip (172 MB)
train_02_town03.zip (153 MB)

Validation Sequences

Download the whole validation set (12 Gb),
Sequence Preview
Download link
valid_08_town05.zip (183 MB)
valid_09_town05.zip (123 MB)

Test Sequences

Download the whole test set (14 Gb).
Sequence Preview
Download link
test_00_town05.zip (183 MB)
test_01_town05.zip (98 MB)