## The development of the DVS and DAVIS sensors

**Tobi Delbruck** 

Inst. of Neuroinformatics, University of Zurich and ETH Zurich



**Sponsors:** Swiss National Science Foundation **NCCR Robotics** project, EU projects **SEEBETTER** and **VISUALISE**, **Samsung**, **DARPA** 

sensors.ini.uzh.ch inilabs.com

Sponsors: Swiss National Science Foundation NCCR Robotics, EU projects CAVIAR, SEEBETTER, VISUALISE, Samsung, DARPA, University of Zurich and ETH Zurich

#### The Human Eye as a digital camera



100M photoreceptors 1M output fibers carrying max 100Hz spike rates 180dB (10<sup>9</sup>) operating range >20 different "eyes" Many GOPs computing 3mW power consumption

Output is sparse, asynchronous stream of digital spike events

# Conventional cameras (**Static vision sensors**) output a stroboscopic sequence of frames



# Good

Compatible with 50+years of machine vision Allows small pixels (1um for consumer, 3-5um for machine vision)

# Bad

Redundant output Temporal aliasing Limited dynamic range (60dB)

#### Fundamental "latency vs. power" trade-off



#### DVS (Dynamic Vision Sensor) Pixel



#### DVS pixel has wide dynamic range



780 lux 🗄 5.8 lux

Edmund 0.1 density chart Illumination ratio=135:1

**ISSCC 2007** 

#### **DAVIS** (Dynamic and Active Pixel Vision Sensor) Pixel







#### DVS/DAVIS +IMU demo





Brandli, Berner, Delbruck et al., Symp. VLSI 2013, JSSC 2014, ISCAS 2015

|                                                | TEMPDIFF128                                               | R         |        | DVS s           | enso     | r specificati       | ons    |    |
|------------------------------------------------|-----------------------------------------------------------|-----------|--------|-----------------|----------|---------------------|--------|----|
| functionality                                  | Asynchronous te<br>contrast                               | 8         |        | 7152            | _        |                     |        |    |
| Pixel size um (lambda)<br>Fill factor (%)      | 40x40 (200<br>8.1%<br>(PD area 1                          | Only      | at die | Isumption       |          | W @ 3.3V<br>A core  | 3      |    |
| Fabrication process                            | 4M 2P 0.3                                                 | USB       | DVS    | cameras burn ~5 | 500mW    | A lowla             |        |    |
| Pixel complexity                               | 26 transist<br>analog), 3                                 | Dy        | 'nan   | nic range       |          | 120dB               |        | 1: |
| Array size                                     | 128x128                                                   |           | Ro     | ad test         |          | 2 lux to > 100 k    | lux    |    |
| Die size mm²                                   | 6x6.3                                                     |           |        | aditiona        |          |                     | iu.    |    |
| nterface                                       | 15-bit word-para<br>AER                                   |           |        | oonse laten     | cy       | 15µs @ 1 klu        | x chip | <  |
| Power consumption                              | 24 nW @ 3.3V<br>1.5mA core<br>0.3mA logic<br>5.5mA biases |           | Fran   | nes/se or ha    | andwidt  | eal world: 100us-10 | )ms    | 6  |
| Dynamic range                                  | 120dB                                                     | - 1       |        |                 |          |                     | 0,000  |    |
| ia 18                                          | 2 lux to > 190 klu<br>scene illuminatio<br>with f/2 lens  |           | F      | PN, matchi      | ng       | 2.1% cont           | rast   |    |
| Photodiode dark current<br>at room temperature | 4fA (~10nA/cm²)<br>Wwell photodiode                       | 2         | 1      |                 |          | threshold matching  |        |    |
| Response latency<br>Frames/se or bandwidth     | 15µs @ 1 kiux ch<br>illumination<br>~1M events/sec        | ip <<br>6 |        | Important fo    | or model | ing DVS, e.g. Baye  | S      |    |
| PN, matching                                   | 2.1% contrast                                             | 2         |        |                 |          |                     |        |    |

## Experiment: Apply slow triangle wave LED stimulus to entire array, measure number of events that pixels generate



## Event threshold matching measurement



Lichtsteiner et al., IEEE JSSC 2008

Integrated bias generator and circuit design enables operation over extended temperature range



Nozaki & Delbruck IEEE TED, 2017 in review 15

## DVS pixel size trend



Delbruck, IEEE ESSCIRC 2016 16

#### Event camera silicon retina developments

ATIS/CCAM

**VS/DAVIS** 

#### Commercial entities

Inilabs (Zurich) – R&D prototoypes Insightness (Zurich) – Drones and Augmented Reality Samsung (S Korea) – Consumer electronics Pixium Vision (Paris) – Retinal implants Inivation (Zurich) – Industrial applications, Automotive Chronocam (Paris) - Automotive Hillhouse (Singapore) - Automotive DVS

CeleX



#### Founded 2009 Run as not-for-profit

#### **Neuromorphic sensor R&D prototypes**

Open source software, user guides,

app notes, sample data

Shipped devices based on EU funded silicon to >200 organizations



# Tracking objects from DVS events using spatio-temporal coherence



#### 1. For each event, find nearest cluster

- If event within a cluster, move cluster
- If event not within cluster, seed new cluster
- 2. Periodically prune starved clusters, merge clusters, etc (lifetime mgmt)

#### Advantages

- 1. Low computational cost (e.g. <5% CPU)
- 2. No frame memory (~100 bytes/object).
- 3. No frame correspondence problem

Litzenberger 2007



Using DVS allows 2 ms reaction time at 4% processor load with USB bus connections

# RoboGoalie www.ini.uzh.ch 2007

<3% laptop CPU load <3ms control latency



#### This talk has 4 parts



#### Block-Matching Optical Flow for Dynamic Vision Sensor: Algorithm and FPGA Implementation





Liu, Min, and Tobi Delbruck. 2017. "Block-Matching Optical Flow for Dynamic Vision Sensor: Algorithm and FPGA Implementation." In *2017 IEEE Symposium on Circuits and Systems (ISCAS 2017)*, in press. Baltimore, MD, USA.

#### Why do we need rapid and low power optic flow?

Human first person view drone racing

Child runs in front of car



https://youtu.be/nLUmW6OfEy0

https://youtu.be/UsoxrsrsdgA

Optical flow could be key part of enabling solutions to these problems 26

#### Example of State of the Art Deep Learning Optic Flow

#### "FlowNet 2.0"

- 1. Computes dense OF with amazing accuracy
- 2. Uses a complex stack of multiple CNNs that process image pairs
- 3. Requires labeled training data that is difficult to obtain and the CNN is difficult to train
- 4. Requires powerful PC plus GPU with
  ~400W power consumption to run
  HD video at 8 frames/second



Ilg, Eddy, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and **Thomas Brox**. 2016. "**FlowNet 2.0**: Evolution of Optical Flow Estimation with Deep Networks." arXiv:1612.01925 [Cs], December.

#### http://arxiv.org/abs/1612.01925. https://www.youtube.com/watch?v=JSzUdVBmQP4

#### Prior work for DVS optical flow

Delbruck, 2007, jAER project Benosman et al., Neural Networks 2012 Benosman et al., IEEE TNNLS 2013 Orchard et al, BioCAS 2013 Barranco et al., Proc IEEE, 2014 Brosch et al., Frontiers 2015 Mueggler et al, ICRA 2015 Conradt, ROBIO 2015 Rueckauer and Delbruck, Frontiers 2016 Bardow et al., CVPR 2016

Existing methods are serial algorithms that robustly solve linear or nonlinear constraints. They require several us/event on fast PC or PC + GPU.



Our work is inspired by MPEG motion estimation hardware. We seek **semi dense OF** architecture for **event-based vision sensors** that is **easily & cheaply** implemented in digital logic



#### Advantages

- 1. Decouples the sample rate from incoming event rate
- 2. Incoming events drive (optional!) flow computation
- 3. Parallel block matching hardware quickly computes block distances to find best match

#### Development and Testing Architecture



#### State machine and bitmap memory access



#### **OF** Results

- Using 9x9 block size and search distance of 1 pixel
- 2. "Translating boxes" data from Rueckauer & Delbruck, 2016
- 3. Ground truth is 10 pps to right
- 4. d = 100ms
- 5. Average measured pps is correct
- 6. Many vectors not "normal flow"
- 7. Solved by improved search



### Latest algorithm improvements (not in paper)

- Using multiscale bitmaps cheaply matches longer distances and lower spatial frequencies.
- Using multiple bits/pixel avoids bitmap saturation, e.g. 3 bits/pixel holds 8 events.
- Using diamond search improves search efficiency by >20X for search distances of 12 pixels.
- 4. Adapting slice duration under feedback control achieves a target average match distance, increasing speed range and usability.



#### 2nd generation OF algorithm

#### This talk has 4 parts

- •Dynamic Vision Sensor Silicon Retinas
- •Simple object tracking by algorithmic processing of events
- DVS Optical Flow //

architecture

•"Data-driven" deep inference with CNNs



#### Driving Convolutional Neural Networks with DVS

- Use DVS to drive conventional CNN with constant event count DVS frames.
- This way, the frame rate is proportional to the rate of change of the scene.
- The DVS local gain control and sparse output makes good input features for the CNN
- The CNNs can be trained using conventional DNN toolchains, e.g. Caffe.
- The sparse DVS frames benefit ČNN accelerators that take advantage of sparsity (e.g. our NullHop).



Aimar, et al. 2016. "Nullhop: Flexibly Efficient FPGA CNN Accelerator Driven by DAVIS Neuromorphic Vision Sensor." *NIPS 2016 Live Demonstration*.

Lungu, et al. 2017. "Live Demonstration: Convolutional Neural Network Driven by Dynamic Vision Sensor Playing RoShamBo." In 2017 IEEE Symposium on Circuits and Systems (ISCAS 2017).

Moeys, D. P., et al. 2016. "Steering a Predator Robot Using a Mixed Frame/Event-Driven Convolutional Neural Network." In 2016 IEEE Second International Conference on Event-Based Control, Communication, and Signal Processing (EBCCSP). 32

What are these people doing?

They are playing RoShambo aka Rock-Scissors-Paper aka Janken

## RoShamBo CNN architecture

Conventional 5-layer LeNet with ReLU/MaxPool and 1 FC layer before output.



I.-A. Lungu, F. Corradi, and T. Delbruck, "Live Demonstration: Convolutional Neural Network Driven by Dynamic Vision Sensor Playing RoShamBo," in 2017 IEEE Symposium on Circuits and Systems (ISCAS 2017), Baltimore, MD, USA, 2017.

#### RoShamBo training images



I.-A. Lungu, F. Corradi, and T. Delbruck, "Live Demonstration: Convolutional Neural Network Driven by Dynamic Vision Sensor Playing RoShamBo," in 2017 IEEE Symposium on Circuits and Systems (ISCAS 2017), Baltimore, MD, USA, 2017.

# SCISSORS

PAPER

## Start RoShamBo Demo

A. Aimar, E. Calabrese, H. Mostafa, A. Rios-Navarro, R. Tapiador, I.-A. Lun Corradi, S.-C. Liu, A. Linares-Barranco, and T. Delbruck, "Nullhop: Flexibly driven by DAVIS neuromorphic vision sensor," in *NIPS 2016*, Barcelona, 20

A. Jimenez-Fernandez, F. ient FPGA CNN accelerator

## Conclusions

- 1. The DVS was developed by following a neuromorphic approach of emulating key properties of biological retinas
- 2. Wide dynamic range and sparse, quick output make these sensors useful in real time uncontrolled conditions
- 3. Applications: vision prosthetics, surveillance, robotics and consumer electronics
- 4. Precise event timing could improve learning and inference
- Main challenges are to reduce pixel size and to develop effective algorithms. Only industry can do the first but academia has plenty of room to play for the second
- 6. Event sensors can nicely drive deep inference. There is a lot of room for improvement of deep inference power efficiency at the system level









