3D-MPA: Multi Proposal Aggregation for 3D Semantic Instance Segmentation

Francis Engelmann, Martin Bokeloh, Alireza Fathi, Bastian Leibe, Matthias Nießner
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020

We present 3D-MPA, a method for instance segmentation on 3D point clouds. Given an input point cloud, we propose an object-centric approach where each point votes for its object center. We sample object proposals from the predicted object centers. Then we learn proposal features from grouped point features that voted for the same object center. A graph convolutional network introduces inter-proposal relations, providing higher-level feature learning in addition to the lower-level point features. Each proposal comprises a semantic label, a set of associated points over which we define a foreground-background mask, an objectness score and aggregation features. Previous works usually perform non-maximum-suppression (NMS) over proposals to obtain the final object detections or semantic instances. However, NMS can discard potentially correct predictions. Instead, our approach keeps all proposals and groups them together based on the learned aggregation features. We show that grouping proposals improves over NMS and outperforms previous state-of-the-art methods on the tasks of 3D object detection and semantic instance segmentation on the ScanNetV2 benchmark and the S3DIS dataset.

» Show BibTeX

title = {{3D-MPA: Multi Proposal Aggregation for 3D Semantic Instance Segmentation}},
author = {Engelmann, Francis and Bokeloh, Martin and Fathi, Alireza and Leibe, Bastian and Nie{\ss}ner, Matthias},
booktitle = {{IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}},
year = {2020}

DualConvMesh-Net: Joint Geodesic and Euclidean Convolutions on 3D Meshes

Jonas Schult*, Francis Engelmann*, Theodora Kontogianni, Bastian Leibe
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020 (Oral)

We propose DualConvMesh-Nets (DMC-Net) a family of deep hierarchical convolutional networks over 3D geometric data that combines two types of convolutions. The first type, geodesic convolutions, defines the kernel weights over mesh surfaces or graphs. That is, the convolutional kernel weights are mapped to the local surface of a given mesh. The second type, Euclidean convolutions, is independent of any underlying mesh structure. The convolutional kernel is applied on a neighborhood obtained from a local affinity representation based on the Euclidean distance between 3D points. Intuitively, geodesic convolutions can easily separate objects that are spatially close but have disconnected surfaces, while Euclidean convolutions can represent interactions between nearby objects better, as they are oblivious to object surfaces. To realize a multi-resolution architecture, we borrow well-established mesh simplification methods from the geometry processing domain and adapt them to define mesh-preserving pooling and unpooling operations. We experimentally show that combining both types of convolutions in our architecture leads to significant performance gains for 3D semantic segmentation, and we report competitive results on three scene segmentation benchmarks.

» Show BibTeX

author = {Jonas Schult* and
Francis Engelmann* and
Theodora Kontogianni and
Bastian Leibe},
title = {{DualConvMesh-Net: Joint Geodesic and Euclidean Convolutions on 3D Meshes}},
booktitle = {{IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}},
year = {2020}

Siam R-CNN: Visual Tracking by Re-Detection

Paul Voigtlaender, Jonathon Luiten, Philip Torr, Bastian Leibe

We present Siam R-CNN, a Siamese re-detection architecture which unleashes the full power of two-stage object detection approaches for visual object tracking. We combine this with a novel tracklet-based dynamic programming algorithm, which takes advantage of re-detections of both the first-frame template and previous-frame predictions, to model the full history of both the object to be tracked and potential distractor objects. This enables our approach to make better tracking decisions, as well as to re-detect tracked objects after long occlusion. Finally, we propose a novel hard example mining strategy to improve Siam RCNN’s robustness to similar looking objects. The proposed tracker achieves the current best performance on ten tracking benchmarks, with especially strong results for long-term tracking.

» Show BibTeX

title={Siam R-CNN: Visual Tracking by Re-Detection},
author={Paul Voigtlaender and Jonathon Luiten and Philip H. S. Torr and Bastian Leibe},

Dilated Point Convolutions: On the Receptive Field Size of Point Convolutions on 3D Point Clouds

Francis Engelmann, Theodora Kontogianni, Bastian Leibe
International Conference on Robotics and Automation (ICRA) 2020

In this work, we propose Dilated Point Convolutions (DPC). In a thorough ablation study, we show that the receptive field size is directly related to the performance of 3D point cloud processing tasks, including semantic segmentation and object classification. Point convolutions are widely used to efficiently process 3D data representations such as point clouds or graphs. However, we observe that the receptive field size of recent point convolutional networks is inherently limited. Our dilated point convolutions alleviate this issue, they significantly increase the receptive field size of point convolutions. Importantly, our dilation mechanism can easily be integrated into most existing point convolutional networks. To evaluate the resulting network architectures, we visualize the receptive field and report competitive scores on popular point cloud benchmarks.

» Show BibTeX

author = {Engelmann, Francis and Kontogianni, Theodora and Leibe, Bastian},
title = {{Dilated Point Convolutions: On the Receptive Field Size of Point Convolutions on 3D Point Clouds}},
booktitle = {{International Conference on Robotics and Automation (ICRA)}},
year = {2020}

Track to Reconstruct and Reconstruct to Track

Jonathon Luiten, Tobias Fischer, Bastian Leibe
RA-L 2020 / ICRA 2020

Object tracking and 3D reconstruction are often performed together, with tracking used as input for reconstruction. However, the obtained reconstructions also provide useful information for improving tracking. We propose a novel method that closes this loop, first tracking to reconstruct, and then reconstructing to track. Our approach, MOTSFusion (Multi-Object Tracking, Segmentation and dynamic object Fusion), exploits the 3D motion extracted from dynamic object reconstructions to track objects through long periods of complete occlusion and to recover missing detections. Our approach first builds up short tracklets using 2D optical flow, and then fuses these into dynamic 3D object reconstructions. The precise 3D object motion of these reconstructions is used to merge tracklets through occlusion into long-term tracks, and to locate objects when detections are missing. On KITTI, our reconstruction-based tracking reduces the number of ID switches of the initial tracklets by more than 50%, and outperforms all previous approaches for both bounding box and segmentation tracking.

» Show BibTeX

title={Track to Reconstruct and Reconstruct to Track},
author={Luiten, Jonathon and Fischer, Tobias and Leibe, Bastian},
journal={IEEE Robotics and Automation Letters},

UnOVOST: Unsupervised Offline Video Object Segmentation and Tracking

Jonathon Luiten, Idil Esen Zulfikar, Bastian Leibe
WACV 2020

We address Unsupervised Video Object Segmentation (UVOS), the task of automatically generating accurate pixel masks for salient objects in a video sequence and of tracking these objects consistently through time, without any input about which objects should be tracked. Towards solving this task, we present UnOVOST (Unsupervised Offline Video Object Segmentation and Tracking) as a simple and generic algorithm which is able to track and segment a large variety of objects. This algorithm builds up tracks in a number stages, first grouping segments into short tracklets that are spatio-temporally consistent, before merging these tracklets into long-term consistent object tracks based on their visual similarity. In order to achieve this we introduce a novel tracklet-based Forest Path Cutting data association algorithm which builds up a decision forest of track hypotheses before cutting this forest into paths that form long-term consistent object tracks. When evaluating our approach on the DAVIS 2017 Unsupervised dataset we obtain state-of-the-art performance with a mean J &F score of 67.9% on the val, 58% on the test-dev and 56.4% on the test-challenge benchmarks, obtaining first place in the DAVIS 2019 Unsupervised Video Object Segmentation Challenge. UnOVOST even performs competitively with many semi-supervised video object segmentation algorithms even though it is not given any input as to which objects should be tracked and segmented.

» Show BibTeX

title={UnOVOST: Unsupervised Offline Video Object Segmentation and Tracking},
author={Luiten, Jonathon and Zulfikar, Idil Esen and Leibe, Bastian},
booktitle={Proceedings of the IEEE Winter Conference on Applications in Computer Vision},

Metric-Scale Truncation-Robust Heatmaps for 3D Human Pose Estimation

István Sárándi, Timm Linder, Kai O. Arras, Bastian Leibe
IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2020, to appear

Heatmap representations have formed the basis of 2D human pose estimation systems for many years, but their generalizations for 3D pose have only recently been considered. This includes 2.5D volumetric heatmaps, whose X and Y axes correspond to image space and the Z axis to metric depth around the subject. To obtain metric-scale predictions, these methods must include a separate, explicit post-processing step to resolve scale ambiguity. Further, they cannot encode body joint positions outside of the image boundaries, leading to incomplete pose estimates in case of image truncation. We address these limitations by proposing metric-scale truncation-robust (MeTRo) volumetric heatmaps, whose dimensions are defined in metric 3D space near the subject, instead of being aligned with image space. We train a fully-convolutional network to estimate such heatmaps from monocular RGB in an end-to-end manner. This reinterpretation of the heatmap dimensions allows us to estimate complete metric-scale poses without test-time knowledge of the focal length or person distance and without relying on anthropometric heuristics in post-processing. Furthermore, as the image space is decoupled from the heatmap space, the network can learn to reason about joints beyond the image boundary. Using ResNet-50 without any additional learned layers, we obtain state-of-the-art results on the Human3.6M and MPI-INF-3DHP benchmarks. As our method is simple and fast, it can become a useful component for real-time top-down multi-person pose estimation systems. We make our code publicly available to facilitate further research.

» Show BibTeX

title={Metric-Scale Truncation-Robust Heatmaps for 3{D} Human Pose Estimation},
author={S\'ar\'andi, Istv\'an and Linder, Timm and Arras, Kai O. and Leibe, Bastian},
booktitle={Automatic Face and Gesture Recognition, 2020 IEEE Int. Conf. on},
note={in press}

Fast and Robust QEF Minimization using Probabilistic Quadrics

Philip Trettner, Leif Kobbelt
Computer Graphics Forum (Proc. EUROGRAPHICS 2020)

Error quadrics are a fundamental and powerful building block in many geometry processing algorithms. However, finding the minimizer of a given quadric is in many cases not robust and requires a singular value decomposition or some ad-hoc regularization. While classical error quadrics measure the squared deviation from a set of ground truth planes or polygons, we treat the input data as genuinely uncertain information and embed error quadrics in a probabilistic setting ("probabilistic quadrics") where the optimal point minimizes the expected squared error. We derive closed form solutions for the popular plane and triangle quadrics subject to (spatially varying, anisotropic) Gaussian noise. Probabilistic quadrics can be minimized robustly by solving a simple linear system - 50x faster than SVD. We show that probabilistic quadrics have superior properties in tasks like decimation and isosurface extraction since they favor more uniform triangulations and are more tolerant to noise while still maintaining feature sensitivity. A broad spectrum of applications can directly benefit from our new quadrics as a drop-in replacement which we demonstrate with mesh smoothing via filtered quadrics and non-linear subdivision surfaces.

High-Fidelity Point-Based Rendering of Large-Scale 3D Scan Datasets

Patric Schmitz, Timothy Blut, Christian Mattes, Leif Kobbelt
IEEE Computer Graphics and Applications

Digitalization of 3D objects and scenes using modern depth sensors and high-resolution RGB cameras enables the preservation of human cultural artifacts at an unprecedented level of detail. Interactive visualization of these large datasets, however, is challenging without degradation in visual fidelity. A common solution is to fit the dataset into available video memory by downsampling and compression. The achievable reproduction accuracy is thereby limited for interactive scenarios, such as immersive exploration in Virtual Reality (VR). This degradation in visual realism ultimately hinders the effective communication of human cultural knowledge. This article presents a method to render 3D scan datasets with minimal loss of visual fidelity. A point-based rendering approach visualizes scan data as a dense splat cloud. For improved surface approximation of thin and sparsely sampled objects, we propose oriented 3D ellipsoids as rendering primitives. To render massive texture datasets, we present a virtual texturing system that dynamically loads required image data. It is paired with a single-pass page prediction method that minimizes visible texturing artifacts. Our system renders a challenging dataset in the order of 70 million points and a texture size of 1.2 terabytes consistently at 90 frames per second in stereoscopic VR.

Towards a Graphical User Interface for Exploring and Fine-Tuning Crowd Simulations

Andrea Bönsch, Marcel Jonda, Jonathan Ehret, Torsten Wolfgang Kuhlen
To be presented at: IEEE Virtual Humans and Crowds for Immersive Environments (VHCIE), 2020

Simulating a realistic navigation of virtual pedestrians through virtual environments is a recurring subject of investigations. The various mathematical approaches used to compute the pedestrians’ paths result, i.a., in different computation-times and varying path characteristics. Customizable parameters, e.g., maximal walking speed or minimal interpersonal distance, add another level of complexity. Thus, choosing the best-fitting approach for a given environment and use-case is non-trivial, especially for novice users.

To facilitate the informed choice of a specific algorithm with a certain parameter set, crowd simulation frameworks such as Menge provide an extendable collection of approaches with a unified interface for usage. However, they often miss an elaborated visualization with high informative value accompanied by visual analysis methods to explore the complete simulation data in more detail – which is yet required for an informed choice. Benchmarking suites such as SteerBench are a helpful approach as they objectively analyze crowd simulations, however they are too tailored to specific behavior details. To this end, we propose a preliminary design of an advanced graphical user interface providing a 2D and 3D visualization of the crowd simulation data as well as features for time navigation and an overall data exploration.

Joint Dual-Tasking in VR: Outlining the Behavioral Design of Interactive Human Companions Who Walk and Talk with a User

Andrea Bönsch, Torsten Wolfgang Kuhlen
To be presented at: IEEE Virtual Humans and Crowds for Immersive Environments (VHCIE), 2020

To resemble realistic and lively places, virtual environments are increasingly often enriched by virtual populations consisting of computer-controlled, human-like virtual agents. While the applications often provide limited user-agent interaction based on, e.g., collision avoidance or mutual gaze, complex user-agent dynamics such as joint locomotion combined with a secondary task, e.g., conversing, are rarely considered yet. These dual-tasking situations, however, are beneficial for various use-cases: guided tours and social simulations will become more realistic and engaging if a user is able to traverse a scene as a member of a social group, while platforms to study crowd and walking behavior will become more powerful and informative. To this end, this presentation deals with different areas of interaction dynamics, which need to be combined for modeling dual-tasking with virtual agents. Areas covered are kinematic parameters for the navigation behavior, group shapes in static and mobile situations as well as verbal and non-verbal behavior for conversations.

Single-Shot Panoptic Segmentation

Mark Weber, Jonathon Luiten, Bastian Leibe

We present a novel end-to-end single-shot method that segments countable object instances (things) as well as background regions (stuff) into a non-overlapping panoptic segmentation at almost video frame rate. Current state-of-the-art methods are far from reaching video frame rate and mostly rely on merging instance segmentation with semantic background segmentation. Our approach relaxes this requirement by using an object detector but is still able to resolve inter- and intra-class overlaps to achieve a non-overlapping segmentation. On top of a shared encoder-decoder backbone, we utilize multiple branches for semantic segmentation, object detection, and instance center prediction. Finally, our panoptic head combines all outputs into a panoptic segmentation and can even handle conflicting predictions between branches as well as certain false predictions. Our network achieves 32.6% PQ on MS-COCO at 21.8 FPS, opening up panoptic segmentation to a broader field of applications.

» Show BibTeX

title={Single-Shot Panoptic Segmentation},
author={Weber, Mark and Luiten, Jonathon and Leibe, Bastian},
journal={arXiv preprint arXiv:1911.00764},

Previous Year (2019)
Disclaimer Home Visual Computing institute RWTH Aachen University