Publications
Learning 3D Human Pose Estimation from Dozens of Datasets using a Geometry-Aware Autoencoder to Bridge Between Skeleton Formats

Deep learning-based 3D human pose estimation performs best when trained on large amounts of labeled data, making combined learning from many datasets an important research direction. One obstacle to this endeavor are the different skeleton formats provided by different datasets, i.e., they do not label the same set of anatomical landmarks. There is little prior research on how to best supervise one model with such discrepant labels. We show that simply using separate output heads for different skeletons results in inconsistent depth estimates and insufficient information sharing across skeletons. As a remedy, we propose a novel affine-combining autoencoder (ACAE) method to perform dimensionality reduction on the number of landmarks. The discovered latent 3D points capture the redundancy among skeletons, enabling enhanced information sharing when used for consistency regularization. Our approach scales to an extreme multi-dataset regime, where we use 28 3D human pose datasets to supervise one model, which outperforms prior work on a range of benchmarks, including the challenging 3D Poses in the Wild (3DPW) dataset. Our code and models are available for research purposes.
@inproceedings{Sarandi23WACV,
author = {S\'ar\'andi, Istv\'an and Hermans, Alexander and Leibe, Bastian},
title = {Learning {3D} Human Pose Estimation from Dozens of Datasets using a Geometry-Aware Autoencoder to Bridge Between Skeleton Formats},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year = {2023},
}
Mask3D for 3D Semantic Instance Segmentation

Modern 3D semantic instance segmentation approaches predominantly rely on specialized voting mechanisms followed by carefully designed geometric clustering techniques. Building on the successes of recent Transformer-based methods for object detection and image segmentation, we propose the first Transformer-based approach for 3D semantic instance segmentation. We show that we can leverage generic Transformer building blocks to directly predict instance masks from 3D point clouds. In our model called Mask3D each object instance is represented as an instance query. Using Transformer decoders, the instance queries are learned by iteratively attending to point cloud features at multiple scales. Combined with point features, the instance queries directly yield all instance masks in parallel. Mask3D has several advantages over current state-of-the-art approaches, since it neither relies on (1) voting schemes which require hand-selected geometric properties (such as centers) nor (2) geometric grouping mechanisms requiring manually-tuned hyper-parameters (e.g. radii) and (3) enables a loss that directly optimizes instance masks. Mask3D sets a new state-of-the-art on ScanNet test (+6.2 mAP), S3DIS 6-fold (+10.1 mAP), STPLS3D (+11.2 mAP) and ScanNet200 test (+12.4 mAP).
@article{Schult23ICRA,
title = {{Mask3D for 3D Semantic Instance Segmentation}},
author = {Schult, Jonas and Engelmann, Francis and Hermans, Alexander and Litany, Or and Tang, Siyu and Leibe, Bastian},
booktitle = {{International Conference on Robotics and Automation (ICRA)}},
year = {2023}
}
Neural Implicit Shape Editing Using Boundary Sensitivity

Neural fields are receiving increased attention as a geometric representation due to their ability to compactly store detailed and smooth shapes and easily undergo topological changes. Compared to classic geometry representations, however, neural representations do not allow the user to exert intuitive control over the shape. Motivated by this, we leverage boundary sensitivity to express how perturbations in parameters move the shape boundary. This allows to interpret the effect of each learnable parameter and study achievable deformations. With this, we perform geometric editing: finding a parameter update that best approximates a globally prescribed deformation. Prescribing the deformation only locally allows the rest of the shape to change according to some prior, such as semantics or deformation rigidity. Our method is agnostic to the model its training and updates the NN in-place. Furthermore, we show how boundary sensitivity helps to optimize and constrain objectives (such as surface area and volume), which are difficult to compute without first converting to another representation, such as a mesh.
@misc{berzins2023neural,
title={Neural Implicit Shape Editing using Boundary Sensitivity},
author={Arturs Berzins and Moritz Ibing and Leif Kobbelt},
year={2023},
eprint={2304.12951},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Surface Maps via Adaptive Triangulations

We present a new method to compute continuous and bijective maps (surface homeomorphisms) between two or more genus-0 triangle meshes. In contrast to previous approaches, we decouple the resolution at which a map is represented from the resolution of the input meshes. We discretize maps via common triangulations that approximate the input meshes while remaining in bijective correspondence to them. Both the geometry and the connectivity of these triangulations are optimized with respect to a single objective function that simultaneously controls mapping distortion, triangulation quality, and approximation error. A discrete-continuous optimization algorithm performs both energy-based remeshing as well as global second-order optimization of vertex positions, parametrized via the sphere. With this, we combine the disciplines of compatible remeshing and surface map optimization in a unified formulation and make a contribution in both fields. While existing compatible remeshing algorithms often operate on a fixed pre-computed surface map, we can now globally update this correspondence during remeshing. On the other hand, bijective surface-to-surface map optimization previously required computing costly overlay meshes that are inherently tied to the input mesh resolution. We achieve significant complexity reduction by instead assessing distortion between the approximating triangulations. This new map representation is inherently more robust than previous overlay-based approaches, is less intricate to implement, and naturally supports mapping between more than two surfaces. Moreover, it enables adaptive multi-resolution schemes that, e.g., first align corresponding surface regions at coarse resolutions before refining the map where needed. We demonstrate significant speedups and increased flexibility over state-of-the art mapping algorithms at similar map quality, and also provide a reference implementation of the method.
@article{schmidt2023surface,
title={Surface Maps via Adaptive Triangulations},
author={Schmidt, Patrick and Pieper, D\"orte and Kobbelt, Leif},
year={2023},
journal={Computer Graphics Forum},
volume={42},
number={2},
}
Effect of Head-Mounted Displays on Students’ Acquisition of Surgical Suturing Techniques Compared to an E-Learning and Tutor-Led Course: A Randomized Controlled Trial

Background: Although surgical suturing is one of the most important basic skills, many medical school graduates do not acquire sufficient knowledge of it due to its lack of integration into the curriculum or a shortage of tutors. E-learning approaches attempt to address this issue but still rely on the involvement of tutors. Furthermore, the learning experience and visual-spatial ability appear to play a critical role in surgical skill acquisition. Virtual reality head-mounted displays (HMDs) could address this, but the benefits of immersive and stereoscopic learning of surgical suturing techniques are still unclear.
Material and Methods: In this multi-arm randomized controlled trial, 150 novices participated. Three teaching modalities were compared: an e-learning course (monoscopic), an HMD-based course (stereoscopic, immersive), both self-directed, and a tutor-led course with feedback. Suturing performance was recorded by video camera both before and after course participation (>26 hours of video material) and assessed in a blinded fashion using the OSATS Global Rating Score (GRS). Furthermore, the optical flow of the videos was determined using an algorithm. The number of sutures performed was counted, visual-spatial ability was measured with the mental rotation test (MRT), and courses were assessed with questionnaires.
Results: Students' self-assessment in the HMD-based course was comparable to that of the tutor-led course and significantly better than in the e-learning course (P=0.003). Course suitability was rated best for the tutor-led course (x=4.8), followed by the HMD-based (x=3.6) and e-learning (x=2.5) courses. The median GRS between courses was comparable (P=0.15) at 12.4 (95% CI 10.0–12.7) for the e-learning course, 14.1 (95% CI 13.0-15.0) for the HMD-based course, and 12.7 (95% CI 10.3-14.2) for the tutor-led course. However, the GRS was significantly correlated with the number of sutures performed during the training session (P=0.002), but not with visual-spatial ability (P=0.626). Optical flow (R2=0.15, P<0.001) and the number of sutures performed (R2=0.73, P<0.001) can be used as additional measures to GRS.
Conclusion: The use of HMDs with stereoscopic and immersive video provides advantages in the learning experience and should be preferred over a traditional web application for e-learning. Contrary to expectations, feedback is not necessary for novices to achieve a sufficient level in suturing; only the number of surgical sutures performed during training is a good determinant of competence improvement. Nevertheless, feedback still enhances the learning experience. Therefore, automated assessment as an alternative feedback approach could further improve self-directed learning modalities. As a next step, the data from this study could be used to develop such automated AI-based assessments.
@Article{Peters2023,
author = {Philipp Peters and Martin Lemos and Andrea Bönsch and Mark Ooms and Max Ulbrich and Ashkan Rashad and Felix Krause and Myriam Lipprandt and Torsten Wolfgang Kuhlen and Rainer Röhrig and Frank Hölzle and Behrus Puladi},
journal = {International Journal of Surgery},
title = {Effect of head-mounted displays on students' acquisition of surgical suturing techniques compared to an e-learning and tutor-led course: A randomized controlled trial},
year = {2023},
month = {may},
volume = {Publish Ahead of Print},
creationdate = {2023-05-12T11:00:37},
doi = {10.1097/js9.0000000000000464},
modificationdate = {2023-05-12T11:00:37},
publisher = {Ovid Technologies (Wolters Kluwer Health)},
}
DasherVR: Evaluating a Predictive Text Entry System in Immersive Virtual Reality

Inputting text fluently in virtual reality is a topic still under active research, since many previously presented solutions have drawbacks in either speed, error rate, privacy or accessibility. To address these drawbacks, in this paper we adapted the predictive text entry system "Dasher" into an immersive virtual environment. Our evaluation with 20 participants shows that Dasher offers a good user experience with input speeds similar to other virtual text input techniques in the literature while maintaining low error rates. In combination with positive user feedback, we therefore believe that DasherVR is a promising basis for further research on accessible text input in immersive virtual reality.
@inproceedings{pape2023,
title = {{{DasherVR}}: {{Evaluating}} a {{Predictive Text Entry System}} in {{Immersive Virtual Reality}}},
booktitle = {Towards an {{Inclusive}} and {{Accessible Metaverse}} at {{CHI}}'23},
author = {Pape, Sebastian and Ackermann, Jan Jakub and Weissker, Tim and Kuhlen, Torsten W},
year = {2023}
}
A Case Study on Providing Immersive Visualization for Neuronal Network Data Using COTS Soft- and Hardware

COTS VR hardware and modern game engines create the impression that bringing even complex data into VR has become easy. In this work, we investigate to what extent game engines can support the development of immersive visualization software with a case study. We discuss how the engine can support the development and where it falls short, e.g., failing to provide acceptable rendering performance for medium and large-sized data sets without using more sophisticated features.
@INPROCEEDINGS{10108843,
author={Krüger, Marcel and Li, Qin and Kuhlen, Torsten W. and Gerrits, Tim},
booktitle={2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)},
title={A Case Study on Providing Immersive Visualization for Neuronal Network Data Using COTS Soft- and Hardware},
year={2023},
volume={},
number={},
pages={201-205},
doi={10.1109/VRW58643.2023.00050}}
Enhanced Auditoriums for Attending Talks in Social Virtual Reality

Replicating traditional auditorium layouts for attending talks in social virtual reality often results in poor visibility of the presentation and a reduced feeling of being there together with others. Motivated by the use case of academic conferences, we therefore propose to display miniature representations of the stage close to the viewers for enhanced presentation visibility as well as group table arrangements for enhanced social co-watching. We conducted an initial user study with 12 participants in groups of three to evaluate the influence of these ideas on audience experience. Our results confirm the hypothesized positive effects of both enhancements and show that their combination was particularly appreciated by audience members. Our results therefore strongly encourage us to rethink conventional auditorium layouts in social virtual reality.
Towards Discovering Meaningful Historical Relationships in Virtual Reality

Traditional digital tools for exploring historical data mostly rely on conventional 2D visualizations, which often cannot reveal all relevant interrelationships between historical fragments. We are working on a novel interactive exploration tool for historical data in virtual reality, which arranges fragments in a 3D environment based on their temporal, spatial and categorical proximity to a reference fragment. In this poster, we report on an initial expert review of our approach, giving us valuable insights into the use cases and requirements that inform our further developments.
@INPROCEEDINGS{Derksen2023,
author={Derksen, Melanie and Weissker, Tim and Kuhlen, Torsten and Botsch, Mario},
booktitle={2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)},
title={Towards Discovering Meaningful Historical Relationships in Virtual Reality},
year={2023},
volume={},
number={},
pages={697-698},
doi={10.1109/VRW58643.2023.00191}}
Poster: Insite Pipeline - A Pipeline Enabling In-Transit Processing for Arbor, NEST and TVB

Simulation of neuronal networks has steadily advanced and now allows for larger and more complex models. However, scaling simulations to such sizes comes with issues and challenges.Especially the amount of data produced, as well as the runtime of the simulation, can be limiting.Often, storing all data on disk is impossible, and users might have to wait for a long time until they can process the data.A standard solution in simulation science is to use in-transit approaches.In-transit implementations allow users to access data while the simulation is still running and do parallel processing outside the simulation.This allows for early insights into the results, early stopping of simulations that are not promising, or even steering of the simulations.Existing in-transit solutions, however, are often complex to integrate into the workflow as they rely on integration into simulators and often use data formats that are complex to handle.This is especially constraining in the context of multi-disciplinary research conducted in the HBP, as such an important feature should be accessible to all users.
To remedy this, we developed Insite, a pipeline that allows easy in-transit access to simulation data of multiscale simulations conducted with TVB, NEST, and Arbor.
@misc{kruger_marcel_2023_7849225,
author = {Krüger, Marcel and
Gerrits, Tim and
Kuhlen, Torsten and
Weyers, Benjamin},
title = {{Insite Pipeline - A Pipeline Enabling In-Transit
Processing for Arbor, NEST and TVB}},
month = mar,
year = 2023,
publisher = {Zenodo},
doi = {10.5281/zenodo.7849225},
url = {https://doi.org/10.5281/zenodo.7849225}
}
Gaining the High Ground: Teleportation to Mid-Air Targets in Immersive Virtual Environments

Most prior teleportation techniques in virtual reality are bound to target positions in the vicinity of selectable scene objects. In this paper, we present three adaptations of the classic teleportation metaphor that enable the user to travel to mid-air targets as well. Inspired by related work on the combination of teleports with virtual rotations, our three techniques differ in the extent to which elevation changes are integrated into the conventional target selection process. Elevation can be specified either simultaneously, as a connected second step, or separately from horizontal movements. A user study with 30 participants indicated a trade-off between the simultaneous method leading to the highest accuracy and the two-step method inducing the lowest task load as well as receiving the highest usability ratings. The separate method was least suitable on its own but could serve as a complement to one of the other approaches. Based on these findings and previous research, we define initial design guidelines for mid-air navigation techniques.
Localized Latent Updates for Fine-Tuning Vision-Language Models
Although massive pre-trained vision-language models like CLIP show impressive generalization capabilities for many tasks, still it often remains necessary to fine-tune them for improved performance on specific datasets. When doing so, it is desirable that updating the model is fast and that the model does not lose its capabilities on data outside of the dataset, as is often the case with classical fine-tuning approaches. In this work we suggest a lightweight adapter that only updates the models predictions close to seen datapoints. We demonstrate the effectiveness and speed of this relatively simple approach in the context of few-shot learning, where our results both on classes seen and unseen during training are comparable with or improve on the state of the art.
@inproceedings{ibing_localized,
author = {Moritz Ibing and
Isaak Lim and
Leif Kobbelt},
title = {Localized Latent Updates for Fine-Tuning Vision-Language Models},
booktitle = {{IEEE/CVF} Conference on Computer Vision and Pattern Recognition Workshops,
{CVPR} Workshops 2023},
publisher = {{IEEE}},
year = {2023},
}
DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer
Most state-of-the-art instance segmentation methods rely on large amounts of pixel-precise ground-truth annotations for training, which are expensive to create. Interactive segmentation networks help generate such annotations based on an image and the corresponding user interactions such as clicks. Existing methods for this task can only process a single instance at a time and each user interaction requires a full forward pass through the entire deep network. We introduce a more efficient approach, called DynaMITe, in which we represent user interactions as spatio-temporal queries to a Transformer decoder with a potential to segment multiple object instances in a single iteration. Our architecture also alleviates any need to re-compute image features during refinement, and requires fewer interactions for segmenting multiple instances in a single image when compared to other methods. DynaMITe achieves state-of-the-art results on multiple existing interactive segmentation benchmarks, and also on the new multi-instance benchmark that we propose in this paper.
@article{RanaMahadevan23arxiv,
title={DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer},
author={Rana, Amit and Mahadevan, Sabarinath and Alexander Hermans and Leibe, Bastian},
journal={arXiv preprint arXiv:2304.06668},
year={2023}
}
Point2Vec for Self-Supervised Representation Learning on Point Clouds

Recently, the self-supervised learning framework data2vec has shown inspiring performance for various modalities using a masked student-teacher approach. However, it remains open whether such a framework generalizes to the unique challenges of 3D point clouds.To answer this question, we extend data2vec to the point cloud domain and report encouraging results on several downstream tasks. In an in-depth analysis, we discover that the leakage of positional information reveals the overall object shape to the student even under heavy masking and thus hampers data2vec to learn strong representations for point clouds. We address this 3D-specific shortcoming by proposing point2vec, which unleashes the full potential of data2vec-like pre-training on point clouds. Our experiments show that point2vec outperforms other self-supervised methods on shape classification and few-shot learning on ModelNet40 and ScanObjectNN, while achieving competitive results on part segmentation on ShapeNetParts. These results suggest that the learned representations are strong and transferable, highlighting point2vec as a promising direction for self-supervised learning of point cloud representations.
@article{abouzeid2023point2vec,
title={Point2Vec for Self-Supervised Representation Learning on Point Clouds},
author={Abou Zeid, Karim and Schult, Jonas and Hermans, Alexander and Leibe, Bastian},
journal={arXiv preprint arXiv:2303.16570},
year={2023},
}
Modeling the Droplet Impact on the Substrate with Surface Preparation in Thermal Spraying with SPH

The properties of thermally sprayed coatings depend heavily on their microstructure. The microstructure is determined by the dynamics of the impact of the droplets on the substrate surface and the subsequent overlapping of the previously solidified and deformed droplets. Substrate preparation prior to spraying ensures strong adhesion of the coating. This includes roughening and preheating of the substrate surface. In the present study, the smoothed particle hydrodynamics (SPH) method is used to model the Al2O3 impact on a preheated substrate and a roughened substrate surface. A semi-implicit enthalpy–porosity method is applied to simulate the solidification process in the mushy zone. In addition, an implicit correction for SPH simulations is used to improve the performance and stability of the simulation. To investigate the dynamics of heat transfer in the contact between the surface and the droplet, the discretization of the substrate is also taken into account. The results show that the studied substrate surface conditions affect the splat morphology and the solidification process. Subsequently, the simulation of multiple droplets for coating formation is also performed and analyzed.
@Article{BHJ+23,
author = {Kirsten Bobzin and Hendrik Heinemann and Kevin Jasutyn and Stefan Rhys Jeske and Jan Bender and Sergej Warkentin and Oleg Mokrov and Rahul Sharma and Uwe Reisgen},
journal = {Journal of Thermal Spray Technology},
title = {Modeling the Droplet Impact on the Substrate with Surface Preparation in Thermal Spraying with SPH},
year = {2023},
month = {jan},
doi = {10.1007/s11666-023-01534-0},
}
Previous Year (2022)