Gen Li

I am a PhD student in the direct doctorate program at the Computer Vision and Learning Group (VLG) at ETH Zürich, supervised by Prof. Marc Pollefeys and Prof. Siyu Tang. Prior to this, I obtained my Bachelor's degree in Computer Science from Tsinghua University in 2020. I had the pleasure of collaborating with Prof. Stelian Coros, Prof. Otmar Hilliges (ETH Zürich) and Prof. Jeannette Bohg (Stanford).

My research focuses on embodied human motion synthesis, egocentric vision, and multimodal foundation models. My doctoral research is supported by the Microsoft Spatial AI Zurich Lab PhD scholarship.

I am actively seeking research internship opportunities for 2025/2026. If you have openings that align with my skills and interests, please feel free to reach out!

  Email  /  Google Scholar  /    LinkedIn  /    Github

profile photo

Publications


EgoM2P: Egocentric Multimodal Multitask Pretraining
Gen Li, Yutong Chen, Yiqian Wu, Kaifeng Zhao, Marc Pollefeys, Siyu Tang
Under submission

EgoM2P: A large-scale egocentric multimodal and multitask model, pretrained on eight extensive egocentric datasets totaling 400 billion multimodal tokens. It incorporates four modalities—RGB and depth video, gaze dynamics, and camera trajectories. EgoM2P matches or outperforms SOTA specialist models in challenging tasks including monocular egocentric depth estimation, camera tracking, gaze estimation, and conditional egocentric video synthesis, while being an order of magnitude faster.

DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control
Kaifeng Zhao, Gen Li, Siyu Tang
ICLR 2025, Spotlight (Top 3.26%)

DartControl achieves high-quality and efficient (> 300 frames per second) motion generation conditioned on online streams of text prompts. Furthermore, by integrating latent space optimization and reinforcement learning-based controls, DartControl enables various motion generation applications with spatial constraints and goals, including motion in-between, waypoint goal reaching, and human-scene interaction generation.

EgoGen: An Egocentric Synthetic Data Generator
Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu, Yan Zhang, Marc Pollefeys, Siyu Tang
CVPR 2024, Oral Presentation (Top 0.78%)

EgoGen is a new synthetic data generator that produces accurate and rich ground-truth training data for egocentric perception tasks. It closes the loop for embodied perception and action by synthesizing realistic embodied human movements in 3D scenes and simulating camera rigs for head-mounted devices, rendering egocentric views with various sensors.

TRTM: Template-based Reconstruction and Target-oriented Manipulation of Crumpled Cloths
TRTM: Template-based Reconstruction and Target-oriented Manipulation of Crumpled Cloths
Wenbo Wang, Gen Li, Miguel Zamora, Stelian Coros
ICRA 2024

Precise reconstruction and manipulation of the crumpled cloths is challenging due to the high dimensionality of deformable objects, as well as the limited observation at self-occluded regions. We use Graph Neural Networks to reconstruct crumpled cloths from their top-view depth images, with our proposed sim-real registration protocols, enabling efficient dual-arm and single-arm target-oriented manipulations.

Reconstructing Action-Conditioned Human-Object Interactions
Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors
Xi Wang*, Gen Li*, Yen-Ling Kuo, Muhammed Kocabas, Emre Aksan, and Otmar Hilliges
3DV 2022. * denotes equal contributions

We propose an action-conditioned modeling of interactions that allows us to infer diverse 3D arrangements of humans and objects without supervision on contact regions or 3D scene geometry. Our method extracts high-level commonsense knowledge from large language models (such as GPT-3) and applies them to perform 3D reasoning of human-object interactions.

Learning Topological Motion Primitives for Knot Planning
Learning Topological Motion Primitives for Knot Planning
Mengyuan Yan, Gen Li, Yilin Zhu, and Jeannette Bohg
IROS 2020

We propose a hierarchical approach where the top layer generates a topological plan, and the bottom layer converts it into robot motion using learned motion primitives. These primitives adapt to rope geometry using observed configurations and are trained via human demonstrations and reinforcement learning. To generalize from simple to complex knots, the neural network leverages shared motion strategies across topological actions.

Academic Service

Teaching


Template adapted from Siwei Zhang's website.