I am a PhD student in the direct doctorate program at the Computer Vision and Learning Group (VLG) at ETH Zürich, supervised by Prof. Marc Pollefeys and Prof. Siyu Tang. Prior to this, I obtained my Bachelor's degree in Computer Science from Tsinghua University in 2020. I had the pleasure of collaborating with Prof. Stelian Coros, Prof. Otmar Hilliges (ETH Zürich) and Prof. Jeannette Bohg (Stanford).
My research focuses on embodied human motion synthesis, egocentric vision, and multimodal foundation models. My doctoral research is supported by the Microsoft Spatial AI Zurich Lab PhD scholarship.
I am actively seeking research internship opportunities for 2025/2026. If you have openings that align with my skills and interests, please feel free to reach out!
  Email  /  Google Scholar  /    LinkedIn  /    Github
@article{li2025egom2p,
title={EgoM2P: Egocentric Multimodal Multitask Pretraining},
author={Li, Gen and Chen, Yutong and Wu, Yiqian and Zhao, Kaifeng and Pollefeys, Marc and Tang, Siyu},
journal={arXiv preprint arXiv:2506.07886},
year={2025}
}
EgoM2P: A large-scale egocentric multimodal and multitask model, pretrained on eight extensive egocentric datasets totaling 400 billion multimodal tokens. It incorporates four modalities—RGB and depth video, gaze dynamics, and camera trajectories. EgoM2P matches or outperforms SOTA specialist models in challenging tasks including monocular egocentric depth estimation, camera tracking, gaze estimation, and conditional egocentric video synthesis, while being an order of magnitude faster.
@inproceedings{Zhao:DartControl:2025,
title = {{DartControl}: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control},
author = {Zhao, Kaifeng and Li, Gen and Tang, Siyu},
booktitle = {The Thirteenth International Conference on Learning Representations (ICLR)},
year = {2025}
}
DartControl achieves high-quality and efficient (> 300 frames per second) motion generation conditioned on online streams of text prompts. Furthermore, by integrating latent space optimization and reinforcement learning-based controls, DartControl enables various motion generation applications with spatial constraints and goals, including motion in-between, waypoint goal reaching, and human-scene interaction generation.
@InProceedings{Li_2024_CVPR,
author = {Li, Gen and Zhao, Kaifeng and Zhang, Siwei and Lyu, Xiaozhong and Dusmanu, Mihai and Zhang, Yan and Pollefeys, Marc and Tang, Siyu},
title = {EgoGen: An Egocentric Synthetic Data Generator},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {14497-14509}
}
EgoGen is a new synthetic data generator that produces accurate and rich ground-truth training data for egocentric perception tasks. It closes the loop for embodied perception and action by synthesizing realistic embodied human movements in 3D scenes and simulating camera rigs for head-mounted devices, rendering egocentric views with various sensors.
@article{wang2023trtm,
title={TRTM: Template-based Reconstruction and Target-oriented Manipulation of Crumpled Cloths},
authors={Wenbo Wang and Gen Li and Miguel Zamora and Stelian Coros},
booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
year={2024}
}
Precise reconstruction and manipulation of the crumpled cloths is challenging due to the high dimensionality of deformable objects, as well as the limited observation at self-occluded regions. We use Graph Neural Networks to reconstruct crumpled cloths from their top-view depth images, with our proposed sim-real registration protocols, enabling efficient dual-arm and single-arm target-oriented manipulations.
@inproceedings{wang2022reconstruction,
title={Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors},
author={Wang, Xi and Li, Gen and Kuo, Yen-Ling and Kocabas, Muhammed and Aksan, Emre and Hilliges, Otmar},
booktitle={International Conference on 3D Vision (3DV)},
year={2022}
}
We propose an action-conditioned modeling of interactions that allows us to infer diverse 3D arrangements of humans and objects without supervision on contact regions or 3D scene geometry. Our method extracts high-level commonsense knowledge from large language models (such as GPT-3) and applies them to perform 3D reasoning of human-object interactions.
@inproceedings{yan2020TMP,
title={Learning Topological Motion Primitives for Knot Planning},
authors={Mengyuan Yan and Gen Li and Yilin Zhu and Jeannette Bohg},
booktitle={IEEE International Conference on Intelligent Robots and Systems (IROS)},
year={2020}
}
We propose a hierarchical approach where the top layer generates a topological plan, and the bottom layer converts it into robot motion using learned motion primitives. These primitives adapt to rope geometry using observed configurations and are trained via human demonstrations and reinforcement learning. To generalize from simple to complex knots, the neural network leverages shared motion strategies across topological actions.
Template adapted from Siwei Zhang's website. |