Zhiyuan Gao

PhD Student

Department of Computer Science

Research Topics

Physics-based Scene Understanding
Real-to-Sim
World Modeling
High Fidelity Digital Twin and Simulation
3D Computer Vision, Computer Graphics, Robotics

I am a PhD student in Computer Science at University of Southern California, advised by Prof. Jernej Barbic. I've finished my Master degree in CS@USC advised by Prof. Yue Wang and Prof. Jernej Barbic. I have also interned at Vision & Graphics Lab, where I was fortunate to work with Prof. Yajie Zhao.

News

[2026-01] One paper is accepted to ICLR 2026!
[2025-10] One paper is accepted to NeurIPS 2025!
[2025-07] One paper is accepted to ICCV 2025!
[2024-12] One paper is accepted as Oral Presentation to 3DV 2025!
[2024-09] One paper is accepted to WACV 2025!

Publications

D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping

Haozhe Lou , Mingtong Zhang , Haoran Geng , Hanyang Zhou , Sicheng He , Zhiyuan Gao , Siheng Zhao , Jiageng Mao , Pieter Abbeel , Jitendra Malik , Daniel Seita , Yue Wang

ICLR 2026

Differentiable real-to-sim-to-real engine using Gaussian Splat representations for mass identification and force-aware dexterous grasping policy learning.

Abstract

Simulation provides a cost-effective and flexible platform for data generation and policy learning to develop robotic systems. However, bridging the gap between simulation and real-world dynamics remains a significant challenge, especially in physical parameter identification. In this work, we introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine, enabling object mass identification from real-world visual observations and robot control signals, while enabling grasping policy learning simultaneously. Through optimizing the mass of the manipulated object, our method automatically builds high-fidelity and physical digital twins. Additionally, we propose a novel approach to train force-aware grasping policies from limited data by transferring feasible human demonstrations into simulated robot demonstrations. Through comprehensive experiments, we demonstrate that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values. Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping, effectively reducing the sim-to-real gap.

arXiv OpenReview Website

Seeing the Wind from a Falling Leaf

Zhiyuan Gao* , Jiageng Mao* , Hong-Xing Yu , Haozhe Lou , Emily Yue-ting Jia , Jernej Barbic , Jiajun Wu , Yue Wang (Equal contribution)

NeurIPS 2025

Differentiable pipeline that recovers wind forces from leaf videos and enables physics-consistent video editing and wind manipulation.

Abstract

A longstanding goal in computer vision is to model motions from videos, while the representations behind motions, i.e. the invisible physical interactions that cause objects to deform and move, remain largely unexplored. In this work, we present an end-to-end differentiable inverse graphics framework, which jointly models object geometry, physical properties, and interactions directly from videos. By backpropagating through physics simulations, we can recover force representations from object movements. We validate our approach on both synthetic and real-world scenarios, demonstrating the ability to estimate plausible force fields—such as wind patterns affecting a falling leaf. Our method shows promise for physics-based video generation and editing, bridging computer vision with physics by understanding the physical processes underlying visual data.

arXivOpenReview Website

Learning an Implicit Physical Model for Image-based Fluid Simulation

Emily Yue-ting Jia , Jiageng Mao , Zhiyuan Gao , Yajie Zhao , Yue Wang

ICCV 2025

Physics-informed model that turns one fluid image into a 4D, multi-view animation using learned implicit motion fields and Navier–Stokes priors.

Abstract

Humans possess an exceptional ability to imagine 4D scenes, encompassing both motion and 3D geometry, from a single still image. This ability is rooted in our accumulated observations of similar scenes and an intuitive understanding of physics. In this paper, we aim to replicate this capacity in neural networks, specifically focusing on natural fluid imagery. Existing methods for this task typically employ simplistic 2D motion estimators to animate the image, leading to motion predictions that often defy physical principles, resulting in unrealistic animations. Our approach introduces a novel method for generating 4D scenes with physics-consistent animation from a single image. We propose the use of a physics-informed neural network that predicts motion for each surface point, guided by a loss term derived from fundamental physical principles, including the Navier-Stokes equations. To capture appearance, we predict feature-based 3D Gaussians from the input image and its estimated depth, which are then animated using the predicted motions and rendered from any desired camera perspective. Experimental results highlight the effectiveness of our method in producing physically plausible animations, showcasing significant performance improvements over existing methods.

arXivWebsite

Geometry-aware Feature Matching for Large-Scale Structure from Motion

Gonglin Chen , Jinsen Wu , Haiwei Chen , Wenbin Teng , Zhiyuan Gao , Andrew Feng , Rongjun Qin , Yajie Zhao

3DV 2025 (Oral Presentation)

A novel geometry-aware feature matching approach that combines detector-based and detector-free methods to improve correspondence accuracy and density for large-scale Structure from Motion.

Abstract

Establishing consistent and dense correspondences across multiple images is crucial for Structure from Motion (SfM) systems. Significant view changes, such as air-to-ground with very sparse view overlap, pose an even greater challenge to the correspondence solvers. We present a novel optimization-based approach that significantly enhances existing feature matching methods by introducing geometry cues in addition to color cues. This helps fill gaps when there is less overlap in large-scale scenarios. Our method formulates geometric verification as an optimization problem, guiding feature matching within detector-free methods and using sparse correspondences from detector-based methods as anchor points. By enforcing geometric constraints via the Sampson Distance, our approach ensures that the denser correspondences from detector-free methods are geometrically consistent and more accurate. This hybrid strategy significantly improves correspondence density and accuracy, mitigates multi-view inconsistencies, and leads to notable advancements in camera pose accuracy and point cloud density. It outperforms state-of-the-art feature matching methods on benchmark datasets and enables feature matching in challenging extreme large-scale settings.

arXivWebsite

Volume Rendering of Human Hand Anatomy

Jingtao Huang , Bohan Wang , Zhiyuan Gao , Mianlun Zheng , George Matcuk , Jernej Barbic

arXiv

We propose novel transfer functions for volumetric rendering of hand MRI data that improve visualization of complex hand anatomy while maintaining fine control over tissue appearance.

Abstract

We study the design of transfer functions for volumetric rendering of magnetic resonance imaging (MRI) datasets of human hands. Human hands are anatomically complex, containing various organs within a limited space, which presents challenges for volumetric rendering. We focus on hand musculoskeletal organs because they are volumetrically the largest inside the hand, and most important for the hand's main function, namely manipulation of objects. While volumetric rendering is a mature field, the choice of the transfer function for the different organs is arguably just as important as the choice of the specific volume rendering algorithm; we demonstrate that it significantly influences the clarity and interpretability of the resulting images. We assume that the hand MRI scans have already been segmented into the different organs (bones, muscles, tendons, ligaments, subcutaneous fat, etc.). Our method uses the hand MRI volume data, and the geometry of its inner organs and their known segmentation, to produce high-quality volume rendering images of the hand, and permits fine control over the appearance of each tissue. We contribute two families of transfer functions to emphasize different hand tissues of interest, while preserving the visual context of the hand. We also discuss and reduce artifacts present in standard volume ray-casting of human hands. We evaluate our volumetric rendering on five challenging hand motion sequences. Our experimental results demonstrate that our method improves hand anatomy visualization, compared to standard surface and volume rendering techniques.

arXiv

Skyeyes: Ground Roaming using Aerial View Images

Zhiyuan Gao* , Wenbin Teng* , Gonglin Chen , Jinsen Wu , Ningli Xu , Rongjun Qin , Andrew Feng , Yajie Zhao (Equal contribution)

WACV 2025

Skyeyes is a framework that can generate photorealistic sequences of ground view images using only aerial view inputs, thereby creating a ground roaming experience.

Abstract

Integrating aerial imagery-based scene generation into applications like autonomous driving and gaming enhances realism in 3D environments, but challenges remain in creating detailed content for occluded areas and ensuring real-time, consistent rendering. In this paper, we introduce Skyeyes, a novel framework that can generate photorealistic sequences of ground view images using only aerial view inputs, thereby creating a ground roaming experience. More specifically, we combine a 3D representation with a view consistent generation model, which ensures coherence between generated images. This method allows for the creation of geometrically consistent ground view images, even with large view gaps. The images maintain improved spatial-temporal coherence and realism, enhancing scene comprehension and visualization from aerial perspectives. To the best of our knowledge, there are no publicly available datasets that contain pairwise geo-aligned aerial and ground view imagery. Therefore, we build a large, synthetic, and geo-aligned dataset using Unreal Engine. Both qualitative and quantitative analyses on this synthetic dataset display superior results compared to other leading synthesis approaches.

arXivWebsite

Understanding street-level urban vibrancy via spatial-temporal Wi-Fi data analytics: Case LivingLine Shanghai

Yan Zhang , Chengliang Li , Jiajie Li , Zhiyuan Gao , Tianyu Su , Can Wang , Hexin Zhang , Teng Ma , Yang Liu , Weiting Xiong , Ronan Doorley , Luis Alonso , Yongqi Lou , Kent Larson

Environment and Planning B: Urban Analytics and City Science

Using Wi-Fi data analytics and machine learning to quantify urban vibrancy at the street level, demonstrated through a case study of urban interventions in Shanghai while preserving user privacy.

Abstract

Urban vibrancy is a topic of great concern in the field of urban design and planning. However, the definition and measurement of urban vibrancy have not been consistently and clearly followed. With the development of technologies such as big data and machine learning, urban planners have adopted new methods that enable better quantitative evaluation of urban performance. This research attempts to quantify the impact on the urban vibrancy of the urban interventions introduced by the LivingLine project in a residential neighborhood renovation made in Siping Street, Shanghai. We use Wi-Fi probes to process collected mobile phone data and segment people into different categories according to commuting patterns analysis. We use a pre-trained random forest model to determine the specific locations of each person. Subsequently, we analyze the behavior patterns of people from stay points detection and trajectory analysis. Through statistical models, we apply multi-linear regression and find that urban intervention (well-curated and defined lab events deployed in the street) and people's behavior are positively correlated, which helps us to prove the impact of urban intervention on street dynamics. The research proposes a novel, evidence-based, low-cost methodology for studying granular behavior patterns on a street level without compromising users' data privacy.

Paper

Refined self-attention mechanism based real-time structural response prediction method under seismic action

Shiqiao Meng , Ying Zhou , Zhiyuan Gao

Engineering Applications of Artificial Intelligence

Our SeisFormer model based on self-attention mechanism for real-time structural response prediction under seismic action for large-scale structures, achieving high accuracy and efficiency even with limited training data.

Abstract

Accurate prediction of structural response under earthquake is of great significance for structural damage and performance evaluation. In order to improve the efficiency of structure time history response prediction, this paper proposes a novel SeisFormer model based on the self-attention mechanism and deep learning technology. Through autoregressive prediction, the SeisFormer can achieve real-time prediction of the response time histories of a large number of nodes in the structure under seismic action and can effectively solve the problem of data scarcity. Four case studies are performed to verify the accuracy and efficiency of the proposed methodology, including validation on datasets obtained from elastoplastic seismic analysis of a single-story structure, a three-story structure, and an eleven-story structure, and measured data of a shaking table test model. In addition, this paper further studies the prediction accuracy of the SeisFormer through ablation experiments and comparative experiments. The experimental results show that the SeisFormer can accurately predict the acceleration, velocity, and displacement time histories of numerous nodes in the structure. The prediction accuracy outperforms the LSTM model, and the prediction speed is 193–109,824 times faster than finite element method. Furthermore, with data augmentation through autoregressive prediction, the SeisFormer model can achieve efficient and accurate predictions when training data is exceptionally scarce, enabling engineering applications.

Paper

Real‐time automatic crack detection method based on drone

Shiqiao Meng , Zhiyuan Gao , Ying Zhou , Bin He , Abderrahim Djerrad

Computer-Aided Civil and Infrastructure Engineering

A real-time drone-based crack detection method that combines lightweight and high-precision algorithms with autonomous flight control for efficient building damage assessment.

Abstract

Real-time automated drone-based crack detection can be used for efficient building damage assessment. This paper proposes an automated real-time crack detection method based on a drone. Using a lightweight classification algorithm, a lightweight segmentation algorithm, a high-precision segmentation algorithm, and a crack width measurement algorithm, the cracks are classified, roughly segmented, finely segmented, and the maximum width is extracted. A crack information-assisted drone flight automatic control algorithm for automatic crack detection guides the drone toward the crack position. The effectiveness of the crack detection algorithm and the crack information-assisted drone flight automatic control algorithm was tested on two different datasets, a two-story building, and a 16-m-high shaking table test building. The results show that crack detection can be finished in real-time during the flight. Using the proposed method can significantly improve the MIoU of crack edge detection and the accuracy of maximum crack width measurement under the non-ideal shooting conditions of the actual inspection situation by automatically approaching the vicinity of the crack.

Paper

A three-stage deep-learning-based method for crack detection of high-resolution steel box girder image

Shiqiao Meng , Zhiyuan Gao , Ying Zhou , Bin He , Qingzhao Kong

Smart Structures and Systems (Excellent Award@IPC-SHM'2020)

A novel three-stage deep learning method for crack detection in high-resolution steel box girder images, combining patch-based CBAM ResNet-50 for localization, Attention U-Net for edge detection, and morphological operations for refinement.

Abstract

Crack detection plays an important role in the maintenance and protection of steel box girder of bridges. However, since the cracks only occupy an extremely small region of the high-resolution images captured from actual conditions, the existing methods cannot deal with this kind of image effectively. To solve this problem, this paper proposed a novel three-stage method based on deep learning technology and morphology operations. The training set and test set used in this paper are composed of 360 images (4928 × 3264 pixels) in steel girder box. The first stage of the proposed model converted highresolution images into sub-images by using patch-based method and located the region of cracks by CBAM ResNet-50 model. The Recall reaches 0.95 on the test set. The second stage of our method uses the Attention U-Net model to get the accurate geometric edges of cracks based on results in the first stage. The IoU of the segmentation model implemented in this stage attains 0.48. In the third stage of the model, we remove the wrong-predicted isolated points in the predicted results through dilate operation and outlier elimination algorithm. The IoU of test set ascends to 0.70 after this stage. Ablation experiments are conducted to optimize the parameters and further promote the accuracy of the proposed method. The result shows that (1) the best patch size of sub-images is 1024 × 1024. (2) the CBAM ResNet-50 and the Attention U-Net achieved the best results in the first and the second stage, respectively. (3) Pre-training the model of the first two stages can improve the IoU by 2.9%. In general, our method is of great significance for crack detection.

PaperIPC-SHM'20

Experience

Research Intern

USC PSI Lab

Los Angeles, California

May 2024 – May 2025

Research on differentiable simulation for scene reconstruction and video generation; resulted in a NeurIPS'25 paper and an ICCV'25 paper.

Research Intern

USC ICT Vision & Graphics Group

Los Angeles, California

Jun 2023 – Aug 2024

Research on urban scene reconstruction and generation based on aerial imagery, which resulted in a WACV'25 paper and a 3DV'25 paper.

Research Intern

Shanghai Research Institute for Intelligent Autonomous Systems

Shanghai, China

Sep 2020 – Mar 2022

Worked on computer vision for structural health monitoring; led to papers in Eng. Appl. AI, CA-CIE, and SSS.

Research Intern

Tongji-MIT City Science Lab

Shanghai, China

Oct 2020 – Oct 2021

Developed pedestrian trajectory analysis and mined patterns; led to the Environment and Planning B paper on street-level urban vibrancy.

Fun Facts

My Erdős number is 3: Zhiyuan Gao → Jitendra Malik → Fan Chung Graham → Paul Erdős.
I built phdstat, a tool for visualizing grad school application data from TheGradCafe with near real-time updates. It has helped 2.7k+ applicants track their applications. If you're interested in contributing to this project, feel free to contact me.
I've contributed to several games (click to expand)

Flappy Balloon

Alt-controller party game guiding a balloon through obstacles with custom inputs.

I was one of the producers.

Trailer: https://vimeo.com/1139208458

PL-23

AI-driven detective/puzzle game in the spirit of story-rich investigation titles.

I am the lead producer and a past backend engineer.

Trailer: coming soon.

Contact

gaozhiyu [AT] usc.edu

300 N Beaudry Ave, Los Angeles, CA 90012, United States