Co-Speech Gesture Video Generatio
with Implicit Motion-Audio Entanglement

CVPR 2025

Videos for Comparisons Videos for Ablation Studies Videos for Other Identities

Videos for Comparisons

On this page, we provide videos to compare our method against S2G and MYA.

Our method produces high-quality videos without blurry hands or finger distortion and maintains a consistent background. In contrast, S2G and MYA exhibit inconsistent backgrounds and suffer from blurry hands and distorted fingers. Additionally, MYA often memorizes appearance features during training. This causes the generated videos to replicate the memorized appearance instead of using the reference image, resulting in inconsistencies.

Ref Img S2G MYA Ours
Ref Img S2G MYA Ours
Ref Img S2G MYA Ours
Ref Img S2G MYA Ours
Ref Img S2G MYA Ours
Ref Img S2G MYA Ours
Ref Img S2G MYA Ours
Ref Img S2G MYA Ours
Ref Img S2G MYA Ours
Ref Img S2G MYA Ours
Ref Img S2G MYA Ours

Co-Speech Gesture Video Generatio with Implicit Motion-Audio Entanglement

CVPR 2025

Videos for Comparisons

Co-Speech Gesture Video Generatio
with Implicit Motion-Audio Entanglement