Realistic Lip-Syncing with NerF and Deform Frameworks: Synthesizing High-Quality Videos of Speakers

1yr ago

0 replies

We propose a video lip-sync algorithm based on the NerF and Deform frameworks that can synthesize high-quality, realistic videos of people speaking with different lip shapes and pronunciations. The algorithm consists of two stages: 1) predicting the target speaker's facial shape and texture from a single image, and 2) synthesizing the desired lip movements through deformation and blending of the predicted face and mapping it onto the source video frame. To predict facial shape and texture, we use the NerF framework, which is a neural rendering technique that models a scene as a continuous 5D function. By training the network on image and corresponding depth-map datasets, the author can predict facial shape and texture from a single image of the target speaker. To synthesize the desired lip movements, we use the Deform framework, which is a mesh-based deformation technique that can deform and blend meshes for smooth and realistic animation. By mapping the predicted face onto the source video frame, the author can synthesize the desired lip movements while retaining the original facial expressions and head poses. The algorithm achieved state-of-the-art results on several benchmark datasets, demonstrating its effectiveness in synthesizing high-quality, realistic videos of people speaking with different lip shapes and pronunciations. The authors believe their method has potential applications in various fields such as movie dubbing, virtual reality, and remote meetings. https://twinsync.xyz

🤔
No comments yet be the first to help