Authored By: Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Brebisson, Yoshua Bengio

ObamaNet: Photo-realistic lip-sync from text

Jan 27, 2024

Abstract
We present ObamaNet, the first architecture that leverages fully trainable neural modules to take any text as input and generates both the corresponding speech and synchronized photo-realistic lip-sync videos. The architecture harnesses the capabilities of three main modules - a text-to-speech network based on Char2Wav, a time-delayed LSTM (Long Short-Term Memory) for generating mouth-keypoints synced to generated audio, and a network based on Pix2Pix for generating video frames conditioned on the mouth-keypoints.

1 Introduction

Recent times have witnessed notable progress in both image generation and speech synthesis, however, rarely are these two modalities modelled together. This is precisely where ObamaNet steps in, combining these recent models to produce artificial videos of a person reading any given text. This model uses any set of close-shot videos and the accompanying transcript of an individual speaking and generates a system that creates speech from any text and modifies accordingly the mouth area of a pre-existing video to make it look quite natural and realistic.

Although the method is demonstrated on videos of Barack Obama, it is versatile and can be used to generate videos of anyone, provided the relevant data is available.

2 Related Work

ObamaNet was built upon the strides made in the generation of photo-realistic videos and facial animations. Despite its close conceptual proximity to the works of Suwajanakorn et al. (2017) and Karras et al. (2017), it deviates in two key respects - integrating a neural network over the traditional computer vision model and incorporating a text-to-speech synthesizer to render a complete text-to-video system.

3 Model Description

3.1 Text-to-speech system
The Char2Wav architecture (Sotelo et al. 2017) is employed to generate speech from the given input text. The speech synthesis system is trained using audio drawn from videos in sync with their corresponding transcripts.

3.2 Keypoint generation
The task of keypoint generation is carried out by predicting the mouth shape on the basis of the input audio. To reconcile the keypoint generation with any target video, normalization of the created mouth keypoints becomes fundamental. This module features the same network design as Suwajanakorn et al. (2017) employing an LSTM network with time-delay to anticipate the representation of the mouth shape given the audio attributes.

3.3 Video generation
Inspired by the recent surge in image-to-image translation solutions, particularly Pix2Pix (Isola et al. (2016)), the ObamaNet architecture adopts this approach to achieve the aim of generating a video.

4 Supplementary Material

4.1 Dataset
The ObamaNet approach was demonstrated using videos of ex-President Barack Obama, a path similar to Suwajanakorn et al. (2017). We worked with 17 hours of video footage sourced from Barack Obama's 300 weekly presidential addresses. The videos provided an optimal dataset given the relatively controlled environment and the fixed, central positioning of the subject.

4.2 Data Processing
The ObamaNet model required two components for its operation, which comprised of a representation of audio for input and a representation of mouth shape for output. The conversion of these components from raw video footage is an elaborate process involving normalization of facial keypoints, implementation of PCA, audio extraction, annotation and cropping of images.

The ObamaNet model has contributed a significant breakthrough in the generation of photo-realistic lip-sync videos and continues to offer a promising direction for future research in this field. The inclusion of research citations throughout the body of this article not only serves as an acknowledgement of prior work but also offers an avenue for those interested to further their knowledge in this fascinating field.