Almost everyone’s facial movements can be synced with speech and audio clips, Microsoft researchers said, using artificial intelligence to produce talking human animations. In fact, this is not surprising, because “deepfake” videos are now available in every aspect of our lives. Although fun videos get there with this method, they sometimes distort politicians’ statements.
Evolving technology could be proof that deepfake will never go away. Because although one of the world’s most important technology companies like Microsoft doesn’t call it “deepfake,” it looks like it’s stepping into this field with a new development. Moreover, Microsoft is not the only one in this field.
Last June, Samsung researchers detailed an end-to-end model that can visualize a person’s eyebrows, mouth, eyelashes and cheeks one-on-one. Just a few weeks later, Udacityintroduced a system that automatically produces lesson videos from audio narration. Two years ago, carnegie mellon researchers published a statement describing the approach that allows the transfer of facial movements from one person to another.
Based on these and other studies, the Microsoft Research team has put forward a technique that they claim to improve the quality of voice-driven talking head animations. Previous human head-building approaches required a clean, relatively noisy sound in a neutral tone. However, with the new research, researchers say, methods that divide sound sequences into factors such as phonetic content and background noise can generalise noisy and emotionally rich data samples.
We can say that human speeches are full of variations. Because different people can say the same word in different contexts, at different times, in tone, etc. In addition to phonetic content, the speaker also provides plenty of information about the speaker’s emotional state, identity (gender, age, ethnicity) and personality. Microsoft explains its new research as the first approach to improving performance from a learning perspective of sound representation.
Under the proposed technique is a variable automatic encoder (VAE) that learns hidden impressions. Input audio sequences are converted into different representations by THE VAE that encode content, emotion, and other variation factors. Based on the input sound, the refore is sampled from a series of content representations from a video generator-fed distribution, along with input face images to take action. So there’s a sound that might fit that face.
The team says their approach is equal in terms of performance in all criteria, along with other methods for clean, neutral verbal expressions. Moreover, they say it can perform consistently across the entire emotional spectrum and is compatible with all modern approaches to head-to-head.