Cool or creepy? Microsoft's VASA-1 is a new AI model that turns photos into 'talking faces'

(Image credit: Microsoft VASA-1 AI)

A new AI research paper from Microsoft promises a future where you can upload a photo, a sample of your voice and create a live, animated talking head of your own face.

VASA-1 takes in a single portrait photo and an audio file and converts it into a hyper realistic talking face video complete with lip sync, realistic facial features and head movement.

The model is currently only a research preview and not available for anyone outside of the Microsoft Research team to try, but the demo videos look impressive.

Similar lip sync and head movement technology is already available from Runway and Nvidia but this seems to be of a much higher quality and realism, reducing mouth artifacts. This approach to audio-driven animation is also similar to a recent VLOGGER AI model from Google Research.

How does VASA-1 work?

Microsoft says this is a new framework for the creation of lifelike talking faces and specifically for the purpose of animating virtual characters. All of the people in the examples were synthetic, made using DALL-E but if it can animate a realistic AI image, it can animate a real photo.

In the demo we see people talking as if they were being filmed, with slightly jerky but otherwise natural-looking movement. The lip sync is very impressive, with natural movement and no artefacts around the top and bottom of the mouth seen in other tools.

One of the most impressive things about VASA-1 seems to be the fact it doesn't require a face-forward portrait style image to make it work.

There are examples with shots facing a range of directions. The model also seems to have a high degree of control, capable of taking eye gaze direction, head distance and even emotion as an input to steer the generation.

What is the point of VASA-1?

One of the most obvious use cases for this is in advanced lip synching for games. Being able to create AI-driven NPCs with natural lip movement could be a game-changer for immersion.

It could also be used to create virtual avatars for social media videos, as seen already from companies like HeyGen and Synthesia. One other area is in AI-based movie making. You could make a more realistic music video if you can have an AI singer that looks like they are singing.

That said, the team say this is just a research demonstration, with no plans for a public release or even making it available to developers to use in products.

How well does VASA-1 work?

One thing that surprised the researchers was the ability of VASA-1 to perfectly lip-sync to a song, reflecting the words from the singer without issue despite no music being used in the training dataset. It also handled different image styles including the Mona Lisa.

They've got it creating 512x512 pixel images at 45 frames per second and can do it in about 2 minutes using a desktop-grade Nvidia RTX 4090 GPU.

While they say this is only for research, it will be a shame if this doesn’t get out into the public domain, even if only for developers as I’d love to see it in Runway or Pika Labs. Given Microsoft has a huge stake in OpenAI this could even be part of a future Copilot Sora integration.

More from Tom's Guide

Back to Laptops

Apple

Asus

Lenovo

AMD Ryzen

AMD Ryzen 7

Intel Core i7

Intel Pentium

8GB RAM

16GB RAM

32GB RAM

128GB

256GB

512GB

1TB

13.3-inch

14-inch

15-inch

Black

Blue

Grey

Silver

New

Refurbished

Showing 10 of 70 deals

Filters☰

(256GB SSD)

Asus Zenbook S 13 OLED

(1TB Silver)

Asus ROG Zephyrus G14 2023

Lenovo Chromebook Duet 3

(128GB 8GB RAM)

$419

$299

View

Asus ROG Zephyrus G14 2023

(14-inch 512GB)

$1,429.99

$1,358.99

View

Apple MacBook Pro 14-inch M3 (2023)

(14-inch 512GB)

Our Review

☆☆☆☆☆

(15-inch 256GB)

Asus Zenbook S 13 OLED

(13.3-inch 1TB)

$1,399.99

View

Load more deals

Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on artificial intelligence and technology speak for him than engage in this self-aggrandising exercise. As the AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover.
When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing. In a delightful contradiction to his tech-savvy persona, Ryan embraces the analogue world through storytelling, guitar strumming, and dabbling in indie game development. Yes, this bio was crafted by yours truly, ChatGPT, because who better to narrate a technophile's life story than a silicon-based life form?