A new hybrid approach to photorealistic neural avatars

A video conference screenshot showing the Microsoft Teams avatars, now in public preview.
(Image credit: Microsoft)

Neural avatars have emerged as a new technology for interactive remote presence. The technology is set to influence applications across video conferencing, mixed-reality frameworks, gaming, and the metaverse

However, at present the technology can feel limited. We see cartoon representations of virtual speakers, as seen in Mesh avatars for Microsoft Teams, and experimental prototypes of photorealistic, neural renderings, like NVIDIA Maxine video compression and Meta pixel codec avatars

In both categories, restrictions in rendering a speaker’s intricate expressions, gestures and body movements severely limit the value of remote presence and visual communication. It’s widely accepted that the majority of human communication is non-verbal. We evolved to be very astute observers of human faces and body movements. We use delicate expressions, hand gestures, and body language to convey trust and meaning, or interpret human emotion and experiences. All these details matter. So, when poor rendering doesn’t reflect reality, many users may simply prefer not to use any video at all. 

The challenge, then, is to identify a new way to introduce advanced AI tools to facilitate genuine, engaging remote presence applications.

iSize neural avatars 1

(Image credit: iSize)

After almost two years of research, deep-tech video delivery company iSize observed that all 2D and 3D photorealistic neural rendering frameworks for human speakers that allow for substantial compression - in other words, 2x-plus compression versus conventional video codecs - cannot guarantee faithful representation of features like eyes, mouth, or certain expressions and gestures.

Due to the enormous diversity in facial features and expressions, the firm estimated it would take 1000x more complexity in neural rendering to get such intricate details right. And even then, there’s no guarantee the result will not have uncanny valley artifacts when deployed at scale for millions of users.

Generative AI and photorealistic avatars: How does it work? 

At the GTC conference 2023, the company presented the latest generative AI live photorealistic avatar, delivering a photorealistic experience while maintaining a lower bitrate compared to standard video codecs. The substantial bitrate reduction leads to 5x lower video latency and transceiver power, uninterrupted remote presence under poor wireless signal conditions, and significantly better Quality of Experience. It could also mean increased user engagement time for gaming and metaverse applications.  

A custom pre-trained neural warp engine is still being used, but it is actually made even simpler and faster than other similar approaches. At the encoder side, such a neural engine will generate a general representation for the target video frame of the speaker (starting from the reference) with warp information that can be very compact, i.e., around 200 bits/frame. Naturally, the uncanny valley issues can creep in around the mouth and eyes regions, as seen in Fig. 1.  

iSize neural avatars 2

Fig. 1. Neural warping that generates a baseline for the target video frame, with uncanny valley artifacts in the eyes and mouth. (Image credit: iSize)

However, what happens next is that a quality scoring neural proxy assesses the expected warp quality prior to it actually happening. It then makes the following decisions: 

Either it determines the warp will fail in a major way and the frame is sent to a ‘sandwiched’ encoding engine like an HEVC or AV1 encoder. 

Or it will only extract select regions of interest components like the eyes and mouth, only sending these components to the sandwiched encoder.

Then, there is a region-of-interest (ROI) blending engine, which seamlessly blends the ROI components together with the warp outcome. The stitched result is displayed.

iSize neural avatars 3

Fig. 2. Full framework of BitGen2D proposed by iSIZE. (Image credit: iSize)

The output avoids all uncanny valley problems. This framework is termed BitGen2D, as it bridges the gap between a generative avatar engine and a conventional video encoder. It’s a hybrid approach that offloads all the difficult parts to the external encoder, while leaving the bulk of the frame to be warped by the neural engine. 

The crucial components of this framework are the quality scoring proxy and the ROI blending. BitGen2D doesn’t have to score the quality after the blending, and this is a deliberate choice, as, if it’s done a posteriori, this increases the delay. 

That is, when a failure is detected, the system would have to unroll the result and send the frame to the video encoder. Getting the quality scoring to work well in conjunction with the warp and blending is key. Furthermore, deep fakes cannot be created as the framework is operating directly from elements of the speaker's face.

BitGen2D provides high reliability, accurately rendering the various events that can happen during a video call, like hand movements and foreign objects entering the frame. It’s also user-agnostic since it doesn’t require any fine-tuning on a specific identity. This enables BitGen2D as plug-and-play for any user. It can also withstand situations like alterations in appearance, such as a change of  clothes, accessories, or haircut, or a switch in speaker during a live call. The method is actively being extended to 3D data using NERFs and octrees.

During the GTC Developer conference keynote address, NVIDIA CEO Jensen Huang said, “Generative AI’s incredible potential is inspiring virtually every industry to reimagine its business strategies and the technology required to achieve them.” 

AI solutions, then, are here to stay and are already transforming how we live, work and play in both physical and virtual spaces. 

Yiannis Andreopoulos, CTO, iSize.

Read more
Synthesia
What is Synthesia? Everything we know about the best enterprise AI video generator
AI Studios
What is AI Studios by DeepBrain? Everything we know about the AI avatar maker
HeyGen
What is HeyGen? Everything we know about the AI video platform
Stability AI 3D Video
Stability AI’s new virtual camera turns any image into a cool 3D video and I’m blown away by how good it is
OmniHuman
TikTok owner ByteDance has a new AI video creator you have to see to believe
An AI face in profile against a digital background.
Navigating transparency, bias, and the human imperative in the age of democratized AI
Latest in Software & Services
TinEye website
I like this reverse image search service the most
A person in a wheelchair working at a computer.
Here’s a free way to find long lost relatives and friends
A white woman with long brown hair in a ponytail looks down at her computer in a distressed manner. She is holding her forehead with one hand and a credit card with the other
This people search finder covers all the bases, but it's not perfect
That's Them home page
Is That's Them worth it? My honest review
woman listening to computer
AWS vs Azure: choosing the right platform to maximize your company's investment
A person at a desktop computer working on spreadsheet tables.
Trello vs Jira: which project management solution is best for you?
Latest in Opinion
AI hallucinations
We're already trusting AI with too much – I just hope AI hallucinations disappear before it's too late
Closing the cybersecurity skills gap
How CISOs can meet the demands of new privacy regulations
Half man, half AI.
Ensuring your organization uses AI responsibly: a how-to guide
Judge sitting behind laptop in office
A day in the life of an AI-augmented lawyer
Cyber-security
Why Windows End of Life deadlines require a change of mindset
Polar Pacer
Polar's latest software update might have finally convinced me to ditch my Garmin