#How Nvidia’s Maxine uses AI to improve video calls

Table of Contents

“#How Nvidia’s Maxine uses AI to improve video calls”

One of the things that caught my eye at Nvidia’s flagship event, the GPU Technology Conference (GTC), was Maxine, a platform that leverages artificial intelligence to improve the quality and experience of video-conferencing applications in real-time.

Maxine used deep learning for resolution improvement, background noise reduction, video compression, face alignment, and real-time translation and transcription.

In this post, which marks the first installation of our “deconstructing artificial intelligence” series, we will take a look at how some of these features work and how they tie-in with AI research done at Nvidia. We’ll also explore the pending issues and the possible business model for Nvidia’s AI-powered video-conferencing platform.

Super-resolution with neural networks

The first feature shown in the Maxine presentation is “super resolution,” which according to Nvidia, “can convert lower resolutions to higher resolution videos in real time.” Super resolution enables video-conference callers to send lo-res video streams and have them upscaled at the server. This reduces the bandwidth requirement of video conference applications and can make their performance more stable in areas where network connectivity is not very stable.

[Read: What audience intelligence data tells us about the 2020 US presidential election]

The big challenge of upscaling visual data is filling in the missing information. You have a limited array of pixels that represent an image, and you want to expand it to a larger canvas that contains many more pixels. How do you decide what color values those new pixels get?

Old upscaling techniques use different interpolation methods (bicubic, lanczos, etc.) to fill the space between pixels. These techniques are too general and might provide mixed results in different types of images and backgrounds.

One of the benefits of machine learning algorithms is that they can be tuned to perform very specific tasks. For instance, a deep neural network can be trained on scaled-down video frames grabbed from video conference streams and their corresponding hi-res original images. With enough examples, the neural network will tune its parameters to the general features found in video-conference visual data (mostly faces) and will be able to provide a better low- to hi-res conversion than general-purpose upscaling algorithms. In general, the more narrow the domain, the better the chances of the neural network to converging on a very high accuracy performance.

There’s already a solid body of research on using artificial neural networks for upscaling visual data, including a 2017 Nvidia paper that discusses general super resolution with deep neural networks. With video-conferencing being a very specialized case, a well-trained neural network is bound to perform even better than more general tasks. Aside from video conferencing, there are applications for this technology in other areas, such as the film industry, which uses deep learning to remaster old videos to higher quality.

Video compression with neural networks

One of the more interesting parts of the Maxine presentation was the AI video compression feature. The video posted on Nvidia’s YouTube shows that using neural networks to compress video streams reduces bandwidth from ~97 KB/frame to ~0.12 KB/frame, which is a bit exaggerated, as users have pointed out on Reddit. Nvidia’s website states developers can reduce bandwidth use down to “one-tenth of the bandwidth needed for the H.264 video compression standard,” which is a much more reasonable—and still impressive—figure.

How does Nvidia’s AI achieve such impressive compression rates? A blog post on Nvidia’s website provides more detail on how the technology works. A neural network extracts and encodes the locations of key facial features of the user for each frame, which is much more efficient than compressing pixel and color data. The encoded data is then passed on to a generative adversarial network along with a reference video frame captured at the beginning of the session. The GAN is trained to reconstruct the new image by projecting the facial features onto the reference frame.