Technology

#DALL-E 2 shows the power of generative deep learning, but raises dispute over AI practices

“DALL-E 2 shows the power of generative deep learning, but raises dispute over AI practices”

This article is part of our coverage of the latest in AI research.

Artificial intelligence research lab OpenAI made headlines again, this time with DALL-E 2, a machine learning model that can generate stunning images from text descriptions. DALL-E 2 builds on the success of its predecessor DALL-E and improves the quality and resolution of the output images thanks to advanced deep learning techniques.

The announcement of DALL-E 2 was accompanied by a social media campaign by OpenAI’s engineers and its CEO, Sam Altman, who shared wonderful photos created by the generative machine learning model on Twitter.

DALL-E 2 shows how far the AI research community has come toward harnessing the power of deep learning and addressing some of its limits. It also provides an outlook of how generative deep learning models might finally unlock new creative applications for everyone to use. At the same time, it reminds us of some of the obstacles that remain in AI research and disputes that need to be settled.

The beauty of DALL-E 2

Like other milestone OpenAI announcements, DALL-E 2 comes with a detailed paper and an interactive blog post that shows how the machine learning model works. There’s also a video that provides an overview of what the technology is capable of doing and what its limitations are.

 

DALL-E 2 is a “generative model,” a special branch of machine learning that creates complex output instead of performing prediction or classification tasks on input data. You provide DALL-E 2 with a text description, and it generates an image that fits the description.

Generative models are a hot area of research that received much attention with the introduction of generative adversarial networks (GAN) in 2014. The field has seen tremendous improvements in recent years, and generative models have been used for a vast variety of tasks, including creating artificial faces, deepfakes, synthesized voices, and more.

However, what sets DALL-E 2 apart from other generative models is its capability to maintain semantic consistency in the images it creates.

For example, the following images (from the DALL-E 2 blog post) are generated from the description “An astronaut riding a horse.” One of the descriptions ends with “as a pencil drawing” and the other “in photorealistic style.”

DALL-E 2

 

The model remains consistent in drawing the astronaut sitting on the back of the horse and holding his/her hands in front. This kind of consistency shows itself in most examples OpenAI has shared.

The following examples (also from OpenAI’s website) show another feature of DALL-E 2, which is to generate variations of an input image. Here, instead of providing DALL-E 2 with a text description, you provide it with an image, and it tries to generate other forms of the same image. Here, DALL-E maintains the relations between the elements in the image, including the girl, the laptop, the headphones, the cat, the city lights in the background, and the night sky with moon and clouds.

DALL-E 2

 

Other examples suggest that DALL-E 2 seems to understand depth and dimensionality, a great challenge for algorithms that process 2D images.

Even if the examples on OpenAI’s website were cherry-picked, they are impressive. And the examples shared on Twitter show that DALL-E 2 seems to have found a way to represent and reproduce the relationships between the elements that appear in an image, even when it is “dreaming up” something for the first time.

In fact, to prove how good DALL-E 2 is, Altman took to Twitter and asked users to suggest prompts to feed to the generative model. The results (see the thread below) are fascinating.

The science behind DALL-E 2

DALL-E 2 takes advantage of CLIP and diffusion models, two advanced deep learning techniques created in the past few years. But at its heart, it shares the same concept as all other deep neural networks: representation learning.

Consider an image classification model. The neural network transforms pixel colors into a set of numbers that represent its features. This vector is sometimes also called the “embedding” of the input. Those features are then mapped to the output layer, which contains a probability score for each class of image that the model is supposed to detect. During training, the neural network tries to learn the best feature representations that discriminate between the classes.

Ideally, the machine learning model should be able to learn latent features that remain consistent across different lighting conditions, angles, and background environments. But as has often been seen, deep learning models often learn the wrong representations. For example, a neural network might think that green pixels are a feature of the “sheep” class because all the images of sheep it has seen during training contain a lot of grass. Another model that has been trained on pictures of bats taken during the night might consider darkness a feature of all bat pictures and misclassify pictures of bats taken during the day. Other models might become sensitive to objects being centered in the image and placed in front of a certain type of background.

Learning the wrong representations is partly why neural networks are brittle, sensitive to changes in the environment, and poor at generalizing beyond their training data. It is also why neural networks trained for one application need to be finetuned for other applications — the features of the final layers of the neural network are usually very task-specific and can’t generalize to other applications.

In theory, you could create a huge training dataset that contains all kinds of variations of data that the neural network should be able to handle. But creating and labeling such a dataset would require immense human effort and is practically impossible.

This is the problem that Contrastive Learning-Image Pre-training (CLIP) solves. CLIP trains two neural networks in parallel on images and their captions. One of the networks learns the visual representations in the image and the other learns the representations of the corresponding text. During training, the two networks try to adjust their parameters so that similar images and descriptions produce similar embeddings.

DALL-E 2
Close

Please allow ads on our site

Please consider supporting us by disabling your ad blocker!