Unraveling the Secrets of AI Image Generators

Improve your prompts by understanding the inner workings of AI image generators

In this newsletter, read about:

  • 🕵️‍♀️ A Glimpse Into AI Generators Inner Workings

  • đź—ž News and Top Reads

  • đź“Ś AI Art Tutorial: Stable WarpFusion, AI Video Maker

  • 🎨 Featured Artist: Roope Rainisto

  • đź–Ľ AI-Assisted Artwork of the Week

🕵️‍♀️ A Glimpse Into AI Generators Inner Workings

Midjourney works unpredictably. It can generate beautiful photorealistic portraits from basic prompts and stunning images from random strings of numbers and symbols. But other times, it refuses to follow some simple guidelines, wouldn’t give us a full-body image, or place the objects the way we request.

I think understanding how the text-to-image AI generators were trained and how they work under the hood, might help us a lot in writing “working” prompts and getting the most from these remarkable tools. So let’s take a peek inside!

Diffusion Models

Diffusion models are generative AI models that power all the recently introduced text-to-image generators, including Midjourney, Stable DIffusion, Dall-E, Adobe Firefly, and others.

The idea originally comes from statistical physics and has been first deployed for image generation in 2015 by the research team from Stanford University and UC Berkeley. But it was only in 2020 that the researchers were able to introduce a few groundbreaking changes to the original architecture that led to a huge jump in the quality of the generated images. One year later, in May 2021, the OpenAI team demonstrated that diffusion models outperform Generative Adversarial Networks (GANs), the state-of-the-art approach to image generation at that point, and the AI behind the popular This Person Doesn’t Exist website. And then, the image generation boom started!

But how do we actually make these models work? First, we take the training images and add random noise to them until there is just noise that you can see. Then, we train a model to reverse the process and denoise images until we get clear pictures similar to the training data.

We can actually observe the denoising process when waiting for our image grid to be generated by Midjourney.

For more details on how diffusion models work, check out this article. Now let’s see how we can guide this process with our text prompts.

Adding Context

Basically, diffusion models do not need text prompts to generate an image. If text input is not incorporated into the model architecture, diffusion models would just generate random images resembling something from their training dataset.

But we want to have control, and now the AI developers enabled us to add “context” to the image generation process. I’ll explain this with a toy example.

Let’s say we are building a very simple text-to-image generator. We have three categories of images in our dataset: (1) portraits, (2) landscapes, and (3) abstract paintings, all labeled accordingly. Now, each time we request a new image to be generated (denoised), we specify the category, and the AI image generator now focuses only on this category, and generates a corresponding image.

The state-of-the-art image generators give us much more flexibility by combining diffusion models with large language models. Our text prompts are transferred into a set of tokens that the model is trained to associate with certain visual data. That’s how it knows what you want when you request “a photo of a funny puppy playing in the park.”

This is a remarkable step forward from having just a limited set of categories to choose from, but it comes with its limitations:

  • The number of tokens that can be considered for image generation is limited. If your prompt is too long, the model will choose what to focus on. If you don't want to leave it to the model's discretion, write shorter prompts.

  • The image generation is guided by a set of tokens, in no specific order. That’s why when you request a photo of a woman in red shoes, you can easily get a red dress and lots of red in the background instead. The model gets this “red” token but doesn’t know where to apply it exactly. If your request corresponds to something that can be often found in the training data, you are more likely to get lucky.

  • Combining very different concepts in one image is challenging. AI image generators can generate totally new objects, not encountered in the dataset, like for example avocado chairs, but it’s much easier to get a beautiful image of something that AI generates out of the box, e.g. beautiful women, handsome men, adorable puppies, etc.

  • Certain words have multiple meanings and we don’t know which one will be picked up by the algorithm. For example, I once did the mistake of using the phrase “busy street” in the prompt and was very surprised not to get the results I expected. I overlooked that “busy” is usually used with a different meaning, and in my case “crowded street” would probably be a better choice.

Let’s now have a deeper look into the training data used to build AI algorithms behind Midjourney and other AI image generators. This should help us further to improve the wording of our text prompts.

Training Data

To train AI generators with the image quality of Midjourney, you need billions of images. There is very limited information on how Midjourney built its training dataset, but I think the technical approach was very similar to how Laion-5B, an open-source dataset used by Stable Diffusion, was created.

The first step is to scrap the Internet for image-text pairs. Yes, for each image, you need to have a caption or ALT text. Then, you filter your dataset. Obviously, very often image caption doesn’t actually describe what’s on the image. So, the developers start by creating text and image embeddings and comparing them. If this comparison shows that an image and corresponding text convey very different information, this image-text pair is removed from the dataset. For reference, in the case of Laion-5B, this filter removed about 90% of images from the initial 50B+ candidates.

Then, developers can apply other filters, for example, removing watermarked images, NSFW pictures, copyrighted images, etc.

But for us, as users, the most important question is – what is actually in the “text” part of the training dataset? This would have a significant impact on how we write our prompts.

You may have a glimpse into the typical training dataset by searching through the Laion-5B database here. For example, here are the results that I’ve got for “a photo of a beautiful woman”. You can also see what’s the typical text associated with images.

And here’s what you’ll see when searching for “Canon EOS 4000D”.

Do you still think that it’s worth adding a camera name to your prompt?

Obviously, people rarely get cameras when using a camera name in the prompt. First of all, it’s because the “tradition” is to include camera names at the end of the prompt, where it has less impact, and secondly, the models today are usually smart enough to recognize what is the most important in the prompt to focus on.

But you can very easily get a camera too. Here is what I’ve got from the very first attempt at the following prompt in Midjourney:

Canon EOS 4000D, a photo of a beautiful woman

Having a little bit better understanding of the training data, we can now choose the terms that are more likely to be found in the caption of the image we’re looking for – e.g., subjects, clothes, style, type of image, but not relative position of subjects, camera name, or lens F-number.

However, it’s not only about the dataset. We can see how different image generators demonstrate dramatically different performances while using similar diffusion models and training datasets. This is the effect of additional improvements introduced by certain developer teams.

Advanced Image Generation

Here are some improvements that you can observe in Midjourney and some other AI image generators:

  • Aesthetics. Midjourney’s artistic style makes images more interesting and attention-grabbing. As you could see from the screenshot above, if you just search the database for a photo of a beautiful woman, the results are mostly not that impressive.

  • Photorealism. The developers can also adjust the model parameters to generate more photorealistic results by default. You could notice how in the latest versions, you usually get photo-like results even without the corresponding request. That is very different from how the earlier versions worked.

  • Diversity. Midjourney and other image generators often succeed in generating diverse results even without explicit request. I can see that this doesn’t work perfectly well, and in some cases, the results are inexcusably homogeneous, but considering that our online data is unfortunately very biased, and AI algorithms by their nature tend to generate the most statistically probable results, introducing at least some diversity is already a step forward.

  • User control. Developers also enable users to have additional control by, for example, setting certain parameters at their own discretion. In Midjourney, you may choose to stylize your image less or more with the --s parameter, or you may decide how weird or diverse would be your results, with the --weird and --chaos parameters, respectively.

  • Model fine-tuning with user feedback. The Midjourney team encourages users to evaluate output images through ranking pairs or rating their own image generations. This feedback is incorporated into the AI model to further improve the AI generator's performance.

Conclusion

When we use AI image generators without understanding their internal workings, we often encounter unexpected results, and see the tool as a stubborn child that just doesn’t want to follow our instructions. Our chances of success get higher if we try to learn the language that AI image generators understand.

Nobody is 100% safe from unexpected results because even researchers who build AI image generators don’t fully understand how they work. But learning some basics gives us a better grasp of what is possible and impossible to achieve with AI text-to-image generators, how much control we have, and how to improve our text prompt to achieve the desired outcome.

đź—ž News and Top Reads

  • Midjourney rolled out a new “Panning” feature.

    • Underneath your upscales you can now see arrow buttons, and if you click an arrow it will extend your image in that direction.

    • Currently, there are a few limitations. Among other things, you cannot pan both horizontally and vertically on the same image, you cannot control the amount you pan with each panning operation, and it’s not possible to do variations supported on panned images.

  • Playground AI announced the Mixed Image Editing feature.

    • The newly introduced feature empowers users to blend multiple images together, expanding on the platform's text-to-image capabilities.

    • Playground's upgraded collaborative Canvas editor enables users to overlay a multitude of edits, offering an unparalleled degree of control and finesse.

  • Google updated its privacy policy, explicitly stating that the company retains the right to extract almost any content you post online, with the purpose of constructing its AI tools.

    • The new Google policy says: “For example, we use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.”

đź“Ś AI Art Tutorial: Stable WarpFusion, AI Video Maker

In this video tutorial, Matt Wolfe is exploring Stable WarpFusion, a new AI video maker. He explains how to get access to this tool and walks through the process of creating beautiful animations and videos using Stable WarpFusion.

🎨 Featured Artist: Roope Rainisto

Roope Rainisto is a Finland-based designer, artist, and creator. He has 25 years of experience in creating and leading user experience work for a wide range of devices and services. Currently, he is exploring the world and possibilities of AI-powered creation. Roope is an artist behind Life In West America and Reworld NFT collections. Even amidst the current NFT downmarket, his artwork continues to be traded for thousands of dollars.

đź–Ľ AI-Assisted Artwork of the Week

What if coffee machines had butterfly wings? From a collaborative series between @julian_ai_art & @iva_ai_popsurreal.

Share Kiki and Mozart

If you like this newsletter and know somebody who might also like it, feel free to share this newsletter. Let’s have more people learn about AI art!

If you have been forwarded this email and you like it, please subscribe below. And welcome to the world of AI art!

Join the conversation

or to participate.