How “AI” Image Generators Work

This post is written by the author, not by or assisted by an LLM. Any portion generated by an LLM is identified as such.

I’m writing this article to help artists and users understand what the current crop of AI image generators work do and how they work. I won’t dive into technical details. This explanation is “right enough” for a 5,000 foot discussion of how to work with the technology and address high level concerns, such as whether training content is “stolen”.

I’ll be using the picture attached to this post for discussion. It was generated by xAI’s Grok using the following prompt:

Mona is a 6 year old white Boxer dog. 
Johnny is a 2 year old Jack Russell Corgi mix. Johnny is black and white with brown highlights. 
Eeyore is the popular children's book character.
Paul Bunyan is the popular children's book character.
Make a picture where these characters play a high stakes game of go fish to raise money for renovating the dog park.

I used a similar cue — “Make a story” — with a private IBM Granite 3 Instruct LLM running on my laptop to create a delightful story that the picture will accompany.

Delight and Realism

The first thing to notice about the picture is that it truly is a delightful depiction of a scene I had in mind. The Boxer looks passably like Mona, or at least her stunt double. The little dog looks amazingly like Johnny. More on that in a future post. Paul Bunyan and Eeyore are out of focus behind the main scene. Johnny and Mona seem to be playing a hybrid of Go and some card game. Literalist critics would derisively call that a “hallucination”, but to me, that’s delightful and funny. You are allowed to smile when you use these tools!

The second thing to notice is that it’s got a well-executed photo realistic feel to it. Obviously, it’s not real, but with costumes, stage dog training, and fortunate timing, you could imagine capturing a shot like that. A Photoshop hack like me or even a professional would have a hard time putting that scene together seamlessly.

Visualization

The first conclusion you should make about the technology is that it is good at bringing ideas to life. If you can express the idea in a few sentences, it can probably render something close, even close enough. That makes it a powerful tool for both visualization and prototyping.

While it might feel wrong to use an AI image generator for a final, professional product, and might cast doubt on the workmanship that went into that product, it is quite valuable in the iterative process of drafting and refining. This applies to all kinds of things activities that generative AI can do, from pictures to stories to analysis to code and beyond. Generative AI is excellent at providing samples and templates that can serve as placeholders for production work or even adapted to finished products.

How It’s Made

The picture itself is created using a model and an algorithm. The model is a large collection of weights, billions of statistical measurements about billions of images in a training set of images. The algorithm is called diffusion. It starts with random pixels — think of a noise pattern on an old tube television, like the picture on the left, courtesy of Grok.

The algorithm iteratively changes the pixels so that the picture looks more and more like what was requested. The picture on the right is that same old tube television tuned into another picture of Mona and Johnny playing Go Fish. That mashup was assembled in Photoshop.

“Looks more and more like…” — That is the key! Again, at a high level, it takes each step between all noise and all picture, and it “asks” the model if it looks more like what is described in the prompt. E.g. Is there a white Boxer dog? Is there a Jack Russell Corgi mix, black and white with brown highlights? Are they playing Go Fish? The things that can answer these yes/no questions are called “classifiers”. They are part of the model.

When the model is trained on millions or billions of images, each image has a description, often in English, associated with it. It could also have a more structured description, with pre-assigned category names, etc. The original work of classifying images is/was done by humans, and is often very low paying work done by very low-skilled workers. That may or may not give you pause, or paws if you’re a dog. The situation is more nuanced than any critical documentary shows.

To get a sense of how the classifiers work, consider the image at left. That’s not a Boxer. It’s a white Siberian Husky. As the picture changes from noise to a finished image, a Boxer classifier would keep it from evolving this way. But a Siberian Husky classifier would move the picture to a result like this one.

Notice what Grok did with “go fish” for this image. Someone cut all the heads off and cleaned them! There is a lot of delight in the ambiguities of our English language. Image and language models are more than happy to surface them!

Elements are Random, not Copied or Directly Computed

The most important thing to know about the generated images are that the elements — such as Mona, Johnny, Eeyore, the fish, etc. — are randomly generated to pass the classifier tests, not copied from any of the original training images. If there is a large, diverse subset of all images in the model in the “yes” images for a classifier, there will be plenty of unique variation in the outputs of generated components. Each component is better thought of as a freehand interpretation of all the “yes” images for its classifiers than a “composite” or “average”. They are most definitely not copies in production scale models.

Creative people and copyright holders are genuinely concerned about unlicensed use of their images for training these models. The argument of the model makers is that US copyright law enumerates the rights a creator has, and doing math on an image to help build a classifier is not one of those rights. Model builder point out that the images themselves do not become part of the model, and are unlikely to be generated by the model.

Creative people also argue that these generators compete with their services. While proponents have no legal need to make a case against replacement, they point out something you can fairly assume from the example in this post. Nobody — not even me — is going to pay for a visualization of Johnny, Mona, Eeyore, and Paul Bunyan playing Go Fish. It is not an activity that is worth the time of a creative person.

I side (vociferously) with the model builders, as you might surmise from my description of how the process works. Copyright is intended to be a limited right, and we need to be vigilant to keep it limited. You’re welcome to pick whichever side resonates with you. The technical legal issues are working their way through courts as you read this.

Now… Just because the model itself and the process of making the model doesn’t violate copyright, you certainly can by using the model. If you publish a generated image that has a protected character, such as Sonic the Hedgehog in it, you may be (read: “are likely”) violating the copyright of Paramount Pictures or the original video game developer. And that’s a good reason to limit generated images to your own personal, non-commercial use or as prototypes that won’t be distributed to the public. You’ll notice that I use characters that belong to me (my dogs) or characters that are now in the public domain, like Eeyore and Paul Bunyan. Standard “I am not a lawyer” disclaimer for this whole section.

Let’s Discuss!

I hope this very high level explanation has helped you understand why image generators like Grok can produce so much pure delight, and how they work at a high level. The details are very interesting too, but you don’t need to know the details to have a sense of what these systems do to create delightful images.

I’ve posted a link to this post on X, and invite you to go there and discuss this article with me if you are so moved!

An article I published on my website about how image generators like Grok images work and issues you should be aware of. This is for non-geeks who want a broad understanding, not a detailed technical article. I invite your comments, love, and hate here.https://t.co/DCIo9LkjGL pic.twitter.com/zspp8PLJ6N
— Brad Hutchings 🍌🐶 (@BradHutchings) December 29, 2024

-Brad