Image generated by Stable Diffusion with a prompt of “Octopus, on sand dune in the desert, blue sky and clouds, Canon EOS” and upscaled with GFPGAN.

Playing with Stable Diffusion

Sep 6, 2022

I’ve been intrigued over the years watching various machine learning techniques steadily progress. Unsurprisingly, some of the advancements that are most apparent are around artificial image generation.

For a long time, Google and others would release videos and blog posts showing what they were working on, and us mere mortals would watch in awe from a distance. Then DALL-E came along with a public beta, and the general public got an accessible tool for anyone able to pay the reasonable fees associated. Since it can take several tries to generate the image you are looking for with these systems, it’s definitely not a cost-free system to use.

The next iteration in machine-generated imagery is here now, arriving as Stable Diffusion by Hugging Face (https://huggingface.co/blog/stable_diffusion). This is a pre-built, freely available model where all the hard and expensive work of training is done. Anybody with a modern, gaming-ready graphics card who is comfortable running a few commands in the terminal can get it up and running.

As tends to happen with useful open software, a community has already cropped up around this tool. The community has made Stable Diffusion easier to use, and has adapted it to multiple hardware platforms. Support was recently added for the Apple M1 chip and its integrated GPU, so I had to give it a try.

My verdict after 2 days of using it: I can’t remember the last time I have had this much fun!

If you aren’t familiar with these kinds of tools, it lets you type a string of text as a prompt and get a matching image from that prompt. Want a picture of a bird? Type “bird” and hit enter. Want a picture of a bird on a fence? Type “bird on a fence”. It’s that simple.

Despite being commonly called “Artificial Intelligence”, this system isn’t particularly intelligent and doesn’t interpret images the way we do. You get strange, funny, and full-on “wtf?” stuff coming out, but through trial and error you can learn to massage the results to get something you are looking for or discover something new.

There are some people stretching the limits of what this tool can do, leveraging it as part of a larger workflow to build some truly incredible images, generating some amazing works of art. Rather than spending my time on these techniques, in this post I am going to focus on generating raw output and showing what Stable Diffusion can produce all by itself, without any upscaling or compositing or anything else to assist with a final product.

For reference, I am using the lstein/table-diffusion repo as my basis for running the project. I won’t go into detail with my setup since there are various instructional posts continually cropping up and improving the process, and the evolution in what is available will certainly keep going. I am using the dream.py script to generate these images, mostly using its default parameters – so this post is all about the prompt itself.

I ran 20 iterations of each prompt to produce the images here, and cherry-picked the best or most representative of them. 20 isn’t a large number – running several hundred would give a great chance of a truly spectacular result, but it’s informative to see some of the weaknesses of the model with only 20 tests to pick from as well.

Without further ado – let’s get started!

Let’s start with something simple first – our prompt is simply “mountain“.

Incredible! One word as a prompt, and we get these images. Neither of them is going to win any awards in a photo gallery anywhere, but without a lot of time staring at them I would not know that these were machine-generated.

The image generator thrives on specificity, so let’s see how we can leverage that. There’s a lot of ways to make your prompt more specific, and one of the most effective and straightforward is to list an artist or photographer name. Where’s a better place to start than Ansel Adams?

Here’s what comes back with the prompt “mountain, Ansel Adams“:

It’s definitely reminiscent of the famous photographer! It’s a cheap imitation of the distinctive composition, and the characteristic contrast – but it’s an imitation nonetheless. In an alternate universe, perhaps Half Dome or Zion National Park or the Tetons looks like this

Let’s try something else while we are at it. Instead of photography, let’s go for a painting.

Here is the results of the prompt “mountain, colorful oil painting, thick brush strokes“:

Everything in our prompt was delivered handily. We can see thick brush strokes, and amazing color. The style is fun, with a variety of scenery – slopes, trees, water, and varying levels of abstraction inside the “thick brush strokes” instruction.

OK, let’s keep the mountain theme and add dogs. Who doesn’t love dogs in the outdoors? Let’s see what we can produce with “golden retriever, running up a mountain“:

If you pass these to somebody and ask them what they see, the answer would almost certainly be “golden retrievers running up a mountain”. But if you look closely, they don’t seem quite right. The range of realism in these images was quite broad, with some of those not posted looking more like a mop than an animal of any kind. Here there are a few whisperings of hair that aren’t quite right, the leg positions aren’t quite where they should be based on how a dog runs, etc.

Next let’s shift away from mountains, but stick with the golden retriever theme. One of the benefits of generating art rather than striving for photorealism is that there is definite latitude for imperfection. To demonstrate, let’s go with a prompt of “golden retriever, side profile portrait, watercolor

Some interesting results here. These definitely look like watercolors, they fit the intended description, and the model even saw fit to make it clear that one of these was a watercolor by putting the paper in clips!

Let’s keep going with the theme, and see if we can stray from the ordinary just a little bit. How will the AI choose to accessorize a dog? Let’s find out with “golden retriever wearing a bow, portrait, oil painting“:

It’s time to try something manmade; let’s try generating some medieval architecture. Here’s a simple prompt of “castle“.

The model actually struggled with this prompt more than any other. Odd things would happen like poor framing of the image subject, castles with an entire wall open, asymmetrical roofs, etc. I had to generate 60 images before I could cherry-pick enough to show a sampling here.

Applying some artistic influence helped before, so let’s try it again. This time we are going to try to influence the generation of these castles by using the name of a painter, to go for their specific style.

Here’s “castle Thomas Kinkade“:

I doubt Thomas Kinkade ever painted a castle, but if he did I bet it would look just like this!

Let’s keep the same approach, with a different artist. To go for something closer to a fantasy style, let’s try James Christensen with “castle James Christensen“:

Quaint, fantastical, volumetric, it all fits!

For our next experiments we are going to fully abandon nature. There’s a type of painting that I’ve always enjoyed, the classic “city street in the rain” type. We can see how good this model is with the following prompt:

city street in the rain, painting“:

I am liking the results, but I want to take the “painting of a rainy day in the streets” theme in a very specific direction. I want something that evokes an alleyway like you might find in Chinatown in a major city at night, or maybe in a cyberpunk movie set somewhere in Asia. Tall buildings with neon signs reaching skyward, etc.

If we want to go for something directly specified, Stable Diffusion supports img2img mode, where a base image is used as a starting point for the image generation. Everything we have done up to this point has been txt2img, naturally enough.

Generation using img2img doesn’t infer much from the base image beyond the general layout. If you have a base image of a herd of elephants on the African savannah and use a prompt involving zebras, the most likely outcome is that the elephants will be fully replaced by zebras. Keeping that in mind, we are going to find a street that roughly follows the layout we are hoping for – but we don’t need to worry about rain, or how the lights are arranged, or anything else specific like that in the base image.

Our starter image, from an actual camera

I found a Creative Commons image (shown above), and used the prompt “rainy busy city streets at night with neon lights, oil painting” to see what would be generated. It’s definitely much more constrained and follows the general layout of the base image reasonably well.

Speaking of zebras and elephants… lets see if we can generate something new with this tool. There aren’t very many photographs out there of elephants with zebra stripes, so if we can get the tool to produce them then we can get some insight into how it synthesizes these images since we know the AI hasn’t trained on pictures of zebra elephants.

After a small amount of experimentation, these images are the result of the prompt “elephant zebra, National Geographic”.

Prompts like “zebra stripes on an elephant” or “elephant with zebra stripes” resulted in approximately the same results as these here. The system wasn’t really able to take an elephant and paint stripes on it – instead it was picking and choosing visual elements from a zebra and an elephant and merging them together. You can see that a little bit here with the hair, but seeing giant zebra ears on an elephant was common, as was a full mane going down most of the back.

To close out this post (and the extent of my experience with Stable Diffusion over the last 2 days), we are going to try to generate some human faces. The curious part about generating faces is that there is a large body of work to trim a model on, with the millions of headshots floating around online. On the other hand, our brains are exceptionally well-tuned to how faces should look, so we notice anything short of perfection.

There was quite a bit of experimentation to get to what you see here. My prompt is “head photo of a woman looking at the camera, black and white, Canon EOS 5D“. Mentioning a professional studio camera can help ensure the you get pictures that tend to be taken with that kind of camera (the photos that the model was trained on will have that kind of information embedded in their metadata).

I also use the dream.py functionality to pass each image off to an additional GFP-GAN model, which helps “clean up” the faces.

Female images end up looking better than male images, possibly because there are more images available for training?

Are these images perfect? No, but they are pretty close. Most of them seem to be on the other side of the uncanny valley, where they seem real enough but perhaps not quite all the way there. Stable Diffusion especially seems to struggle with hands, for example.

The skin is too smooth, there are no scars of any type, and they end up looking completely airbrushed. There are certainly post-production techniques that can be used to further improve these, but they aren’t bad as a starting point.

Interestingly, the type of face generated doesn’t have much variance until you ask for it. At least for black and white photographs, dark, straight hair was the standard. This can be adjusted easily though, and a few of those above have the phrase “blonde hair” in the prompt and a few have the word “curly hair” in them

It’s exciting to see the advancements in this tech and exciting to think about where it is going next – but it’s a blast to use right now.

If this is interesting to you, I hope you get a chance to use the model yourself!