01-Basic Concepts of Stable Diffusion

Although I started using Stable Diffusion quite early, I only scratched the surface: setting up the local environment, drawing some images, checking out the API, and integrating it into the GeekAI Creative Assistant. Then... nothing more. I rarely use AI drawing in my daily work, and when I do, I usually rely on MidJourney.

Last week, to meet work requirements, I tried using ComfyUI to create an AI outfit-changing workflow and set up a local environment. However, I found ComfyUI more complex than the Stable Diffusion WebUI I had used before. The latter is straightforward—just input prompts, select a model, adjust parameters, and generate images. In ComfyUI, besides selecting a model, you also need to load Clip models, VAE, etc., which I had only heard of by name before.

Skipping many foundational principles has led to numerous operational issues. So, I decided to spend some time relearning the basics of Stable Diffusion. After learning, my takeaway is that foundational knowledge is far more important than techniques. Without understanding the principles, learning techniques will be laborious, and even if you master some AI drawing skills through extensive practice, those skills might not apply when switching to another software. For example, if two AI drawing tools use different Clip models, their prompt techniques might differ. If you grasp the underlying principles and basics, your skill transfer will be much faster.

This article compiles some notes I took while learning the basics of Stable Diffusion.

Neural Networks

First, let's understand the most fundamental structural unit of AI models: neural networks.

Don't be intimidated by technical jargon; you don't need to understand the technical details—just focus on the concept.

The image below represents the simplest computer neural network.

It is divided into three layers from left to right.

The first layer represents input data, each dot in the second and third layers represents a neuron.

The second layer is called the "hidden layer."

The third layer is the "output layer."

Data is input, processed by neurons in the hidden layer, and then signals are passed to the output layer, where neurons further process them to make a final judgment.

The image below shows the inference process of a neural network for image recognition.

An image is input, its feature parameters are extracted, and these parameters are fed into the neural network. After processing by a series of hidden layer neurons, the output neurons make a judgment.

Deep Neural Networks

A "deep neural network" is, as the name suggests, a neural network with many layers—specifically, multiple "hidden layers" instead of just one.

If this is unclear, refer to the image below: the left side shows a simple neural network, and the right side shows a deep learning neural network.

AI Large Models

An AI large model is essentially a deep neural network. However, this network has an enormous number of parameters, comprising thousands of layers of neurons, making it highly complex.

Two Characteristics of Large Models

The first characteristic is... size. Large models have an enormous number of parameters—for example, GPT-3.5 has 175 billion parameters.

Second, the model's hidden layers are numerous, often in the thousands. Due to the sheer number of parameters and hidden layers, the intermediate reasoning process is essentially a black box—you can't figure out why the model outputs a particular result. This is somewhat similar to how the human brain works. For instance, when you see an image, you instantly recognize it as inappropriate, but you can't explain why—your brain's neural network just makes that judgment. This is quite fascinating.

Another characteristic of large models is unstable output. We now know that the core function of neural networks is prediction. Since large models are essentially neural networks, their sole capability is also prediction. Take large language models as an example: their output is akin to a word-chain game. Your prompt is input into the neural network, which predicts the next word and returns it. Here's an example:

The inference process roughly works like this:

The large model inputs "hello" (possibly after some preprocessing) into the neural network, which predicts the next word (token) as "Hello!"
The prompt and the first result are input into the model, which predicts the second word as "How."
"Hello! How" is then input, and the model predicts the next word as "can."
This repeats until a complete response is generated.

However, because large models have so many hidden layers, it's impossible to ensure that every hidden layer produces the same result in every inference. Thus, predictions aren't always consistent. For example, if you pass by a breakfast stall daily and greet the owner, he might usually reply with "Good morning," but occasionally say, "Have you eaten?"

If this is still unclear, consider a simple neural network for handwritten digit recognition with dozens of hidden layers. The image below illustrates this:

Simplifying the logic: if a signal enters the input layer and exits from output port 0, it's digit 0; if it exits from port 1, it's digit 1. Now, suppose an image signal takes a wrong turn at the 10th hidden layer—instead of exiting from port 2 (digit 1), it exits from port 8 (digit 7).

AI Drawing Large Models

Let's quickly review some common technical terms:

Diffusion Model: Directly generates images but requires significant computational resources and is slow. Examples include Stable Diffusion, DALL-E, and MidJourney. Diffusion models aren't limited to images—most audio and video generation models also use this technology.
Latent Diffusion Model: An improved version of the Diffusion Model, introducing an intermediate layer called the "Latent Space." Images are compressed and reduced in dimensionality before computation (e.g., from 512x512 to 64x64), significantly reducing computational load.
Stable Diffusion: A stable diffusion model based on Latent Diffusion. Note that "Stable" refers to the company Stability AI, not the model's stability.
Flux: Developed by the Black Forest team (formerly Stability AI's technical team), it's based on a multimodal and parallel diffusion Transformer architecture. It's currently the most powerful open-source image-generation model, outperforming even the best commercial models like MidJourney v6.0 in official tests.
Stable Diffusion WebUI: A web application based on Stable Diffusion, simplifying installation and configuration with an easy-to-use interface and plugin support. It's currently the most popular Stable Diffusion application.
ComfyUI: Another powerful web application based on Stable Diffusion. Its standout feature is node-based workflows, allowing precise customization and reproducibility for complex drawing tasks. ComfyUI also optimizes image generation speed and supports models like Flux alongside Stable Diffusion.

Stable Diffusion Principles

The image below shows a simplified workflow of Stable Diffusion's image generation:

The user inputs a prompt (e.g., "a dog running in the park").
The prompt is converted into a vector by a text encoder (CLIP model) and fed into the Diffusion model.
Image features are generated in the latent space.
An image decoder (VAE) converts latent space features back into an image.

Diffusion Model Principles

The image below illustrates the training workflow of a Diffusion model:

Given an image, random noise is added repeatedly until only noise remains. Then, a neural network (U-Net) is trained to reverse the noise step-by-step to restore the original image.

Why add noise instead of removing pixels? Removing pixels would lose image information.

Why add noise gradually instead of all at once? This helps the model learn image features more effectively. Typically, noise isn't added uniformly—it's added sparingly at first to preserve features, then more heavily later to save time.

The restoration process works like this:

A random image is selected from training data, and noise is added (e.g., 50 steps' worth).
The noisy image and step count (50) are input into the U-Net network to predict the noise.
The predicted noise is subtracted from the noisy image to restore the original.

Since noise prediction isn't perfect, the initial result is a blurry outline. This blurry image is treated as the original, and the process repeats with reduced noise (e.g., 49 steps) until a clear image emerges. This explains why Stable Diffusion and MidJourney images start blurry and gradually sharpen.

Notably, the "iteration steps" parameter in SD-WebUI corresponds to the noise addition steps. More steps mean finer details but longer processing. For Stable Diffusion, 20-30 steps strike a good balance between quality and resource usage. Flux's Schnell model achieves this in just 4 steps—impressive indeed.

Text-to-Image in Stable Diffusion

The process is similar to image restoration:

A random noise image is generated based on the "seed" parameter.
The "prompt" is input into the CLIP model and converted into a text feature vector.
The text features, noise image, and step count are fed into the U-Net network for noise prediction.
The noise is subtracted to restore the image.

However, the result may only vaguely resemble the prompt. To enhance accuracy, "Classifier Free Guidance" (CFG) amplifies the prompt's influence. Here's how it works:

The U-Net predicts two noises: one with the prompt and one without.
The difference between them is the "guided noise," which is amplified (typically by 7.5x).
The noise image subtracts this amplified guided noise.

If negative prompts are provided, their noise is also subtracted to steer results away from undesired outputs. This process repeats until the final image is generated.

CLIP Model

CLIP (Contrastive Language-Image Pre-training) converts user prompts into feature vectors for the latent space. Since CLIP only supports English, AI drawing tools typically require English prompts.

CLIP has two encoders:

A text encoder for text-to-image.
An image encoder for image-to-image.

Stable Diffusion initially used OpenAI's CLIP model, trained on 400 million image-text pairs via self-supervised learning. While OpenAI's CLIP is open-source, its training data isn't. Stable Diffusion 2 switched to OpenCLIP, a fully open-source alternative.

VAE Model

VAE (Variational Autoencoder) consists of an encoder and a decoder. The encoder compresses an image into latent space features, where computations occur before the decoder reconstructs the image.

Stable Diffusion uses VAE's encoder to compress images into latent space for training. The diffusion model's inputs and outputs are latent space features, not raw pixels. Random noise for generation is also created in latent space.

VAE reduces training and inference time by compressing images (e.g., 512x512 → 64x64). However, this sacrifices some image precision and detail.

Some models (e.g., SD1.5, SDXL) include VAE, while others (e.g., Flux) require a separate download.

Stable Diffusion Structure Recap

Training images are compressed into latent space via VAE.
Noise is added based on a scheduler, and the noisy image is input into U-Net along with text features from CLIP.
U-Net predicts noise, and the process repeats until a noise-free image emerges.
VAE's decoder converts the result into a pixel image.

For text-to-image, only the reverse process (steps 2-4) is needed:

Fine-Tuning Models

Here are common fine-tuning methods for Stable Diffusion. All involve modifying the U-Net network—either directly or via additional networks.

Dreambooth

Directly adjusts U-Net parameters using new images and keywords. The entire modified model is saved, resulting in large files. The upside is standalone usage.

LORA

Adds layers to U-Net and trains them to adjust outputs. Only the added layers are saved, keeping files small (tens to hundreds of MB). LORA models depend on a base model and are ideal for style consistency (e.g., generating a character's portrait series).

Textual Inversion

Also called embedding models, these adjust the CLIP model to generate images with specific attributes. For example, an embedding model for "Yagami" ensures generated images feature that character's face regardless of other prompts.

Embedding models are often used in negative prompts to avoid low-quality outputs (e.g., "low resolution, blurry, distorted features"). They're the smallest fine-tuning models (tens of KB).

ControlNet Model

ControlNet trains an additional neural network to adjust U-Net, enabling precise control over generated content by inputting extra information.

Upload a reference image; ControlNet extracts specific "pose" information.
This information is input into ControlNet's latent space network.
ControlNet copies the original model, applies control conditions, and combines results for the final output.

Download ControlNet models here: https://huggingface.co/lllyasviel/sd_control_collection/tree/main.

Commonly used models include:

For detailed ControlNet explanations, refer to: 14 ControlNet Official Models and Usage.