Fun With Diffusion Models!

Derry Xu

CS 180, Fall 2024

Part 1: The Power of Diffusion Models

During this part of the project, we experiemented with pre-trained diffusion models, more specifically the DeepFloyd IF diffusion model and its denoisers.

Trying out DeepFloyd

To begin the project, we generate three images using DeepFloyd's two stages using different numbers of inference steps, just to get an idea for the quality of outputs we might expected from DeepFloyd, and how the number of inference steps affects quality. For reproducibility, for the entire first section of the project we use random seed 42.

snow
"An oil painting of a snowy mountain village" - 20 inference steps
man
"A man wearing a hat" - 20 inference steps
rocket1
"A rocket ship" - 10 inference steps
rocket2
"A rocket ship" - 20 inference steps
rocket3
"A rocket ship" - 40 inference steps

In general, it seemed more detailed prompts gave better results. In my opinion, the toil painting looks best, while the rocket ship looks worst, which makes sense given the detail given to prompt the oil painting (painting style, specifically oil painting style), and the lack of detail given to the rocket prompt. Also the number of inference steps didn't necessarily increase how good the rocket looked, but added a lot of complexity to the image: more details, background, etc.

Sampling Loops

After trying out the out-of-the-box DeepFloyd model, we specifically looked at the U-net denoising model that DeepFloyd depends on.

Forward Process

To begin, I implemeted the forward process of diffusion, where we gradually add noise to a clean image. The amount of noise is dictated by noise coefficients provided by DeepFloyd:

$$ x_t = \sqrt{\bar \alpha_t} x_0 + \sqrt{1 - \bar \alpha_t} \epsilon \quad \text{where } \epsilon \sim N(0, I) $$ Where $x_t$ is the noised image at timestep $t$, $x_0$ is the clean image, $\bar \alpha_t$ is the noise coefficient for timestep $t$, and $\epsilon$ is random noise.

Below are some examples of the noising process for a test image over T = 1000 iterations.

clean
Original test image
250
Image after $t = 250$ noising steps
500
Image after $t = 500$ noising steps
750
Image after $t = 750$ noising steps

Classical Denoising

Having generated our noisy images, we can start playing around with different ways of denoising them. Earlier in the class we learned about using a Gaussian low pass filter to denoise images. Before using any of the fancier denoisers, we can give the Gaussian blur a try:

clean
Original test image
250
Image after $t = 250$ noising steps
gauss
Noisy image passed through Gaussian filter
clean
Original test image
500
Image after $t = 500$ noising steps
gauss
Noisy image passed through Gaussian filter
clean
Original test image
750
Image after $t = 750$ noising steps
gauss
Noisy image passed through Gaussian filter

Evidently, the performance is not very strong.

One-Step Denoising

Now, we can try directly passing the noisy image through DeepFloyd's pretrained U-net denoiser.

clean
Original test image
250
Image after $t = 250$ noising steps
onestep
Noisy image passed through U-net
clean
Original test image
500
Image after $t = 500$ noising steps
onestep
Noisy image passed through U-net
clean
Original test image
750
Image after $t = 750$ noising steps
onestep
Noisy image passed through U-net

Evidently the performance is significantly better, but at higher noise levels, the original image tends to be somewhat lost.

Iterative Denoising

In practice, diffusion models denoise iteratively rather than in one shot. To speed things up and save on computation, we use a strided iterative denoising process, where instead of denoising through all $T = 1000$ timesteps, we can skip a few steps by defining a new set of "strided" timesteps that correspond to certain intermediate noisy images.

Then, when updating from the $t$th image in the strided set to the next image (slightly more clean): the $t'$th image, we can use the following formula: $$ x_{t'} = \frac{\sqrt{\hat \alpha_{t'}} \beta_t}{1 - \hat \alpha_t} x_0 + \frac{\sqrt{\alpha_t} (1 - \bar \alpha_{t'})}{1 - \bar \alpha_t} x_t + v_\sigma $$ Where $\alpha_t = \frac{\bar \alpha_t}{\bar \alpha_{t'}}$, $\beta_t = 1 - \alpha_t$, and $v_\sigma$ is some random noise. The rest of the variables are the same as they were in the previous equation in the forward process.

Implementing the iterative process gives us the following results:

t690
Noisy image at $t = 690$
t540
Noisy image at $t = 540$
t390
Noisy image at $t = 390$
t240
Noisy image at $t = 240$
t90
Noisy image at $t = 90$

Comparing each of the denoising methods:

clean
Original image
gauss
Gaussian blurred image
onestep
One step denoised image
iterative
Iterative denoising using stride = $30$

It seems clear that iterative denoising performed the best (starting from the noisy image at $t = 990$)

Diffusion Model Sampling

Our iterative denoiser also can work as an image generator; we just pass in truly random noise, conditioned on the prompt "A high quality image". Here are five sample images from this process:

1
Sample 1
2
Sample 2
3
Sample 3
4
Sample 4
5
Sample 5

Some of these images look decent, like sample 5, but generally they aren't great in quality.

Classifier-Free Guidance (CFG)

To improve the images, we can use a technique called Classifier-Free Guidance, where we generate both a conditional and unconditional noise estimate, and use the two of them to generate a final noise estimate using the formula: $$\epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u)$$ And setting $\gamma$ to be greater than 1, in which case we skew our noise estimate to the difference beweetn the conditioned and unconditioned noise estimate. The concept seems pretty similar to the caricutures idea in a previous project.

Using Classifier-Free Guidance with our iterative denoiser, we can produce some better quality photos:

1
Sample 1
2
Sample 2
3
Sample 3
4
Sample 4
5
Sample 5

Image-to-image Translation

Now that we have our cfg-iterative denoiser, we can do some pretty neat things with it, such as editing a photo. The idea is to add some noise to the original photo using our forward process, then denoise the noisy photo to get a new photo. The more noise we add, the less the end photo will resemble our original photo at all. Here are a few examples:

1
Start index = 1 (Most noise added)
1
Start index = 3
1
Start index = 5
1
Start index = 7
1
Start index = 10
1
Start index = 20 (Least noise added)
1
Original image
1
Start index = 1 (Most noise added)
1
Start index = 3
1
Start index = 5
1
Start index = 7
1
Start index = 10
1
Start index = 20 (Least noise added)
1
Original image
1
Start index = 1 (Most noise added)
1
Start index = 3
1
Start index = 5
1
Start index = 7
1
Start index = 10
1
Start index = 20 (Least noise added)
1
Original image

Drawing and Cartoon Edits

The above process is more interesting when applied to non-realistic photos, such as drawings or cartoons pulled from the web. In both cases, in theory the edit could make the original image more realistic.

1
Drawing
1
Edits (most noise on left)
1
Drawing
1
Edits (most noise on left)
1
Snoopy
1
Edits (most noise on left)

Inpainting

We can also mask out a certain section of an image, and have the diffusion model denoise that section only, while coercing all other pixels to return to their original values.

1
Original image
1
Mask
1
Area to replace
1
Inpainted image
1
Original image
1
Reduced to appropriate size, mask, and area to replace
1
Inpainted image
1
Original image
1
Reduced to appropriate size, mask, and area to replace
1
Inpainted image
1
Original image
1
Reduced to appropriate size, mask, and area to replace
1
Inpainted image

Text-Conditional Image-to-image Translation

We can also condition our diffusion model on different natural language prompts (up to this point we had been conditioning on "A high quality image"). Below are some examples, where the lower photos have less noise added and the higher photos have more noise added.

1
Campanile condition on "a rocket ship"
1
Derry conditioned on "a photo of a hipster barista"
1
Drawing conditioned on "an oil painting of people around a campfire"

Visual Anagrams

One of the more cool things we can do is to create illusions where an image appears to be one thing when rightside up, but turns into a completely different thing when flipped upside down.

To have our diffusion model create these images, at each time step we essentially predict the noise of the image conditioned on one text prompt, flip the image, and predict the noise condition on the other text prompt. These two noises are then combined for our final noise estimate to be used for denoising. $$ \begin{aligned} \epsilon_1 &= \text{UNet}(x_t, t, p_1)\\ \epsilon_2 &= \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_1))\\ \epsilon &= (\epsilon_1 + \epsilon_2)/2 \end{aligned} $$ Below are some examples of visual anagrams we can create:

1
Rightside up, it's a group of people around a campfire
1
Upside down, it's an old man
1
Rightside up, its a coastline
1
Upside down, it's a dog with a flower crown
1
Rightside up, it's a snowy winter town
1
Upside down, it's a man in a hat

Hybrid Images

In a similar vein to the previous part, we can create images that look like one thing from close up, and another from far away (or blurring your eyes)

To achieve this, we generate two noise estimates, each conditioned on a prompt. The noise estimate corresponding to the prompt we want to see from close up, we high pass. The noise estimate corresponding to the prompt we want to see from far away, we low pass. Then we just add the noises together for our final noise estimate. Here are some examples:

1
Close up, it's a waterfall, far away, it's a skull
1
Close up, it's a dog, far away, it's a pencil
1
Close up, it's a coast, far away, it's an old man

Part 2: Diffusion Models from Scratch!

In the previous section, we used a pre-trained DeepFloyd model. In this section, we build and train our own U-Net, specifically trained on the MNIST dataset of handwritten numbers.

Training a Single-Step Denoising U-Net

We begin by building up a simple U-Net architecture as outlined in the project spec. The model learns kernels such that it can denoise an image into a digit by taking the 1x28x28 image in, creating many hidden layers (for this first part we use 128), and pass the transformed image through various layers of dowmsampling, upsampling, and convolutional blocks (plus some skip connections using tensor concatenation)

To train this U-net, we need noisy images, which we can generate using Gaussian noise: $z = x + \sigma \epsilon$ where $\epsilon$ is our Gaussian noise, and $\sigma$ controls the strength of the noise. Below are a few different noise levels visualized:

1

We choose to work with $\sigma = 0.5$ and train the U-net on our MNIST dataset using hidden dimension $D = 128$, and an Adam optimizer with learning rate $1e-4$. Over 5 epochs, here is the training loss plot:

1

And below we can see the results of applying the U-Net after 1 epoch of training, and 5 epochs of training.

1
Top layer: original image. Middle layer: noised with $\sigma = 0.5$. Bottom layer U-Net applied
1
Top layer: original image. Middle layer: noised with $\sigma = 0.5$. Bottom layer U-Net applied

The U-Net was only trained on the specific cases where the noise added had parameter $\sigma = 0.5$. So we can test it on out-of-distribution examples at various other $\sigma$ levels:

1
$\sigma = $ 0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0, from left to right

Training a Diffusion Model

Now that we have a basic U-Net, we can use it to create a diffusion model.

Adding Time-Conditioning to UNet

From the previous parts of the project, we know that a one-shot diffusion model is less effective than a iterative diffusion model. That means we need to introduce some dependency on $t$ the denoising step, into the U-Net.

We do so by introducing two FCBlocks that take in $t$ as an input and are added to blocks during our upsampling stage.

We also need a set of noise coefficients, which are derived based on the procedures in the project spec. Finally, our loss is defined as the MSE of our predicted noise vs. the actual noise. Implementing these changes and training the U-Net, we get the following training loss curve:

1

We also define a sampling procedure which is similar to our previous parts, but we replace the added variance with $\sqrt{\beta_t}$. Using this sampler, we can produce the following digits after 5 and 20 epochs.

1
Generated digits after 5 epochs
1
Generated digits after 20 epochs

Adding Class-Conditioning to Unet

Clearly, the digits from the previous section weren't great. To improve the performance of our U-Net, we can condition on class, ie. the kind of digit we are generating. Since there are exactly 10 possible digits, we can pass in a length 10 one-hot encoded vector to our U-Net, essentially allowing it to learn to denoise different for each digit.

To do so, we add two more FCBlocks into our architecture, which take the one hot encoded vector and produces a tensor to multiply into existing blocks during our upsampling stage, essentially conditioning the output on the class of digit we manually pass in.

For both the training and the sampling procedure, we also have a probability of dropout, where 10% of the time, we pass in a zero vector instead of a one-hot encoded vector, essentially getting rid of conditioning for that specific iteration/training pass.

These modifications result in the following training loss plot:

1

And now after 5 and 20 epochs, we get significantly better digits!

1
Generated digits after 5 epochs
1
Generated digits after 20 epochs