Skip to content

Computer Vision

Stable Diffusion Understanding

Overview

Stable Diffusion has become so popular for image generation. It is the go to model for developers. It is a latent diffusion model that generates AI images for text. Sometimes you can also use an image and text to generate images.

Capabilities of Stable Diffussion

Stable diffusion is a text-to-image model. Given a text it will produce an image.

Stable Diffusion Text to Image
Figure 1:Basic Workflow of Stable Diffusion

Stable diffusion belongs to a class of deep learning models called diffusion models. These are models that are capable of generating new data that is similar to the training data. These models are so named since they use diffusion based mechanics we see in physics. We see two types of diffusion here: 1. Forward Diffusion 2. Reverse Diffusion

Forward Diffusion

Forward diffusion is the process that adds noise to an image in steps such that it gradually becomes unrecognizable. It is similar to the process where you drop ink on tissue paper the ink eventually spreads out.

Forward Diffusion Process
Figure 2: Stable diffusion Forward diffusion process taken from here
Forward Diffusion Process
Figure 3: Drop of ink from the nib of the pen spreading on the tissue paper (AI Generated from LLama 3.2)

Reverse Diffusion

Reverse diffusion is the opposite of Forward Diffusion. So rather than adding noise, it removes noise gradually from an image.

Reverse Diffusion Process
Figure 4: Stable diffusion Reverse diffusion process taken from here

Training process of Stable Diffusion

Adding noise is simple process and does not require explicit training. But how do get the old image back from a noisy image. We need to remove the noise from the image. To put it mathematically.

Reverse Diffusion Process Equation
Figure 5: Stable diffusion Reverse diffusion High Level Equation

So what we need to do is predict the amount of noise that needs to be removed to produce the original almost noiseless image. We use a noise predictor which for stable diffusion is a U-net model.

U-Net Model

It is a widely used deep learning model for image segmentation. The primary purpose of the model was t o address the challenge of limited data in healthcare. This network allows you to use a smalled dataset for training while maintaining the speed and accuracy of the model.

The U-Net model consists of 2 paths:

  1. Contracting Path
  2. Expansive Path

The contracting path consist of encoders, that capture the relevant information and encode it. The expansive path contains decoders the decode the encoded information and also use the information from the contracting path via the skip connections to generate a segmentation map.

U-net Model
Figure 6: U-net model taken from here

U-net Model Encoder
Figure 7: U-net model Encoder Architecture

U-net Model Decoder
Figure 8: U-net model Decoder Architecture

Cost of running the model

Diffusion models like Google’s Imagen and Open AI’s DALL-E are in pixel space. They have used some tricks to make the model faster but still not enough. Whereas, Stable Diffusion is a latent diffusion model. Instead of operating in the high-dimensional image space, it first compresses the image into the latent space. The latent space is 48 times smaller so it reaps the benefit of crunching a lot fewer numbers. That’s why it’s a lot faster. We use a Variational Autonencoders (VAE).

To summarise we use U-net in the image space for faster generation we make use of the latent space, for this we use VAE. U-Net is still used as the noise predictor.

Variational Autoencoders

Like U-net these also have encoders and decodes, the noise is added to latent vector and is later decoded to generate the images.

VAE overview
Figure 9: VAE Working

Does using latent space cause loss in information?

It might seem that while using the latent space we are loosing a lot of information, however thats not the case. It might seem that images are random but they are regular in nature. For Example: A face of any species has a mouth, ears and a nose. This is better explained by the Manifold Hypothesis.

Reverse Diffusion in Latent Space

Here’s how latent reverse diffusion in Stable Diffusion works.

  1. A random latent space matrix is generated.
  2. The noise predictor estimates the noise of the latent matrix.
  3. The estimated noise is then subtracted from the latent matrix.
  4. Steps 2 and 3 are repeated up to specific sampling steps.
  5. The decoder of VAE converts the latent matrix to the final image.

The noised predictor here is still U-Net.

So far we have seen only image generation process which is called the unconditioned process. In the following sections we will see how we can condition for text i.e. given a text the model should generate an image.

Text Conditioning

To be able to generate images using the text prompts we need to perform the preprocessing steps in figure 10. In the figure the Tokenizer and Embedder are implemented by a Contrastive Language-Image Pretraining model (CLIP). It should be noted here since we are dealling with a text input the convulutional layers are replaced by cross attention layers to help establish relationship between different words in a sentence. Attention layers are the new feature extracture layers, they are going to replace RNNs and CNNs as they are faster at processing and get rid of any inductive biases due the structure of neural network.

There are other forms of conditioning as well

VAE overview
Figure 10: Text Conditioning steps

Summary

To summarize how stable diffusion creates images here are the steps:

  1. Given a text or image we generate a random vectors in the latent space this is done through VAE encoder.
  2. U-NET then predicts the noise that is added to this vector.
  3. Given the amount of noise that is added we remove the noise for the latent vector.
  4. Steps 2 and 3 are repeated for a certain number of sampling steps
  5. Finally, the decoder of VAE converts the latent image back to pixel space. This is the image you get after running Stable Diffusion.

References

  1. Andrew, “How does Stable Diffusion work?,” Stable Diffusion Art, Jun. 10, 2024. https://stable-diffusion-art.com/how-stable-diffusion-work/
  2. GeeksforGeeks, “UNET Architecture explained,” GeeksforGeeks, Jun. 08, 2023. https://www.geeksforgeeks.org/u-net-architecture-explained/
  3. O. Ronneberger, P. Fischer, and T. Brox, “U-NET: Convolutional Networks for Biomedical Image Segmentation,” arXiv.org, May 18, 2015. https://arxiv.org/abs/1505.04597
  4. Wikipedia contributors, “Manifold hypothesis,” Wikipedia, Aug. 01, 2024. https://en.wikipedia.org/wiki/Manifold_hypothesis

Understanding Vision Transformers

Overview

I was taking part in the ISIC 2024 challenge when I got stuck training a ResNet50 model that it started overfitting. My score at this point was 0.142. To be at the the top I had to beat the score 0.188. While scouring the internet for any new model I came across Vision Transformers. I was honestly surprised that transformer architecture could be applied to images. I came across this interesting paper called "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"

About the Paper

We know that the Transformer architecture has become the norm for Natural Language Processing(NLP) tasks. Unlike in NLP tasks in conjunction with attention we also have convoluitonal networks. However, this paper demonstrates that convolitonal networks need not be applied and a pure transformer architecture applied on a sequence of image patches can perform image classification tasks really well provided that its pre-trained on large amounts of data.

Basic Theory of Vision Transformers

The Vision Transformer architecture has been inspired by the successes of the Transformer successes in NLP. The first step to create a Vision Transformer is to split an image into patches. We now generate the position of these patches and then generate embeddings for them. Let us consider dimensional tranformation that is taking place here.

Our original image X had the dimenstion HxWxC. Where H is height and W is the width of the images and C is the channel. Since, we are dealing with RGB images the C will be 3.

After fetching the patches, we get the following dimensions NxPxPxC.

Where N is the number of patches in an image.

To calculate it N = \(\frac{H * W}{P*P}\)

Now, we flatten the aforementioned patches and project them via a dense layer to have a dimension D whic his known as the constant latent vector size D. Then we add the patch embeddings and the positional embeddings to retain some of the position information. The postional information is in 1D and not 2D since no performance gain was observed.

This output is forwarded through the some layers of the transformer blocks.

The transformer enocder block is composed of alternating layers of multiheaded self attention and MLP blocks. Layer Norm is applied before every block i.e. before an attention or MLP block and a residual connection is created after every block.

It is to be noted that Vision Transformers have much less inductive bias than CNNs. Inductive biases are assumptions we make about a data set. For example we can assume the marks of students in a given subject to follow a gaussian distribution. CNN architectures inherently will have some biases due to the way they are structured. CNNs are structured to capture the local relationship between the pixels of an image. As CNNS get deeper the local feature extractors help tp extract the global features. In Vit only the MLP layers are local and translationally equivariant while the self attention layers are global. An hybrid version of ViT also exists where CNN is applied to extract the feature maps and then forward to the Transformer encoder block.

Is it better than CNNs?

In the paper ViT can only perform classification tasks and not Segmentation or detection tasks but it still matches or outperforms CNNs and introduces a parallelism with multihead self attention. ViT will only perform better with pre-training and requires more epochs.