Skip to content

Artificial Neural Networks

All about attention

Overview

Attention layers are now used over RNNs and even CNNs to speed up processing. In this blog we will see how attention layers are implemented.

Working of Attention layers

There three inputs to attention layers:

  1. Query: Represents the "question" or "search term" for determining which parts of the input sequence are relevant.
  2. Key: Represents the "descriptor" of each input element, used to decide how relevant each input element is to the query.
  3. Values: Contains the actual information or representation of each input element that will be passed along after attention is computed.

Given a Query and Key we calculate the similarity, this allows us to use the key with the max similarity and use its value for attention.

\[ \text{Score}(Q, K) = Q \cdot K^\top \]

The above equation results in matrix describing how much importance a query gives to a key. In the equation Q is the query and K is the key.

The next step is scaling, we perform scaling to avoid large values, larger values require more resources for computation, So now the equation takes the following shape:

\[ \text{Scaled Score}(Q, K) = \frac{Q \cdot K^\top}{\sqrt{d_k}} \]

Where \(d_k\) is the dimensionality of the Key vectors.

The scores are passed through a softmax function to convert them into probabilities (attention weights). These probabilities determine the contribution of each input element. The equation now takes the following form:

\[ \text{Attention Weights} = \text{softmax}\left(\frac{Q \cdot K^\top}{\sqrt{d_k}}\right) \]

Overall the equation would look something like this:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V \]
QKV
Figure 1: Query Key Value

Scaled Dot PRoduct Attention
Figure 2: Flow of calculating Attention in Scaled Dot Product Attention
Query Key maping
Figure 2: Example mapping similar query-key value pairs

Lets try to understand this with an analogy. Consider the example where you are visiting a library and ask for a book. You say "I want a book about science fiction", this is analogous to Query. The library uses the description of each book (Key) in the library that is similar to the customers query to recommend books that fit the genre of science fiction and provides the list of these books to the customer (Value).

Queries, Keys, and Values are computed as linear transformations of the input embeddings (or outputs of the previous layer):

$$ Q = XW_Q, \quad K = XW_K, \quad V = XW_V $$

where \(X\) is the input, and \(W_Q\), \(W_K\), \(W_V\) are learned weight matrices.

Summary

  1. Attention is a layer that lets a model focus on what's important
  2. Query, Values and Keys are used for information retrieval insde the attention layer.

LLaVA

Overview

LLaVA (Large Language and Vision Assistant) was first introduced in the paper "Visual Instruction Tuning".

What is Visual Instruction Tuning?

Visual instruction tuning is a method used to fine-tune a large language model, enabling it to interpret and respond to instructions derived from visual inputs.

One example is to ask a machine learning model to describe an image.

LLaVA

As already established LLaVA is a multimodal model. LLaVA was trained on a small dataset. Despite this it can perform image analysis and respond to questions.

Architecture

The LLaVA has the following components: 1. Language model 2. Vision Encoder 3. Projection

We use the Llama as the language model, which is a family of autoregressive LLMs released by Meta AI.

The vision encoder is implemented by CLIP visual encoder ViT-L/14. The encoder extracts visual features and connects them to language embeddings through a projection matrix. The projection component translates visual features into language embedding tokens, thereby bridgin the gap between text and images.

Training

Two stages of training:

  1. Pre-training for Feature Alignment: LLaVA aligns visual and language features to ensure compatibility in this initial stage.
  2. Fine-tune end-to-end: The second training stage focuses on fine-tuning the entire model. At this stage the vision encoder's weights remain fixed

LLaVA-1.5

In LLaVA-1.5 there are two significant changes: 1. MLP vision-language connector 2. Trained for academic task-oriented data.

The linear projection layer is replaced with a 2 layer MLP.

LLaVA 1.6 (LLaVA-NeXT)

n addition to LLaVA 1.5, which uses the Vicuna-1.5 (7B and 13B) LLM backbone, LLaVA 1.6 considers more LLMs, including Mistral-7B and Nous-Hermes-2-Yi-34B. These LLMs possess nice properties, flexible commercial use terms, strong bilingual support, and a larger language model capacity. It allows LLaVA to support a broader spectrum of users and more scenarios in the community. The LLaVA recipe works well with various LLMs and scales up smoothly with the LLM up to 34B.

Here are the performance improvements LLaVA-NeXT has over LLaVA-1.5:

Increasing the input image resolution to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, and 1344x336 resolution. Better visual reasoning and zero-shot OCR capability with multimodal document and chart data. Improved visual instruction tuning data mixture with a higher diversity of task instructions and optimizing for responses that solicit favorable user feedback. Better visual conversation for more scenarios covering different applications. Better world knowledge and logical reasoning. Efficient deployment and inference with SGLang.

Other variants of LLaVA: 1. LLaVA-Med 2. LLaVA-Interactive

Reference

  1. A. Acharya, “LLAVA, LLAVA-1.5, and LLAVA-NeXT(1.6) explained,” Nov. 04, 2024. https://encord.com/blog/llava-large-language-vision-assistant/
  2. Wikipedia contributors, “Llama (language model),” Wikipedia, Jan. 01, 2025. https://en.wikipedia.org/wiki/Llama_(language_model)

Introduction to Hugging Face

Overview

Hugging Face is a leading platform in natural language processing (NLP) and machine learning (ML), providing tools, libraries, and models for developers and researchers. It is widely known for its open-source libraries and community contributions, facilitating the use of pre-trained models and accelerating ML workflows.

Applications of Hugging Face:

  • Sentiment Analysis
  • Text Summarization
  • Machine Translation
  • Chatbots and Virtual Assistants
  • Image Captioning (via VLMs)
  • Healthcare, legal, and financial domain-specific NLP solutions

Why Hugging Face Matters:

Hugging Face democratizes access to advanced AI tools, fostering innovation and collaboration. With its open-source ethos, it has become a go-to resource for researchers and developers alike, empowering them to tackle complex challenges in AI and ML effectively.

Hugging Face can be used with both TensorFlow and PyTorch.

Hugging Face AutoClasses

Hugging Face AutoClasses are an abstraction that simplifies the use of pre-trained models for various tasks, such as text classification, translation, and summarization. They automatically select the appropriate architecture and configuration for a given pre-trained model from the Hugging Face Model Hub.

Commonly Used AutoClasses:

1. AutoModel
  • For loading generic pre-trained models.
  • Use case: Extracting hidden states or embeddings.
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")
2. AutoModelForSequenceClassification
  • For text classification tasks.
  • Use case: Sentiment analysis, spam detection, etc.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
3. AutoTokenizer
  • Automatically loads the appropriate tokenizer for the specified model.
  • Handles tokenization, encoding, and decoding.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
4. AutoModelForQuestionAnswering
  • For question-answering tasks.
  • Use case: Extracting answers from context.
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
5. AutoModelForSeq2SeqLM
  • For sequence-to-sequence tasks like translation or summarization.
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
6. AutoModelForTokenClassification
  • For tasks like Named Entity Recognition (NER) or Part-of-Speech (POS) tagging.
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
7. AutoModelForCausalLM
  • For language modeling tasks that generate text.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2")
8. AutoProcessor (for Multimodal Models)
  • Loads processors for tasks involving images, text, or both.
  • Example: Vision-Language Models (e.g., CLIP).
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

Use Cases in Projects:

  • VLMs: Use AutoProcessor and AutoModel for image-text embedding or image captioning tasks.
  • Healthcare: Use AutoModelForSequenceClassification for text classification tasks like predicting medical conditions based on clinical notes.

Why use Transformers?

Traditionally to process text we RNNS but as the window size increases we see the problem of vanishing gradients. Additionally, they are slow. Transformers are able to address these concerns.

Stable Diffusion Understanding

Overview

Stable Diffusion has become so popular for image generation. It is the go to model for developers. It is a latent diffusion model that generates AI images for text. Sometimes you can also use an image and text to generate images.

Capabilities of Stable Diffussion

Stable diffusion is a text-to-image model. Given a text it will produce an image.

Stable Diffusion Text to Image
Figure 1:Basic Workflow of Stable Diffusion

Stable diffusion belongs to a class of deep learning models called diffusion models. These are models that are capable of generating new data that is similar to the training data. These models are so named since they use diffusion based mechanics we see in physics. We see two types of diffusion here: 1. Forward Diffusion 2. Reverse Diffusion

Forward Diffusion

Forward diffusion is the process that adds noise to an image in steps such that it gradually becomes unrecognizable. It is similar to the process where you drop ink on tissue paper the ink eventually spreads out.

Forward Diffusion Process
Figure 2: Stable diffusion Forward diffusion process taken from here
Forward Diffusion Process
Figure 3: Drop of ink from the nib of the pen spreading on the tissue paper (AI Generated from LLama 3.2)

Reverse Diffusion

Reverse diffusion is the opposite of Forward Diffusion. So rather than adding noise, it removes noise gradually from an image.

Reverse Diffusion Process
Figure 4: Stable diffusion Reverse diffusion process taken from here

Training process of Stable Diffusion

Adding noise is simple process and does not require explicit training. But how do get the old image back from a noisy image. We need to remove the noise from the image. To put it mathematically.

Reverse Diffusion Process Equation
Figure 5: Stable diffusion Reverse diffusion High Level Equation

So what we need to do is predict the amount of noise that needs to be removed to produce the original almost noiseless image. We use a noise predictor which for stable diffusion is a U-net model.

U-Net Model

It is a widely used deep learning model for image segmentation. The primary purpose of the model was t o address the challenge of limited data in healthcare. This network allows you to use a smalled dataset for training while maintaining the speed and accuracy of the model.

The U-Net model consists of 2 paths:

  1. Contracting Path
  2. Expansive Path

The contracting path consist of encoders, that capture the relevant information and encode it. The expansive path contains decoders the decode the encoded information and also use the information from the contracting path via the skip connections to generate a segmentation map.

U-net Model
Figure 6: U-net model taken from here

U-net Model Encoder
Figure 7: U-net model Encoder Architecture

U-net Model Decoder
Figure 8: U-net model Decoder Architecture

Cost of running the model

Diffusion models like Google’s Imagen and Open AI’s DALL-E are in pixel space. They have used some tricks to make the model faster but still not enough. Whereas, Stable Diffusion is a latent diffusion model. Instead of operating in the high-dimensional image space, it first compresses the image into the latent space. The latent space is 48 times smaller so it reaps the benefit of crunching a lot fewer numbers. That’s why it’s a lot faster. We use a Variational Autonencoders (VAE).

To summarise we use U-net in the image space for faster generation we make use of the latent space, for this we use VAE. U-Net is still used as the noise predictor.

Variational Autoencoders

Like U-net these also have encoders and decodes, the noise is added to latent vector and is later decoded to generate the images.

VAE overview
Figure 9: VAE Working

Does using latent space cause loss in information?

It might seem that while using the latent space we are loosing a lot of information, however thats not the case. It might seem that images are random but they are regular in nature. For Example: A face of any species has a mouth, ears and a nose. This is better explained by the Manifold Hypothesis.

Reverse Diffusion in Latent Space

Here’s how latent reverse diffusion in Stable Diffusion works.

  1. A random latent space matrix is generated.
  2. The noise predictor estimates the noise of the latent matrix.
  3. The estimated noise is then subtracted from the latent matrix.
  4. Steps 2 and 3 are repeated up to specific sampling steps.
  5. The decoder of VAE converts the latent matrix to the final image.

The noised predictor here is still U-Net.

So far we have seen only image generation process which is called the unconditioned process. In the following sections we will see how we can condition for text i.e. given a text the model should generate an image.

Text Conditioning

To be able to generate images using the text prompts we need to perform the preprocessing steps in figure 10. In the figure the Tokenizer and Embedder are implemented by a Contrastive Language-Image Pretraining model (CLIP). It should be noted here since we are dealling with a text input the convulutional layers are replaced by cross attention layers to help establish relationship between different words in a sentence. Attention layers are the new feature extracture layers, they are going to replace RNNs and CNNs as they are faster at processing and get rid of any inductive biases due the structure of neural network.

There are other forms of conditioning as well

VAE overview
Figure 10: Text Conditioning steps

Summary

To summarize how stable diffusion creates images here are the steps:

  1. Given a text or image we generate a random vectors in the latent space this is done through VAE encoder.
  2. U-NET then predicts the noise that is added to this vector.
  3. Given the amount of noise that is added we remove the noise for the latent vector.
  4. Steps 2 and 3 are repeated for a certain number of sampling steps
  5. Finally, the decoder of VAE converts the latent image back to pixel space. This is the image you get after running Stable Diffusion.

References

  1. Andrew, “How does Stable Diffusion work?,” Stable Diffusion Art, Jun. 10, 2024. https://stable-diffusion-art.com/how-stable-diffusion-work/
  2. GeeksforGeeks, “UNET Architecture explained,” GeeksforGeeks, Jun. 08, 2023. https://www.geeksforgeeks.org/u-net-architecture-explained/
  3. O. Ronneberger, P. Fischer, and T. Brox, “U-NET: Convolutional Networks for Biomedical Image Segmentation,” arXiv.org, May 18, 2015. https://arxiv.org/abs/1505.04597
  4. Wikipedia contributors, “Manifold hypothesis,” Wikipedia, Aug. 01, 2024. https://en.wikipedia.org/wiki/Manifold_hypothesis