Skip to content

LLMs

All about attention

Overview

Attention layers are now used over RNNs and even CNNs to speed up processing. In this blog we will see how attention layers are implemented.

Working of Attention layers

There three inputs to attention layers:

  1. Query: Represents the "question" or "search term" for determining which parts of the input sequence are relevant.
  2. Key: Represents the "descriptor" of each input element, used to decide how relevant each input element is to the query.
  3. Values: Contains the actual information or representation of each input element that will be passed along after attention is computed.

Given a Query and Key we calculate the similarity, this allows us to use the key with the max similarity and use its value for attention.

\[ \text{Score}(Q, K) = Q \cdot K^\top \]

The above equation results in matrix describing how much importance a query gives to a key. In the equation Q is the query and K is the key.

The next step is scaling, we perform scaling to avoid large values, larger values require more resources for computation, So now the equation takes the following shape:

\[ \text{Scaled Score}(Q, K) = \frac{Q \cdot K^\top}{\sqrt{d_k}} \]

Where \(d_k\) is the dimensionality of the Key vectors.

The scores are passed through a softmax function to convert them into probabilities (attention weights). These probabilities determine the contribution of each input element. The equation now takes the following form:

\[ \text{Attention Weights} = \text{softmax}\left(\frac{Q \cdot K^\top}{\sqrt{d_k}}\right) \]

Overall the equation would look something like this:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V \]
QKV
Figure 1: Query Key Value

Scaled Dot PRoduct Attention
Figure 2: Flow of calculating Attention in Scaled Dot Product Attention
Query Key maping
Figure 2: Example mapping similar query-key value pairs

Lets try to understand this with an analogy. Consider the example where you are visiting a library and ask for a book. You say "I want a book about science fiction", this is analogous to Query. The library uses the description of each book (Key) in the library that is similar to the customers query to recommend books that fit the genre of science fiction and provides the list of these books to the customer (Value).

Queries, Keys, and Values are computed as linear transformations of the input embeddings (or outputs of the previous layer):

$$ Q = XW_Q, \quad K = XW_K, \quad V = XW_V $$

where \(X\) is the input, and \(W_Q\), \(W_K\), \(W_V\) are learned weight matrices.

Summary

  1. Attention is a layer that lets a model focus on what's important
  2. Query, Values and Keys are used for information retrieval insde the attention layer.

LLaVA

Overview

LLaVA (Large Language and Vision Assistant) was first introduced in the paper "Visual Instruction Tuning".

What is Visual Instruction Tuning?

Visual instruction tuning is a method used to fine-tune a large language model, enabling it to interpret and respond to instructions derived from visual inputs.

One example is to ask a machine learning model to describe an image.

LLaVA

As already established LLaVA is a multimodal model. LLaVA was trained on a small dataset. Despite this it can perform image analysis and respond to questions.

Architecture

The LLaVA has the following components: 1. Language model 2. Vision Encoder 3. Projection

We use the Llama as the language model, which is a family of autoregressive LLMs released by Meta AI.

The vision encoder is implemented by CLIP visual encoder ViT-L/14. The encoder extracts visual features and connects them to language embeddings through a projection matrix. The projection component translates visual features into language embedding tokens, thereby bridgin the gap between text and images.

Training

Two stages of training:

  1. Pre-training for Feature Alignment: LLaVA aligns visual and language features to ensure compatibility in this initial stage.
  2. Fine-tune end-to-end: The second training stage focuses on fine-tuning the entire model. At this stage the vision encoder's weights remain fixed

LLaVA-1.5

In LLaVA-1.5 there are two significant changes: 1. MLP vision-language connector 2. Trained for academic task-oriented data.

The linear projection layer is replaced with a 2 layer MLP.

LLaVA 1.6 (LLaVA-NeXT)

n addition to LLaVA 1.5, which uses the Vicuna-1.5 (7B and 13B) LLM backbone, LLaVA 1.6 considers more LLMs, including Mistral-7B and Nous-Hermes-2-Yi-34B. These LLMs possess nice properties, flexible commercial use terms, strong bilingual support, and a larger language model capacity. It allows LLaVA to support a broader spectrum of users and more scenarios in the community. The LLaVA recipe works well with various LLMs and scales up smoothly with the LLM up to 34B.

Here are the performance improvements LLaVA-NeXT has over LLaVA-1.5:

Increasing the input image resolution to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, and 1344x336 resolution. Better visual reasoning and zero-shot OCR capability with multimodal document and chart data. Improved visual instruction tuning data mixture with a higher diversity of task instructions and optimizing for responses that solicit favorable user feedback. Better visual conversation for more scenarios covering different applications. Better world knowledge and logical reasoning. Efficient deployment and inference with SGLang.

Other variants of LLaVA: 1. LLaVA-Med 2. LLaVA-Interactive

Reference

  1. A. Acharya, “LLAVA, LLAVA-1.5, and LLAVA-NeXT(1.6) explained,” Nov. 04, 2024. https://encord.com/blog/llava-large-language-vision-assistant/
  2. Wikipedia contributors, “Llama (language model),” Wikipedia, Jan. 01, 2025. https://en.wikipedia.org/wiki/Llama_(language_model)

Introduction to Hugging Face

Overview

Hugging Face is a leading platform in natural language processing (NLP) and machine learning (ML), providing tools, libraries, and models for developers and researchers. It is widely known for its open-source libraries and community contributions, facilitating the use of pre-trained models and accelerating ML workflows.

Applications of Hugging Face:

  • Sentiment Analysis
  • Text Summarization
  • Machine Translation
  • Chatbots and Virtual Assistants
  • Image Captioning (via VLMs)
  • Healthcare, legal, and financial domain-specific NLP solutions

Why Hugging Face Matters:

Hugging Face democratizes access to advanced AI tools, fostering innovation and collaboration. With its open-source ethos, it has become a go-to resource for researchers and developers alike, empowering them to tackle complex challenges in AI and ML effectively.

Hugging Face can be used with both TensorFlow and PyTorch.

Hugging Face AutoClasses

Hugging Face AutoClasses are an abstraction that simplifies the use of pre-trained models for various tasks, such as text classification, translation, and summarization. They automatically select the appropriate architecture and configuration for a given pre-trained model from the Hugging Face Model Hub.

Commonly Used AutoClasses:

1. AutoModel
  • For loading generic pre-trained models.
  • Use case: Extracting hidden states or embeddings.
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")
2. AutoModelForSequenceClassification
  • For text classification tasks.
  • Use case: Sentiment analysis, spam detection, etc.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
3. AutoTokenizer
  • Automatically loads the appropriate tokenizer for the specified model.
  • Handles tokenization, encoding, and decoding.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
4. AutoModelForQuestionAnswering
  • For question-answering tasks.
  • Use case: Extracting answers from context.
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
5. AutoModelForSeq2SeqLM
  • For sequence-to-sequence tasks like translation or summarization.
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
6. AutoModelForTokenClassification
  • For tasks like Named Entity Recognition (NER) or Part-of-Speech (POS) tagging.
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
7. AutoModelForCausalLM
  • For language modeling tasks that generate text.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2")
8. AutoProcessor (for Multimodal Models)
  • Loads processors for tasks involving images, text, or both.
  • Example: Vision-Language Models (e.g., CLIP).
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

Use Cases in Projects:

  • VLMs: Use AutoProcessor and AutoModel for image-text embedding or image captioning tasks.
  • Healthcare: Use AutoModelForSequenceClassification for text classification tasks like predicting medical conditions based on clinical notes.

Why use Transformers?

Traditionally to process text we RNNS but as the window size increases we see the problem of vanishing gradients. Additionally, they are slow. Transformers are able to address these concerns.

Computational Graphs

These are Directed Graphs that helps map out dependencies for mathematical computations. For Example let us consider the following set of equations:

  1. Y=(a-b)*(a+b)
  2. Let d=(a-b) and e=(a+b)

Our dependency graph will look as follows:

Graph Example

The lower nodes are evaluated first then the higher nodes are evaluated.

Let us consider how this works when performing chain differentiation when it comes to neural networks.

To review chain differentiation consider the following equation:

  1. y = \(u^4\)
  2. u = 3x + 2

Performing chain rule differentiation with respect to x we would get the follolwing:

We first perform partial differentiation of u with respect to x

\[\frac{\partial u}{\partial x} = 3 \]

Then perform partial differentiation of y with respect to u

\[\frac{\partial y}{\partial u} = 4u^3\]

Can be re-written as:

\[ \frac{\partial y}{3\partial x} = 4u^3\]
\[ \frac{\partial y}{\partial x} = 12 u^3\]

if x = 3.0

u = 11

\(\frac{\partial y}{\partial x} = 15972\)

Representing the above steps in a computational graph we get the following:

Chained Computational Graph

How do we implement this? Luckily this has already been implemented for us in Tensorflow and Pytorch.

There are 2 implementations of Computational Graphs:

  1. Static Computational Graphs - Graphs are constructed once befor the execution of the model.
  2. Dynamic Computational Graphs - Graphs are constructed on the fly.

Tensorflow Computation Graph implementation.

import tensorflow as tf
2024-08-27 18:45:09.326809: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-27 18:45:09.357051: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-27 18:45:09.365983: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-27 18:45:09.395484: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
x = tf.constant(3.0)
with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)
    u = 3*x + 2
    y = u ** 4
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1724784312.756873    1081 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724784312.767002    1081 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724784312.767097    1081 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724784312.777530    1081 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724784312.777763    1081 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724784312.777895    1081 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724784313.020475    1081 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1724784313.020614    1081 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-08-27 18:45:13.020636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2112] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
I0000 00:00:1724784313.020750    1081 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-08-27 18:45:13.020821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1767 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3050 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
g = tape.gradient(y,x)
g
<tf.Tensor: shape=(), dtype=float32, numpy=15972.0>





Pytorch Computation Graph Implementation.

import torch
x = torch.tensor(3.0, requires_grad=True)
u = 3*x +2
y = u**4
y.backward()
x.grad
tensor(15972.)


Why go for RAG?

Overview

In this project we choose a foundational model i.e. GPT or BERT and create an API that makes it easy to interact with the LLM.

Foundational model

We want a foundational model that can interact in the medical context. Some of the models considered here are :

  • Medical Llama-8b - Optimized to address health related inquiries and trained on comprehensive medical chatbot dataset (Apache License 2.0) foundational model used here Meta-Llama-3-8b
  • Llama3-OpenBioLLM-8B - fine tuned on corpus of high quality of biomedical data, 8 billion parameters. Incorporated the DPO data set

Approaches

To create a chat bot we have 2 approaches:

  • Fine tuning existing foundational models on medical data set
  • Create a Retrieval augmented generation framework which is used for retrieving facts from an external knowledge

Comparisons

Fine tuning existing foundation models on medical data set

  • Incorporates the additional knowledge into the model itself
  • Offers a precise, succinct output that is attuned to brevity.
  • High initial cost
  • Minimum input size

Retrieval Augmented Generation

  • Augments the prompt with external data
  • Provides an additional context during question answering.
  • Possible collision among similar snippets during the retrieval process
  • RAG has larger input size due to inclusion of context information ,output information tends to be more verbose and harder to steer.

Experiment Conclusion

GPT learned 47% of new knowledge with fine-tuning with RAG this number goes upto 72% and 74%.

Preferred approach

What we want?

  • Fast Deployment option

Choice of Approach

RAG allows to create embeddings easily and allows for a fast deployment option.

Architecture

Architecture

References

  • https://arxiv.org/pdf/2401.08406

LangChain

Overview

Consists of 3 components:

  • Components:
  • LLM Wrappers
  • Prompt Templates
  • Indexes for information Retrieval
  • Chains: Assemble Components to solve a specific task
  • Agents: allows LLMs To interact with it's environment

Installation

  • Use Pycharm as your preferred IDE since it makes things easier and user friendly
  • Create a new project in Pycharm which looks as follows:

Create Project Window

References

  1. Youtube