Skip to content

Transformers

All about attention

Overview

Attention layers are now used over RNNs and even CNNs to speed up processing. In this blog we will see how attention layers are implemented.

Working of Attention layers

There three inputs to attention layers:

  1. Query: Represents the "question" or "search term" for determining which parts of the input sequence are relevant.
  2. Key: Represents the "descriptor" of each input element, used to decide how relevant each input element is to the query.
  3. Values: Contains the actual information or representation of each input element that will be passed along after attention is computed.

Given a Query and Key we calculate the similarity, this allows us to use the key with the max similarity and use its value for attention.

\[ \text{Score}(Q, K) = Q \cdot K^\top \]

The above equation results in matrix describing how much importance a query gives to a key. In the equation Q is the query and K is the key.

The next step is scaling, we perform scaling to avoid large values, larger values require more resources for computation, So now the equation takes the following shape:

\[ \text{Scaled Score}(Q, K) = \frac{Q \cdot K^\top}{\sqrt{d_k}} \]

Where \(d_k\) is the dimensionality of the Key vectors.

The scores are passed through a softmax function to convert them into probabilities (attention weights). These probabilities determine the contribution of each input element. The equation now takes the following form:

\[ \text{Attention Weights} = \text{softmax}\left(\frac{Q \cdot K^\top}{\sqrt{d_k}}\right) \]

Overall the equation would look something like this:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V \]
QKV
Figure 1: Query Key Value

Scaled Dot PRoduct Attention
Figure 2: Flow of calculating Attention in Scaled Dot Product Attention
Query Key maping
Figure 2: Example mapping similar query-key value pairs

Lets try to understand this with an analogy. Consider the example where you are visiting a library and ask for a book. You say "I want a book about science fiction", this is analogous to Query. The library uses the description of each book (Key) in the library that is similar to the customers query to recommend books that fit the genre of science fiction and provides the list of these books to the customer (Value).

Queries, Keys, and Values are computed as linear transformations of the input embeddings (or outputs of the previous layer):

$$ Q = XW_Q, \quad K = XW_K, \quad V = XW_V $$

where \(X\) is the input, and \(W_Q\), \(W_K\), \(W_V\) are learned weight matrices.

Summary

  1. Attention is a layer that lets a model focus on what's important
  2. Query, Values and Keys are used for information retrieval insde the attention layer.

Understanding Vision Transformers

Overview

I was taking part in the ISIC 2024 challenge when I got stuck training a ResNet50 model that it started overfitting. My score at this point was 0.142. To be at the the top I had to beat the score 0.188. While scouring the internet for any new model I came across Vision Transformers. I was honestly surprised that transformer architecture could be applied to images. I came across this interesting paper called "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"

About the Paper

We know that the Transformer architecture has become the norm for Natural Language Processing(NLP) tasks. Unlike in NLP tasks in conjunction with attention we also have convoluitonal networks. However, this paper demonstrates that convolitonal networks need not be applied and a pure transformer architecture applied on a sequence of image patches can perform image classification tasks really well provided that its pre-trained on large amounts of data.

Basic Theory of Vision Transformers

The Vision Transformer architecture has been inspired by the successes of the Transformer successes in NLP. The first step to create a Vision Transformer is to split an image into patches. We now generate the position of these patches and then generate embeddings for them. Let us consider dimensional tranformation that is taking place here.

Our original image X had the dimenstion HxWxC. Where H is height and W is the width of the images and C is the channel. Since, we are dealing with RGB images the C will be 3.

After fetching the patches, we get the following dimensions NxPxPxC.

Where N is the number of patches in an image.

To calculate it N = \(\frac{H * W}{P*P}\)

Now, we flatten the aforementioned patches and project them via a dense layer to have a dimension D whic his known as the constant latent vector size D. Then we add the patch embeddings and the positional embeddings to retain some of the position information. The postional information is in 1D and not 2D since no performance gain was observed.

This output is forwarded through the some layers of the transformer blocks.

The transformer enocder block is composed of alternating layers of multiheaded self attention and MLP blocks. Layer Norm is applied before every block i.e. before an attention or MLP block and a residual connection is created after every block.

It is to be noted that Vision Transformers have much less inductive bias than CNNs. Inductive biases are assumptions we make about a data set. For example we can assume the marks of students in a given subject to follow a gaussian distribution. CNN architectures inherently will have some biases due to the way they are structured. CNNs are structured to capture the local relationship between the pixels of an image. As CNNS get deeper the local feature extractors help tp extract the global features. In Vit only the MLP layers are local and translationally equivariant while the self attention layers are global. An hybrid version of ViT also exists where CNN is applied to extract the feature maps and then forward to the Transformer encoder block.

Is it better than CNNs?

In the paper ViT can only perform classification tasks and not Segmentation or detection tasks but it still matches or outperforms CNNs and introduces a parallelism with multihead self attention. ViT will only perform better with pre-training and requires more epochs.