# Deep Dive Into GPT-2

In this chapter, we take a deep dive into the architecture of one of the first truly _Large_ Language Models - **GPT-2**.
GPT-2 is an LLM that was released by OpenAI in 2019, which sparked widespread public discussion about the potential benefits and dangers of LLMs.

The reason we chose GPT-2 is simple.
The model is not too large ("just" 1.5 billion parameters), so you will be able to load it into the memory of your local machine without having to provision a GPU instance on some cloud provider.

As usual, we need to import a few things:

In [None]:
import torch
import torch.nn as nn

import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

We will also disable the `huggingface_hub` progress bars as to not pollute the book (you should probably keep them though when following along):

In [None]:
import os

os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"

Note that we use `torch==2.4.0` and `transformers==4.44.0`.
If you want to follow along, you should also probably have these versions installed. Otherwise, you might need to change some of the code:

In [None]:
print(torch.__version__)

In [None]:
print(transformers.__version__)

The code in this chapter is written in such a way that it closely mimicks the `transformers` codebase.
Generally, we highly encourage you to read the `transformers` codebase - it is well-written and easy to understand.

The most relevant file for the purposes of this chapter is [`models/gpt2/modeling_gpt2.py`](https://github.com/huggingface/transformers/blob/v4.44-release/src/transformers/models/gpt2/modeling_gpt2.py) (especially the `GPT2LMHeadModel`, `GPT2Block`, `GPT2SdpaAttention` and `GPT2MLP` classes).

## Loading the Model and Performing Inference

First, let's load the `gpt2` tokenizer and the `gpt2` model:

In [None]:
%%capture

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

In [None]:
print(type(tokenizer))

In [None]:
print(type(model))

Next, let's perform some inference using an example text:

In [None]:
text = "This is an example sentence"

In [None]:
encoded_input = tokenizer(text, return_tensors="pt")

In [None]:
print(encoded_input)

In [None]:
output = model(**encoded_input)

In [None]:
print(type(output))

The `output` of the model is a `CausalLMOutputWithCrossAttentions` object that has (among other things) a `logits` attribute.
This is basically the list of probabilities we discussed in Chapter 1, except that it contains the logits of the probabilities (for numerical reasons):

In [None]:
print(output.logits.shape)

In [None]:
print(output.logits)

Note that `output.logits` is a tensor with three dimensions.

The first dimension represents the batch size. 
Since we only have a single text, the batch size is simply `1`.

The second dimension is the number of tokens in the sequence.
We have five tokens, and therefore, the second dimension has the size `5`.

Finally, the third dimension is the size of the vocabulary (as we output a probability for every token).
Since the vocabulary size is `50257`, the third dimension has the size  `50257`.

We want to predict the token that follows the last token.
Therefore, we are interested in the probabilities of the tokens that come after the _last_ token, so let's extract those:

In [None]:
last_logits = output.logits[0, -1, :]

In [None]:
print(last_logits.shape)

In [None]:
print(last_logits)

Let's now get the actual probabilities:

In [None]:
probas = torch.softmax(last_logits, dim=0)
print(probas)

Finally, we sample the next token from this probability distribution.
There are many different ways to accomplish this; the simplest is to select the token with the highest probability:

In [None]:
next_token_id = torch.argmax(probas, dim=-1)
print(next_token_id)

In [None]:
next_token = tokenizer.decode(next_token_id)
print(next_token)

Let's also have a look at the `10` most probable tokens:

In [None]:
top_k = 10
top_k_probs, top_k_ids = torch.topk(probas, top_k)

top_k_tokens = [(token_id, tokenizer.decode(token_id)) for token_id in top_k_ids]

for (token_id, token), prob in zip(top_k_tokens, top_k_probs):
    print(f"Token: {token} (ID = {token_id}), Probability: {round(prob.item(), 2)}")

We can see that all of these tokens are reasonable candidates for the next token in the sentence `"This is an example sentence"`.

In fact, instead of simply selecting the most probable next token at every step (which often leads to repetitive and boring texts), we could _sample_ from the probability distribution.
Here, we would choose a random token weighted by the probabilities of the tokens (i.e. tokens with higher probabilities are more likely to be sampled):

In [None]:
# Set a random seed for reproducibility
torch.manual_seed(42)

sampled_token_id = torch.multinomial(probas, num_samples=1)[0]

In [None]:
print(sampled_token_id)

In [None]:
print(tokenizer.decode(sampled_token_id))

Now, that we have seen how to automatically compute the probabilities of the next token, let's redo the calculations manually - layer by layer.
Before we do that, we will inspect the architecture of the model first to see what layers are actually present.

## The Architecture

The `model` we have loaded is actually a `torch.nn.Module` that is the PyTorch base class for all neural network modules:

In [None]:
print(isinstance(model, nn.Module))

There are many ways to inspect the architecture of a `torch.nn.Module`. One straightforward method is to simply `print` the model:

In [None]:
print(model)

The model consists of two parts: a `transformer` and an `lm_head`.
Two important points should be noted here:

First, the tokenizer is not part of the `model` as it is already represented by the `tokenizer` variable.
Second, the `transformer` object includes both the embedding block and the transformer (in terms of the terminology introduced in Chapter 1).

Looking closely, we see that the model has two embedding layers at the beginning - `wte` and `wpe`.
The `wte` layer is the embedding layer for the tokens:

In [None]:
print(model.transformer.wte)

The `wpe` layer is the positional embedding layer:

In [None]:
print(model.transformer.wpe)

These layers are followed by a dropout layer:

In [None]:
print(model.transformer.drop)

Next, we have a module list consisting of 12 "GPT blocks":

In [None]:
print(type(model.transformer.h))

In [None]:
print(len(model.transformer.h))

Each block is a so-called `GPT2Block` object:

In [None]:
print(type(model.transformer.h[0]))

When we look inside a `GPT2Block`, we will encounter many familiar components:

In [None]:
print(model.transformer.h[0])

The module list with the GPT blocks is followed by one last layer normalization:

In [None]:
print(model.transformer.ln_f)

The final component of the entire model is a linear layer which is responsible for computing the logits of the probabilities of the next token:

In [None]:
print(model.lm_head)

Generally speaking, whenever you want to understand how a particular LLM model works, printing its architecture is extremely instructive as it provides a basic overview of the components it has.

Now let's see how the tensors actually flow through the model.

We will start with the embeddings.

## Embeddings

Let's retrieve the token IDs of our example text:

In [None]:
token_ids = encoded_input["input_ids"]

In [None]:
token_ids

We also obtain the attention mask for later:

In [None]:
attention_mask = encoded_input["attention_mask"]

In [None]:
print(attention_mask)

Next, we need to generate the position IDs.

For each token ID, we require a corresponding position ID.
The position ID sequence is constructed by simply starting at `0` and then counting up to `len(token_ids) - 1`.

In [None]:
position_ids = torch.tensor([[0, 1, 2, 3, 4]], dtype=torch.long)

Let's now calculate the token embeddings.
This is just a matter of applying the token embedding layer (i.e. `wte`) to the tensor containing the token IDs:

In [None]:
token_embeds = model.transformer.wte(token_ids)

We can do a quick dimensionality check.
The `token_ids` is a sequence of dimension `1x5` (batch size `1` and sequence length of `5`).
For each token, we compute an embedding of dimension `768`, and therefore, the output tensor should have a dimension of `1x5x768`.
This is indeed the case:

In [None]:
print(token_embeds.shape)

Let's also calculate the positional embeddings.
This calculation is very similar to the calculation of the token embeddings, except that we now apply the _positional_ embedding layer (i.e. `wpe`) to the tensor containing the position IDs:

In [None]:
position_embeds = model.transformer.wpe(position_ids)

Again, we perform a quick dimensionality check.
The `position_ids` sequence has a dimension of `1x5` (with a batch size `1` and `5` position IDs).
For each position, we compute an embedding of dimension `768`, so the resulting output tensor should have a dimension of `1x5x768`:

In [None]:
print(position_embeds.shape)

To get the final embeddings we simply _add_ the token embeddings and the positional embeddings.

These final embeddings will be passed to the module block as the input to the first layer.
Since the `transformers` codebase refers to the intermediate tensors in the module block as "hidden states", we will adopt that terminology:

In [None]:
hidden_states = token_embeds + position_embeds
print(hidden_states.shape)

Here is a graphic representation of the tensor flow so far:

![Embeddings](images/embeddings.png)

## The First GPT2 Block

Let's now have a look at the module list, specifically its first "GPT block".

First, we will assign more meaningful names to both the module list and the GPT block instead of using `h` and `h[0]`:

In [None]:
layer_blocks = model.transformer.h
layer_block = layer_blocks[0]

As a reminder, here is how the layer block looks like:

In [None]:
print(layer_block)

Basically, a `GPT2Block` has two components: an attention part and an MLP part. 

The attention part consists of a `LayerNorm` layer, followed by a `GPT2SdpaAttention` block which contains the attention mechanism.

The MLP part consists of a `LayerNorm` layer, followed by a `GPT2MLP` block which contains a simple MLP with two linear layers separated by a non-linear activation function.

We will now look at both parts in detail.

Let's rename the tensor and also save it in another variable since we will need it later:

In [None]:
attention_input = hidden_states
attention_residual = attention_input

### The Attention Part

First, we perform layer normalization on our tensor using the `LayerNorm` layer.

Remember that this operation doesn't change the dimension of the tensor:

In [None]:
normalized_attention_input = layer_block.ln_1(attention_input)
print(normalized_attention_input.shape)

Next, we want to get the query, key and value tensors.

This is what the `c_attn` layer is for:

In [None]:
query_key_value = layer_block.attn.c_attn(normalized_attention_input)
print(query_key_value.shape)

Again, the first dimension of the tensor is the batch size (which is `1`), while the second dimension represents the number of tokens (which is `5`).

The third dimension is more complicated.
In the `transformer` codebase, the calculation of the queries, keys and values are combined in a single operation.
Therefore, the output tensor contains the queries, keys and value all in one object.
This is why the third dimension is `2304 = 768 * 3` (since we store queries _and_ keys _and_ values and each of these has `768` items).

Since we want to work with these tensors separately, we need to split them out using the `split` function.
We have three dimensions and the queries, keys and values are split across the dimension number 2, so we need to `split` across `dim=2`.

The order of the items in the tensor is query first, key second and value third:

In [None]:
queries, keys, values = query_key_value.split(768, dim=2)

In [None]:
print(queries.shape, keys.shape, values.shape)

Next, we need to split the attention heads.

To accomplish this, we will simply reuse the `_split_heads` helper function from the `transformers` codebase:

In [None]:
def _split_heads(tensor, num_heads, attn_head_size):
    """
    Splits hidden_size dim into attn_head_size and num_heads.
    This function is taken directly from the transformers codebase.
    """
    new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)
    tensor = tensor.view(new_shape)
    return tensor.permute(0, 2, 1, 3)

The GPT-2 model has 12 attention heads and a head dimension of 64:

In [None]:
num_heads = 12
head_dim = 64
head_queries = _split_heads(queries, num_heads, head_dim)
head_keys = _split_heads(keys, num_heads, head_dim)
head_values = _split_heads(values, num_heads, head_dim)
print(head_queries.shape, head_keys.shape, head_values.shape)

Next, we compute the attention scores:

In [None]:
sdpa_output = torch.nn.functional.scaled_dot_product_attention(
    head_queries,
    head_keys,
    head_values,
    attn_mask=None,
    dropout_p=0.0,
    is_causal=True,
)

In [None]:
print(sdpa_output.shape)

Now, its time to merge the attention heads back together:

In [None]:
sdpa_output_transposed = sdpa_output.transpose(1, 2).contiguous()
sdpa_output_view = sdpa_output_transposed.view(1, 5, 768)

Let's double check the dimension:

In [None]:
print(sdpa_output_view.shape)

Now, we pass the the tensor through the projection layer `c_proj`:

In [None]:
projection_output = layer_block.attn.c_proj(sdpa_output_view)

Again, we verify the dimension:

In [None]:
print(projection_output.shape)

Finally, we add our saved hidden state to the output:

In [None]:
attention_output = projection_output + attention_residual

This is called a **residual connection**.

Generally, we speak of residual connections if there is some function of the form $y = f(x) + x$ (in this case `x` is the input hidden state).
Residual connections essentially help with propagating the "signal" across layers (both in the forward and the backward pass).

Especially for the backward pass, such connections can help address the vanishing gradient problem (discussed in the chapter on computational graphs).
We can see this by comparing the derivative of a hypothetical loss function (with respect to the input $x$) with a residual connection and without a residual connection.

Let's compute $\frac{\partial L}{\partial x}$:

$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial x} = \frac{\partial L}{\partial y} (1 + \frac{\partial}{\partial x} f(x)) = \frac{\partial L}{\partial y} + \frac{\partial L}{\partial y} \frac{\partial}{\partial x} f(x)$

Without the residual connection, the derivative would simply be $\frac{\partial L}{\partial y} \frac{\partial}{\partial x} f(x)$.
With the residual connection, however, we directly _add_ the term $\frac{\partial L}{\partial y}$ on top of that.
This means that even if the gradient of `f` is vanishingly small, this won't be true for the gradient $\frac{\partial L}{\partial x}$ and there should be a meaningful update when we perform gradient descent.

Let's print the shape of `attention_output`.
Note that the attention part of the GPT block did not change the _shape_ of the tensor, only its _values_:

In [None]:
print(attention_output.shape)

In [None]:
print(attention_input)

In [None]:
print(attention_output)

Here is a visualization of the process:

![Attention](images/attention.png)

### The MLP Part

The second part of the GPT block is the MLP part.

Again, we first save the current hidden state tensor:

In [None]:
mlp_input = attention_output
mlp_residual = mlp_input

And - again - we first pass the tensor through a layer normalization block:

In [None]:
normalized_nlp_input = layer_block.ln_2(mlp_input)

In [None]:
print(normalized_nlp_input.shape)

Next, we pass the hidden states through the MLP block.

The MLP block consists of a linear layer, followed by a non-linear activation function, and then another linear layer:

In [None]:
print(layer_block.mlp)

Here is how the tensor flow looks like:

In [None]:
c_fc_output = layer_block.mlp.c_fc(normalized_nlp_input)
act_output = layer_block.mlp.act(c_fc_output)
c_proj_output = layer_block.mlp.c_proj(act_output)

In [None]:
print(c_proj_output.shape)

Finally, we have another residual connection:

In [None]:
mlp_output = mlp_residual + c_proj_output

In [None]:
print(mlp_output.shape)

In [None]:
mlp_output

Here is a visualization of the MLP part:

![MLP](images/mlp.png)

## The Other GPT Blocks

Now, we simply pass the hidden states through one block after the other, where the output of each block is the input to the next block.

Note that we already passed the tensor through the first block, so we will only consider the other 11 blocks:

In [None]:
hidden_states = mlp_output

for block in model.transformer.h[1:]:
    hidden_states = block(hidden_states)[0]

Since every block only changes the values of the tensor, but not its shape, the shape of the final tensor is _unchanged_ as well:

In [None]:
print(hidden_states.shape)

Finally, we pass the final result through one last layer normalization:

In [None]:
hidden_states = model.transformer.ln_f(hidden_states)
print(hidden_states.shape)

## Calculating the Logits

At last, we use the `lm_head` layer to calculate the logits:

In [None]:
logits = model.lm_head(hidden_states)
print(logits.shape)

Let's verify that our calculations are correct by checking if the `logits` tensor, which we computed manually, is the same as `output.logits`:

In [None]:
print((output.logits == logits).all())

In [None]:
print(layer_block)