Tokenization#

The Idea#

The core of a Large Language Model expects some kind of sequence, so first we need to be able to convert a text into a sequence of some kind.

This is the job of the tokenizer. The tokenizer splits the input text into a sequence of individual units - called tokens - for further processing by the LLM. A token is simply any string - tokens can be characters, subwords, words and (theoretically) even larger units of text.

There are many different strategies that can be used to split a text into tokens. These strategies mainly differ in the vocabulary they allow for. The vocabulary is the list of all possible tokens that can be produced by a tokenizer.

One very simple strategy would be to do character tokenization, i.e. to split the text into its individual characters:

def tokenize_characters(text):
    return list(text)

tokenize_characters("Example")
['E', 'x', 'a', 'm', 'p', 'l', 'e']

The problem with this approach is that the model would have to learn everything from the ground up - including how to form words. Additionally, this will result in very long sequence lengths (since every character is a separate unit), which will make it harder for our model to “pay attention” to the relevant parts.

One better way is to use word tokenization - just split the text into words. Here, how the simplest word tokenizer might look like:

def tokenize_words(text):
    return text.split()

tokenize_words("This is a sentence")
['This', 'is', 'a', 'sentence']

There are two objections to this approach.

First, we would need to manually handle a lot of special cases such as punctuation and words with apostrophes (should don't be a single token or two tokens?).

Second and more important, this approach implicitly treats all words as equally important and leads to an extremely large vocabulary size (which is just the number of possible tokens). In the later chapters, we will see that the size of the LLM grows with the vocabulary size, so we want to keep that reasonable. This becomes especially important once we move beyond the English language and consider - well - the rest of the world.

So far, we have considered two opposing approaches and identified flaws in both. Character tokenization results in units that are too small, and we end up with very long sequences. Word tokenization results in units that are too large, leading to a lot of possible units.

Therefore, most modern tokenizers have settled somewhere in between character and word tokenization and do something called subword tokenization.

Subword Tokenization#

Let’s use the tiktoken library to show an example of subword tokenization. First, we need to load a tokenizer.

We will use the gpt2 tokenizer since we will use the gpt2 model throughout this book:

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

We can use the encode method to tokenize a string:

encoded = tokenizer.encode("This is a sentence", allowed_special={"<|endoftext|>"})

Let’s inspect the tokenization result:

print(encoded)
[1212, 318, 257, 6827]

Huh? What are these strange numbers?

This is where we mention, that we have omitted an important technical detail so far. A tokenizer doesn’t actually return a list of strings. Instead, it returns a list of integers, where every integer is an ID representing a certain token from the tokenizer vocabulary.

We can view the tokens themselves like this:

for encoded_id in encoded[:10]:
    print(repr(tokenizer.decode([encoded_id])))
'This'
' is'
' a'
' sentence'

We can also decode multiple tokens at the same time:

print(tokenizer.decode(encoded))
This is a sentence

Interestingly, the tokenizer can even handle garbage input:

garbage = "asdasdaf"
print(tokenizer.decode(tokenizer.encode(garbage)))
asdasdaf

If we look at the individual tokens, we will get a glimpse at how subword tokenization operates:

for encoded_id in tokenizer.encode(garbage):
    print(repr(tokenizer.decode([encoded_id])))
'as'
'd'
'as'
'd'
'af'

It seems like the gpt2 tokenizer breaks words not found in its predefined vocabulary into smaller subwords and backs off to individual characters if necessary. We will cover the specifics of this in a second. Before we do that, let’s look at tokenizers in the transformers library, since that’s the library we will use most of the time.

Tokenizers in the transformers Library#

Using the transformers library, we can instantiate a tokenizer like this:

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

print(type(tokenizer))
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
<class 'transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast'>
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(

Alternatively, you can also use the AutoTokenizer class and let the transformers library figure out what kind of tokenizer should be instantiated:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(type(tokenizer))
<class 'transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast'>

Let’s have a look at the vocabulary size of the tokenizer:

print(tokenizer.vocab_size)
50257

The gpt2 tokenizer has more than 50.000 tokens, which is quite a large number, but certainly not as large as the number of all possible words that can be formed in English and other languages.

We can map a text to its token IDs using the encode method of the tokenizer object:

text = "This is a sentence"

token_ids = tokenizer.encode(text)

print(token_ids)
[1212, 318, 257, 6827]

Similarly, we can map the token IDs back to the original text using the decode method of the tokenizer object:

original_text = tokenizer.decode(token_ids)

print(original_text)
This is a sentence

If you want to get the tokens from the text as strings, you can use the tokenize method:

tokens = tokenizer.tokenize(text)

print(tokens)
['This', 'Ġis', 'Ġa', 'Ġsentence']

Note that most tokenizers do some shenanigans when representing tokens internally. For example, the gpt2 tokenizer represents a whitespace at the beginning of a word as Ġ. This is not important to us except as a technical detail.

You can also convert IDs to tokens using the convert_ids_to_tokens method:

print(tokenizer.convert_ids_to_tokens(token_ids))
['This', 'Ġis', 'Ġa', 'Ġsentence']

Finally, you can call the tokenizer object directly to get output that will be usable by the rest of the LLM:

encoded_input = tokenizer(text, return_tensors="pt")

print(encoded_input)
{'input_ids': tensor([[1212,  318,  257, 6827]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

This will not only return the token IDs (called input_ids in this dictionary) but also an attention_mask. We will cover the purpose of this in later chapters.

The Byte-Pair Encoding Algorithm#

How exactly are subword tokenizers created?

We could theoretically manually define all the respective subwords, but this is pretty tedious and becomes quite hard to do for tokenizers with 50.000 possible tokens. Therefore, subword tokenizers are usually trained.

Let’s consider the following text:

text = """
Tokenizers are essential tools in natural language processing.
They break down text into smaller units called tokens.
Tokenizers help in transforming raw text into a format that machine learning models can understand.
There are different types of tokenizers, including word tokenizers, subword tokenizers, and character tokenizers.
"""

First, we convert the text into a byte sequence:

tokens = text.encode("utf-8")
print(tokens)
b'\nTokenizers are essential tools in natural language processing.\nThey break down text into smaller units called tokens.\nTokenizers help in transforming raw text into a format that machine learning models can understand.\nThere are different types of tokenizers, including word tokenizers, subword tokenizers, and character tokenizers.\n'

Let’s inspect the first ten bytes:

token_ids = list(map(int, tokens))
print(token_ids[:10])
[10, 84, 111, 107, 101, 110, 105, 122, 101, 114]

We will now use the token_ids list to train our own (very limited) tokenizer. The most popular method to train tokenizers is called Byte Pair Encoding (BPE) which builds a vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words.

Basically, we begin with an initial list of token IDs. At every step, we identify the pair of token IDs that is most common in our sequence and “merge” that pair into a new token. For example, if the pair (126, 84) is the most common pair, we would generate a new token with some ID that doesn’t exist so far, replace all occurrences of (126, 84) with these new tokens and continue the process.

To get started, let’s write a helper function that identifies the most common token pair in our sequence:

def get_token_pair_counts(token_ids):
    counts = {}
    for pair in zip(token_ids, token_ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

def get_most_common_token_pair(token_ids):
    token_pair_counts = get_token_pair_counts(token_ids)
    return max(token_pair_counts, key=token_pair_counts.get)
most_common_token_pair = get_most_common_token_pair(token_ids)
print(most_common_token_pair)
(101, 114)

This pair corresponds to the characters e and r, which makes sense if we look at the text and observe the frequencies of the character pairs:

print(chr(101), chr(114))
e r

To merge these characters into a new token, we simply iterate over the sequence and replace each character pair with a new token ID:

def merge(token_ids, pair, new_token_id):
    new_token_ids = []
    i = 0
    while i < len(token_ids):
        if i < len(token_ids) - 1 and token_ids[i] == pair[0] and token_ids[i+1] == pair[1]:
            new_token_ids.append(new_token_id)
            i += 2
        else:
            new_token_ids.append(token_ids[i])
            i += 1
    return new_token_ids

Here is how we could use the merge function:

print(merge([1, 4, 5, 2, 3, 4, 5, 3, 4, 5], (4, 5), 6))
[1, 6, 2, 3, 6, 3, 6]

Now, we simply repeatedly perform a merge of the most common token pairs. Note that this would include tokens that were the result of a merge, so BPE can merge tokens recursively.

One important question to consider is how many steps we want to perform. This is basically a hyperparameter - the more tokens we have, the larger our vocabulary size but the smaller our sequence lengths will be.

num_merges = 50

old_token_ids = list(token_ids)

merges = {}
for i in range(num_merges):
    most_common_token_pair = get_most_common_token_pair(token_ids)
    new_token_id = 256 + i
    print(f"Merge {most_common_token_pair} into a new token {new_token_id}")
    token_ids = merge(token_ids, most_common_token_pair, new_token_id)
    merges[most_common_token_pair] = new_token_id
Merge (101, 114) into a new token 256
Merge (32, 116) into a new token 257
Merge (105, 110) into a new token 258
Merge (101, 110) into a new token 259
Merge (111, 107) into a new token 260
Merge (260, 259) into a new token 261
Merge (256, 115) into a new token 262
Merge (261, 105) into a new token 263
Merge (263, 122) into a new token 264
Merge (264, 262) into a new token 265
Merge (101, 32) into a new token 266
Merge (32, 258) into a new token 267
Merge (97, 110) into a new token 268
Merge (10, 84) into a new token 269
Merge (97, 114) into a new token 270
Merge (97, 108) into a new token 271
Merge (258, 103) into a new token 272
Merge (111, 114) into a new token 273
Merge (257, 265) into a new token 274
Merge (101, 115) into a new token 275
Merge (97, 116) into a new token 276
Merge (46, 269) into a new token 277
Merge (32, 99) into a new token 278
Merge (272, 32) into a new token 279
Merge (274, 44) into a new token 280
Merge (265, 32) into a new token 281
Merge (270, 266) into a new token 282
Merge (275, 115) into a new token 283
Merge (259, 116) into a new token 284
Merge (108, 115) into a new token 285
Merge (277, 104) into a new token 286
Merge (257, 101) into a new token 287
Merge (287, 120) into a new token 288
Merge (288, 116) into a new token 289
Merge (289, 267) into a new token 290
Merge (290, 116) into a new token 291
Merge (291, 111) into a new token 292
Merge (292, 32) into a new token 293
Merge (271, 108) into a new token 294
Merge (32, 117) into a new token 295
Merge (295, 110) into a new token 296
Merge (102, 273) into a new token 297
Merge (297, 109) into a new token 298
Merge (97, 99) into a new token 299
Merge (268, 100) into a new token 300
Merge (119, 273) into a new token 301
Merge (301, 100) into a new token 302
Merge (302, 280) into a new token 303
Merge (303, 32) into a new token 304
Merge (269, 281) into a new token 305

All that’s left is to write the decode and encode functions.

Writing the decode functions is simple - we just need to translate every individual ID to the respective string it encodes:

vocab = { idx: bytes([idx]) for idx in range(256) }
for (p0, p1), idx in merges.items():
    vocab[idx] = vocab[p0] + vocab[p1]

def decode(ids):
    tokens = b"".join(vocab[idx] for idx in ids)
    return tokens.decode("utf-8", errors="replace")

print(decode([104, 101, 108, 108, 111, 32, 301, 108, 100, 33]))
hello world!

The encode function is a bit harder and happens in an iterative way:

def encode(text):
    token_ids = list(text.encode("utf-8"))
    while len(tokens) >= 2:
        counts = get_token_pair_counts(token_ids)
        pair = min(counts, key=lambda p: merges.get(p, float("inf")))
        if pair not in merges:
            break
        idx = merges[pair]
        token_ids = merge(token_ids, pair, idx)
    return token_ids
print(encode("hello world!"))
[104, 101, 108, 108, 111, 32, 301, 108, 100, 33]

Final Notes#

Note that the tokenizer is conceptually distinct from the rest of the LLM. It often uses its own training set, which may differ from that of the LLM, to train its vocabulary using the BPE algorithm.

The tokenizer’s role is to translate between text and numbers. The LLM only processes numbers and never interacts directly with the text. In theory, once the text is translated into numerical form, the original text could be discarded.