BPE tokenization
Overview
Byte-Pair Encoding (BPE) tokenization is a common technique to convert text to trainable input array before feeding it to a large language model (LLM). It is the tokenizer used by some of the most influential models such as GPT-4 and RoBERTa. Andrej Karpathy has an excellent video about this process. I will summarize and expand on his ideas in this writing. The idea of BPE is rather simple and is showcased by its Wikipedia page: suppose we have the following data to encode aaabdaaabac
, we replace the most frequent adjacent pairs of bytes (aa
) by a new vocabulary (Z
), so it becomes (ZabdZabac
). We can repeat this process again and again until we reach a certain vocabulary size, reducing the original data into chunks represented by the new vocabulary in the process.
Detailed Process
First of all, BPE (Byte Pair Encoding) requires an array of byte representations of data. This can be achieved using the UTF-8 encoding scheme. UTF-8 is a character-level encoding process that transforms characters into 1 to 4 bytes, with the most common characters transformed into 1 byte and less common ones into up to 4 bytes. Just like aaabdaaabac
, the transformed byte array is used as the basis of the TPE process where the most popular adjacent bytes are grouped into a new vocabulary. The Python codes below (by Karpathy) showcase this process.
First, a text string is encoded into byte tokens using UTF-8, notice how the length of the encoded tokens is larger than the length of the original characters. This is because some characters such as 🅤
are encoded to more than 1 byte. In order to get which pairs are popular, we also need a function to establish the pairs’ statistics within the data:
The above code registers each adjacent byte pair’s frequency in the data, and list them in descending order. Here, the byte pair (101, 32) appears most frequently (20 times) in the text data. The pairs are the byte representation of space
and e
.
Now, with the statistics of pairs, we can merge the most frequent pair to a new vocabulary:
Here, we merged the most frequent pair ((101, 32)
) into a new vocabulary (256
). 256 is chosen because the bytes encoded using UTF-8 ranges from 0 to 255. Notice how the length diminished from 616
to 596
, as 20 instances have been merged. Such merges can of course be done repeatedly on a larger textual corpus:
Here, we have repeatedly merged adjacent byte pairs and increased our vocabulary size from 256 to 276, reducing the length of byte level data from 24291 to 19241 in the process. Such a reduction helps fit data into transformers’ context window and achieves better empirical results compared to more granular byte arrays. In practice, the vocabulary size can reach up to ~50k (GPT-2 tokenizer) or even ~100k (GPT-4 tokenizer).
With the merges done, we are now able to encode the original byte data using the new vocabulary. The code below illustrates the encoding process:
There are several things to notice here. First, while len(tokens) >= 2
ensures that we have at least 2 tokens to merge, as otherwise we cannot merge. Second, pair = min(stats, key=lambda p: merges.get(p, float('inf')))
takes the pair within our data (stats
) ranked by its frequency in the merges dictionary (trained separately on a corpus of text). If the pair does not exist in the merges
dictionary, we assign infinity to it so it wouldn’t be picked up if any other pair exists in merges
. If none of the pairs of our encoded data are in merges, all pairs will have infinity as value and min
returns the first such pair, and we will exit the loop. Just like how merges
is established, any hierarchy within the data pairs will preserve their structure. For example, if ab
is labeled as Z
, and Zc
is labeled as Z'
, abc
will first be encoded in to Zc
and then to Z'
, just like in merges
.
In this example, hello world!
has output [104, 101, 108, 108, 111, 32, 119, 266, 108, 100, 33]
of length 11, while the UTF-8 encoding of hello world!
is [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33]
and has length 12. It is because our merges
dictionary has combined 111 (o)
and 114 (r)
into 266 (or)
. In non-toy scenarios, the final encoding will be even shorter, using the gpt-2 tokenizer of tiktoken
library, hello world!
is encoded to [31393 (hello), 995 ( world), 0 (!)]
.
Now, we are ready to tackle the decoding process in the following code:
Here, the ids
input is the encoded data (after merges), the line tokens = b"".join(vocab[idx] for idx in ids)
gives us the concatenated byte representations of ids
, where the new vocabulary within merges
are simply concatenations of the original pair. For example, the byte representation of or
is simply the byte representation of o
concatenated with the byte representation of r
. It should be noted that the code above assumes the iterator of merges
follows order of insertion, as similar to encoding, there is a hierarchy within. Suppose we have ord
somewhere in merges
where the key pair is (or, d)
, we need the byte representation of or
to be estalished first in vocab
before we can process ord
. The insertion order ensures this is the case. Another detail of interest is the errors='replace'
error handling. During the merging process, it is possible to create invalid UTF-8 byte sequences that do not follow the UTF-8 byte convention. errors='replace'
will replace these bytes with the replacement token �
for a more robust decoding.
We can test our BPE with the following line and it should pass the assertion:
GPT-2 Tokenizer
In practice, GPT-2 and GPT-4 tokenizers are just the above technique applied on a larger scale. However, there are 2 important additions:
- Use of regular expression to first compartmentalize the input text before performing encoding. The following codes from the gpt-2 encoder showcase this:
Here, the regular expression pattern r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
is used to tokenize a given text string. Common contractions such as ‘s and ‘t are treated as individual tokens. Unicode letter strings with optional leading space (?\p{L}+
), Unicode number strings with optional leading space (?\p{N}+
), non-letter and non-number such as punctuations (?[^\s\p{L}\p{N}]+
) are also tokenized, along with other patterns. Once compartmentalized, these strings can be UTF-8 encoded using token.encode('utf-8')
.
The reason why this is done is because without such compartmentalization, BPE tokenizer might “group many versions of common words like dog since they occur in many variations such as dog. dog! dog? . This results in a sub-optimal allocation of limited vocabulary slots and model capacity”1. With such compartmentalization, strings like dog! will first be compartmentalized into dog and ! before BPE encoding, thus avoiding the above cases.
- Use of an additional layer
byte_encoder
on top ofUTF-8
before performing encoding. The following codes showcase this:
This is a somewhat superfluous step which is only necessary when the BPE encoding happens on Unicode characters (as in the case of the GPT-2 encoder) instead of bytes. the bytes_to_unicode()
function gives a static mapping of each byte from 0 - 255 to a unique Unicode character that is not a whitespace or control character, and then self.bpe()
function does its merging job on the resulting characters instead of directly on bytes. It is an artificially confusing step that does not add to the algorithm.
Another thing to note for the GPT-2 tokenizer are the encoder
and bpe_merges
files during __init__
. These files are similar to the vocab
and merges
files in our code above. In fact, when we do tiktoken.get_encoding('gpt2')
, we are internally making requests to the following files: https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json
and https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe
. Taking a peek, we can see that encoder.json
looks like the following:
{"!": 0, "\"": 1, "#": 2, "$": 3, "%": 4, "&": 5, ..., "\u0120informants": 50254, "\u0120gazed": 50255, "<|endoftext|>": 50256}
and vocab.bpe
looks like the following:
#version: 0.2
Ġ t
Ġ a
h e
i n
r e
o n
Ġt he
e r
...
Com par
Ġampl ification
om inated
Ġreg ress
ĠColl ider
Ġinform ants
Ġg azed
Notice how there is a special token <|endoftext|>: 50256
used to denote end of documents during training. Special tokens are not the result of BPE merges, but are rather artificial tokens that we can add to the tokenizer. As a result, they are also handled specially before BPE encoding.