Login Get started

Let's build the GPT Tokenizer

Andrej Karpathy・9 minutes read

Tokenization is fundamental in large language models, impacting various functionalities like language processing and arithmetic tasks, with different tokenization methods significantly influencing the efficiency and effectiveness of models. Unicode code points, UTF-8 encoding, and Byte Pair Encoding are key concepts in tokenization, along with the importance of setting the vocabulary size based on model requirements like computational complexity and performance.

Insights

Tokenization is a fundamental process in language models, converting text into tokens for model input.
State-of-the-art models like GPT-2 use advanced tokenization methods such as byte pair encoding for improved efficiency.
Unicode code points, with over 150,000 characters across 161 scripts, are essential for handling various languages and special characters.
Byte Pair Encoding algorithm compresses sequences by identifying common pairs, expanding the vocabulary for effective tokenization.

Get key ideas from YouTube videos. It’s free

Recent questions

What is tokenization in language models?
Tokenization is the process of converting text into sequences of tokens, which are fundamental units used in language models for processing and analysis. It involves breaking down text into smaller components to facilitate language understanding and modeling.
How does tokenization impact language models?
Tokenization significantly affects the efficiency and effectiveness of language models. Different tokenization methods can influence various functionalities like arithmetic tasks, language processing, and handling special characters. The design of tokenizers, such as the GPT-4 tokenizer, plays a crucial role in enhancing the performance of language models for specific languages and tasks.
What are some challenges with tokenization in language models?
Tokenization issues can lead to challenges in large language models, impacting tasks like spelling, non-English language processing, and simple arithmetic. Unstable tokens, discrepancies between training and tokenization datasets, and unpredictable outcomes from splitting tokens are common challenges that can affect the performance and reliability of language models.
How does Byte Pair Encoding improve tokenization?
Byte Pair Encoding (BPE) is a tokenization algorithm that compresses byte sequences by identifying and replacing frequently occurring pairs of tokens. By iteratively creating new tokens and expanding the vocabulary, BPE reduces the overall sequence length, enhancing the efficiency of tokenization in language models like GPT-2.
Why is understanding tokenization crucial for language models?
Understanding tokenization is crucial for language models due to its significant impact on model performance, efficiency, and functionality. Tokenization influences various aspects of language processing, including handling special characters, supporting different languages, and optimizing arithmetic tasks. By grasping the complexities of tokenization, researchers and developers can improve the design and implementation of language models for diverse applications.

Related videos

Summary

00:00

Importance of Tokenization in Language Models

Tokenization is a crucial process in working with large language models, despite being complex and challenging.
In a previous video, a basic tokenization process was demonstrated using a Shakespeare dataset with a vocabulary of 65 characters.
Tokenization involves converting text into sequences of tokens, which are then used in language models.
An embedding table is utilized to map tokens to trainable parameters for input into the Transformer in language models.
State-of-the-art language models employ more sophisticated tokenization methods like byte pair encoding.
The GPT-2 paper introduced byte-level encoding for tokenization, with a vocabulary of 52,257 tokens and a context size of 1,24 tokens.
Tokenization is fundamental in large language models, affecting various functionalities like arithmetic tasks and language processing.
Tokenization issues can lead to challenges in large language models, impacting tasks like spelling, non-English language processing, and simple arithmetic.
Different tokenization methods can significantly impact the efficiency and effectiveness of language models, as seen in examples with English and non-English text.
The design of tokenizers, like the GPT-4 tokenizer, can enhance the handling of specific languages like Python, improving efficiency and performance in language models.

15:14

"Unicode, UTF-8, and BPE Explained"

The text discusses using integers to access a lookup table of vectors and input them into a Transformer.
It highlights the need to support various languages and special characters like emojis.
Unicode code points, defined by the Unicode Consortium, encompass around 150,000 characters across 161 scripts.
The Unicode standard is dynamic, with the latest version being 15.1 as of September 2023.
The text explains how to access Unicode code points for individual characters using the `ord` function in Python.
It delves into the complexities of Unicode code points, showcasing examples like emojis having high code point values.
The text explores different Unicode encodings like UTF-8, UTF-16, and UTF-32, with UTF-8 being the most prevalent due to its variable-length encoding.
UTF-8 is favored for its backward compatibility with ASCII encoding and its efficiency in representing various characters.
The Byte Pair Encoding algorithm is introduced as a method to compress byte sequences by identifying and replacing frequently occurring pairs of tokens.
The algorithm iteratively compresses sequences by creating new tokens and appending them to the vocabulary, reducing the overall sequence length.

30:36

"Python Tokenizer: Merging Pairs for Vocabulary Expansion"

Identify the most common pair in the sequence, which is 101 comma 32.
Create a new token with the ID of 256 to replace occurrences of 101 comma 32.
Utilize Python to obtain the highest-ranking pair by calling the Max function on a dictionary.
Implement a function to merge pairs of IDs by replacing a specific pair with a new index.
Develop a function to iterate through a list, checking for and replacing pairs based on a set condition.
Provide a toy example where a list is modified by replacing occurrences of 67 with 99.
Replace the top pair in a list with 256, resulting in a decreased list length by 20.
Plan to iteratively repeat the process of finding common pairs and replacing them to expand vocabulary.
Determine the final vocabulary size for the tokenizer, set at 276 for 20 merges.
Train the tokenizer using the Byte Pair Encoding algorithm to create a binary forest of merges.

45:44

Handling Invalid UTF-8 Encoding in Python

Decoding the token "128" as a single element in Python results in an error due to an invalid start byte in UTF-8 encoding.
The binary representation of "128" does not conform to the UTF-8 encoding format, leading to the error.
To resolve the issue, the "errors" parameter in the Python "bytes.decode" function can be set to "replace" to replace invalid characters with a special marker.
Not all byte sequences are valid UTF-8, and using "errors=replace" is a common practice to handle such cases.
Implementing a function to encode a string into tokens involves encoding it into UTF-8, converting the bytes to integers, and then merging certain byte pairs based on a predefined dictionary.
Merging byte pairs should follow a specific order to ensure proper encoding.
The function iterates through possible merge candidates and selects the pair with the lowest index in the merges dictionary for merging.
If no pairs are eligible for merging, the function returns the original tokens.
A special case to consider is when the input string has only one character or is empty, in which case no merging is needed.
The implementation of byte pair encoding algorithm allows encoding and decoding between raw text and token sequences, forming the basis for more advanced tokenizers used in language models like GPT-2.

01:00:50

Tokenization process for NLLM like GPT2

The process involves encoding a string into tokens for NLLM like GPT2.
The 're.findall' function matches a pattern against a string from left to right.
The pattern, a raw string enclosed in triple quotes, consists of vertical bars representing 'or' in regex.
Matching involves optional spaces followed by specific letters or numbers.
The goal is to split the text into independent elements for tokenization.
Token sequences from each element are concatenated without merging across elements.
Specific patterns are used to enforce no merging across letters, numbers, or punctuation.
Apostrophes are handled differently based on their type, impacting tokenization.
The GPT2 tokenizer prefers spaces before letters or numbers for consistent tokenization.
GPT4 tokenizer changes include case insensitivity, limited number merging, and increased vocabulary size.

01:16:26

"Tokenizers: Merges, Special Tokens, and Structure"

The variables crucial for representing a tokenizer are the merges variable and the vocab variable.
OpenAI's implementation includes a bite encoder and a bite decoder, which are not significant.
OpenAI's tokenizer includes a separate layer for bite encoding and decoding, used sequentially with the tokenizer.
The core of the file is the BPE function, similar to a Y loop, identifying pairs to merge.
Special tokens, like "end of text," are used to delimit documents in training data.
Special tokens are handled differently in the GPT2 tokenizer, with specific instructions for their treatment.
Special tokens are extensively used, especially in fine-tuning language models for chat applications.
Extending tokenizers with special tokens requires adjustments in the model's structure.
The MBP repository provides code for building a GPT4 tokenizer, with steps outlined in the exercise progression.
SentencePiece, another popular tokenization library, operates on code points directly, merging them and handling rare code points differently.

01:31:23

SentencePiece: Configuring, Training, and Vocabulary Considerations

SentencePiece is a tool that handles a wide range of tasks and carries historical baggage with numerous configuration arguments available.
The raw Proto buff documentation provides insight into the trainer spec, with many options being irrelevant, like the shrinking factor in the BPE encoding algorithm.
Setting up SentencePiece to mimic Llama 2 involves copying relevant options from the tokenizer model file released by Meta.
Configuration involves specifying input as raw text, output as a model and vocab file, using the BPE algorithm with a batch size of 400, and various pre-processing and normalization rules.
SentencePiece introduces the concept of sentences for training a tokenizer, including options for sentence count, maximum length, and shuffling sentences, which may not align well with language models.
The tool addresses rare word characters, splitting digits and white space, and includes special tokens like UNK, BOS, EOS, and PAD.
Training with SentencePiece generates model and vocab files, allowing inspection of the vocabulary and encoding text into IDs.
Byte fallback in SentencePiece handles unseen code points by encoding them as bytes, ensuring all characters are represented.
The tool's add dummy prefix option adds a space at the beginning of text to treat words consistently, aiding the language model in understanding word similarities.
Considerations for setting vocabulary size in models like GPT include the impact on the token embedding table and LM head layer, computational complexity, parameter undertraining with a large vocabulary, and potential sequence compression affecting model performance.

01:46:26

Optimizing Tokenization for Efficient Transformer Processing

Squishing too much information into a single token affects the Transformer's ability to process information effectively.
Designing the vocabulary size is mostly an empirical hyperparameter, typically ranging from high 10,000s to around 100,000 in state-of-the-art architectures.
Extending the vocab size of a pre-trained model is common, especially for fine-tuning tasks like cha GPT, which introduces new special tokens for various functionalities.
Adding new tokens involves resizing embeddings, initializing parameters with small random numbers, and extending weights in linear layers for dot product calculations.
Introducing new tokens involves minor model surgery, freezing the base model, and training only the new parameters for the new tokens.
There's a design space for introducing new tokens beyond special functionalities, like compressing long prompts into gist tokens for parameter-efficient fine-tuning.
Transformers can process multiple modalities like text, images, videos, and audio by tokenizing input domains and treating them as text tokens without changing the architecture.
Language models struggle with tasks like spelling due to chunking characters into tokens, leading to difficulties in tasks like counting letters or reversing strings.
Language models perform worse on non-English languages due to tokenization issues, resulting in bloated representations and difficulties in translation.
Tokenization of numbers affects language models' performance in arithmetic tasks, with arbitrary representations causing challenges in handling numerical operations.

02:00:24

Tokenization challenges in AI model training.

Adding a space at the end of a string results in the space becoming token 220, affecting the model's distribution.
Splitting tokens can lead to unpredictable outcomes as the model struggles with completing sequences.
The model's confusion is evident when encountering unfamiliar token combinations, leading to errors and warnings.
Unstable tokens, not documented, pose challenges in completion APIs, requiring careful consideration.
The "solid gold Magikarp" phenomenon showcases how unseen tokens during training can lead to erratic model behavior.
Tokenization discrepancies between training and tokenization datasets can cause unexpected model responses.
Different data formats like JSON and YAML can impact token efficiency, with YAML being more efficient.
Understanding tokenization is crucial due to potential security and AI safety issues.
Recommendations include reusing GPT-4 tokens, utilizing efficient libraries like Tokenizer, and caution with training vocabularies using SentencePiece.

Try it yourself — It’s free.