Login Get started

Let's build GPT: from scratch, in code, spelled out.

Andrej Karpathy・8 minutes read

Chachi PT and ChatGPT are AI systems that can generate various text-based tasks, from haikus to explaining HTML, showcasing their versatility and probabilistic nature. These systems are based on the Transformer architecture introduced in 2017, focusing on efficiency, batch processing, and self-attention mechanisms to improve text generation and model performance.

Insights

Chachi PT is an AI system that allows text-based interactions for tasks like writing haikus about AI's importance, showcasing creativity and engagement.
ChatGPT, based on the Transformer architecture, demonstrates versatility by explaining HTML to a dog and writing release notes for Chess 2, highlighting its broad applicability.
Training a Transformer model involves tokenization, chunk-based data processing, batch dimension optimization, and context length variation, essential for efficient and effective model training.

Get key ideas from YouTube videos. It’s free

Recent questions

What is ChatGPT?
ChatGPT is an AI system that allows text-based interaction through tasks like writing haikus about AI's importance for prosperity. It generates responses based on prompts, showcasing its probabilistic nature and versatility for tasks like explaining HTML or writing release notes.
How does Tokenization work?
Tokenization converts raw text to integer sequences using character-level encoding or sub-word encodings like SentencePiece. In the context of Tiny Shakespeare data, it involves splitting text into training and validation sets to prevent overfitting.
What is the Transformer architecture?
The Transformer architecture, introduced in a 2017 paper titled "Attention is All You Need," revolutionized AI applications. It is the basis for systems like ChatGPT and Nano GPT, enabling efficient training and generation of text character by character.
Why is Positional Embedding important?
Positional embeddings encode token identities and positions, crucial for self-attention blocks' development in the Transformer model. They prevent scope issues by limiting embeddings to the block size and necessitate context cropping for effective training.
How does Layer Norm optimize neural networks?
Layer Norm normalizes individual columns or rows of input, ensuring zero mean and unit standard deviation. It eliminates the need for running buffers and is applied before transformations in the Transformer, leading to improved performance.

Related videos

Summary

00:00

AI Systems: ChatGPT, Chachi PT, Nano GPT

Chachi PT is an AI system that allows interaction through text-based tasks, such as writing haikus about the importance of understanding AI for prosperity.
ChatGPT generates responses sequentially based on prompts, showcasing its probabilistic nature and ability to provide varied outcomes.
ChatGPT has been used for diverse tasks like explaining HTML to a dog or writing release notes for Chess 2, showcasing its versatility.
The system is based on the Transformer architecture, introduced in a 2017 paper titled "Attention is All You Need," which revolutionized AI applications.
Nano GPT is a repository for training Transformer-based language models, like GPT2, using Python and basic calculus and statistics knowledge.
The repository allows training on text datasets like Tiny Shakespeare, enabling the generation of Shakespeare-like text character by character.
Tokenization involves converting raw text to sequences of integers, with options like character-level encoding or sub-word encodings like SentencePiece.
The training data from Tiny Shakespeare is tokenized and split into training and validation sets to prevent overfitting.
Transformers are trained on chunks of data rather than the entire dataset, with random sampling of sequences for efficient training.
The process involves feeding chunks of text sequences into the Transformer model for learning patterns and generating text akin to the input dataset.

14:57

"Training Transformer with Chunked Data and Batches"

Training involves breaking data into chunks with a maximum length, often referred to as block size.
Each chunk contains multiple examples for training a Transformer network.
Code is used to illustrate how inputs and targets are structured for training the Transformer.
Training on various context lengths up to the block size is crucial for the Transformer's effectiveness.
Batch dimension is introduced for efficiency in processing multiple chunks simultaneously.
Code demonstrates how batches of text chunks are sampled and processed independently.
The batch of input data is fed into a neural network, starting with a bigram language model.
The bigram language model predicts the next character based on individual token identities.
Cross-entropy loss is used to evaluate the quality of predictions in the bigram language model.
Generation function extends the input data to predict and generate new tokens based on the model's output.

30:26

Optimize training with GPU and Transformer model.

To convert logits to probabilities, use softmax after focusing on the last step in the time dimension.
Utilize torch multinomial to sample one prediction for each batch dimension, resulting in a b by one array.
If targets are not provided, set them as optional (none by default) to avoid errors in loss creation.
Train the model using the Adam optimizer, with a recommended learning rate of 3E-4 for small networks.
Increase batch size to 32 for better optimization results.
During training, evaluate loss over multiple batches to reduce noise and get accurate loss estimates.
Ensure the model runs on GPU for faster computations by moving data and model parameters to the device.
Implement a context manager torch.nograd to enhance memory efficiency during training.
Convert the code to a script for simplicity, including hyperparameters, data loading, model creation, and training loop.
Prepare for the Transformer model by introducing self-attention blocks and mathematical tricks for efficient token communication.

45:47

Efficient Matrix Aggregation for Token Embeddings

Xrev is shaped based on the number of elements in the past and C, containing two-dimensional information from previous tokens.
The average over the zeroth dimension is calculated to create a one-dimensional vector stored in X background words.
Matrix multiplication is introduced as a more efficient method for aggregation, exemplified with a toy example using matrices A, B, and C.
The lower triangular portion of a matrix can be extracted using torch.tril, enhancing efficiency.
Manipulating matrix elements allows for weighted averages, demonstrated by normalizing rows of matrix A and multiplying with matrix B.
Vectorization is employed to efficiently aggregate past elements using weighted sums specified in a T by T array.
Batched matrix multiplication is utilized to aggregate past elements efficiently, resulting in identical matrices Expo and Expo 2.
The use of softmax, masked fill, and normalization operations aids in producing weighted averages for aggregation.
Affinities between tokens are data-dependent, influencing the aggregation process through matrix multiplication.
Positional embeddings are introduced to encode not only token identities but also their positions, preparing for the self-attention block's development.

01:02:04

"Self-Attention: Data-Dependent Information Aggregation"

The video focuses on implementing a small self-attention for a single individual head.
The code example involves changing the number of channels from 2 to 32, resulting in a 4x8 arrangement of tokens, each with 32 dimensions.
The current code performs a simple average of past and current tokens by creating a lower triangular structure to mask out and normalize the weight matrix.
Initially initializing affinities between tokens to zero leads to a uniform structure in the weight matrix, resulting in a simple average.
Self-attention addresses the need for data-dependent information gathering from past tokens by emitting query and key vectors for each token.
Affinities between tokens are calculated by taking the dot product of keys and queries, allowing for data-dependent interactions.
The weighted aggregation process involves multiplying the values with the weights, resulting in a data-dependent aggregation of information.
The upper triangular masking is applied to prevent certain nodes from communicating, followed by exponentiation and normalization to create a distribution for aggregation.
In a single self-attention head, values are produced by propagating a linear module on top of the input, which are then aggregated instead of the raw tokens.
Attention is described as a communication mechanism in a directed graph where nodes aggregate information via a weighted sum in a data-dependent manner.

01:17:44

"Softmax, Self-Attention, and Multi-Head Implementation"

Multiplying by one over the square root of the head size preserves the variance of the weights, crucial for the diffuse distribution needed for softmax.
Softmax convergence towards one-hot vectors is prevented by ensuring weights are not too extreme at initialization.
Softmax sharpens towards the maximum value if weights are too extreme, leading to undesired peakiness.
The head module is created to implement a single head of self-attention, with key, query, and value linear layers.
A triangular lower triangular matrix is created as a buffer for the self-attention implementation.
The self-attention mechanism involves calculating keys, queries, and attention scores, followed by normalization and aggregation.
The self-attention head is integrated into a language model for processing token and position embeddings.
Positional embeddings are limited to the block size to prevent scope issues, necessitating context cropping.
Lowering the learning rate and increasing iterations improve the validation loss in training the network.
Implementing multi-head attention involves running multiple self-attention heads in parallel and concatenating the outputs.

01:32:52

Optimizing deep neural networks with Layer Norm

Layer Norm is a helpful innovation for optimizing deep neural networks, similar to BatchNorm.
Layer Norm normalizes individual columns of input, ensuring zero mean and unit standard deviation.
Implementing Layer Norm involves changing from column normalization to row normalization.
Layer Norm normalizes rows for each individual example in a 100-dimensional vector.
The Layer Norm operation eliminates the need for maintaining running buffers.
The Layer Norm is applied before the transformation in the Transformer, deviating slightly from the original paper.
Two Layer Norms are needed in the Transformer, one for embedding dimension and another for the layer.
The addition of Layer Norms leads to a slight improvement in performance in the Transformer.
Scaling up the model involves adjusting hyperparameters like batch size, block size, learning rate, embedding dimension, number of heads, and dropout rate.
The validation loss significantly improves after scaling up the neural net in the Transformer.

01:48:23

"Training and Fine-Tuning GPT Models"

The GPT model consists of position encodings, token encodings, blocks, layer Norm, final linear layer, and parameters separated into weight decayed and non-weight decayed categories.
Training a ChatGPT model involves two stages: pre-training on a large internet dataset to create a Transformer model with around 10 million parameters and fine-tuning to align it as an assistant.
The largest Transformer model in the GPT3 paper has 175 billion parameters, trained on 300 billion tokens, showcasing a significant increase in scale compared to smaller models.
Fine-tuning a model like ChatGPT involves collecting specific training data resembling question-answer formats, ranking responses, training a reward model, and using reinforcement learning to optimize response generation.
Further stages beyond language modeling, like sentiment detection or task performance, require additional fine-tuning steps such as supervised fine-tuning or more complex reward-based alignment processes like in ChatGPT.

Try it yourself — It’s free.