High-Level Overview
Vocabulary
I will write
could contain the alphabet, digits - , and the remaining symbols on your keyboard. could additionally contain all the words in the Oxford dictionary. could additionally contain all the suffixes and prefixes such as"ing"
and"un"
.
In practice, the number of tokens in the dictionary is around n_vocab
for the size of the vocabulary, in math formulas
I will use the symbol
Token Indices
In reality the LLM does not work directly with the tokens, since these are typically strings and we want to work with numbers. Instead, we sort the vocabulary into a long sequence of tokens (the order is not important)
The input to the LLM are token indices.
Embeddings
I hope you agree that it is a bit awkward to work with integers. Internally, the LLM will associate to each of these indices a vector. This vectors will be known as embeddings of the tokens and they will have length
This happens internally the LLM, remember, the input to the LLM are the token indices.
Large Language Model
A Large Language Model (LLM) is a function that maps a sequence of tokens indices to an unnormalized log-probability vector over the set of token indices, representing the un-normalized log-probability of the next token index in the sequence.
The LLM accepts sequences of token indices of length up to
A LLM is a parametrized function
.
Notice that we typically call the output of the LLM logits and it is very easy to recover probabilities from them, one just needs to feed them through the softmax function
Training Data
The training data for a LLM is typically loads of text scraped from the internet, for simplicity imagine that you have a very large .txt
file, which you read in memory as a very large string. Before performing any training, you need to transform this large string into numbers. The string is then tokenized: it is converted in a sequence of token indices
Training Dynamics
The parameters
Training happens in batches. The researcher will choose a batch size,
- Sample a token index uniformly from the tokenized training set
- Construct two sequences of length
The second sequence is shifted by one with respect to the first sequence. This is because we train the LLM using self-supervised learning which in this case means that we want the LLM to learn to predict the next token index given a previous sequence of token indices of length at most . These two sequences together contain training examples: for corresponds to the next token index that appears after in the training data. I will generally call the input and the target.
When using batching, we repeat this process
In practice, the input and target of the LLM is a matrix of shape
.
Loss Function
How do we learn
- Feed
through and obtain a vector of logits of size for each training example in the batch and for each time-step from to . That is, we have log-probability vectors, which we can stack into a single tensor - Compute the cross-entropy loss (
F.cross_entropy
). where is the -entry of the vector . That is, we compute the average negative log-probability of the correct token index ( ) under the model.
We then back-propagate the gradient of this loss to update
Text Generation
Once the model is trained, generation is straightforward once the user will provide a prompt of length
- If
then we simply crop the string and grab the final characters of that string, otherwise we leave it unchanged. - This (potentially truncated) string will be tokenized, reshaped to have shape
- This will be fed through the network to obtain logit vectors
. We are only interested in generating text that follows the user-provided prompt, so really we are only interested in as this contains the un-normalised log-probabilities for the index of the next token. - We now simply compute the normalized probabilities over the token indices and sample a new token index
We then use this token index to grab the corresponding token which we append to the (potentially cropped) user prompt and repeat from step 1.
Content of the course
In the rest of the course, I aim to cover:
- Tokenizers: How to map text to indices?
- Embeddings: How do we map token indices
to their embedding ? - Positional Encoding: How do we encode positional information into the embeddings?
- LLM structure: What’s the inner structure of an LLM and what design options do we have?
- Unsupervised Pre-Training: How do we teach the model to sample sensible token indices?
- Supervised Fine-Tuning: After pre-training how do we specialize our LLM to our task?
- Alignment: After fine-tuning, how do we make sure that the LLM’s output is not harmful?