Fine-Tuning
In theory, the first paper to suggest Supervised Fine-Tuning was the paper introducing InstructGPT. However, their description of the fine-tuning procedure is definitely not clear to me. The paper introducing LoRA actually does a better job, in my opinion.
There are two key differences between unsupervised pre-training (UPT) and supervised fine-tuning (SFT):
- Training data:
- SFT: Each example in our dataset actually consists of a pair
of text where is the prompt and is the answer that we wish our model to learn. For instance, could be a piece of text to summarise and could be its summary. As usual, these have been tokenized and are therefore token indices. - UPT: Each example in our dataset is simply just one sequence of token indices.
- SFT: Each example in our dataset actually consists of a pair
- Loss function:
- SFT: We use cross-entropy for next-token prediction, just like in UPT, except that we only do this for tokens in
and neglect those in . Below, I will write for the logit obtained from the language model corresponding to the correct target token index by feeding the concatenated token indices , where we assume that is empty and adds nothing to . We are basically computing exactly the same loss, except that we don’t compute it for the entire sequence but only for the “target/answer” part of the sequence. - UPT: Standard cross-entropy for next-token prediction, over the entire sequence. Here
basically takes the place of
- SFT: We use cross-entropy for next-token prediction, just like in UPT, except that we only do this for tokens in
Full Supervised Fine Tuning
The automatic thing to do would be to then train all the parameters
This works well in terms of performance, but it can be prohibitive to do this for every downstream task. A better approach is LoRa.