Fine-Tuning

Paper: Finetuned Language Models Are Zero-Shot Learners
Repo: FLAN Instruction Fine-tuning Repo

LLMs are good at few-shot learning, but poor at zero-shot learning. The authors take a 137B pre-trained language model and perform instruction fine-tuning obtaining FLAN (Fine-tuned LAnguage Net).

Instruction-FT vs Pretrain-FT vs Prompting

Given a pre-trained LLM the authors distinguish between:

Pretrain Fine-Tuning (e.g. BERT, T5): Fine-tune pre-trained LLM on a specific task A, and then test it by performing inference on that task. This model is specialized to task A and requires lots of task-specific examples.
Prompting (e.g. GPT-3): Use few-shot prompting or prompt engineering to improve model performance on task A.
Instruction Fine-Tuning: Instruction fine-tune on many tasks via natural language instructions, then perform inference on unseen task.

Data preparation

Rather than creating datasets from scratch, they take publicly available datasets in TensorFlow Datasets and for each task type, they manually write $10$ unique instructions for that task. For diversity, for each task they include up to three instructions that are the “opposite” task. E.g. for Question-Answering, they could include one instruction that asks to generate a question from an answer.

Evaluation

Datasets are clustered by task type and to evaluate the model on task A, say, the pre-trained LLM is instruction fine-tuned on all other clusters, and then tested on datasets in task A.

Classification tasks

Most tasks requires to generate text, which is perfect because that is what the pre-trainde LM can do well. Some tasks require a “classification” output. In this case, they append a new token called OPTIONS at the end of the classification task, followed by a list of available options (e.g. “yes” and “no”).

Architecture

LaMDA-PT 137B with 2.49T BPE tokens, 32k vocabulary using SentencePiece library. Model pre-trained on web documents, dialog and Wikipedia, with 10% non-English. FLAN is the instruction fine-tuned version of this model.

Training

All datasets except those in the held-out cluster are mixed together and they randomly sample a batch from this mixed dataset during training. To balance the examples, they put a limit of $30 k$ examples per dataset.

Last updated on Jul 17, 2024

Edit this page