Landmark Papers
- Improving Language Understanding by Generative Pre-Training - Radford 2018, OpenAI - GPT, context . This is a decoder-only architecture. Training constists of two stages:
- unsupervised pre-training: Uses the language modeling objective on unlabeled data, to provide a good initialization for the parameters. They maximise the likelihood
- supervised fine-tuning: Use the supervised objective with a manually annotated datasets, depending on the target task. Each sequence in the supervised dataset has an associated label . The sequence is fed through the Transformer to obtain the final activation . This is then fed through another, task-specific, linear layer with matrix , and then through a soft-max. For some tasks, which are very structured, they require some modifications at the end of the network. We then maximise a mix between supervised and unsupervised
- Language Models are Unsupervised Multitask Learners - Radford 2019, OpenAI - GPT-2, 1.5B parameters, pre-LN, context , batchsize . Scale allows GPT-2 to achieves SOTA on language modelling datasets that it was not trained on, thus it is a zero-shot learner.
- Data: To improve CommonCrawl, they only include documents that have been curated/filtered by humans: they scraped all outbound links from Reddit which received at least 3 karma, up to December 2017, and removed all Wikipedia links to avoid training on data it will be evaluated on. The resulting dataset is called WebText and includes 8 million documents, with a total of 40GB.
- Encoding: They use a modification of Byte-Pair Encoding (BPE)
- Residual Scaling: In deep networks residual connections accumulate, thus they get initialized at where is the number of residual layers.
- Language Models are Few Shot Learners: