Landmark Papers

Improving Language Understanding by Generative Pre-Training - Radford 2018, OpenAI - GPT, context $512$ . This is a decoder-only architecture. Training constists of two stages:
1. unsupervised pre-training: Uses the language modeling objective on unlabeled data, to provide a good initialization for the parameters. They maximise the likelihood $\sum_{i} \log P (u_{i} ∣ u_{i - k}, \dots, u_{i - 1}; Θ), k context window$
2. supervised fine-tuning: Use the supervised objective with a manually annotated datasets, depending on the target task. Each sequence $x^{1}, \dots, x^{m}$ in the supervised dataset has an associated label $y$ . The sequence is fed through the Transformer to obtain the final activation $h_{l}^{m}$ . This is then fed through another, task-specific, linear layer with matrix $W_{y}$ , and then through a soft-max. For some tasks, which are very structured, they require some modifications at the end of the network. We then maximise a mix between supervised and unsupervised $[\sum_{(x, y)} \log softmax (h_{l}^{m} W_{y})] + λ \sum_{i} \log P (u_{i} ∣ u_{i - k}, \dots, u_{i - 1}; Θ)$
Language Models are Unsupervised Multitask Learners - Radford 2019, OpenAI - GPT-2, 1.5B parameters, pre-LN, context $1024$ , batchsize $512$ . Scale allows GPT-2 to achieves SOTA on language modelling datasets that it was not trained on, thus it is a zero-shot learner.
- Data: To improve CommonCrawl, they only include documents that have been curated/filtered by humans: they scraped all outbound links from Reddit which received at least 3 karma, up to December 2017, and removed all Wikipedia links to avoid training on data it will be evaluated on. The resulting dataset is called WebText and includes 8 million documents, with a total of 40GB.
- Encoding: They use a modification of Byte-Pair Encoding (BPE)
- Residual Scaling: In deep networks residual connections accumulate, thus they get initialized at $1 / \sqrt{N}$ where $N$ is the number of residual layers.
Language Models are Few Shot Learners:

Last updated on Jul 17, 2024

Edit this page