Landmark Papers

  • Improving Language Understanding by Generative Pre-Training - Radford 2018, OpenAI - GPT, context 512. This is a decoder-only architecture. Training constists of two stages:
    1. unsupervised pre-training: Uses the language modeling objective on unlabeled data, to provide a good initialization for the parameters. They maximise the likelihood ilogP(uiuik,,ui1;Θ),k context window
    2. supervised fine-tuning: Use the supervised objective with a manually annotated datasets, depending on the target task. Each sequence x1,,xm in the supervised dataset has an associated label y. The sequence is fed through the Transformer to obtain the final activation hlm. This is then fed through another, task-specific, linear layer with matrix Wy, and then through a soft-max. For some tasks, which are very structured, they require some modifications at the end of the network. We then maximise a mix between supervised and unsupervised [(x,y)logsoftmax(hlmWy)]+λilogP(uiuik,,ui1;Θ)
  • Language Models are Unsupervised Multitask Learners - Radford 2019, OpenAI - GPT-2, 1.5B parameters, pre-LN, context 1024, batchsize 512. Scale allows GPT-2 to achieves SOTA on language modelling datasets that it was not trained on, thus it is a zero-shot learner.
    • Data: To improve CommonCrawl, they only include documents that have been curated/filtered by humans: they scraped all outbound links from Reddit which received at least 3 karma, up to December 2017, and removed all Wikipedia links to avoid training on data it will be evaluated on. The resulting dataset is called WebText and includes 8 million documents, with a total of 40GB.
    • Encoding: They use a modification of Byte-Pair Encoding (BPE)
    • Residual Scaling: In deep networks residual connections accumulate, thus they get initialized at 1/N where N is the number of residual layers.
  • Language Models are Few Shot Learners:
Previous