A brief on GPT-2 and GPT-3 models

Summarizing OpenAI ’s GPT-2 and GPT-3 models

Chamanth mvs
6 min readJul 30, 2023

In one of the last article, I discussed about Decoder-only transformer model, which is GPT-1 model from OpenAI.

GPT-2 model is designed by making minor changes in the architecture of GPT-1 model. Similarly, GPT-3 model is slightly an enhanced version of GPT-2 model.

GPT-2 model

It would be easier to understand GPT-2 model, when compared with GPT-1 model, because GPT-2 model’s architecture is designed by basing on GPT-1 with few changes.

Change-1 : Layer-normalization layer was moved to the input of each sub-block.

(Right) Decoder-Transformer block in GPT-1 model and (Left) Decoder-Transformer block in GPT-2 model

Change-2 : An additional layer normalization was added after the final self-attention block.

(Right) GPT-1 model overall architecture and (Left) GPT-2 model overall architecture

Change-3 : GPT-1 model can at most use 512 input tokens but GPT-2 model has the privilege to use 1024 input tokens. GPT-1 model is trained with around 40k tokens (which is total vocabulary size) but GPT-2 model is trained with 50,257 tokens (total vocabulary size).

→ Tokens means words, when it is said that the model can consider 1024 input tokens — it means that it is the maximum number of tokens that can be given as an input to the model. What if the input sentence is having more than 1024 words? — The simple technique to follow is to truncate (discard any word other than 1024th word), there are more techniques that can be used other than truncating. Actually, there would be only max 1022 tokens in sentence because <start> and <end> are also considered as two tokens.

(Right) vocabulary matrix size and (Left) input token size

→ Each word should be embedded — in simple terms, representing the word in terms of list of numbers that captures the meaning of the word. The smallest GPT-2 model uses an embedding size of 768 per each word/token.

Change-4 : In GPT-1, the batch size used is 64 whereas in GPT-2, the batch size is increased to 512.

These are all the noticeable changes in terms of model/architecture from GPT-1 to GPT-2.

Types of GPT-2 models and training data

The researchers have proposed four types of GPT-2 models.

GPT-2 small : with 117 million parameters and 12 decoder-transformer blocks and the embedding size of each token to be 768 dimensions.

GPT-2 medium : with 345 million parameters and 24 decoder-transformer blocks and the embedding size of each token to be 1024 dimensions.

GPT-2 large : with 762 million parameters and 36 decoder-transformer blocks and the embedding size of each token to be 1280 dimensions.

GPT-2 very-large : with 1.5 billion parameters and 48 decoder-transformer blocks and the embedding size of each token to be 1600 dimensions.

GPT-2 very-large model is having 10 times more parameters than GPT-1 model.

GPT-2 model is trained on common-crawl data (the data, which is taken from web pages — almost 8 million web page documents are used) and wikipedia data.

GPT-2 is focused on zero-shot learning

GPT-2 model emphasized a lot on zero-shot learning. Concept of Zero-shot learning can be understood from this attached resource.

Example : Difference between Zero-shot and Few-shot learning

Let’s assume, there is a problem of solving Question-Answering

If this task is to be solved using GPT-1 or BERT model, then the sequence of steps to be followed would be

  1. Pretrain the model using Next Word Prediction or Next Sentence Prediction.
  2. Finetune the model with Question-Answer corpus.
  3. Then provide the Paragraph <delimiter> Question — as input, the model generates an answer — as output from the model.

If the same task is to be solved using GPT-2 (Zero-shot learning)

  1. Pretrain the model using Next Word Prediction or Next Sentence Prediction.
  2. Then provide the Paragraph <delimiter> Question — as input, the model generates an answer — as output from the model.
  • There wouldn’t be the step of Fine-tuning in Zero-shot learning process.

What is meant by generative model with primed input?

In simple terms, it is like any other model, which generates the data based on input. So, then what is meant by primed?primed — means giving some context to the model on what output should be!

For example: If a question or prompt is asked as — What is GPT?

What is GPT? is priming the model with input — because, to answer this question, we are giving the context (priming) to the model, that the question is about GPT.

What is GPT? — model first outputs the word The, then What is GPT? The will be input to the model to generate next word.

What is GPT? The — model outputs the word GPT, then What is GPT? The GPT will be input to the model again to generate next word.

What is GPT? The GPT— model outputs the word model, then What is GPT? The GPT model will be input to the model again to generate next word.

This process continues…..until the complete answer is output by model.

GPT models (or) any generative models (or) large language models fail when it is prompted or provided input with very specialized (or) rare content.

GPT-3 model

There are very minor changes have been made to the architecture of GPT-3 model, when compared to GPT-2 model.

On an higher level, the noticeable architectural change made to GPT-3 model in comparison with GPT-2 model is alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to that of a Sparse Transformer — which is actually a slight change in the attention layer mechanism of each transformer block.

GPT-3 model was released in 8 different types — ranging from GPT-3 small to actual GPT-3 (which is a base to ChatGPT)

GPT-3 model has 175 billion parameters, which is infact 100 times more than the complex GPT-2 model.

GPT-3 model is pretrained on 300 billion tokens (which is 300 billion words) and the data is the common crawl internet data along with wikipedia data dated between 2016 to 2019 — it is approximated to be costed around USD 5million to train the GPT-3 model.

If you logically think —as GPT-3 model is trained on such huge data — isn’t it the reason — why it is able to give answers to most of the web-based questions.

We have been speaking on large language models, but

What are large language models?

A large language model is a trained deep-learning model that understands and generates text in a human-like fashion.

It is actually a large transformer model on a large scale that does all the magic behind. These large models are learned by vast amount of text text before it can remember the patterns and structures of language such as our GPT-3 model, which is trained on huge corpus of internet data. The attention mechanism allows LLMs to capture long-range dependencies between words, hence the model can understand context.

As described above, most of the LLM models has many million parameters to train and run, which is so complex and huge, that cannot be run on a single computer. So, these LLMs provide over API or a web interface like ChatGPT service, which is backed by GPT-3 model, that is trained on massive amounts of text data from the internet.

During the training process, the model learns the semantic and statistical relationships between words, phrases, and sentences, allowing it to generate coherent and contextually relevant responses when given a prompt or query.

--

--

Chamanth mvs

Data Science and ML practitioner | I share my learnings and thoughts here