LLM and Fine-Tuning

Understanding more about Large Language models

Published in

AI Mind

12 min readAug 17, 2023

Large Language Model (LLM) is kind of powerful neural network, which enable computers to understand and generate human language. LLM is based on Transformers architecture.

These LLMs are trained on massive datasets and huge corpus of internet data, which showed impressive results like understanding complex nuanced language and generating language more fluently.

In traditional machine learning, a model can be trained for specific task like sentiment analysis or machine translation but LLMs can be used for variety of tasks like chat, summarization, translation, analysis and many language related applications — the best thing is you need not to be good at machine learning to use LLMs.

LLMs are general purpose language models that can be pre-trained and then fine-tuned for specific purposes.

What does pre-training and fine-tuning mean?

We teach our children to sit, stand, walk, run and jump, which are basic motor skills that helps him/her to be able to move and complete tasks efficiently.

These motor skill development supports cognitive, speech, and sensory development.

If he/she become interested in athletics, then he/she should undergo special training in that particular sport.

Similar idea applied to large language models as well — LLM is trained for general purposes to solve common language problems like text generation, text summarization by getting trained on vast amount of public available datasets.

These Large Language models can then be tailored to solve specific problems in different fields like finance, healthcare using limited available data.

As discussed above, (Large, General purpose and pre-trained/fine-tuned) can be considered as three important features of Large Language models.

Large can be understood with two different meanings

Large training dataset: LLM was trained on vast corpus of data, which contains millions of internet-based text documents and wikipedia data.
Large number of parameters: The parameters are called as hyperparameters in machine learning terminology. These parameters are basically the knowledge that machine learns from model training. These parameters define the skill of a model in solving a problem.

General purpose means model’s ability to solve common problems

Commonality of human language irrespective of specific tasks, as words used in the language in any task would relate to contextual meaning.
Restriction of the resource — To train such huge large language model on huge dataset with so many parameters, it requires a lot of resources, which can be accommodated only by big tech companies. So, these tech giants create fundamental language models, that can be used by others

Pre-train and Fine-tuning of Large Language model : These LLMs can be pre-trained for a general purpose with larger datasets and fine-tune it for task-specific problem with task specific datasets.

Why should we use LLM?

With the help of LLM, a single model can be used for many different tasks unlike ML, where different models needs to be created for specific task.

LLMs obtain decent performance even with limited data on problem specific tasks (few-shot learning task or zero-shot learning task)

In simple words, Zero-Shot learning helps to solve a task without receiving any example of that task at training phase and Few-shot learning helps to solve a task with by providing the model with a few examples (or) very limited training data at training phase.

To understand more about these concepts, you can refer this resource

A step into Zero-Shot Learning

A conceptual understanding of Zero shot learning

medium.com

The Large Language models performance continuously improves when it is trained with more data and also with more parameters — The example is GPT-2 vs GPT-3 model (where the performance of GPT-3 model outweigh GPT-2 model) because of increased parameters and also additional data GPT-3 is trained upon.

For example : To perform a task of classification between cricket-bat and hockey-stick

In traditional programming, rules can be hard coded to perform this specific task.

cricket-bat : {used_for : sport, handle : yes, wide_bottom : yes}

Using neural networks, CNN based object-detection algorithms (or) YOLO can be used to do this classification.

Using generative AI, we can generate the content using images, audio or video — when questioned about cricket bat through prompt or audio (verbal question), then generative AI model can output all the information, it learned about cricket-bat.

All centered around Prompts

With use of these LLM, building an generative AI powered application became simple, as one need not to be an expert at training examples to train a model. The LLM gives a desired response if a prompt is clear to the LLM — It only boils down to prompt design — it is a process of creating a prompt that is clear concise and informative.

prompt design and prompt engineering are similar concepts in NLP, the key differences between both these concepts is that

prompt design is the process of creating a prompt that is made only to the specific task.

prompt design focuses on crafting high-quality conversational prompts, that draws specific responses from an AI model.

prompt engineering is the process of creating a prompt that is designed to improve performance using domain-specific knowledge by providing examples of desired output (or) by providing keywords in the prompt.

prompt engineering is a strategy for creating precise instructions to extract more performance from an AI model.

Different types of Language models

There are three most used large language models

Generic Language model

Generic language models are also called as raw language models — these are designed to generate text based on patterns and general knowledge learned during training.

These models excel at providing information, explanations, and summaries without specific instructions or constraints.

These models predict the next word based on the language in the training data.

In the above example — The cat sat on are tokens and the next token should be the — it is the most likely next word, this langauge model is similar to autocomplete in search.

Instruction tuned Language model

Instruction-tuned language models are trained with additional supervision (or) fine-tuning using specific instructions.

These models excel at following directions and generating text according to given guidelines.

These language models are trained a response based on the instructions given in the input.

From the above indicated examples, it is clear that LLM is taking instructions in the form of prompt and generating output based on the instructions provided.

Dialogue tuned Language model

These language models are trained to have dialogue by predicting the next response.

These dialogue tuned models are a special case of instruction tuned models, where prompts are typically framed as questions to LLM.

These dialogue-tuned language models are trained to engage in conversational interactions. To prompt this type of model, you can provide a conversation context with alternating user and model responses. It is important to clearly indicate which part of the dialogue is the user input and which part is the model’s response, same as like above example in the image.

Chain-of-Thought Reasoning and Chain-of-Thought Prompting

Chain-of-Thought Prompting is about guiding the LLM to think step-by-step. Chain of thought Reasoning allows models to decompose complex problems into intermediate steps that are solved individually.

Chain-of-Thought Prompting is achieved by Chain of thought Reasoning — by providing the model a few-shot learning strategy (providing the model with few examples that outlines the reasoning process of an answer).

Chain-of-Thought Prompting model is then expected to follow a similar chain of thought when answering the prompt. This approach is particularly effective for complex tasks that require a series of reasoning steps before a response can be generated.

The key advantages of Chain-of-Thought Prompting is its ability to enhance the performance of LLMs on tasks that require arithmetic, commonsense, and symbolic reasoning.

Here is a sample Chain-of-Thought prompt using a few-shot learning strategy

Q: Satvik has 10 pens. He buys 5 more pen boxes. Each pen box carton contains 15 pens. How many pens does Satvik have now?
A: Satvik started with 10 pens. 5 pen boxes of 15 pens is 75 pens. 10 + 75 = 85. Therefore, Satvik has 85 pens, and the answer is 85.

Q: Praveen had 90 cows in his diary farm. If he sold 15 to Prakash and bought twice that number more, how many cows does Farmer Praveen have now?

GPT model’s response:

A: Praveen sold 15 cows to Prakash, so he was left with 90–15 = 75 cows. He then bought twice the number of cows he sold, so he bought 2 * 15 = 30 cows. Adding these newly bought cows to his remaining cows, Praveen now has 75 + 30 = 105 cows.

Fine-Tuning a model

Tuning a model helps in customizing the model’s response based on examples of the task that model wants to perform.

Tuning is the process of adapting a model to a new domain (or) set of custom use cases by training the model on new data. Fine tuning is the process of making small adjustments to achieve the desired output or performance

In deep learning, Fine tuning involves the use of weights of a trained neural network to train another deep learning algorithm from the same domain — a part of transfer learning

Fine tuning is used to speed up the training and it can also be used to overcome a small dataset — without losing any already contained vital information from a pre-existing deep learning algorithm.

In terms of NLP, the concept of fine-tuning LLM can be thought of like an accomplished linguist living in USA, who is capable of speaking several languages fluently, who wants to become a tourist guide in the North of India. So, he should have specialized knowledge about Hindi in order to understand culture and local slang to truly shine in his new role. Even though his overall language skills are quite impressive, he still needs to undergo special training — can also be called as fine-tuning (if related to our technical terminology).

Even in the context of LLM, Fine-tuning resembles similar purpose. Despite the impressive ability of LLM to generate text in various contexts, they require Fine-tuning to perform specific tasks (or) understand particular domains more accurately.

Fine-tuning trains the LLM on task-specific or domain-specific data, thereby enhancing its performance in those particular trained areas.

Fine-tuning a LLM can enrich the Large Language model with domain expertise — let’s understand this with a simple example : If we are interested in learning Artificial Intelligence, then reading a book on operating-systems won’t help you much. Instead, you would benefit much from the book written on Artificial Intelligence or Machine learning.

If there is a use-case to work on reading the medical transcripts then LLM, which is fine-tuned with medical data will give better performance when compared to the base model — that is trained on internet data. Base LLM might fail because it lacks required medical knowledge. So, fine-tuning becomes inevitable when dealing with specialized fields (or) sensitive data (or) unique information that isn’t well-represented in the general training data.

There are different type of Fine-tuning approaches like Feature-based approach and Parameter Efficient Fine tuning.

Feature-based approach — Reuse the Features

This type of approach is typically used when your task is related to already trained LLM task but not identical to the original pre-trained task.

In this approach, the pretrained LLM model acts as a feature-extractor. For a given input, the pre-trained LLM model outputs a fixed-sized, n-dimensional array. A separate classifier neural network, which consists of a simple models like logistic regression model (or) a fully connected layer, where the last layer containing neurons equal to the number of outputs for a classification task.

The input to this classifier is an n-dimensional array — which is the output of a pretrained model. The output of this classifier gives the probability of input belonging to a particular class for classification task.

In the above image, it is indicated that training process consists of changing only the weights of the separate classifier and keeping the weights of pre-trained LLM frozen — So, it is said that pre-trained LLM works as a feature extractor and a separate network uses these features to train a classification model.

Finetuning-I approach —Partial re-train

This type of approach is typically used when you have large labelled dataset.

In this approach, instead of creating a separate classifier, a few dense layers are appended at the end of the pre-trained LLM model.

While training, the weights are frozen that are corresponding to pretrained LLM (which means pretrained LLM weights will not change during model training) and the weights corresponding to newly added dense layers are updated during model training.

Finetuning-II approach — Re-train entire model

This type of approach is typically used when your dataset is significantly different from the pre-trained LLM dataset (the dataset on which LLM is trained upon).

In this approach, all the weights of the entire model (pre-trained LLM weights) are also updated during training. This approach is resource intensive and expensive as all parameters of the LLM model are involved.

The problem with this approach is that it would forget already pre-trained LLM knowledge, as new parameters/features overwrite the previously learnt features.

‍Parameter Efficient Fine-tuning — Added Parameters

This type of approach is typically used when your computational resources are limited.

This is new approach when compared to above approaches, as the major disadvantage with the above approaches for fine-tuning LLMs is the number of parameters of state-of-the-art LLMs. In the previous articles, it was demonstrated on the number of parameters GPT models could have. For inference alone of the GPT model with 1.5 billion parameters requires at least 16 GB of RAM, with the 175 billion parameters, it might require as much as around 400 GB of RAM. Fine-tuning would be quite expensive and resource intensive for multibillion-scale LLMs. The larger the base model, the more expensive it is to train all the layers.

Parameter Efficient Fine-tuning (PEFT) solves the problem of finetuning LLM by training only a subset of parameters. These subset of parameters could be a set of newly added parameters (or) select parameters of existing model.

This approach has different type of implementations. On broader level, various PEFT approaches can be classified into 3 main groups based on their underlying approach:

Additive-based — (i) Adapters and (ii) Soft prompts

2. Selection-based

3. Re-parameterization-based

1. Additive methods: The idea of additive methods is to add extra parameters to existing pretrained models and only train these new parameters. Adding extra parameters increases the training time but memory efficiency improvements are achieved using techniques like quantization, reducing the size of the gradients and the optimizer states.

1.1. Adapter methods: These methods are a type of additive PEFT method that add small fully-connected layers after Transformer sub-layers.

1.2. Soft prompting methods: In soft prompting approaches such as soft prompt tuning (or) prefix tuning, a small trainable parameter is introduced alongside the model input (or) pre-appended to each transformer block.

2. Selective methods: Selective approach is the simplest method involving fine-tuning only the top layers of the network.

3. Reparametrization-based methods: Reparametrization-based parameter-efficient finetuning methods leverage low-rank representations to minimize the number of trainable parameters. The Low Rank Adaptation paper adds trainable rank decomposition matrices alongside each transformer layer and only trains those newly added weights.