A step into Zero-Shot Learning

A conceptual understanding of Zero shot learning

6 min readJul 29, 2023

Zero-Shot Learning is a subfield of transfer learning, a machine learning method where a model developed for a one task is reused as the starting point for a model on a different task.

On a broader level, there are two types of transfer learning: homogenous transfer learning — where the source and target data are in the same feature space and heterogeneous transfer learning — where the source and target data are represented in different feature spaces.

Zero-Shot learning helps to solve a task without receiving any example of that task at training phase — for example: There is an image that has cricket bat in it, the task is to recognize the cricket bat

but in the entire training data, there is no image of cricket bat present. Zero-shot learning helps in achieving this task.

In simple words, zero shot learning refers to asking a model to do something it wasn’t trained to do

In Zero-Shot Learning, a model is pre-trained on a set of seen-classes and then asked to generalize to a different set of unseen-classes without any additional training.

The goal of Zero-Shot Learning is to transfer the knowledge already learned by the model to the task of classifying the unseen classes.

Zero-shot learning is typically used when there is no enough training data available or when the data is imbalanced.

How does it work?

When I was reading on zero-shot learning, I am very excited to understand on how does it work? — because there is training data and unseen classes — there are no samples from unseen classes used during model training — then how the model trained on seen-classes will recognize unseen-class data?

To put it in simple words — How is it possible to recognize objects that have never seen before?

The simple answer to the question is Auxiliary Information

In Zero-Shot Learning, data is divided into three categories:

Seen Classes: Classes that have been used to train the model.
Unseen Classes: Classes that the model needs to be able to classify without any specific training. The data from these classes were not used during training.
Auxiliary Information: Word embeddings, Semantic information or Descriptions about all of the unseen classes.

This Auxiliary Information is required to solve Zero-Shot Learning problem because there are no labeled examples of the unseen classes available.

Zero-shot learning is used to build models for unseen-classes that do not train using labeled data, therefore it requires two stages:

Training : Model learns in this phase by capturing as much knowledge as possible about the qualities of the data.
Inference : This is kind of prediction phase, where all the learned knowledge from training stage is applied and utilized in order to classify samples into a unseen-classes.

The knowledge from the seen-classes will be transferred to the unseen-classes in a high-dimensional vector space — which is also called as semantic space.

For instance: In image classification the semantic space along with the image will undergo two steps

Joint embedding space : The semantic vectors and the vectors of the visual features are used.
Highest similarity : The features of seen-classes are matched against those of an unseen-class.

To understand the concept in more intuitive way in two phases (training) and (inference), let’s consider an example to classify the image.

Training

If you read the the text on the right side of the image, you would understand that there is a cricket bat, cricket ball and bails lying on the ground. But, if someone who knew nothing about cricket, he/she learns that if an object which has a handle to hold and bottom part of the object is few inches wider than handle is called as cricket bat. So, the more images you find cricket bat, it will be easier to distinguish between cricket bat, hockey stick, golf stick and tennis racket (objects related to outdoor sports).

CLIP (Contrastive Language-Image Pretraining) by OpenAI for zero-shot learning in image classification works in the way as described above — it is known as Auxiliary Information

Logically thinking, isn’t it a labeled data training? — but, it is said that auxiliary information is not the labeled data, rather it is form of supervision to help the model learn during the training stage. I still believe, it is indirectly making the data learn the patterns of unseen-classes data.

When a zero-shot learning model sees a sufficient amount of image-text pairings — it will be able to differentiate and understand phrases, and how they correlate with certain patterns in the images.

Using the CLIP technique ‘contrastive learning’, the zero-shot learning model has been able to accumulate a good knowledge base to be able to make predictions on classification tasks.

In CLIP — they train an image encoder and a text encoder together in order to predict the correct pairings of a batch of (image, text) training examples.

Considered from Learning Transferable Visual Models From Natural Language Supervision paper and modified based on our example

Inference

After the model passes the training stage, the model has a good knowledge base of image-text pairing and the model can now be used to make predictions.

A classification task need to be set up by creating a list of all possible labels like (hockey stick, golf stick, tennis racket, cricket bat, etc) that model could output. Each of these labels will be encoded as shown in above image (from T1 to Tn) using the pretrained text encoder that occurred in the training stage.

After all the labels have been encoded, the images could be sent as input through the pre-trained image encoder. The distance metric cosine similarity is used to compute the similarities between the image encoding and each text label encoding.

In image classification, zero-shot learning is achieved by classifying the image based on the label with the greatest similarity to the image.

How is Zero-shot learning different traditional machine learning?

As said above, Zero-shot learning uses semantic embeddings to establish relationships between objects and their features, attributes, and context. Neural networks are used to generate this mappings. This mapping is also referred to as a knowledge graph. In contrast, traditional machine learning requires supervised training data to recognize and identify an unseen-class. The zero-shot learning approach enables the model to generalize and learn from the correlation between different classes, attributes, and features, as the zero-shot learning is built on the idea of predicting unseen data from a model that has been trained using limited labeled data.

Traditional machine learning requires large labeled datasets to enable the model to learn and make predictions — These traditional ML models memorize the data and identify the objects in the test dataset that they have seen during the training phase.

But in zero-shot learning, it depends on conditional probabilities — the model’s ability to establish relationships between the seen-classes and the unseen classes. The model can recognize and classify unseen-classes based on their attributes and features rather than memorizing them.

Challenges with Zero-shot learning

It relies heavily on the model’s initial training data quality.
There is reliance on learning a joint embedding space where semantic attribute representations and semantic word vector space representations can be projected. For instance, if an image caption needs to be learned to represent a unseen-class, then mage captioning model needs to be trained on the unseen-class and then compare this model with the knowledge already learned by the semantic model. On the other hand, suppose the unseen-class is no longer present in the training data. In that case, conclusions can be derived such that the image captioning model needs to learn a new semantic attribute representation for this unseen-class.
Assume, there is a pre-trained model (a large language model) which serves as the knowledge base since it has been trained on a huge amount of text from many websites — Using this LLM, any type of task can be given and leave the model to infer on the given task. Zero-shot learning might not work, if there is completely different feature space representation entirely from the trained model. So, if there are at least few samples of labeled data but not enough for fine tuning, then few-shot learning can be the solution.

References

CLIP: Connecting text and images

We're introducing a neural network called CLIP which efficiently learns visual concepts from natural language…

openai.com

research paper