Real-world applications

Summary:

Discrimination out of a fixed, unchangable set of possible classes really sucks. Generalizing to new classes requires retraining, which is costly (in compute, in labeling).
So why fix the classes? Let the model choose from a set of classes that’s different each batch, so it learns to generalize.
SOTA on like 90 tasks that deal with generalization.

High level:

Like above, it really sucks to have to discriminate between some fixed \(N\) number of classes. Think of ImageNet-trained CNNs. You want to predict on some other (even if closely related dataset)? You have to retrain the entire last layer, if not the entire model. This is a hastle, and requires quite a lot of data, even for simple tasks.

So instead let the model have a (modular) set of labels to choose from, fed in as input. The model then must select which label (as a part of the input) each image its fed in belongs to.

This means, at inference, you can simply feed it whatever classes you want! Want to classify cat vs dog? Feed it in an image, and two prompts: (‘This is an image of a cat’, ‘This is an image of a dog’).

How do we get this data? Just scrape it online, there’s plenty of captioned images out there.

The contrastive pretrainining, and the zero-shot uses.

Low(er) level:

The paper actually isn’t much more than what was presented above.

Take a batch of input text, and input images.
Encode them both to a fixed-size representation, then apply a linear projection to them both.
For each image, normalize then take the dot product against all text representations.
Given this N x N matrix of cosine similarities, scale them via a learned amount, then use crossentropy!

Note that step (4) actually optimizes \(N^2\) terms: it maximizes the ‘diagonal’ and minimizes everything else.

There still are a couple details which need to be dug into, namely, what are these encoders? Thankfully, this is simple as well, perhaps adding to the idea that simple ideas given years of GPU time to train can work.

Image encoder

The paper experimented with both ResNets (varients of RN50) and the Vision Transformer. Both of them output fixed-size representations by nature, so we get one of the above requirements for completely free.

Text encoder

For the text encoder itself, once again, there’s nothing special. It’s a simple transformer-based model whos’ fixed-size output is the EOS token at the last layer. Interestingly, a form of masked-self attention (similar to LMs) was used in the text encoder, technically allowing it to be initialized to GPT2 or a generic pretrained LM. However, this feature was not used.

The Generative Part: CLIP + BigGAN

Here’s the part that makes it a paper for the generative-RG. We can use this to generate!

How is not immediately obvious, but consider the following setup. We want to generate an image for a given prompt. We have some model capable of generating images from random noise (GAN).

We then randomly initialize some starting noise, \(z\). We pass \(z\) through the GAN to produce an output image. We then feed this image into the CLIP image encoder, and compare the cosine similarity of the CLIP-extracted features and the prompt we gave it.

We can then optimize all the way through the image encoder, through the GAN, all the way back to \(z\), to increase the cosine similarity!

cosmic love and affection

fire in the sky

lonely house in the woords

marriage in the mountains

demon fire

the death of the lonesome astronomer