CLIP 相关工作2

Posted on 2024-06-26 Edited on 2024-12-28 In Paper reading Views:

Something like

CLIPasso
The goal is that given an image, generate a stroke image with several simple curves to represent this image.
Based on some initial points, curves can be drawn to generate a simple stroke image. The function from points to the stroke image is differentiable. Then to make sure the stroke image represent the given image, two kinds of losses are designed. One loss is based on CLIP, that the CLIP image embedding from the two images should be similar. The other is geometric loss, that the features in the shallow layers of a VGG/ResNet of these two images should be similar.

CLIP4Clip
The goal is to use CLIP to do video retrieval. Given a text, retrieve the associate videos. Very simple, there are three kinds of ways to use CLIP image encoder to get the embedding of a video. The first way, calculating the image embedding of each frame, then use avg pooling. Second way, use LSTM/1D Conv to fuse the image embedding from frames. Third way, use a cls token along with the frame embeddings and use a transformer to get the cls embedding in the output.
ActionCLIP
Datasets are annotated with video-label pairs. The method fine-tunes CLIP, by introducing some text prompt (in order to get a complete sentence that contains the label) and image prompt (i.e., image adapters which can be trained). Similarly to CLIP4Clip, they use CLIP image encoder to get video embeddings and calculate the similarity between on video embeddings and text embeddings. The difference to CLIP is that the GT is not a diagonal matrix but also has some non-zero elements on other places. This happens when multiple videos share one action.
Therefore, they can do zero-shot action recognition.
AudioCLIP
They just use the idea of CLIP, they have video-text-audio dataset, such that three pairs are used to do contrastive learning.
PointCLIP
They might fine-tune CLIP or just use CLIP encoders to extract text and image embeddings. Each point cloud has a label which can be converted to a sentence by some prompting. Then the point cloud are used to extract different images through different angles. Based on that, they have image-text pairs.
Therefore, they can do zero-shot point cloud recognition.
DepthCLIP
This can be quite simple, as original images can be processed by CLIP image encoder, then the GT depth image can be used to first extract sentences, by something like: this object is far/unseen/close, by setting some thresholds of the depths to define the object.