Member-only story
Scaling Image Recognition
ViT-22B¹ is the new State-of-the-Art model for Image recognition with ObjectNet benchmark.
Introduction
Google Research published ViT-22B¹ model. It offers State-of-the-Art zero-shot Image recognition capabilities.
The model outperforms CoCa, CLIP, and OpenCLIP models with the ObjectNet-dataset with its impressive zero-shot accuracy:
- 69.7 OpenCLIP⁷
- 72.4 CLIP³
- 82.7 CoCa⁴
- 87.6 ViT-22B¹
This score is promising — considering the importance of object recognition for image/video/3D model-generation and reinforcement learning.
Let’s have a look on few example images from the ObjectNet:
These chairs appear in real-world like environments — with random backgrounds, rotations and image viewpoints. Models performing well with this benchmark — are likely to generalize well in the real world.
ViT-22B Image Embeddings are State-of-the-Art. Therefore, the model will likely trigger new wave of State-of-the-Art models in Image/Video/3D model applying this image Encoder.