Member-only story

Scaling Image Recognition

ViT-22B¹ is the new State-of-the-Art model for Image recognition with ObjectNet benchmark.

3 min readFeb 15, 2023

Introduction

Google Research published ViT-22B¹ model. It offers State-of-the-Art zero-shot Image recognition capabilities.

The model outperforms CoCa, CLIP, and OpenCLIP models with the ObjectNet-dataset with its impressive zero-shot accuracy:

69.7 OpenCLIP⁷
72.4 CLIP³
82.7 CoCa⁴
87.6 ViT-22B¹

This score is promising — considering the importance of object recognition for image/video/3D model-generation and reinforcement learning.

Let’s have a look on few example images from the ObjectNet:

These chairs appear in real-world like environments — with random backgrounds, rotations and image viewpoints. Models performing well with this benchmark — are likely to generalize well in the real world.

ViT-22B Image Embeddings are State-of-the-Art. Therefore, the model will likely trigger new wave of State-of-the-Art models in Image/Video/3D model applying this image Encoder.

Scaling Image Recognition

ViT-22B¹ is the new State-of-the-Art model for Image recognition with ObjectNet benchmark.

Introduction

ViT-22B

Written by Teemu Maatta

No responses yet