Member-only story

Scaling Image Recognition

ViT-22B¹ is the new State-of-the-Art model for Image recognition with ObjectNet benchmark.

Teemu Maatta
3 min readFeb 15, 2023
Photo by Pawel Czerwinski on Unsplash

Introduction

Google Research published ViT-22B¹ model. It offers State-of-the-Art zero-shot Image recognition capabilities.

The model outperforms CoCa, CLIP, and OpenCLIP models with the ObjectNet-dataset with its impressive zero-shot accuracy:

  • 69.7 OpenCLIP⁷
  • 72.4 CLIP³
  • 82.7 CoCa⁴
  • 87.6 ViT-22B¹

This score is promising — considering the importance of object recognition for image/video/3D model-generation and reinforcement learning.

Let’s have a look on few example images from the ObjectNet:

These chairs appear in real-world like environments — with random backgrounds, rotations and image viewpoints. Models performing well with this benchmark — are likely to generalize well in the real world.

ViT-22B Image Embeddings are State-of-the-Art. Therefore, the model will likely trigger new wave of State-of-the-Art models in Image/Video/3D model applying this image Encoder.

ViT-22B

--

--

Teemu Maatta
Teemu Maatta

Written by Teemu Maatta

Author (+200k views) in Artificial General Intelligence. Autonomous Agents. Robotics. Madrid.

No responses yet