Claude 3 Haiku — Vision-Language Model

Claude 3 Haiku offers impressive Vision-Language Model (VLM) performance with lower price.

3 min readMar 14, 2024

Claude 3 — Haiku. Vision-Language Model — Claude 3 Haiku

Introduction

Anthropic launched today Claude 3 Haiku — its fastest and cheapest Vision Language Model (VLM). Claude 3 handles:

Both text and image inputs.
Text as output.

Haiku model input tokens are priced currently $0.25 and output tokens $1.25 per each million token. This may not bear much of significance, but Haiku is the cheapest high-performance VLM model available in the market!

Let’s get started.

Haiku

I will start by importing the standard libraries recommended by the Anthropic.

!pip install anthropic #package installation, need to run only once
import anthropic
import os
import base64
import httpx

I define first an image url, which I will use in this tutorial.

image_url = "https://upload.wikimedia.org/wikipedia/commons/1/12/Haiku_de_L._M._Panero.jpg"
image_media_type = "image/jpeg"
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")

I can now use the API to perceive this image.

In this example, the “antrhopic_key” must be defined in your local computer as an environmental variable in windows.
The example uses a simple text prompt “Describe image”.

client = anthropic.Anthropic(
    api_key=os.getenv("anthropic_key"),
)
response = client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": image_media_type,
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": "Describe this image."
                }
            ],
        }
    ],
)

I can then see the result:

print(response.content[0].text)

You should receive a response similar to the below.

We can now compare this result with the original image:

Archivo:Haiku de L. M. Panero.jpg - Wikipedia, la enciclopedia libre

Haz clic sobre una fecha y hora para ver el archivo tal como apareció en ese momento. Fecha y hora Miniatura…

es.m.wikipedia.org

This visual perception task was very simple with Claude 3, but it involves multiple layers of perception:

Reading the image
Understanding without any context, that the image is a Haiku.
Interpret the meaning of Haiku written in different language.

Is it not incredible, that we can get something like this done in low latency?

Image specification

I recommend using the following guidelines, when uploading images:

Clear, not pixelated, blurred or low quality.
Images are placed before the text.
Any text in the image is sufficiently sized to be visible.

Image inputs accepted are JPEG, PNG, GIF and WebP-formats.

The maximum size of images file is 1092px x 1092px in 1:1 aspect ratio. This image equals roughly 1600 tokens.

For example in case your image is actually 1568px as width, then the image height is limited to only 786px in 1:2 aspect ratio.

Images with larger sizes will be scaled down by the API. This extra step increases latency.

Claude 3 accepts up to 20 images per each API request.

Conclusions

Haiku’s pricing makes it particularly attractive as VLM model. The outputs are 24x cheaper than GPT-4V. In my opinion, this pricing change is the single biggest enabler of LLM- based agents.

No other currently available VLM model offers the level of performance, as Claude Haiku 3 in terms of latency, pricing and results.

References

[1] LLMs. Teemu Maatta. https://github.com/tmgthb/LLMs.