Claude 3 Haiku — Vision-Language Model

Claude 3 Haiku offers impressive Vision-Language Model (VLM) performance with lower price.

Teemu Maatta
3 min readMar 14, 2024
Claude 3 — Haiku. Vision-Language Model
Claude 3 Haiku

Introduction

Anthropic launched today Claude 3 Haiku — its fastest and cheapest Vision Language Model (VLM). Claude 3 handles:

  • Both text and image inputs.
  • Text as output.

Haiku model input tokens are priced currently $0.25 and output tokens $1.25 per each million token. This may not bear much of significance, but Haiku is the cheapest high-performance VLM model available in the market!

Let’s get started.

Haiku

I will start by importing the standard libraries recommended by the Anthropic.

!pip install anthropic #package installation, need to run only once
import anthropic
import os
import base64
import httpx

I define first an image url, which I will use in this tutorial.

image_url = "https://upload.wikimedia.org/wikipedia/commons/1/12/Haiku_de_L._M._Panero.jpg"
image_media_type = "image/jpeg"
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")

I can now use the API to perceive this image.

  • In this example, the “antrhopic_key” must be defined in your local computer as an environmental variable in windows.
  • The example uses a simple text prompt “Describe image”.
client = anthropic.Anthropic(
api_key=os.getenv("anthropic_key"),
)
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": image_media_type,
"data": image_data,
},
},
{
"type": "text",
"text": "Describe this image."
}
],
}
],
)

I can then see the result:

print(response.content[0].text)

You should receive a response similar to the below.

Claude 3 Haiku response.

We can now compare this result with the original image:

This visual perception task was very simple with Claude 3, but it involves multiple layers of perception:

  • Reading the image
  • Understanding without any context, that the image is a Haiku.
  • Interpret the meaning of Haiku written in different language.

Is it not incredible, that we can get something like this done in low latency?

Image specification

I recommend using the following guidelines, when uploading images:

  • Clear, not pixelated, blurred or low quality.
  • Images are placed before the text.
  • Any text in the image is sufficiently sized to be visible.

Image inputs accepted are JPEG, PNG, GIF and WebP-formats.

The maximum size of images file is 1092px x 1092px in 1:1 aspect ratio. This image equals roughly 1600 tokens.

For example in case your image is actually 1568px as width, then the image height is limited to only 786px in 1:2 aspect ratio.

Images with larger sizes will be scaled down by the API. This extra step increases latency.

Claude 3 accepts up to 20 images per each API request.

Conclusions

Haiku’s pricing makes it particularly attractive as VLM model. The outputs are 24x cheaper than GPT-4V. In my opinion, this pricing change is the single biggest enabler of LLM- based agents.

No other currently available VLM model offers the level of performance, as Claude Haiku 3 in terms of latency, pricing and results.

References

[1] LLMs. Teemu Maatta. https://github.com/tmgthb/LLMs.

--

--

Teemu Maatta
Teemu Maatta

Written by Teemu Maatta

Author (+200k views) in Artificial General Intelligence. Autonomous Agents. Robotics. Madrid.

Responses (2)