Claude 3 Haiku — Vision-Language Model
Claude 3 Haiku offers impressive Vision-Language Model (VLM) performance with lower price.
Introduction
Anthropic launched today Claude 3 Haiku — its fastest and cheapest Vision Language Model (VLM). Claude 3 handles:
- Both text and image inputs.
- Text as output.
Haiku model input tokens are priced currently $0.25 and output tokens $1.25 per each million token. This may not bear much of significance, but Haiku is the cheapest high-performance VLM model available in the market!
Let’s get started.
Haiku
I will start by importing the standard libraries recommended by the Anthropic.
!pip install anthropic #package installation, need to run only once
import anthropic
import os
import base64
import httpx
I define first an image url, which I will use in this tutorial.
image_url = "https://upload.wikimedia.org/wikipedia/commons/1/12/Haiku_de_L._M._Panero.jpg"
image_media_type = "image/jpeg"
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")
I can now use the API to perceive this image.
- In this example, the “antrhopic_key” must be defined in your local computer as an environmental variable in windows.
- The example uses a simple text prompt “Describe image”.
client = anthropic.Anthropic(
api_key=os.getenv("anthropic_key"),
)
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": image_media_type,
"data": image_data,
},
},
{
"type": "text",
"text": "Describe this image."
}
],
}
],
)
I can then see the result:
print(response.content[0].text)
You should receive a response similar to the below.
We can now compare this result with the original image:
This visual perception task was very simple with Claude 3, but it involves multiple layers of perception:
- Reading the image
- Understanding without any context, that the image is a Haiku.
- Interpret the meaning of Haiku written in different language.
Is it not incredible, that we can get something like this done in low latency?
Image specification
I recommend using the following guidelines, when uploading images:
- Clear, not pixelated, blurred or low quality.
- Images are placed before the text.
- Any text in the image is sufficiently sized to be visible.
Image inputs accepted are JPEG, PNG, GIF and WebP-formats.
The maximum size of images file is 1092px x 1092px in 1:1 aspect ratio. This image equals roughly 1600 tokens.
For example in case your image is actually 1568px as width, then the image height is limited to only 786px in 1:2 aspect ratio.
Images with larger sizes will be scaled down by the API. This extra step increases latency.
Claude 3 accepts up to 20 images per each API request.
Conclusions
Haiku’s pricing makes it particularly attractive as VLM model. The outputs are 24x cheaper than GPT-4V. In my opinion, this pricing change is the single biggest enabler of LLM- based agents.
No other currently available VLM model offers the level of performance, as Claude Haiku 3 in terms of latency, pricing and results.
References
[1] LLMs. Teemu Maatta. https://github.com/tmgthb/LLMs.