OPEN KNOWLEDGE: AI

How to use GPT-4 Vision API?

OpenAI released today API for GPT-4 Turbo Vision. In this tutorial, I will build you an application using this SOTA model.

Teemu Maatta
3 min readNov 6, 2023
Photo by David Travis on Unsplash

Introduction

OpenAI released today important featuresfor the API:

  • GPT-4 Vision
  • 128k context window
  • Text-To-Speech (TTS)
  • Code interpreter
  • Knowledge retrieval

In this tutorial, I will demonstrate using of each of them.

GPT-4 Vision

GPT-4 Vision model enables interpreting multimodal inputs: text and images — in a single API call. This feature will be very important for example in programming robots.

!pip install --upgrade openai # to update latest OpenAI Python package
import os # Included to Python
from openai import OpenAI # OpenAI official Python package
from IPython.display import Audio # Included to Python
client = OpenAI(
api_key=os.getenv("openaikey"))

We can now load the image url with the text:

response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What you think the person in the image is doing?"},
{
"type": "image_url",
"image_url": "https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AA18Lnc8.img?w=1920&h=1080&q=60&m=2&f=jpg",
},
],
}
],
max_tokens=300,
)

I can print the output-object simply:

print(response.choices[0].message.content)

The result is simply impressive. GPT-4 Vision API recognizes the person is giving a presentation with hands crossed.

GPT-4 Vision model response on web-image. Image by Author.

We have now created an application, which takes images and text as an input. I can use the text-instructions to define the way to interpret the image.

Text-To-Speech (TTS)

OpenAI released today Text-To-Speech API. Let’s see an example.

I only need to add the file path, the model name and the voice to be used with the voice-parameter.

speech_file_path =  "C:/your_file_path_here/filename.mp3"
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input=text_to_speech
)

I can the audio output:

response.stream_to_file(speech_file_path)
Audio(speech_file_path)

I uploaded this file here, so you can listen it.

These two steps have enabled creating an application taking images as an input, interpret it and finally generate voice-output.

Code interpreter

Code interpreter is another new functionality within the Assistant API.

I included a separate tutorial around the Code interpreter here.

Conclusions

OpenAI revealed its long-waited GPT-4 Turbo Vision with 128k context window and support for images. In fact, you can interpret videos using this model.

OpenAI has as well launched Text-To-Speech (TSS), which enables building applications with audio.

In this tutorial, I combined all three modalities into single application flow with only OpenAI Python API library.

This article is in the series: Open Knowledge: AI.

The aim is to share at least 10% of my articles without the Medium paywall for free to everybody.

References

[1] Github. https://github.com/tmgthb/LLMs/tree/main. Teemu Maatta.

--

--

Teemu Maatta

Author (+200k views) in Artificial General Intelligence. Autonomous Agents. Robotics. Madrid.