OPEN KNOWLEDGE: AI
How to use GPT-4 Vision API?
OpenAI released today API for GPT-4 Turbo Vision. In this tutorial, I will build you an application using this SOTA model.
Introduction
OpenAI released today important featuresfor the API:
- GPT-4 Vision
- 128k context window
- Text-To-Speech (TTS)
- Code interpreter
- Knowledge retrieval
In this tutorial, I will demonstrate using of each of them.
GPT-4 Vision
GPT-4 Vision model enables interpreting multimodal inputs: text and images — in a single API call. This feature will be very important for example in programming robots.
!pip install --upgrade openai # to update latest OpenAI Python package
import os # Included to Python
from openai import OpenAI # OpenAI official Python package
from IPython.display import Audio # Included to Python
client = OpenAI(
api_key=os.getenv("openaikey"))
We can now load the image url with the text:
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What you think the person in the image is doing?"},
{
"type": "image_url",
"image_url": "https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AA18Lnc8.img?w=1920&h=1080&q=60&m=2&f=jpg",
},
],
}
],
max_tokens=300,
)
I can print the output-object simply:
print(response.choices[0].message.content)
The result is simply impressive. GPT-4 Vision API recognizes the person is giving a presentation with hands crossed.
We have now created an application, which takes images and text as an input. I can use the text-instructions to define the way to interpret the image.
Text-To-Speech (TTS)
OpenAI released today Text-To-Speech API. Let’s see an example.
I only need to add the file path, the model name and the voice to be used with the voice-parameter.
speech_file_path = "C:/your_file_path_here/filename.mp3"
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input=text_to_speech
)
I can the audio output:
response.stream_to_file(speech_file_path)
Audio(speech_file_path)
I uploaded this file here, so you can listen it.
These two steps have enabled creating an application taking images as an input, interpret it and finally generate voice-output.
Code interpreter
Code interpreter is another new functionality within the Assistant API.
I included a separate tutorial around the Code interpreter here.
Conclusions
OpenAI revealed its long-waited GPT-4 Turbo Vision with 128k context window and support for images. In fact, you can interpret videos using this model.
OpenAI has as well launched Text-To-Speech (TSS), which enables building applications with audio.
In this tutorial, I combined all three modalities into single application flow with only OpenAI Python API library.
This article is in the series: Open Knowledge: AI.
The aim is to share at least 10% of my articles without the Medium paywall for free to everybody.
References
[1] Github. https://github.com/tmgthb/LLMs/tree/main. Teemu Maatta.