How to use GPT-4 Vision API?

OpenAI released today API for GPT-4 Turbo Vision. In this tutorial, I will build you an application using this SOTA model.

Teemu Maatta
3 min readNov 6, 2023
Photo by David Travis on Unsplash


OpenAI released today important featuresfor the API:

  • GPT-4 Vision
  • 128k context window
  • Text-To-Speech (TTS)
  • Code interpreter
  • Knowledge retrieval

In this tutorial, I will demonstrate using of each of them.

GPT-4 Vision

GPT-4 Vision model enables interpreting multimodal inputs: text and images — in a single API call. This feature will be very important for example in programming robots.

!pip install --upgrade openai # to update latest OpenAI Python package
import os # Included to Python
from openai import OpenAI # OpenAI official Python package
from IPython.display import Audio # Included to Python
client = OpenAI(

We can now load the image url with the text:

response =
"role": "user",
"content": [
{"type": "text", "text": "What you think the person in the image is doing?"},
"type": "image_url",
"image_url": "",

I can print the output-object simply:


The result is simply impressive. GPT-4 Vision API recognizes the person is giving a presentation with hands crossed.

GPT-4 Vision model response on web-image. Image by Author.

We have now created an application, which takes images and text as an input. I can use the text-instructions to define the way to interpret the image.

Text-To-Speech (TTS)

OpenAI released today Text-To-Speech API. Let’s see an example.

I only need to add the file path, the model name and the voice to be used with the voice-parameter.

speech_file_path =  "C:/your_file_path_here/filename.mp3"
response =

I can the audio output:


I uploaded this file here, so you can listen it.

These two steps have enabled creating an application taking images as an input, interpret it and finally generate voice-output.

Code interpreter

Code interpreter is another new functionality within the Assistant API.

I included a separate tutorial around the Code interpreter here.


OpenAI revealed its long-waited GPT-4 Turbo Vision with 128k context window and support for images. In fact, you can interpret videos using this model.

OpenAI has as well launched Text-To-Speech (TSS), which enables building applications with audio.

In this tutorial, I combined all three modalities into single application flow with only OpenAI Python API library.

This article is in the series: Open Knowledge: AI.

The aim is to share at least 10% of my articles without the Medium paywall for free to everybody.


[1] Github. Teemu Maatta.



Teemu Maatta

Author (+200k views) in Artificial General Intelligence. Autonomous Agents. Robotics. Madrid.