GPT-4o (“omni”) — for agents to understand emotions

OpenAI releases GPT-4o (“omni”) with significant lower latency and better vision understanding.

2 min readMay 13, 2024

OpenAI released today the new GPT-4o model using input modalities of text, audio, and image. The model is available already via the API.

The two core features are:

The better visual understanding is especially useful in agentic flows, where vision is core component. This addresses as well the

Let’s review the model in details. I will not introduce usage guide here, because there is not much novelties in the API itself.

We do not know currently anything about the model architecture. I plan to add details, when the technical paper is released.

We do know, that the GPT-4o is first time using unified omni-modality. This is just a way of saying, that all modalities are trained together.

In terms of the coding skills, GPT-4o is marketed as being very strong.

However, I do want to note some X-platform users reporting lower performance compared to GPT-4 model:

The model is marketed as low latency, which I think will be a crucially important factor for agentic flows.

The other piece is the pricing. GPT-4o is 50% cheaper to operate than GPT-4.

I recommend checking the new GPT-4o model.

In general, the model appears a generic improvement in important aspects like pricing, latency and vision-capabilities.

So, I do think the model will be gain traction as GPT-4 replacement to some jobs and as a base model for agentic flows.