GPT-4o (“omni”) — for agents to understand emotions
OpenAI releases GPT-4o (“omni”) with significant lower latency and better vision understanding.
Introduction
OpenAI released today the new GPT-4o model using input modalities of text, audio, and image. The model is available already via the API.
The two core features are:
- Lower latency
- Better visual understanding.
- Lower pricing.
The better visual understanding is especially useful in agentic flows, where vision is core component. This addresses as well the
Let’s review the model in details. I will not introduce usage guide here, because there is not much novelties in the API itself.
GPT-4o
We do not know currently anything about the model architecture. I plan to add details, when the technical paper is released.
We do know, that the GPT-4o is first time using unified omni-modality. This is just a way of saying, that all modalities are trained together.
In terms of the coding skills, GPT-4o is marketed as being very strong.
However, I do want to note some X-platform users reporting lower performance compared to GPT-4 model:
The model is marketed as low latency, which I think will be a crucially important factor for agentic flows.
The other piece is the pricing. GPT-4o is 50% cheaper to operate than GPT-4.
Conclusions
I recommend checking the new GPT-4o model.
In general, the model appears a generic improvement in important aspects like pricing, latency and vision-capabilities.
So, I do think the model will be gain traction as GPT-4 replacement to some jobs and as a base model for agentic flows.