Real World Environments

Real World Environments (RWEs) are continuous action spaces, infinite events with full of actors. Autonomous Agents operate in them.

7 min readJun 13, 2024

Introduction

Autonomous agents interact in Real World Environments (RWEs), which differ from traditional SW environments.

Traditional SW is built, as if deployed in strict rules-based environments.

We write code in a reproducible manner. A function is programmed to work exact same way always. If the real world use case well determined, this function tends to work with high accuracy.

Let’s think of a practical example. I shall call this function: “get_weather”. The function executes an AI call based on the input city and outputs the current temperature at this location. This function could be the core functionality of a mobile app. It works, as long as the city name is one of the accepted values in the server side.

Software industry has grown to massive extent using these types of reproducible functions. Code once and replicate. Each day, we use hundreds of functions in our mobile phones and computers.

Despite this software paradigm enables massive scale of automation — we can easily identify limits of such strict-rule based programs.

Let’s review the sample “get_weather”-function. We need a dedicated new software, in case we want to know the difference in temperature between two different cities. Humans perform such tasks naturally. For example we can easily read a map of Europe in local newspaper and observe the temperature signs above different cities and tell exactly the difference in temperature between the two cities. For a computer program this is not possible.

Thousands of skills were built for home assistants such as Alexa by software developers. It tackled a massive undertaking: an attempt to cover infinite number of user intents arising in the real world.

Home speakers were supposed to be a new “microwave” of the home, but their use has decreased over time.

Home assistants tried to solve every single user intent, which human language has to offer through small set of programmed skills.

Microwave have very precise use case. Microwaves defrost and heat food quickly. Everyone in a family know how to use it, it takes only short time and is effective.

Voice assistants failed, because it could not handle complex situations in the space, such as requesting to play a song, while children at the family are playing and speaking. The voice assistants are not able to define in such situations the correct user intent, because of background noise and as well it could struggle to understand whose voice it should follow.

In such Real World Environments (RWEs) it requires human-like understanding about what is going on in the environment, who are involved, what is sensed and what are the possible actions.

Real World Environments (RWEs)

The rapid progress in AI research has provoked different views on the future of automation.

Some experts claim, that LLM-based autonomous agents lack architectural advancements, before we could use them repeatedly in production environments.

I see partly this critic as valid, such as we still lack generic agents, which work reliably and repeatedly in open-ended world.

However, I believe this is less due to lack of architectural challenges and more due to the complex nature of Real World Environment (RWE), where they operate.

We are so used to program rule-based tools, that we lack the skills to develop autonomous agents.

I think good example of skills gap is the fact, that so few AI researchers are aware of the Autonomous agents research in the past decades.

For example many AI researchers working in prominent AI labs with the largest compute resources in the world – are simply not able to define an Autonomous agent.

Real World Environments (RWEs) are defined by:

Continuous event space,
Open-ended action space and
Unlimited number of actors.

Let’s take a practical example of the home assistant.

New LLM-based assistants offer significantly more built-in knowledge. I can ask it to translate a sentence between languages and it will likely succeed.

The situation becomes significantly more complex when I am not alone in the room. Any background noise could lead the home assistant to misinterpret my request.

This is why I recommend to test your AI product demo in the presence of your family members or friend. This simple change requires a significantly more generic solution in order to work.

We can improve the TTS-pipeline to reduce occurrences of this type of errors. However, we will encounter many additional events impossible to foresee.

For example we cannot tell, if an internet page pops up a warning window or in case an autonomous car participates in a traffic accident.

Overall, it is impossible to foresee all unexpected events arising.

We can neither foresee all the possible actors, who might become part of the use case. Even at home with typically only few users, we will still occasionally invent friends.

You might think the total number of users is limited to the total population of the world. However, this is neither true anymore. Bot-accounts are a reality in social media and new frauds are invented using voice-technology.

In theory, we have tools to connect AI agents, which can interact with anybody in the world.

Traditional voice assistants never operated in such large pool of actors.

Autonomous agents are already starting to operate with massive user-bases through chatbots and AI personas.

World is open-ended. I like to think this from the perspective of your home. We might start building an autonomous agent, which is able to handle tasks at kitchen. The kitchen is limited space, but it already includes massive number of different actions we could perform inside them. We could clean the kitchen, cook food, change window, buy food etc.

The next step would be to go outside the kitchen into living room or go outside the home. This increases the available action space exponentially.

One might be tempted to conclude, that the challenges of Real World Environments (RWE) makes it impossible to deploy successfully an Autonomous agent.

However, I see this challenge more as of a lack of understanding, rather a underlining issue with the technology.

Successful Autonomous Agents

The number one rule to remember is, that Autonomous agents are not programmed as traditional software.

Traditional software programs exact rules. For example send API request with exact parameters required. The API call can fail only for specific reasons.

LLMs include lots of special roles, which we can prompt. We can call the LLM to act like a musician, which works great in music-domain, but conditions the use to work only in this domain.

I have seen many examples of people claiming to have built an Autonomous agent, which only works in few use cases, because the developers only programmed few roles.

Autonomous agents cannot be programmed using single role in mind. Instead your system must use dynamic roles, where the role is automatically defined, when the context is known.

Humans working with numbers pay attention to small details such as place of a comma, while a person playing football will focus on vision and body coordination and might not even hear what the coach is trying to yell at then. So, it is natural to ask LLMs take different roles.

You cannot decide upfront, what roles the LLM will use. The LLM has to learn to figure them out by itself.

Next, I want to speak about the events and actions.

As I tested an agent to post a tweet to X.com, I realised the agent sometimes clicked an incorrect button, so I had to stop it. I was not expecting this, so I reviewed what was going on.

The reason was, that sometimes there was a notification pop up by the browser, which a normal user would close.

Traditional software would program an extra check, where in case the pop up notification window would open up, the robot would click to close the window.

We cannot program Autonomous agents this way to foresee all possible events arising. If we tried such approach, we would be continuously adding longer and longer prompts.

Eventually, this would fail on its own weight.

Autonomous agents are programmed instead to reason events.

In case the agent sees a pop up window, it should reason about what the message is, in case it is relevant to the task and how it should react.

You cannot program a Vision Language Model (VLM) model agent to take actions directly by looking at an image. It would cause the VLM to hallucinate facts. Instead, it is better idea to only first reason about the image. The produced reasoning can then be quickly sent to a LLM, which can produce a proper action.

Human imagination is similarly prone for errors.

If you close your eyes, you are able to move your hand towards an object you know exists in the room such as painting, but you will be surprised how inaccurately your brain imagines the exact positions even in familiar environments such as your home. This estimation based on your internal model about the environment is significantly less accurate in unfamiliar environments.

This explains why it is so important to separate perception activity from reasoning and why human brain is not doing everything using single brain region either.

Conclusion

In this article, we have gone through the challenges in Real World Environments (RWEs).

I have explained the three key dimensions of complexity:

Continuous action spaces,
Infinite events,
Full of actors.

These dimensions make programming of Autonomous Agents especially challenging task, but not impossible.

References

[1] https://github.com/tmgthb/LLMs