Perception in NLP

Natural Language Processing is used already across wide range of products on internet from Google Translate to Amazon Alexa-platform. Despite the profound usefulness of these use cases — we cannot forget these applications are still limited in terms of actual comprehension and automating only part of the human activity.

6 min readOct 6, 2020

NLP technology is not yet at the human level, but progress is measurable. NLP market is estimated to reach 9.9 Billion USD within 2020 with promising forecast for the future. Such investment is important source of funding for NLP research projects in terms of purchasing computing power and paying salaries. Secondly, NLP research is surging — The volume of published NLP academic papers has been doubling yearly, according to data from arXiv.org*:

2018: 299 articles
2019: 506 articles
2020: 1001 articles

Thirdly, new education programs are made available to expand the NLP community via various fronts, such as the NLP nanodegree programs within DeepLearning.ai or Udacity NLP Nanodegrees or the Kaggle’s platform with its numerous NLP datasets for practical usage. These platforms are capable of educating hundreds of students each month with cutting edge technology, compared to years taking University studies. One cannot underestimate, the impact of these learning tools are having on making available talent into NLP market. Fourthly, new datasets are constantly published to drive new research:the “Spotify Podcast”-dataset or the various COVID-19 related datasets available.

Based on these aspects of the actual status NLP market — the outlook appears very promising for the future, despite the unrealistic claims — sometime reported regards NLP within the popular media.

Where are we at the NLP journey?

A wonderful research paper from Bisk et al. 2020, structures this question into five-step framework, which we will more broadly present and discuss in this article. It succeeds to explain the state of the NLP research, but as well its past & future — with a clear structure. The framework introduces five steps in NLP journey:

Corpus
Internet
Perception
Embodiment
Social

The NLP research was in the past dataset focused: it required creating manually crafted datasets for each particular NLP research question. For example, researchers would need to spend hours on trying to gather enough question and answer pairs, which could be fed into the NLP model.

This step is called the “Corpus”, which as its word — is defined by the work spent on gathering the actual data. Each time dataset is made available, it allows creating an optimized models for it. The downside used to be, that the data generation was particularly laborious.

The web scraping tools have allowed us to create purpose specific datasets from Internet, e.g. forums, chats, product reviews etc, without having to manually gather all the data. This stage is called the “Internet”-stage. It is defined by the massive amount of data available to us, that we can now feed into these models. The obvious direct impact is on the model performance: larger datasets tend to improve our model performance, but as well it allows building more complex models, which can better fit to particular use cases.

Especially in terms of NLP research — the current technology is mature enough to allow us sort very specific types of NLP problems, by helping us to create datasets particular for such NLP problems.

What is the next step in the NLP research?

The next frontier in the NLP research is defined by the paradigm, which we can witness from the ongoing NLP research:

The web crawling-methods have allowed to extract more data.
The large enterprises are capable of creating massive computational infrastructure for larger machine learning models.

Yet, the recent performance gains seem to offer only limited incremental benefit. NLP is able to beat human level performance on some of its use cases, but it is unlikely that simply adding more data and more computing power — are a quick solution, or solution at all, for beating human level performance on wide variety of NLP tasks.

Bisk et al. 2020 propose the “Perception”-layer, as the next frontier of the NLP reserch —next, we will see what this means.

Perception

Perception is defined by Oxford dictionary “as ability to see, hear or become aware of something through senses” and “the way something is regarded, understood or interpreted”.

Humans perceive language not only from text, but we process all our senses: we listen the tone of the voice, we visualize the situation and we use other senses such as touch and smell — to interpret language. A good example of non verbal communication — is the body language. Body language reveals quickly, if we are nervous by the movements we make or the tone of voice we transmit.

Let’s think this from practical point of view. “Blue” is an adjective with various meanings. If we say blue sea — we know it means colour. We can see a person being blue — meaning sadness. One approach is to treat these words in our language models as separate words. So, if only we build sufficient large model — we could perhaps manage such complexity within our NLP models. One could argue, that we only need to add enough data and the model will just figure out all the word embeddings for all the possible senses and meanings.

Now, let’s think another word: “red”. Red is a colour too, but a red face may refer to person being either ashamed or angry. So, if we speak of person having a red face — we cannot really tell, to which person is exactly being referred, unless we know the context. Human communication is supported by the body language, which allows us to pull easily the context: by looking at the face, listening tone of the voice and sensing the overall postures and body movements.

red face, blushing, is commonly referred to person being ashamed
blue is referred to person being sad: “he is a bit blue, since she left him”

The current NLP models can seek context from the text only. Despite, these models are impressive in the sense of their capability to learn from the text — we must admit the limits these NLP models offer in terms performing at the human level. As we try to push the NLP technology into real live — we are gradually discovering these practical limitations, when we use just text data only.

Bisk et al. 2020 point out, that better NLP models need dimension of “Perception”, which can be derived from the different types inputs such as audio recordings, images and videos.

If we just think the importance of body language within our lives — this idea of not just relying on larger text corpus — is the right way to go forward. We can think numerous situations, where our facial expressions offer visual cues.

Small children tend to show understanding towards body language sooner than the actual spoken language. For example, most parents will know how babies sense emotions surrounded by them, despite lacking capability of speaking.

The body language or the voice tone, are currently at the sideline, of the NLP research. Interestingly — similar, however reverse, approach has been used, when Computer Vision (CV) models are trained, using language input. An example of this type of “sentence + image”-models are presented by Zhou et al. 2019.

Bisk et al. 2020 point out, that new benchmarks are required to properly measure models, incorporating multiple sources of information: audio, visual, tactile etc.

The creation of these new benchmarks is to be expected to be seen in next years within NLP research. The challenge is here, to first build the practice of creating these datasets with the existing tools such as web crawling and the NLP community, to start applying them to develop new models. For example, we may start seeing datasets, where film data is divided into video, audio and text data. Such datasets will likely require getting used to process larger sets of data & more efficient computer hardware, as we need to convert the video, audio and text data into these NLP models.

What is left to discover?

Once we reach to the point of Perception within NLP — one might wonder: what is still left to discover once we integrate text with video/audio?

There is still plenty to be discovered within NLP beside Perception. Bisk et al. 2020 see the final stages: “Reasoning” and “Social”. Reasoning will will transform the way the NLP models understand mental models such as shapes or stability as part of reasoning, while Social aspect refers to theory of mind, where desires and identities change the way people act. For example effective communication requires taking into account people emotions.

*Source arXiv.org. Example search criteria “NLP” (all fields). Period: 15.09.2019–15.09.2020.
The aim of our article is not to make a summary, but in case interested, we strongly recommend to read the original work on the 5 World Scopes can be read: Bisk et al. 2020.

Perception in NLP

Written by Teemu Maatta