What Brings Virtual Conversations to Life?
Byline by
Quinn Agen
Speech Recognition technology has been woven into our daily lives; so much so that we barely notice it. Every time you interact with a virtual assistant or a smart speaker, you are using automated speech recognition technology.
Commentators have estimated that spending on automated speech and voice recognition technology across a wide range of deployments is expected to grow to $26.8 billion by 2025. This growth is in part driven by the increased use of smart payment systems and strong consumer demand for smart devices. It seems that Alexa and Siri are not going to dominate the virtual assistant market for long.
Some automated speech recognition solutions have more potential than others. Solutions with 95% accuracy can leverage the added capability to parse and understand the relationship between words and denote the underlying meaning of an utterance, a sentence, or a combination of words. This is called natural language processing (NLP).
NLP is revolutionizing how we interact with chatbots, and it does not operate in a vacuum. It necessitates high and consistent accuracy by the Speech Recognition technology and the two must work hand in hand. The role of Speech Recognition is to identify words, while NLP makes sense of what is said. Speech technology has had a lot of room for improvement over the last decade, and many companies have struggled to increase their Speech Recognition accuracy, thus limiting their use of NLP. When the speech recognition accuracy is high, the potential of NLP is endless.
Prepare for Complexity
To start, NLP must effectively distinguish the customer’s voice from environmental and other noise. This includes the sound of a car honking, breathing, and simultaneous background conversations other than that of the customer.
To respond to and maintain a conversation with the customer, the solution must navigate complex interactions continue to accurately process speech.
Let’s take the example of a fast-food drive-through to demonstrate different levels of complexity for NLP.
Simple command: May I have a large number 2 with a diet coke?
Medium command: May I have a large number 2 with a diet coke? No pickles on the burger.
Complex command: May I have a large number 2 with a diet coke? No pickles on the burger. I only want a little bit of ice in the diet coke and I would like to substitute the french fries with onion rings.
Editing sandwich options and making substitutions using AI includes a level of complexity that can only be achieved with sophisticated speech recognition with high accuracy and effective NLP implementation.
Create the Conversation Context
An added layer of complexity beyond accurately transcribing speech to text is taking into account the context of the conversation. When developing NLU models, creating context rules (properties that have value within the overall meaning of the text) is important, as well as developing a dialog logic (meaning, the reasoning and actions behind how a human or system drives a dialog towards an objective e.g., fulfilling a food order in the drive thru, or making a payment on the phone). Context rules allow for multiple context streams within a single dialog with a customer. NLU’s purpose is to identify meaning and understand the customer’s intent. Context carries a lot of weight in understanding intent.
For example, dialog logic allows the AI to recognize that when a customer says they want “5 number 3s,” 5 refers to the number of orders and 3 represents the meal option. Even though they are both numeric values, the AI has been trained to assign meanings specific to this context.
Customer: May I have 5 number 3s, 4 with a sprite and 1 with a coke?
Response: So, you would like 5 number 3s and would like 4 of these meals to come with a sprite and 1 of these meals to come with a coke? Would that be all?
In this case, both the speech recognition engine and NLP were trained to respond to and understand the customer in the context of a fast-food drive-through.
Keeping the Conversation Going
It is important to note that most conversations do not end after transcribing a single customer request. If a customer has a follow up request, the speech recognition engine must have a dialog management capability to keep the conversation going. For example, dialog management allows the technology to have a human like conversation with a beginning, middle, and end by using memory from the previous interaction.
Customer command: May I have a large number 2 with a diet coke?
Response: So, you would like a large sized meal number 2 with a diet coke? Would that be all?
Customer: Yes, that is all.
Response: Okay, the total for your order is $10.75. Please proceed to the next window.
Sophisticated AI can be adapted to function seamlessly across almost any industry. However, many organizations struggle to integrate different capabilities of speech recognition technology into one cohesive solution. The three most important capabilities of an effective speech recognition solution are 1) the ability to customize speech recognition models to their business, 2) accurate voice activity detection that only responds to the customer’s voice (and ignores other noise), and 3) the ability to work in hand with NLU to communicate and adapt the conversation in real-time. Organizations that tailor their Speech Recognition and NLP capabilities to customer needs can greatly benefit from this innovation, saving money, reducing customer service time, and improving the overall customer experience.