Voice Is the New Default Interface

This article was published on July 27, 2021

Not long ago, calling “what’s the weather like today?” into an empty room would have earned you concerned looks from your neighbors. Today, they’d barely give it a passing thought. That’s because we’re on the verge of an enormous shift in how we interact with computers. One that’s even more profound than the switch from green-screen terminals to mice and menus.

Talking to technology has gone from sci-fi dream to becoming a regular way that we interact with computers. Research shows that by the end of 2018 there’ll be more than 100 million voice-activated smart speakers worldwide. And in China, they’re way ahead of the West. But Alexa, Siri, and friends are just the beginning.

That imminent shift from curiosity to commonplace is nicely summarised by Andrew Ng––Stanford professor in AI and robotics, and former Chief Scientist at Baidu––when he says he hopes to have “grandchildren who are mystified at how, back in 2016, if you were to say ‘Hi’ to your microwave oven, it would rudely sit there and ignore you.”

So, as a developer or a product manager, how do you prepare yourself for a market where voice is the new default interface for software?

Adapting to Context

Not all technologies come to us fully formed. Even the telephone took a while to find its feet. But voice interfaces are interesting in that norms are already forming. For example, there are three domains, broadly speaking, within which voice interfaces exist today:

Business-specific: such as voice-activated IVRs and virtualized contact center agents.
Device-specific: this is where a particular app or device has its own voice interface, such as a smart TV that lets you name the channel you want to watch.
Ecosystem: today this largely means Alexa Skills and Google Home Actions.

Each of these contexts has a profound effect on how end-users perceive our software and they’re each as different from each other as web is from mobile and mobile is from native desktop. From the UX of the interface, through the design and branding that’s incorporated within it, to the capabilities on offer, the way we present our software in each domain will be driven by context-specific needs.

The concrete outcome is that we can’t build a one-size-fits-all voice interface and hope to deploy it everywhere. Let’s look at each of those concerns––UX, brand, and functionality––in turn.

UX

As an industry, we’ve made enormous improvements in both how we understand and how we respond to user needs. The best web apps today are intuitive and exciting to use thanks to the efforts of user experience professionals who have built a corpus of knowledge around how to make software that improves the lives of humans.

Often, the key areas of UX are summarised as the pillars of UX. While not everything in web and mobile UX is applicable to voice interfaces, academics and practitioners have adapted them to produce best-practice guidelines for voice user experience.

Let’s look at some of those guidelines for voice UX.

1. Intent Discovery

Language provides its speakers multiple ways to convey the same message. For example, “I love you” and “Love is what I feel for you” and “I emoji heart you” generally mean the same thing. Our voice interfaces must be able to handle this wonderful variety and make sense of the ebb and flow of human intent.

2. Signposting

We’ve become really good at providing subtle cues in visual interfaces that indicate where we are in an application. Breadcrumbs, colour changes and so on all describe to us where we are in the world of the app. Research by Microsoft shows that a common complaint made by people who frequently use voice interfaces is that they don’t know what the tool can do because they are used to seeing what is possible. Signposting where the user is and what they can do at that point is essential but it’s a difficult trade-off between sharing enough and not being irritatingly verbose.

3. Error Handling

We can’t expect people to prepare carefully worded statements before interacting with our voice interfaces and we may even catch a comment directed at someone or something else while listening for a reply. In situations where the speech provided is not what we expect or something we cannot respond to, our software must handle that interaction gently and fall back to, “Sorry, I didn’t catch that,” or if identified as a known deficiency in the software, “I can’t do that yet, but I’m hard at work on improving myself!”

4. Variety and Human Nature

Hearing the same response repeatedly begins to tear down the wall and reveal the robot working the microphone. A better experience varies replies at each point of the conversation. For example, “sorry, I didn’t catch that,” is just one way to communicate a misunderstanding. One of the goals of great UX is to ensure the user never notices the UX. Making our voice interface behave more like a human, perhaps by using different ways to say the same thing, is a sure way to ensure that our users are thinking only of what they want to achieve rather than how clunky our voice UX appears to be.

5. Handholding

Libraries and APIs give us shortcuts when building software. Shared culture does something similar for speech. We humans make a ton of assumptions about our conversational partner’s knowledge and understanding each time we talk. Our voice interfaces need to work around that gracefully. For example, “Order more kitty litter” should spark a conversation that unpacks all the intent in that statement. “Okay, which kind? Where should we ship it? Which method should we use to pay for the litter?”

Brand

How do you build and sustain a brand when people’s primary interaction with your software is through a voice interface? Wally Olins was one of the 20th century’s leading experts on brand and he had this to say:

“Overall, because branding is about creating and sustaining trust it means delivering on promises. The best and most successful brands are completely coherent. Every aspect of what they do and what they are reinforces everything else.”

In other words, brand is about far more than visual design. A brand is like a personality and that makes it supremely communicable by voice. When we’re building business or device-specific voice interfaces then we can control precisely how our brand is reflected in that interface. However, when our software is part of another company’s ecosystem, the challenge will be to break through the brand and persona of that ecosystem.

Functionality and Use Cases

If we get the UX and brand right, there still remains the question of when and where to use voice interfaces. If the world to come is one where voice is the default way to interact with software, then perhaps that question is moot.

Just as some software remains best suited to the command-line rather than a graphical interface, there will be software that lends itself naturally to a voice interface. This provides us an opportunity to drive our software and services into situations where previously computing would have been difficult or impossible.

Today, we’re seeing two primary ways that voice interfaces are being used:

Supplementing existing computing usage: research appears to show that people tend to use Siri and Google Assistant to do the things they’d do anyway on their phones and Alexa is used to play music, set timers, and listen to the news.
Bringing existing computing into new situations: some people with disabilities, artists covered in paint, and many other others find themselves less able to use computers as deployed today. Voice interfaces open computing to those people and scenarios.

However, the future of voice––or conversational––interfaces lies in ubiquitous computing. The shift began the day we moved from punch-cards to teletype terminals. In the early days of computing, using a computer was an event. People had to book time on a machine. Personal computers democratised computing but they were at once sufficiently cumbersome and delicate that using one still had to be thought of as something it itself. With powerful mobile devices, we’re closer than ever to ubiquitous computing. The next step is to decouple software from any one device in particular. While that’s, perhaps, impossible for now, voice interfaces get us close enough; speaking into a room or down a phone line turns computing into something that’s “just there.”

So, what will that look like?

Ubiquitous Computing and the Voice Interface

Imagine it’s 2028; close enough to be a world that’s not all that different from today.

You’re driving to work, or rather, your car is cruising along the highway keeping itself in the correct lane, staying a safe distance from the car in front, and following the traffic-adjusted sat-nav route. You pay attention but mostly out of habit.

“How’s the traffic looking?” you ask the car.

“Lighter than usual today. We’ll be at the office in just under fifteen minutes.”

Great, you think, just enough time to get something done.

“Call the bank,” you say and in a few moments you’re speaking to Sandra. She greets you by name and asks how she can help.

“I need to make a complaint, please, Sandra,” you say.

“Of course, let me connect you to someone who can help,” Sandra says. You know she’s not a human customer service agent but you like that there’s never a queue to speak to her and she can help with most things.

After a few moments, you’re connected to a human agent who asks about your complaint. There’s a detail you can’t remember. That’s when Sandra chips back in, “Excuse me, I can help with that.” Then she fills in the missing detail for both you and the human agent.

Within a few minutes, the matter is dealt with and your car is pulling into the parking lot outside your office.

Perhaps this glimpse into the near future will prove to be a little off in some of the details but there are a couple of aspects that we’re likely to see even sooner:

Interacting effortlessly and seamlessly with all manner of services through a voice interface will be commonplace.
Virtual assistants will become ubiquitous.

Today’s virtual assistants live trapped like genies in our phones and smart speakers. We summon them with the right magical phrase but they’re confined to that particular device and the contexts within which it can operate. Soon, this will seem odd.

In our example above, the bank has a virtual assistant called Sandra. Sandra not only offers first-line help to customers but she also listens in to calls between customers and human agents ready for the moment where she can provide help to either side of the call. Here at Nexmo, we demonstrated how IBM’s Watson can integrate with our own Voice API to provide just this sort of assistance.

Similarly, such an assistant could simply act as a smart minutes-taker, sending transcripts of meetings to participants with each person’s action items specifically highlighted for them.

Stay Agile Because No One Yet Knows the Future

While voice interfaces seem like an entirely new world, they’re pretty much just a new way of doing what technology has always done: make humans more efficient. We’re a little way off from having virtual assistants everywhere but, honestly, it’s closer than you might think.

So, the challenge for businesses preparing today for the voice-led future is to:

Start experimenting now with voice, to learn how it fits their business.
Stay nimble and not go too far down any one path just yet, in order to remain open to new developments and opportunities.

Staying agile means not buying off-the-shelf solutions, especially now when we’re not even sure how usage of voice interfaces will evolve. But that doesn’t mean that your dev team needs to build everything in-house. Instead, you can combine cloud communication APIs, such as Nexmo’s, with AI APIs from providers such as Amazon. That way your dev team can focus on building your unique value into your voice offerings, while ensuring what’s driving the tech underneath is state of the art.