By Ewan MacLeod in android — Oct 22, 2014

Voice recognition - has it come of age?

“Hey Siri, wake me up at 7 a.m.”…that’s about all I use Apple’s Siri for at the moment, short of asking what song is currently playing (which always amazes me).

It’s not that voice recognition on the iPhone is bad – most of the time it provides reliable and accurate information (when I’m in the UK; but abroad it’s another story) but it’s usually not much quicker than just navigating to an app or typing in a search query manually. What’s more, at times the responses can be just so boring and unimaginative.

Much like Joaquin Phoenix’s character Theodore Twombly in the excellent Spike Jonze movie “Her”, I would like an intelligent assistant with a personality that not only responds to requests, but enquires how I’m feeling and provides useful suggestions based around my interests and behaviour.

“Mr. Theodore Twombly, welcome to the worlds first artificially intelligent operating system, OS1. We’d like to ask you a few basic questions before the operating system is initiated. This will help create an OS to best fit your needs.”

I want to believe that Siri actually thinks for herself. I wouldn’t even mind the prosaic tones of HAL from Stanley Kubrick’s 1968 masterpiece “2001: A Space Odyssey”, as long as the homicidal tendencies could be toned down. The first step is to replace Siri’s stilted male voice (the default UK option) with the dulcet tones of OS1 (Scarlett Johansson). Problem nearly solved.

When Apple released iOS 8 in September, I had hoped that Siri would become a little more like Microsoft’s Cortana, nudging me with helpful little suggestions and even the odd witty remark.

Is that too much to ask?

HAL-9000 — The future of voice recognition?

Voice recognition grows up

It was only a few years ago that the idea of speaking to our smartphones seemed like the stuff of science fiction. It’s not that voice recognition software wasn’t available (it had been around for years on desktop computers, for example Dragon NaturallySpeaking) but it just wasn’t very well suited to devices with limited memory and processing capabilities. Part of the problem was that early smartphones had meagre amounts of memory. Then there’s the network issue – without always-on 4G networks all the speech processing had to be done on the phones instead of being sent to remote servers.

Desktop PC voice recognition software has never achieved much popularity apart from with speech technology enthusiasts – perhaps because there are other input devices to hand that do the job just as well (keyboards and mice!), and also because those systems needed a training period to recognise speech effectively. They just didn’t work well out of the box.

Mobiles phones on the other hand are with us every day; voice recognition then becomes a much more useful feature that can draw on contextual clues such as your location and activity. For example, suppose you’re in a bar and you ask your smartphone to “Get me home, I’m drunk”, then it would deduce that you need a taxi from your location to home.

Cortana demonstration — Cortana being demonstrated at Microsoft’s Build conference

A more natural way to interact

Nowadays of course, nearly everyone with a smartphone is familiar with Siri, Cortana and Google Now as a natural way to search the web, send messages, or simply to elicit a humorous response.

Today’s voice assistants were built upon decades of research that lay the groundwork for what was to come. Siri was originally conceived as an artificially intelligent “do engine” that would allow people to hold a conversation with the Internet, pulling in relevant information from multiple sources to anticipate what you wanted before you even wanted it.

Siri was eventually launched as a standalone iPhone app with a development team of just 24 people, before Apple bought them in 2010 scuppering a possible deal with Verizon that would have put Siri on every Android phone on their network.

Despite the initial scepticism and claims that voice recognition on a smartphone would not be useful, Microsoft and Google were compelled to create their own digital assistants.

“I don’t believe your phone should be an assistant…Your phone is a tool for communicating…you shouldn’t be communicating with the phone; you should be communicating with somebody on the other side of the phone.” – Andy Rubin, Google

How things have changed…

In the Microsoft camp, Cortana was named after a character in the popular Halo video game. On first impressions, it seems like a more convivial hybrid that combines elements of Siri and Google Now. Cortana even has its own virtual notebook that holds everything she knows about you much in the same way that a real human assistant would, and even though it’s not a real privacy control, it represents her view of you and can be altered at any time.

Despite the best marketing efforts by Google, Apple and Microsoft, many people simply never use their respective digital assistants, a fact highlighted in a survey by Intelligent Voice which found that 85 percent of iPhone respondents have never even used Siri.

Siri Response — Siri often provides a humorous response.

Harnessing the power of the cloud

In Microsoft’s case, Cortana is the coming together of the vast amounts of data held by the various Bing services. In other words, Microsoft didn’t just create Cortana overnight – there’s a vast amount of backend data and complex systems that power her ‘intelligence’.

All today’s most popular voice recognition services rely on the cloud to process your voice, analysing the words and then inferring meaning – and that’s the hard part because it varies with context and is inherently fluid in the case of human speech. Without persistent network access, there’s a limit to what Siri, Cortana and Google Now can do and how much meaningful information they can provide.

Siri uses a cloud-based speech recognition technology by Nuance – the same company that powers voice recognition on Samsung’s Gear line of smart watches. In fact, wearables is an area in which cloud processing (via a connected smartphone) will be essential as those devices typically lack much processing power.

So it’s largely thanks to cloud services that voice recognition and virtual assistants are feasible today. Take Siri as an example – in iOS 8 your speech is recorded, compressed and streamed to the remote server for processing, and then displayed in real-time as you speak.

All digital assistants are not created equal

At one time, Siri was regarded as one of the best smartphone assistants around with superior voice recognition capabilities. This helped Apple to promote Siri over the competition as a standout feature that was unique to iPhone customers, but more recently Google Now appears to have the upper hand.

A study published this week by Stone Temple Consulting tested over 3000 queries on all three platforms, comparing the accuracy and the number of mistakes. Ultimately, Google Now was declared the winner with 88% completely (correctly) answered queries, then Siri at 53% and Cortana coming in last at 40%. However, it’s just as easy to find other studies which claim Cortana beats Siri, or Siri beats everything else.

Answered Queries — Google Now came out on top as the most capable smartphone assistant.

Clearly, there is substantial room for improvement for all smartphone digital assistants and the voice recognition and processing technologies that drive them.

Where is voice recognition going?

Voice recognition will benefit from advancements in related technologies such as facial recognition and gesture control; with these combined data sources, our smartphones will be able more accurately determine the context of what we mean. Voice recognition capabilities will also benefit consumer areas such as television and the automative industry – Apple has already launched its Siri-enabled CarPlay initiative in 2014, and rumours of a voice-controlled Apple TV still circulate, mainly because Steve Jobs told biographer Walter Isaacson that he had “finally cracked the TV interface”.

Currently however, voice control works well on smartphones because we speak directly into the microphone; unfortunately, these other scenarios are often prone to noise and interference. A future in which you could ask Siri to play a movie and turn out the lights without having to speak into a smartphone would be intuitive and convenient.

With the advent of wearables and smart watches, voice recognition should become indispensable as a way to interact with such small devices.

But unlike Theodore Twombly, the day when we fall in love with our smartphone OS still seems a long way off…