Speech Recognition and Artificial Intelligence

Every so often, I’ll get a little pissed off and start wondering aloud, “where the hell are my talking computers?”

Seriously, though – it’s 2008. Ten years ago, we were sure that by now, speech recognition would have surpassed the keyboard as the primary means of input. Hell, we’ve been predicting it for so long, it’s become somewhat of a hollow prediction – a lot like the “flying cars” argument.

But really, why aren’t we all talking to our computers? The answer, in my opinion, is that we haven’t developed artificial intelligence enough yet.

Why is artificial intelligence important for speech recognition, you ask? Let me explain.

We’ve had “basic” speech recognition for some time now. I have personally heard of “Dragon Naturally Speaking” as the be-all, end-all of speech recognition software since somewhere around 1998 – and I’m still not using it. Nor is anyone else – at least not on any large scale. And there’s a very, very good reason for that – it’s simply not good enough.

Now, I’m not saying that speech recognition isn’t getting better at recognizing words and so forth, but at this point, using your computer via voice commands is a bit like trying to operate your computer through the same interface as the original Altair 8800. Oh sure, each individual switch works quite well – but try teaching your grandmother to check her email by just flicking 8 switches on the front of a panel with a few lights on it. That’s about where voice recognition is right now.

You see, there’s a very important “missing piece,” which is context. Or, to put it another way, consciousness.

In order for a speech recognition system to understand instructions given by a human being in plain speech, that system needs to be able to understand plain human speech – which, more often than not, requires a lot of understanding of the context in which it’s used. And to understand context like that, you need a rudimentary consciousness – something that has awareness – not necessarily of itself, but of what it’s working with. And we simply don’t have that yet.

Take an example.

Imagine you’re composing a message. You’re going to send it to your friend, “Bob.” Here’s how you’d use voice commands today:

Command mode. Open Email. Compose message. Dictation mode. “Hi Bob comma how are you doing today question mark capital I am doing just fine comma we enjoyed dinner with you last week period command mode backspace word backspace word command mode” Alt, File, S, Tab, Tab, Tab, Enter. Close Program.

And that’s with minimal errors – in reality, you’d be using the “backspace” or “undo” command quite often. And because speech recognition has no context, no consciousness, you need to tell it explicitly when you move from giving commands about what to do with the computer (basically, using voice commands as a slow and unreliable mouse pointer) to “dictation mode,” where it just writes what you say – basically acting like a bad transcriber. It’s slow, cumbersome, and unreliable. And until it becomes faster and easier (and, to a certain extent, cheaper) than using a keyboard and a mouse, it will remain a fringe method of input.

Contrast this with a voice command session with a computer equipped with speech recognition and a rudimentary AI:

Computer, begin new email to Bob. “Hey Bob, how are you doing today? I am doing just fine, we enjoyed dinner with you last week.” Send message.

Which one do you think most people could adapt to quicker – the first one, or the second one?

Remember also that we haven’t even touched upon corrections. With AI, you could say “no, wait, make that ‘I’m doing just fine'” and the computer would know (based partly on your emphasis on “I’m,” and partly due to its awareness of the sentence structure itself and the context in which it was used) which phrase to replace. Just you try that with today’s speech recognition!

I’m not sure if AI research is being pursued as much as it should be – I have a sinking suspicion it’s not (probably due to fear of runaway AI and other ethical concerns). And maybe that’s a good thing, in the long run. But I’d like to see this sort of thing happen, and happen soon. Because I’m tired of typing – I want to talk to my computer.

I mean, seriously… it’s 2008! Wasn’t this sort of thing supposed to happen like 7 years ago, at least? What ever happened to “life imitating art?”

I’m waiting…