(Let me apologize before you start for the length of this blog entry. If it were a magazine article I’d spend hours more trying to edit it to perhaps half its current length. But this is a blog, and the thing about blogs is that they are usually stream of conscience rather than highly thought through and edited. And when I stream my thoughts, well…)
One of my recent postings brought up a reply that essentially says “touch and gestures is old thinking, I want a speech-based user interface”. Ah, wouldn’t we all? Generalized speech recognition is one of the “Holy Grail”‘s of Computer Science. I can still remember one of my friend’s returning from Carnegie Mellon University for a summer break in the mid-1970s and going on about how generalized speech recognition (he’d been working on Hearsay-II) was right around the corner. 35ish years later and we are still not quite there. I still pick on him about it. A couple of years ago I teased Microsoft’s Chief Research Officer (and former CMU professor), Rick Rashid, about this. Rick correctly pointed out that we have come a long way and that speech recognition is now entering widespread, if more targeted, use. So I’m going to talk about the evolution of computer User Interface, where we seem to be with speech, and why speech may never become the primary mode of computer interaction.
When it comes to direct human interaction with computers the way it all started was by taking existing tools and figuring out how to wire them up to the computer. We had typewriters, so by hooking a typewriter to the computer you could input commands and data to it and the computer could print its output. We had oscilloscopes so by hooking one up to the computer we could output more graphical information. We had to create a language you talked to the computer in and those command line (aka command shell) languages became the primary means of interacting with computers in the 1960s, 70s, and 80s. Even today Linux, Windows, MAC OS, etc. all have command line languages and they are often used to perform more esoteric operations on the systems. The nice thing about command line languages is that they are dense and precise. The bad thing is that they are unnatural (requiring wizard-level experts who have trained on and utilized them for years).
These three attributes, density (how much information can be conveyed in a small space), precision (how unambiguous is the information conveyed), and how natural (to the way humans think and work) can be used to evaluate any style of computer interaction. The ideal would be for interactions to be very dense, very precise, and very natural. The reality is that these three attributes work against one another and so all interaction styles are a compromise.
As far back as the 1960s researchers were looking for a more natural style of computer interaction than command lines. And obviously Science Fiction writers were there too. For example in the original Star Trek we see interactive graphic displays, tablet style computers, sensor-based computers (e.g., Tricorder) and computers with full speech recognition. Who can forget Teri Garr’s amazement at seeing a speech controlled typewriter in 1968′s “Assignment: Earth” episode? Yet these were all truly science fiction at the time. Interestingly Star Trek never showed use of a computer mouse, and in the Star Trek Movie “The Voyage Home” when Scotty sees one he has no idea what it is. I find that interesting because the computer mouse was invented in 1963, although most people would never see one until the 1990s.
The command line world wasn’t static and continued to evolve. As video terminals began to replace typewriter-style terminals (or “teletypes”) they evolved from being little more than glass teletypes to being capable of displaying forms for data input and displaying crude graphics for output. Some more human-oriented command languages, such as the Digital Command Language (DCL) appeared. Some command line processors (most notably that of DEC’s TOPS-20) added auto-completion and in-line help, making command lines much easier to use by non-experts. Of all these only Forms altered the basic Density, Precision, Naturalness equation by allowing Task Workers (e.g., order entry clerks) to make use of computers. After all, filling out forms is something that humans have been doing for at least a couple of centuries.
In the 1960s and 1970s Stanford Research Institute’s ARC and Xerox’s PARC continued to work on better ways to interact with computers and produced what we now know as the Graphical User Interface (GUI), based on Windows, Icons, Menus, and Pointers (WIMP). While WIMP is far less dense than command line based systems, it maintains their precision. Density is still important however, which is why keyboard shortcuts were added to Microsoft Windows. But most importantly, WIMP is far more natural to use than command lines due to the desktop paradigm and visual clues it provides. It was GUI/WIMP that allowed computers to fully transition from the realm of computer specialists to “a computer on every desk and in every home”.
Work continued on how to make computers even more natural to use. One of the first big attempts was Pen Computing and Handwriting Recognition, which had its roots in the 1940s (or as far back as 1888 if you want to stretch things). There was a big push to bring this style to the mainstream in the late 1980s and early 1990s, but it failed. High costs, poor handwriting recognition, and other factors kept pen computing from catching on. It wasn’t dense nor precise enough. This style enjoyed a bit of a renaissance in the late 1990s with the introduction of the Palm Pilot which eschewed general handwriting recognition in favor of a stylized pen input technique known as Graffitti. The Palm Pilot was also a limited function device, which allowed it to be well tuned for Pen use. This lead to further use of a Pen (aka Stylus) in many PDAs and Smartphones. However, the more general purpose the platform (e.g., Smartphones or a PC) the more tedious (lack of density) Pen use became. In other words, the use of a Pen as just another pointer in a WIMP system was just not very interesting.
This finally brings us to the user interface paradigm that will dominate this decade, Touch and Gestures (Touch). Touchscreens have been around for many years, at least back to the 70s. But they generally had limited applicability (e.g., the check-in kiosk at the airport). When Apple introduced the iPhone, dropping WIMP and bypassing Pen Computing in favor of a Touch-based UI, it really did change the world. To be fair Microsoft introduced these at the same time, but in a very limited production product known as Surface. So Apple gets the real credit. Touch trades away density and precision to achieve a massive leap in how natural it is for a human to interact with the computer. The tradeoff works really well for content consumption, but is not good for content creation. So WIMP, which is a great content creation paradigm, is likely to live on despite the rise of Touch. The place most users probably notice Touch’s precision problems are when there are a series of links on a web page that are stacked on top of one another. Your finger can’t quite touch the right one (there it is, lack of precision). If you are lucky you can use a gesture to expand and then position the page so you can touch the right link (requiring more operations, which is less dense than WIMP would allow), but sometimes even this doesn’t work. Now expand this to something like trying to draw a schematic, or a blue print, and you can see the problems with Touch and why WIMP will continue to survive. For another example consider how much easier it is to book a complex travel itinerary (tons of navigation and data input) on your PC versus doing the same on your iPad. It is one of the few activities where I feel compelled to put down my iPad and move to my PC. Writing this blog is another. Touch is great for quick high-level navigation to content you want to view. It is painful for performing precise and/or detailed input.
Speech-based user interface research dates back to the 1950s, but took off in the 1970s. You can really split this into Speech output and Speech recognition. As I pointed out earlier, the big joke here is that generalized speech recognition is always right around the corner. And has been for almost 40 years. But speech synthesis output has been commercially successful since 1984′s introduced of DECtalk. DECtalk was a huge hit and 27 years later you can still hear “Perfect Paul” (or “Carlos” as he was known to WBCN listeners, which included so many DECies that most of us forgot the official name), DECtalk’s default voice, from time to time. But what about Speech recognition?
If you own a Windows XP, Windows Vista, or Windows 7 PC then you have built-in speech recognition. Ditto for the last few versions of Office. How many of you know that? How many of you have tried it? How many use it on a regular basis? I’d love if Microsoft would publish the usage statistics, but I already know they would indicate insignificant usage. My father used to call me up and say “hey, I saw a demo of this thing called Dragon that would let me write letters by just talking into the computer”. He did this more than once, and each time I told him he had that capability in Microsoft Word, but to my knowledge he never actually tried it. I did meet a lawyer who threw away her tape recorder and began using Dragon Naturally Speaking for dictation, but I think she was a special case. Frankly, in all the years I’ve heard about speech recognition she is the only layperson (or non-physically challenged person) I’ve met who uses it on such a general and regular basis. More on her situation later. Meanwhile my own attempts to use this feature demonstrated its weakness. It works great until you have to correct something, then its use becomes extremely tedious (lack of precision and density), and complex changes require the use of a pointing device (or better put, you go back to WIMP).
It’s not just that you can do dictation in Microsoft Word or other applications, you can control your Microsoft Windows machine with it. However, I can’t see many people doing this for two reasons. One is speech’s lack of both density and precision. The other is that layering speech on top of a WIMP system makes everything about speech’s lack of density and precision worse. File->Save As->… is just too tedious a command structure to navigate with speech. But the most important indictment of speech as the primary form of computer interaction is that it is far less natural than people assume.
Think about how annoying it is for someone to take a cell phone call in a restaurant. Or why do you suppose that most U.S. Airlines have decided not to install microcells on their planes so you can use your cell phone in flight (and even those with in-flight WiFi are blocking Skype and VOIP services)? And how proper is it for you to whip out your cell phone and take a call in the middle of a meeting? Or think about how hard it is to understand someone in a crowded bar, at a rock concert, in an amusement park, or on a manufacturing floor. Now imagine talking to your computer in those same circumstances. Your co-workers, fellow diners, or seatmates will want to clobber you if you sit around talking to your computer. And you will want to slit your own throat after a few experiences trying to get your computer to understand you in a noisy environment. Speech is a highly flawed communications medium that is made acceptable, in human to human interaction, by a set of compensating mechanisms that don’t exist in a human to computer interaction.
I recently read about a study that showed that in a human to human conversation comprehension rises dramatically when you can see the face of the person you are talking to. Our brains use lip-reading as a way to autocorrect what we are hearing. Now maybe our computers will eventually do that using their cameras, but today they are missing this critical clue. In a human to human interaction body language is also being used as a concurrent secondary communication channel along with speech. Computers don’t currently see this body language, nor could they merge it with the audio stream if they did. In human to human communications the lack of visual cues is what makes an audio conference so much less effective than a video conference, and a video conference so much less effective than an immersive experience like Cisco’s Telepresence system, and Telepresence somewhat less effective than in-person meetings. And when you are sitting in a meeting and need to say something to another participant you don’t speak to them, you slip them a note (or email, instant message, or txt them even though they are sitting next to you).
I use speech recognition on a regular basis in a few limited examples. One of the ones I marvel at is United Airlines’ voice response system (VRP). It is almost flawless. In this regard it proves something we’ve long known. You can do generalized speech recognition (that is, where the system hasn’t been trained to recognize an individual’s voice) on a restricted vocabulary or you can do individualized recognition on a broader vocabulary. For example, getting dictation to work requires that you spend 15 or more minutes training the software to recognize your voice. I imagine that specialized dictation (ala medical or legal) takes longer. United has a limited vocabulary and so it works rather well. My other current usage is Windows Phone 7′s Bing search. I try to use speech recognition with it all the time, and it works maybe 70% of the time. There are two problems. The first is that if there is too much noise (e.g., other conversation) around me then it can’t pick up what I’m saying. The bigger one is that if I say a proper noun it will often not come close to the word I’m trying to search on. Imagine all the weird autocorrect behaviors you’ve seen on steriods. Autocorrect is a great way to think about speech recognition, because after software converts raw sound into words that sounds similar it uses dictionary lookups and grammatical analysis to guess at what the right words are. I suggest a visit to http://damnyouautocorrect.com/ for a humorous (and, warning, sometimes offensive) look at just how off course these techniques can take you.
Let’s get to the bottom line. Speech has horrible precision, poor density, and there are social factors that make it natural in only certain situations.
So what is the future of speech? Well first of all I think the point uses of it will continue to grow dramatically. Things like United Airlines’ VRP. Or the lawyer I mentioned. She used to dictate into a tape recorder then pay a transcription service to transcribe the tape. She would then go back over the transcript and make corrections. The reason that a switch to Dragon Naturally Speaking worked for her is that the correction process took her no more time then did fixing the errors the transcription service introduced. And it was a lot cheaper to have Dragon do the initial transcription than to pay a service. So certainly there are niches where speech recognition will continue to make inroads.
The bigger future for speech is not as a standalone user interface technology but rather part of a full human to humanoid-style of interaction. I can say play or “touch” play to play a video. I can merge sensory inputs, just as humans do, to figure out what is really being communicated. I can use a keyboard and/or pointer when greater precision is required, just as humans grab white boards and other tools when they can’t communicate with words and gestures alone. And I can project output on any display (the same one you use as your TV, your phone, a dedicated monitor, the display panel on your oven, the speakers on your TV or audio components, etc. This is the totality of a Natural User Interface (NUI). Speech doesn’t become truly successful as a user interface paradigm of its own. It shines as part of the NUI that will dominate the next decade.
I really think it will take another 8-10 years for a complete multi-sensor NUI (nee Humanoid UI) to become standard fare, but Microsoft has certainly kicked off the move with the introduction of Kinect. It’s primitive, but its the best prototype of the future of computing that most of us can get our hands on. Soon we’ll be seeing it on PCs, Tablets, and Phones. And a decade from now we’ll all be wondering how we ever lived without it.