![]() |
||||||||||||
![]() ![]() ![]() ![]() ![]() |
||||||||||||
Evidence-Based Information, Training and Tools
for Optimizing the Usability of Computer Systems
|
||||||||||||
|
|
||||||||||||
Speech Recognition February, 2000
Why is it taking so long for speech to be used as a primary input method? Introduction Automatic speech recognition technology has been under development for over 25 years, but has not yet received widespread use. One of the main reasons that speech recognition has not gained greater acceptance is that speech recognition errors are fundamentally different than keying errors. Most keying errors can be tracked back to users, while most speech errors are tracked back to mis-recognition of the speech by the computer. In the latter case, user input simply does not match computer output. Even though people can dictate faster than they can type, actual throughput is usually much slower with automatic speech recognition systems than with keying. A major problem is that error correction takes much longer with speech. The most commonly used correction methods used with speech input are: (a) deleting and repeating the last phrase, Multimodal Correction Past studies have suggested that switching modality could speed up interactive correction of recognition errors. Suhm, Myers and Waibel (1999) at Carnegie Mellon University found that switching between modalities eliminated repeated recognition errors. They found that if users simply repeated their speech to correct errors, correction accuracy was much lower than if users switched to a different modality (keyboard and mouse). The correction accuracy when keying depended on the user's typing skill. For example, the fastest typists using “keyboard and mouse” made almost three times more corrections per minute than did subjects who made corrections using “voice-only.” They concluded that multimodal correction strategies could reliably expedite error correction in speech user interfaces. Throughput Throughput is the number of correct words produced per minute. The key variables are: (a) the accuracy of the speech recognition system, Lewis (1999) at IBM evaluated the performance of participants using a speech recognition dictation system. The participants received training in one of two correction strategies, either “voice-only” or “voice, keyboard and mouse.” In both cases, users spoke at about 105 uncorrected words per minute. The multimodal (voice, keyboard, mouse) corrections were made three times faster than “voice-only” corrections, and generated 63% more throughput. Keyboard Correction Faster Karat, et.al. (1999) at IBM evaluated three speech recognition products with their users correcting errors by using either “voice-only” or “keyboard and mouse.” Participants were native English speakers with good typing skills. Each person trained one of the speech recognition systems to more readily recognize their voices and then completed two tasks, copying from a novel and composing replies to questions. The fastest users spoke at an average of 107 uncorrected words per minute, which resulted in about 25 corrected words per minute. The “keyboard-mouse” group completed almost three times more words per minute than did the “voice-only” group. Participants observed that they were usually aware of when a typing error occurred, but were much less confident of being aware of when a speech error occurred. Users must either constantly glance at the display for errors, or rely heavily on proofreading after the speaking has ended. Conclusions It seems that the primary reasons that developers are avoiding speech for input are that: (a) speech recognition systems are still somewhat unreliable, and
References Karat, C.M., Halverson, C., Horn, D. and Karat, J., Patterns of Entry and Correction in Large Vocabulary Continuous Speech Recognition Systems, CHI 99 Conference Proceedings, 568-575 (1999). Lewis, J.R., Effect of Error Correction Strategy on Speech Dictation Throughput, Proceedings of the Human Factors and Ergonomics Society, 457-461 (1999). Suhm, B., Myers, B. and Waibel, A., Model-Based and Empirical Evaluation of Multimodal Interactive Error Correction, CHI 99 Conference Proceedings, 584-591 (1999). |
||||||||||||
|
Home Contact Dr. Bob Bailey at (801) 201-2002 or bob@webusability.com Copyright 2002 - 2005 |
||||||||||||