Hi Henrik / Eric Although I don't want to get embroilled in your discussions I have a question to ask on this.
How does voice recognition work - does it use word parts as in a TTS engine like eSpeak but in reverse, or does it maintain a dictionary of actual words? I presume that the problems you are corresponding about is not the way the STT engine works but the way it interprets the input? Fascinating discussions - thanks Ian Ian -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Eric S. Johansson Sent: 23 February 2007 19:45 To: Henrik Nilsen Omma Cc: Ubuntu Accessibility Mailing List Subject: Re: Voice Recognition for Linux Henrik Nilsen Omma wrote: > Eric S. Johansson wrote: > Looks like the original text got caught in a spam filter somewhere > because of the attachment (I found it in the web archives). No worries > about the tone. We are having a frank technical discussion and need to > speak directly to get our points across. So my turn :) ... Thanks for the understanding but it always helps to be polite. > > I think you are too caught up in the current working model of NS to see > how things can be done differently. you haven't seen the comments I've made in the past about speech user interfaces and what Dragon has done wrong. I have proposed many things that should be fixed but the current command model is not one of them > > I have not studied the details of voice recognition and voice models, >... but I do appreciate the need for custom voice model training over time. > There is a need for feedback, but it does _not_ need to be real-time. > Personally, I would prefer it not to be real time. NS does in theory > tout this as a feature when they claim that you can record speech on a > voice recorder and dump it into NS for transcription. I have no idea > whether that actually works. okay, I should probably attempt to capture some of the user experience issues. Correction from its recognitions is something people debate a lot. If you don't correct miss recognitions, you'll most likely get the same thing over and over again. The output of the language and recognition model is probabilistic so makes recognitions will change from time to time but it'll basically be the same kind misrecognition. (Yes, all uncorrected). The user is then faced with a choice, do you correct the recognition engine or do you edit the document? In both cases, it's painful. But then you get the odd case with the misrecognition is completely unintelligible and you don't have any idea what the hell you said. Then you have no choice but to go back and listen to what was said at that phrase and make a correction. This is a very real user experience. I have spoken with people who write documents in Microsoft Word and they'll go back to page 5 out of 20 see something that's garbled and play it back so they can figure out what they said. They usually don't correct heavy garbling but just say it again and get a more consistent recognition from that point forward courtesy the incremental training. In theory, you can dictate into most applications using something called natural text. It's a direct text injection with a history of what was said (audio and recognition). You can do limited correction by Select-and-Say and it even sort of kind of works if it's a full native Microsoft Windows application. Tools like Thunderbird, gaim, Emacs don't work so well. How they feel is for later discussion. But you have this nice tool, that's almost right, called the dictation box. It's a little window which has full editing a correction capability using the voice model of NaturallySpeaking. When you are done with your dictation, you can inject that into the application its associated with. The wonderful thing about the dictation box is that making corrections significantly improves accuracy. If I dictated nothing but the dictation box for a week, I would have a significantly more accurate system and a lower level of frustration on misrecognition's. If I had what ever magic dictation box uses on all of my applications, I would be ecstatic. I wouldn't need to retrain every six months. But it's not sufficient. Why is again conversation for a future time. If you want to migrate away from incremental recognition, you'll need to look to NaturallySpeaking 3 or NaturallySpeaking 4 for the user experience. You would probably lose one to 2% (or more) on the accuracy which is really significant. Believe me, there's a huge difference between 99% and 99.5% recognition accuracy in actual operating conditions. It's also important to note that dragon changed from the incremental correction model a couple of times. The last time I was in touch with dragon employees (before the bakers got greedy), they will really convinced incremental training, properly done, gave a significantly better user experience and I would have to say from what I hear and from what I have experienced, I think they were right. Maybe they were drinking their own Kool-Aid, maybe they were onto something. I am no stranger to figuring out interesting ways to get the signals you need to do something right so I trust them. But independent of your desire, you may not be able to turn it off. You may have users who know how it works making your life uncomfortable because you have made their life less pleasant. You will have me demanding the highest possible accuracy. :-) I think at this point it would be a really good idea for you to go purchase a copy of NaturallySpeaking 9 preferred. Get a really good headset. The one that comes in the box is a piece of crap. No seriously, it's really bad. I can give you some recommendations on headsets (xvi mostly) but I really really love my vxi Bluetooth wireless headset. It is just so sweet. It has some flaws but it's really sweet too. > I don't really want to interact with the voice engine all the time, I > want it to mostly stay out of my way. I don't want to look at the little > voice level bar when I'm speaking or read the early guesses of the voice > engine. I want to look out the window or look at the spreadsheet that > I'm writing an email about :) The fact that NS updates the voice model > incrementally is actually a bad feature. I don't want that. If I have a > cold one day or there is noise outside or the mic is a bit displaced the > profile gets damaged. That's probably why you have to start a fresh one > every six months. Can you use your keyboard without the delete or backspace key? Or even the arrow keys? the correction dialog I'm talking about is as core to your daily operation as those keys are. As for changing focus, sure, you can do it but only if you have an application which is sufficiently speech aware to record your audio track at the same time and be able to play back a segment you think is an error. It's the only way you'll make corrections unless you have a memory which is a few orders of magnitude better than mine. I should also note that if you don't have a clear and accurate indication of what's a misrecognition error, correcting something that is right can make your user model go back quickly. at least so I am told. Of course, I've never done anything like that, no, no way. Uh-huh. > Instead of saving my voice profile every day, I would like to save up a > log of all the mistakes that were made during the week. I would then sit > down for a session of training to help NS cope with those words and > phrases better. I would first take a backup of my voice profile, then > say a few sample sentences to make sure everything was generally working > OK. I would then read passages from the log and do the needed correction > and re-training. I would save the profile and start using the new one > for the next week. I would also save profiles going back four weeks, and > once a month I would do a brief test with the stored up profiles to see > if it had degraded over time. If it had, I would roll back to an older > one and perhaps do some training from recent logs too. There is no > reason a voice profile should just automatically go bad over time. now you're thinking like a geek. Ordinary users eventually learn when to save a profile based on the type and number of corrections they make. They don't test them, they just save them and count on the system to automatically backup every few saves. I don't save mine every day and I only save my profile when I correct really persistent is recognitions. If I'm getting a cold or hay fever, I definitely don't save but I also suffer from reduced recognition for a few days. user reluctance to put in the effort is reason why you train on a document once at the beginning. I usually choose a couple different documents to train on after a month on a new model but I am a rarity. I described this behavior in a white paper I wrote called "spam filters are like dogs". You have expert trainers and you have people whose dogs crap on the neighbors lawns. Same category of animals, with roughly the same skill potential but very different training models. Naturally speaking is try to take advantage of the "less formal" behaviors for training and they're doing a pretty good job at succeeding with those signals. Don't force the ordinary user to train at an expert level. It won't work, it will just piss them off, and it will discourage if not drive away the moderately expert user who wants to work in the way they are comfortable. > > The fact that you have to constantly interact with the voice engine is > not a feature, it's a bug! It's just that you have adapted your > dictation to work around it. It's not at all clear that interactive > correction is better that batched correction. It certainly should not be > seen as a blocker for a project like this going forward. I wouldn't want > to spend years on a project simply to replicate NS on Linux. There is > plenty of room for improvement in the current system. You constantly interact with your computer and except from it a bunch of feedback. This is no different. In not looking at speech levels but you may be looking at load averages, time of day, alerts about e-mail coming in, cursor position in an editor buffer, color changes for syntax highlighting. These are all forms of feedback. Incremental training and looking at recognition sequences are just different forms of feedback. He learned to incorporate it in your operation ("he learned" is a persistent misrecognition error that mostly shows up when using natural text, because I'm not in a place where I can correct it often enough, it keeps showing up if I was in dictation box right now, it would be mostly gone. This is why incremental recognition correction is so very very important. batch training has never made this go away and I've tried. The only thing that has succeeded has been incremental in one context.) > > OK, now for some replies: you mean the above weren't enough? :-) > >> There is a system that art exists that does exactly what you've >> opposed. > [assuming you meant 'proposed' here] Unlikely. If a system with the > level of usability existed it would already be in widespread use. > >> While it was technically successful, it has failed in that nobody >> but the originator uses it in even he admits this model has some >> serious shortcomings. >> > What system, where? What was the model and what were the shortcomings? http://eepatents.com/ but the package is no longer visible. Ed took a gun awhile ago. His package used xinput direct injection. He used a Windows application with a window to receive the dictation information and inject it into the virtual machine. he was able to do straight injection of text limited by what NaturallySpeaking put out. I think he did some character sequence translations but I'm not sure. He couldn't control the mouse, couldn't shift Windows, had only global commands and not application-specific commands. I could be wrong at some of these points but that's basically what I remember. There was also a bunch of other stuff like, complicated to set up etc. but that can be fixed relatively easily. Especially if you remove the dependency on twisted. to my mind, it's the same as what you're proposing. And there is a general agreement that it only a starting point for the very committed/dedicated > >> The reason I insist on feedback is very simple. A good speech >> recognition environment lets you lets you correct recognition errors >> and create application-specific and application neutral commands. > Yes, we agree that you need correction. The application-specific > features can be implemented in this model too it the same way that Orca > uses scripting. Don't know how orca uses scripting. pointers? seriously though, I want a grammar and the ability to associate methods with the grammar. I do know I'm not the only one because there is a fair number of people that have built grammars using the NaturallySpeaking Visual Basic environment, natpython and a couple macro packages built on top of natpython. Even if you convince me, you'll have to convince them. > You would still have to correct the mistake at some point. I would > prefer to just dictate on and come back and correct all the mistakes at > the end. One should read through before sending in any case ;) Oh I understand but in my experience, if I don't pay attention to what the recognition system is saying, by speech gets sloppy and by recognition accuracy drops significantly until I have something which is completely unrecognizable at the end. Also, I'm probably "special" in this case but even when I was typing, I continually look back at the document as far as the screen permits searching for errors. It seems to help me keep speaking written speech and identifying where I'm using spoken speech for writing. I know other people like you want to just dictate and not look back. Some of them will turn their chair around and stare at painting on the wall while they dictate. But there are those, like me that can't. > And I think that is a serious design-flaw for two (related) reasons: It > gradually corrupts you voice files AND it makes the reader constantly > worry about whether that is happening. You have to make sure to speak as > correctly as properly at all times and always make sure to stop > immediately and correct all the mistakes. Otherwise your profile will be > hosed. I repeat: that is a bug, not a feature. You end up adapting more > to the machine than the machine adapts to you. *That is a bug.* It's a feature... seriously, get NaturallySpeaking, And play with the dictation box as well as natural text driven applications. When you have something that Select-and-Say enabled, you don't need to pay attention all the time, you can go back a paragraph or two or three and fix your errors. The only time you need to pay attention is when you are using natural text which is one-way nuance forces you to toe the line when it comes to applications. That is a bug! > I think this is an NS bug too. I don't want natural editing, I only want > natural dictation. I want two completely separate modes: pure dictation > and pure editing. If I say 'cut that' I want the words 'cut that' to be > typed. To edit I want to say: 'Hal: cut that bit'. Why? because that > would improve overall recognition and would remove the worry that you > might delete a paragraph by mistake. NS would only trigger it's special > functions on a single word, and otherwise just do its best to > transcribe. You would of course select that word to be one that it would > never get wrong. (you could argue that natural editing is a feature, but > the fact that you cannot easily configure it to use the modes I > described is a design-flaw). A few things are very important in this paragraph. Prefacing a command is something I will really fight against. It is a horrible thing to impose on the user because it adds extra vocal load and cognitive load on the user. Voice coder has a "yo" command model for certain commands and I just refuse to use them. I type rather than say that sequence is so repellent to me. I have also had significant experience with modal commands with DragonDictate which is why I have such a strong reaction against the command preface and this is why Dragon Systems went away from them. Remember, technology dedicated company, I know for a fact thatsome of the employees were quite smart. If Dragon's research group does something and sticks with it, there's probably a good reason for it. I think part of our differences comes from modal versus nonmodal user interfaces. I like Emacs, it's nonmodal (mostly) other people like VI which is exceptionally modal. Non-modal user interfaces are preferred in the circumstances if the indicator to activate some command or different course of action is relatively natural. For example if I say "don't show dictation box" I just get text. But if I say "show dictation box" with a pause before the text as well as after, up comes the dictation box. Same words, but the simple addition of natural length pauses allows NaturallySpeaking to identify the command and activate it only when it's asked for. Yes, it's training but minimal training and it applies everywhere when separating commands from text. This works for NaturallySpeaking commands and my private commands. there is one additional form of mode switching in NaturallySpeaking and that's the switching of commands based on which program is active and its state (i.e. running dialog boxes or something equivalent). That's why I have Emacs commands There are only active when running Emacs. > Precisely. It's because they don't want to fiddle with the program, they > just want to dictate. But those that just dictate, get unacceptable results. Try it. When you get NaturallySpeaking running, just dictate and never ever correct and see what happens. Then try it the other way around using dictation box whenever possible. ---eric -- Speech-recognition in use. It makes mistakes, I correct some. -- Ubuntu-accessibility mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-accessibility -- Ubuntu-accessibility mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-accessibility
