Re: Voice Recognition for Linux

Eric S. Johansson Fri, 23 Feb 2007 07:03:10 -0800

as I was constructing my response, and was almost finished when it hitme about what's wrong with the model proposed. it is the equivalent ofraw natural text. Full function natural text sucks a little bit. Thebroken, unable to correct consistently, natural text is horrible andruins voice models. What you're proposing has even less functionalitythan a broken natural text.


Henrik Nilsen Omma wrote:

Eric S. Johansson wrote:
this is one half of the solution needed. Not only do you need topropagate text to Linux but you need to provide enough context back towindows so that NaturallySpeaking can select different grammars. Itwould be nice to also modify the text injected in to Linux becausenuanced really screwed the pooch on natural text.
My point that this is actually all you need. It has the advantage ofbeing quite simple from a coding point of view: you need the transmitteron the windows system that NS feeds into (a version already exists) andyou need a gnome or KDE app on the other end with solid usability andconfigurability. The latter is something that our open source communityis quite good at making and there are good tools like Mono, python, Qt4that can be used.

There is a system that art exists that does exactly what you've opposed.While it was technically successful, it has failed in that nobody butthe originator uses it in even he admits this model has some seriousshortcomings.

The reason I insist on feedback is very simple. A good speechrecognition environment lets you lets you correct recognition errors andcreate application-specific and application neutral commands.

One point I've obviously glossed over is training. You'll need to dosome training to improve the recognition rate. Under my proposed schemeyou would need to do the training natively under windows. I'm quitehappy to do that actually. I would rather not worry about trainingduring my daily work with using the system, but would collect themistakes over a week or so and spend an hour or two doing just training.With the system I'm proposing you could make the Linux client recognisea 'must-train-this-later' command, which would cause it to save the pastfew lines to a log file.

modern systems train incrementally. This improves the user experiencebecause you don't have to put up with continual misrecognition's.Apparently they also train incrementally on what's not corrected whichmeans batch correction is not a good thing. another example is what'shappening with me right now. There are a bunch of small words andmisrecognized endings that are cropping up with increasing frequency.If nuance hadn't screwed up and left naturaltext in a working state, Iwould be able to correct them as I dictate into this Thunderbird window.But no, it's so broken I make corrections by hand and as a result, themisrecognition get cast in stone and I need to scrap the user and startover again retraining about every six months. Do not subject users tothis kind of frustration and time waste. They will drop the system in aheartbeat if you do.

I have no problem leaving the entire user interface for correction etc.in Windows. The only trouble is how do you make it visible if you'rerunning a virtual machine full-screen? Don't run the virtual machinefull-screen?

this is a difficult task. There is a very nice package called voicecoder spearheaded by Alain Desilets up at nrc-it in conjunction withDavid Fox.
Do you have a link to this work? I'd be interested to see.


http://voicecode.iit.nrc.ca/VoiceCode/public/ywiki.cgi

Something you might also want to see which is a full Select-and-Sayinterface to Emacs


http://emacs-vr-mode.sourceforge.net/

These two things should keep you out of trouble for a while.  :-)

We don't need any of that. We just accept a text stream from NS, runningin pure dictation mode, and create our events based on that. All we areafter is the excellent recognition engine. The GUI we leave behind.

...

You don't. You set this all up first on the native system along with theinitial setup. If you notice that it's not working as it should you openthe VMware window or the VNC session where NS is running and make a fewadjustments to it directly.

the graphical user interface is an integral portion of the dictationprocess. For example, I pay attention to the little floating box whichshows partial recognition states. It gives me an early warning on how Iam speaking and the quality of the recognition. It also gives me theability to terminate a recognition sequence is NaturallySpeaking loseshis mind. The little recognition box floats inside the window of theapplication is active so that if it's not in the window and I'm notgetting any text injected, I know it's time to reset/restartNaturallySpeaking.

if you look at a system running NaturallySpeaking with the VNC, thedictation box is usually not visible. If it is visible, it usually doesnot show information dynamically because updates far faster than DNC cancope.

I think we'd be better off finding some way of overlaying the userinterface from NaturallySpeaking on top of a Linux virtual machinescreen. Sucks but you might get done faster than you are verydesirable but overly optimistic wish.
So I disagree that this is easier or faster. It sounds very messy. Youwould need to capture and transmit bits of the screen or something. Alot of work to copy an already poor user interface.

the only thing was really a hideous user interface of the trainingdialogue. David has shown how to replace that with something moreuseful. The user interface elements that are quite useful are the audiolevel indicator, partial recognition information, and the ability toterminate the recognition sequence.

I've attached a very small (<6k) image showing the final recognitionstate of an utterance. Normally in the upper left-hand corner is a reddot. Click on that red dot and the recognition sequence terminates.The microphone in the taskbar turns to red and indicates it's in the offstate. The little bar in the lower left-hand corner is the audiointensity meter. It's yellow now indicating that no one is speaking.When screen I'm speaking at the right level and when it's red, I'mtalking too loud. the text in the middle of the box changes as therecognition engine changes its evaluation. Like I said, it's damneduseful feedback helps me modify how I speak and interact with the enginein real time.

In any event, take a look at the voice coder you live for makingcorrections. I really like it. It's the best correction interface isseen so far. David Fox is responsible for that wonderful creation.
Sounds interesting. URL?


see the voice coder URL above.  It's in the user manual.

...except you only have to say "delete line" and not "Macro delete line".
If those are phrases that are active in NS's dictation mode then I'mproposing to generally stay away from them and use your own customcommands. Of course if you get them working reliably, then you can usethem and have the transmitter be clever enough to realise that a linehas just been deleted, etc.

no, the grammar I gave you was a custom grammar. It didn't need apreamble of "macro". It demonstrates how you can create a more naturalspeech user interface. You can also overlay NaturallySpeaking commandswith your own actions so that you can say "cut that" and "paste that"for commands that don't use ^c/^v for cutting and pasting. As you know,this is desirable because it reduces the number of distinct commands auser must remember and eliminates the need for the user to be contextsmart. Computers are much better at being context smart than we are.


 >> We have negotiated for rights to a speech recognition engine.  I don't

know if it's better than the Sphinx group but it is open source, andthe developer is still interested in seeing it have a life.
You have negotiated the rights to an open source speech engine? In whatsense? A transfer of copyright?

I'll get the details. But if memory serves, we got the rights assignedback to the original developer and he is going to license it under someform of the GPL. Then if you add a whole bunch of work, you might havesomething useful. I'd estimate the development time to be roughly 3 to4 years if you had five fully funded, full-time developers. Whichmeans it probably take longer given that I'm an optimist when it comesto schedules. :-)

But as with all these things it's important to make a start. A journeyof a thousand miles starts with the ground under you, and all that :)

I agree. Let me finish up my project specification and get it boughtoff by the board of directors at OSSRI and from there, we can startsoliciting contributions etc..

but also consider this. Ever wonder why the acceptance rate for speechrecognition is only one user in five? Granted I only have a smallsample but all of the doctors I've talked to about speech recognitiontell me stories of purchasing a very expensive package only to drop itin a few months and go back to human transcription. Obviouslyrecognition accuracy is a part of the problem but the other half isusability. Can a transcriptionist detect errors and correct themwithout seriously interrupt their workflow? Can they eliminatepersistent errors quickly and effectively? These are just a couple ofhigher level issues that will hit us as we go forward.


--
Speech-recognition in use.  It makes mistakes, I correct some.

<<attachment: dictation_dialog.JPG>>

-- 
Ubuntu-accessibility mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-accessibility

Re: Voice Recognition for Linux

Reply via email to