as I was constructing my response, and was almost finished when it hit me about what's wrong with the model proposed. it is the equivalent of raw natural text. Full function natural text sucks a little bit. The broken, unable to correct consistently, natural text is horrible and ruins voice models. What you're proposing has even less functionality than a broken natural text.

Henrik Nilsen Omma wrote:
Eric S. Johansson wrote:
this is one half of the solution needed. Not only do you need to propagate text to Linux but you need to provide enough context back to windows so that NaturallySpeaking can select different grammars. It would be nice to also modify the text injected in to Linux because nuanced really screwed the pooch on natural text.

My point that this is actually all you need. It has the advantage of being quite simple from a coding point of view: you need the transmitter on the windows system that NS feeds into (a version already exists) and you need a gnome or KDE app on the other end with solid usability and configurability. The latter is something that our open source community is quite good at making and there are good tools like Mono, python, Qt4 that can be used.

There is a system that art exists that does exactly what you've opposed. While it was technically successful, it has failed in that nobody but the originator uses it in even he admits this model has some serious shortcomings.

The reason I insist on feedback is very simple. A good speech recognition environment lets you lets you correct recognition errors and create application-specific and application neutral commands.

One point I've obviously glossed over is training. You'll need to do some training to improve the recognition rate. Under my proposed scheme you would need to do the training natively under windows. I'm quite happy to do that actually. I would rather not worry about training during my daily work with using the system, but would collect the mistakes over a week or so and spend an hour or two doing just training. With the system I'm proposing you could make the Linux client recognise a 'must-train-this-later' command, which would cause it to save the past few lines to a log file.

modern systems train incrementally. This improves the user experience because you don't have to put up with continual misrecognition's. Apparently they also train incrementally on what's not corrected which means batch correction is not a good thing. another example is what's happening with me right now. There are a bunch of small words and misrecognized endings that are cropping up with increasing frequency. If nuance hadn't screwed up and left naturaltext in a working state, I would be able to correct them as I dictate into this Thunderbird window. But no, it's so broken I make corrections by hand and as a result, the misrecognition get cast in stone and I need to scrap the user and start over again retraining about every six months. Do not subject users to this kind of frustration and time waste. They will drop the system in a heartbeat if you do.

I have no problem leaving the entire user interface for correction etc. in Windows. The only trouble is how do you make it visible if you're running a virtual machine full-screen? Don't run the virtual machine full-screen?

this is a difficult task. There is a very nice package called voice coder spearheaded by Alain Desilets up at nrc-it in conjunction with David Fox.
Do you have a link to this work? I'd be interested to see.

http://voicecode.iit.nrc.ca/VoiceCode/public/ywiki.cgi

Something you might also want to see which is a full Select-and-Say interface to Emacs

http://emacs-vr-mode.sourceforge.net/

These two things should keep you out of trouble for a while.  :-)


We don't need any of that. We just accept a text stream from NS, running in pure dictation mode, and create our events based on that. All we are after is the excellent recognition engine. The GUI we leave behind.

...
You don't. You set this all up first on the native system along with the initial setup. If you notice that it's not working as it should you open the VMware window or the VNC session where NS is running and make a few adjustments to it directly.

the graphical user interface is an integral portion of the dictation process. For example, I pay attention to the little floating box which shows partial recognition states. It gives me an early warning on how I am speaking and the quality of the recognition. It also gives me the ability to terminate a recognition sequence is NaturallySpeaking loses his mind. The little recognition box floats inside the window of the application is active so that if it's not in the window and I'm not getting any text injected, I know it's time to reset/restart NaturallySpeaking.

if you look at a system running NaturallySpeaking with the VNC, the dictation box is usually not visible. If it is visible, it usually does not show information dynamically because updates far faster than DNC can cope.

I think we'd be better off finding some way of overlaying the user interface from NaturallySpeaking on top of a Linux virtual machine screen. Sucks but you might get done faster than you are very desirable but overly optimistic wish.

So I disagree that this is easier or faster. It sounds very messy. You would need to capture and transmit bits of the screen or something. A lot of work to copy an already poor user interface.

the only thing was really a hideous user interface of the training dialogue. David has shown how to replace that with something more useful. The user interface elements that are quite useful are the audio level indicator, partial recognition information, and the ability to terminate the recognition sequence.

I've attached a very small (<6k) image showing the final recognition state of an utterance. Normally in the upper left-hand corner is a red dot. Click on that red dot and the recognition sequence terminates. The microphone in the taskbar turns to red and indicates it's in the off state. The little bar in the lower left-hand corner is the audio intensity meter. It's yellow now indicating that no one is speaking. When screen I'm speaking at the right level and when it's red, I'm talking too loud. the text in the middle of the box changes as the recognition engine changes its evaluation. Like I said, it's damned useful feedback helps me modify how I speak and interact with the engine in real time.


In any event, take a look at the voice coder you live for making corrections. I really like it. It's the best correction interface is seen so far. David Fox is responsible for that wonderful creation.
Sounds interesting. URL?

see the voice coder URL above.  It's in the user manual.

...except you only have to say "delete line" and not "Macro delete line".
If those are phrases that are active in NS's dictation mode then I'm proposing to generally stay away from them and use your own custom commands. Of course if you get them working reliably, then you can use them and have the transmitter be clever enough to realise that a line has just been deleted, etc.

no, the grammar I gave you was a custom grammar. It didn't need a preamble of "macro". It demonstrates how you can create a more natural speech user interface. You can also overlay NaturallySpeaking commands with your own actions so that you can say "cut that" and "paste that" for commands that don't use ^c/^v for cutting and pasting. As you know, this is desirable because it reduces the number of distinct commands a user must remember and eliminates the need for the user to be context smart. Computers are much better at being context smart than we are.

 >> We have negotiated for rights to a speech recognition engine.  I don't
know if it's better than the Sphinx group but it is open source, and the developer is still interested in seeing it have a life.
You have negotiated the rights to an open source speech engine? In what sense? A transfer of copyright?

I'll get the details. But if memory serves, we got the rights assigned back to the original developer and he is going to license it under some form of the GPL. Then if you add a whole bunch of work, you might have something useful. I'd estimate the development time to be roughly 3 to 4 years if you had five fully funded, full-time developers. Which means it probably take longer given that I'm an optimist when it comes to schedules. :-)

But as with all these things it's important to make a start. A journey of a thousand miles starts with the ground under you, and all that :)

I agree. Let me finish up my project specification and get it bought off by the board of directors at OSSRI and from there, we can start soliciting contributions etc..

but also consider this. Ever wonder why the acceptance rate for speech recognition is only one user in five? Granted I only have a small sample but all of the doctors I've talked to about speech recognition tell me stories of purchasing a very expensive package only to drop it in a few months and go back to human transcription. Obviously recognition accuracy is a part of the problem but the other half is usability. Can a transcriptionist detect errors and correct them without seriously interrupt their workflow? Can they eliminate persistent errors quickly and effectively? These are just a couple of higher level issues that will hit us as we go forward.

--
Speech-recognition in use.  It makes mistakes, I correct some.

<<attachment: dictation_dialog.JPG>>

-- 
Ubuntu-accessibility mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-accessibility

Reply via email to