Right, everyone. Not everyone is familiar with speech recognition, so here are a few basic elements, facts and some notes on the three 'champions'.
Open-source speech recognition systems could be better. There really are only two contenders, described below. GnomeVoiceControl et al either failed and died because of the complexity of this technology, or could only make sense out of small (normally one-word) commands which limited their usefulness. Speech recognition is a hideously advanced technology, and NO-ONE has perfected it yet; though many (commercial) companies are close to doing so; Most Android devices have the capacity to record and send snippets of voice data back to Google to be analysed and sent back to the device as text. These snippets are very limited, and only sometimes accurate; this is most likely the result of Google's system (licensed from Nuance Communications) not being trained to each voice. Speech recognition requires a low-noise environment (or a very good, but expensive, microphone to deal with that issue). Only 'trained' systems can really make any sense out of a human speaking normally (like Dragon, for instance); for the best accuracy, you have to speak in a flat, mono-tone voice, and do so very clearly and distinctly. Speech recognition systems generally require two distinct models to operate; the first is a vocal model, which will pick up human speech (as distinct from environmental sounds) and in some systems these vocal models can be trained to better recognize users or better recognize (and ignore) sounds from their environments. The second is the language model, which sifts through the data filtered from the vocal model and ascribes sounds to words. There are currently no complete GPL models. (This is why VoxForge lives) I am currently unaware of any speech recognition system capable of recognizing and identifying multiple users at the same time; this system will need to be built for Wintermute. Both Microsoft and Apple's operating systems have inbuilt speech recognition systems. This technology is nowhere near as simple as a file browser or a music player; in terms of complexity, it's on the same level of complexity as an operating system kernel; you need a HELL of a lot of knowledge to build it. Below are my notes for the three speech recognition engines; the first two are open-source, the last is not, but we can learn, at least from a user-experience level, where we should generally be heading from it. In terms of licenses, both systems are open source; "Just the usual "include this notice and disclaimer", with the addition of "if you edit the code, say when and by whom" for Sphinx." *Sphinx* 1. BSD-licensed project; open-source version. 2. Mostly used as a research project; expect spaghetti code. 3. Functioning models; some parts of it fully in public domain. 4. No GUI; mostly console; third party GUI's available. 5. Latest version is coded with Java. 6. Very, VERY fast recognition. 7. The most complete system available to us. 8. Used by multiple robotics labs around the world. *Julius* 1. Julius started development in 1997 2. Julius' English models are complete, they are also under the HTK license; they are not open-source and cannot be redistributed without permission. 3. Julius can only run with HTK models, though there is apparently a F/OSS one, but it's Japanese (and before you ask; no, it isn't possible to just 're-code' it. It's a model, the whole of it needs to be re-created 4. Julius itself is open-source, but under what appears to be a custom license (needs to be checked) if so, this could make things difficult for us. 5. Used by the open-source 'Q.bo' robotics project. *Dragon NaturallySpeaking;* a commercial (closed-source) speech recognition program; 1. Uses some simple tests (volume check, quality check) to make modifications to audio input. 2. Has no capacity to recognize multiple speakers; it has a number of users but these require switching between accounts. 3. Has a number of modes (Dictation mode, spell mode, command mode, numeral mode) as it appears their system is not capable enough to distinguish between numbers in words, or as numbers. Sometimes it accidently triggers a command when you meant to dictate something, etc. These modes have to be manually set. There *is* a: 'normal mode' which can dictate, spell, enumerate and accept commands, but it's frequency of misunderstandings is, of course, higher then the pre-set modes. 4. The quality of speech input depends upon the microphone; a microphone operating in a noisy environment (say, next to a hard-drive, as some laptop microphones are) is not nearly as effective as a headset microphone. 5. When the program has finished checking volume and quality, it then gets the user to read aloud a prepared section of speech and, after doing so, appears to use some form of hidden Markov model to adjust itself. 6. It has functions to pull in and establish the user's writing style from multiple sources (email program, word processors, local documents) in an attempt to expand it's vocabulary and to better guess the context of the user's words. 7. Recognition rate of Dragon after installation with no training is 60-70%. Recognition after one hours training shoots to 90% (dependant on the environment). Estimated recognition rate after Dragon has processed six hours of training with complete analysis of local sources of data: 99.65%. 8. It uses a 'best guess' method of establishing context; if it should write '1' or 'one', or 'where' and 'ware' by getting a basic idea of what was said, and where it was mentioned; previous examination of user's previous works appears to help here. 9. It requires the user to speak punctuation, if, however, a modern grammar checker is used with it; full documents can be grammatically correct with only a slight margin of error. 10. The system cannot easily handle a change in tempo, pitch (or other) alterations to your voice. This results in a rather nasty model of computing, as someone speaking with a stuffy nose would have a very high error rate. 11. Dragon is processor-intensive; it's modern version (version 11) requires the machine be active but not doing anything at a user-defined time, upon which the program will use the full resources of the computer to alter itself on a more in-depth level. The initial training data provided during the setup of the user account will take about half an hour to process on a modern machine, despite being only 10 minutes worth of voice recordings. It is estimated that six hours worth of data would take most of a day to compute. It is also noteworthy to point out the data computed by Dragon in this 'optimization' process can reach several dozen gigabytes in size I'm unsure of Dragon 11, but Dragon 9 (an earlier version) would take about 10 minutes worth of recordings, and output 5 gigabytes of data. -Dante -- -Danté Ashton Vi Veri Veniversum Vivus Vici Sent from Ubuntu
_______________________________________________ Mailing list: https://launchpad.net/~wintermute-devel Post to : wintermute-devel@lists.launchpad.net Unsubscribe : https://launchpad.net/~wintermute-devel More help : https://help.launchpad.net/ListHelp