On Apr 17, 7:26 am, Nick White <[email protected]> wrote: > On Mon, Apr 16, 2012 at 06:38:01PM +0200, zdenko podobny wrote: > > I think in 3.02 will provide solution this cases: you can use more than one > > language for OCR. e.g. you can run something like this: > > > tesseract image output -l grc+ell > > Ah, that's a very good idea, and will indeed be useful. However for > my usecase (a script which is mostly the same, but with additions, > and an older version of the language), it would be useful to only > use one set of dictionary files (rather than presumably the union of > grc & ell, in the above example).
The main difficult thing for you will be any characters that are not already trained. There's no easy way to "just add a few characters" you basically have to do a full retrain. If you're okay living within the unicharset already trained for a given language, you can just swap in your own dictionary files, either using combine_tessdata(1) and wordlist2dawg(1) or by specifying a user-words file (to augment the dictionary) described as zdenko mentioned on the man page. -david http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.html http://tesseract-ocr.googlecode.com/svn/trunk/doc/wordlist2dawg.1.html http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

