Re: Specifying different dictionary files [was: Getting usable source files from traineddata files]

David Eger Tue, 17 Apr 2012 18:13:18 -0700

On Apr 17, 7:26 am, Nick White <[email protected]> wrote:
> On Mon, Apr 16, 2012 at 06:38:01PM +0200, zdenko podobny wrote:
> > I think in 3.02 will provide solution this cases: you can use more than one
> > language for OCR. e.g. you can run something like this:
>
> > tesseract image output -l grc+ell
>
> Ah, that's a very good idea, and will indeed be useful. However for
> my usecase (a script which is mostly the same, but with additions,
> and an older version of the language), it would be useful to only
> use one set of dictionary files (rather than presumably the union of
> grc & ell, in the above example).


The main difficult thing for you will be any characters that are not
already trained.  There's no easy way to "just add a few characters"
you basically have to do a full retrain.  If you're okay living within
the unicharset already trained for a given language, you can just swap
in your own dictionary files, either using combine_tessdata(1) and
wordlist2dawg(1) or by specifying a user-words file (to augment the
dictionary) described as zdenko mentioned on the man page.

-david

http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.html
http://tesseract-ocr.googlecode.com/svn/trunk/doc/wordlist2dawg.1.html
http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Specifying different dictionary files [was: Getting usable source files from traineddata files]

Reply via email to