Re: Allowing tesseract to recognize a lot of additional words (similar to .user-words but maybe not)

Nick White Wed, 18 Sep 2013 04:44:42 -0700

On Tue, Sep 17, 2013 at 11:57:04AM -0700, [email protected] wrote:
> I need to add words to the list of words recognized by tesseract; the problem
> is that the list of words I'm adding could be lengthy, and I'm concerned that
> if I put them all in a .user-words file that the OCR process will be very slow
> (I'm assuming it does the equivalent of wordlist2dawg on the .user-words file
> each init()) so I had thought to take my list and "compile" it into a
> .traineddata file, but of course I'm missing the config, unicharset,
> unicharambigs, inttemp, pffmtable, and normproto files.
> 
> I know that all my words will come from the same language, can I take the
> existing .traineddata file for that language, extract the config, unicharset,
> unicharambigs, inttemp, pffmtable, and normproto files, and use them in my own
> .traineddata file?


Yes, that should work fine. An alternative which would achieve the
same result would be to unpack the .traineddata, use dawg2wordlist
to expand the dawg back out to a wordlist, append your list, then
wordlist2dawg it back again.

Nick

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Allowing tesseract to recognize a lot of additional words (similar to .user-words but maybe not)

Reply via email to