Allowing tesseract to recognize a lot of additional words (similar to .user-words but maybe not)

chris Tue, 17 Sep 2013 18:21:20 -0700

I need to add words to the list of words recognized by tesseract; the 
problem is that the list of words I'm adding could be lengthy, and I'm 
concerned that if I put them all in a .user-words file that the OCR process 
will be very slow (I'm assuming it does the equivalent of wordlist2dawg on 
the .user-words file each init()) so I had thought to take my list and 
"compile" it into a .traineddata file, but of course I'm missing the 
config, unicharset, unicharambigs, inttemp, pffmtable, and normproto files.


I know that all my words will come from the same language, can I take the 
existing .traineddata file for that language, extract the config, 
unicharset, unicharambigs, inttemp, pffmtable, and normproto files, and use 
them in my own .traineddata file?

Maybe another way to ask the question is this - are the config, unicharset, 
unicharambigs, inttemp, pffmtable, and normproto files dependent on the 
word list, or are they dependent only on the language & font?

Thanks,
Chris

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Allowing tesseract to recognize a lot of additional words (similar to .user-words but maybe not)

Reply via email to