On Tue, Sep 17, 2013 at 11:57:04AM -0700, [email protected] wrote: > I need to add words to the list of words recognized by tesseract; the problem > is that the list of words I'm adding could be lengthy, and I'm concerned that > if I put them all in a .user-words file that the OCR process will be very slow > (I'm assuming it does the equivalent of wordlist2dawg on the .user-words file > each init()) so I had thought to take my list and "compile" it into a > .traineddata file, but of course I'm missing the config, unicharset, > unicharambigs, inttemp, pffmtable, and normproto files. > > I know that all my words will come from the same language, can I take the > existing .traineddata file for that language, extract the config, unicharset, > unicharambigs, inttemp, pffmtable, and normproto files, and use them in my own > .traineddata file?
Yes, that should work fine. An alternative which would achieve the same result would be to unpack the .traineddata, use dawg2wordlist to expand the dawg back out to a wordlist, append your list, then wordlist2dawg it back again. Nick -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

