Hello all, I was able to train some new fonts thanks to the help I've got here.
The Wiki is somewhat vague when it comes to dictionaries. On the Wiki there are few dictionaries mentioned as well as the concern with the licenses. Looking at both aspell and ispell there are different list of words as far as their size. Ispell is simpler and there is an extra large list and a medium list. Inside the file american.med+ I see that some words have a slash / and then something else like MS. For instance: abc abdicate/DGNS abdomen/MS abdominal/Y abduct/DGS abduction/MS abductor/MS I wonder if that is going to generated bad results with the WORDLIST2DAWG application. http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/wordlist2dawg.1.html has some pointers saying that you need to use as: wordlist2dawg frequent_words_list lang.freq-dawg lang.unicharset. In this case, if I am using ispell then the command should be: wordlist2dawg american.med+ eng.freq-dawg eng.unicharset But I wonder if I should use the med+ or if I use the xlg file instead. There is also some instructions that also take into account the length of each word: wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset Anyone with experience on this that could give me some pointers? Thanks in advance. -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

