Hi, I am training Tesseract to OCR 17th and 18th century or earlier English language documents. Using a proper dictionary is important since spelling was not standardized until half way through the 18th century, so my word lists include a LOT of alternative spellings. I have a word frequency file generated from materials of this time frame that's over 800,000 words long. And some other word lists including proper names and a general dictionary with alternative spellings of words. These lists are also several hundred thousand words long.
I'm trying now to decide how many words to have in my frequently used words list. Is there an advantage to have more or less words in this file compared to the word list? Is there any problem with having overlap in the two lists, i.e. words that appear in both lists? Is there a better way to handle alternative spellings than just having them all available in the word list? Thanks for any help, Matt -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

