Hi Matt, > I am training Tesseract to OCR 17th and 18th century or earlier English > language documents.
Cool, that sounds very interesting to me :) > I'm trying now to decide how many words to have in my frequently used words > list. I had the same concern when training Ancient Greek. I discuss it briefly in my article[0], but I'll answer your questions as best as I can below too. > Is there an advantage to have more or less words in this file compared to > the word list? The way it works is that words in the freq-words list are weighted higher when Tesseract looks for likely matches. So if there are too many, you're likely to get less common variants switched to the most common ones. It's not a question to which there is an easy, exact number. I settled on only the most popular few hundred words for the freq-words list, as this is what the other trainings tended to have, and it seems to work well. On the off chance it's useful to you, I created a bourne shell script[1] that takes a word list in 'number-of-occurances word' format, and outputs two word lists, one freq-words and one all-words. (even less likely to be useful, this[2] is the hacky script that I used to generate the original word list from an XML TEI Greek corpus.) > Is there any problem with having overlap in the two lists, i.e. > words that appear in both lists? That's fine, I believe. If it's in both lists it will be treated as a frequent word, which is correct. > Is there a better way to handle alternative > spellings than just having them all available in the word list? Nope. If you have any ideas, by all means share them. Let us know how your project gets on! Nick 0. http://www.eutypon.gr/eutypon/pdf/e2012-29/e29-a01.pdf 1. https://gitorious.org/ancient-greek-training-for-tesseract/tesstrainingtools/blobs/master/wordlistparse.sh 2. https://gitorious.org/ancient-greek-training-for-tesseract/grctrainingtools/blobs/master/wordlistfromperseus.sh -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

