Hi,

I am training Tesseract to OCR 17th and 18th century or earlier English 
language documents. Using a proper dictionary is important since spelling 
was not standardized until half way through the 18th century, so my word 
lists include a LOT of alternative spellings. I have a word frequency file 
generated from materials of this time frame that's over 800,000 words long. 
And some other word lists including proper names and a general dictionary 
with alternative spellings of words. These lists are also several hundred 
thousand words long. 

I'm trying now to decide how many words to have in my frequently used words 
list. Is there an advantage to have more or less words in this file 
compared to the word list? Is there any problem with having overlap in the 
two lists, i.e. words that appear in both lists? Is there a better way to 
handle alternative spellings than just having them all available in the 
word list?

Thanks for any help,
Matt

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to