Hi Matt,

> I am training Tesseract to OCR 17th and 18th century or earlier English
> language documents.

Cool, that sounds very interesting to me :)

> I'm trying now to decide how many words to have in my frequently used words
> list.

I had the same concern when training Ancient Greek. I discuss it
briefly in my article[0], but I'll answer your questions as best as
I can below too.

> Is there an advantage to have more or less words in this file compared to
> the word list?

The way it works is that words in the freq-words list are weighted
higher when Tesseract looks for likely matches. So if there are too
many, you're likely to get less common variants switched to the most
common ones. It's not a question to which there is an easy, exact
number.

I settled on only the most popular few hundred words for the
freq-words list, as this is what the other trainings tended to have,
and it seems to work well.

On the off chance it's useful to you, I created a bourne shell
script[1] that takes a word list in 'number-of-occurances word'
format, and outputs two word lists, one freq-words and one
all-words. (even less likely to be useful, this[2] is the hacky
script that I used to generate the original word list from an XML
TEI Greek corpus.)

> Is there any problem with having overlap in the two lists, i.e.
> words that appear in both lists?

That's fine, I believe. If it's in both lists it will be treated as
a frequent word, which is correct.

> Is there a better way to handle alternative
> spellings than just having them all available in the word list?

Nope. If you have any ideas, by all means share them.

Let us know how your project gets on!

Nick

0. http://www.eutypon.gr/eutypon/pdf/e2012-29/e29-a01.pdf
1. 
https://gitorious.org/ancient-greek-training-for-tesseract/tesstrainingtools/blobs/master/wordlistparse.sh
2. 
https://gitorious.org/ancient-greek-training-for-tesseract/grctrainingtools/blobs/master/wordlistfromperseus.sh

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to