When I look at the word frequencies in eng.cube.word-freq, they look more like what I would expect from analyzing a web corpus rather than a corpus of printed materials (of any era).
The list starts off okay: #1 the 13675 #2 of 15222 #3 and 15473 #4 to 15694 #5 a 17149 but then we have: #29 Links 24448 #34 Search 25779 #37 Home 25853 which seem suspiciously like high frequency terms from web boilerplate. If you look at the Google N-grams data <https://books.google.com/ngrams/graph?content=Links%2Cwill%2Call%2CA%2CThis%2Csearch%2Chas%2Ccan%2CHome&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2CLinks%3B%2Cc0%3B.t1%3B%2Cwill%3B%2Cc0%3B.t1%3B%2Call%3B%2Cc0%3B.t1%3B%2CA%3B%2Cc0%3B.t1%3B%2CThis%3B%2Cc0%3B.t1%3B%2Csearch%3B%2Cc0%3B.t1%3B%2Chas%3B%2Cc0%3B.t1%3B%2Ccan%3B%2Cc0%3B.t1%3B%2CHome%3B%2Cc0>, you can see that the frequency of "Links" is orders of magnitude lower. How much of an impact does the word-freq list have on the OCR? Would I get better results on printed documents if I tuned the word frequencies to match their contents? Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEE%3DyCYAvDbh84h5hVS99%3DsHtJ8jFPu6LZjr9Z_6eV2QwA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

