[tesseract-ocr] Corpus for word frequencies in eng.cube.word-freq ?

Tom Morris Wed, 01 Jul 2015 12:00:59 -0700

When I look at the word frequencies in eng.cube.word-freq, they look more
like what I would expect from analyzing a web corpus rather than a corpus
of printed materials (of any era).


The list starts off okay:

#1 the 13675
#2 of 15222
#3 and 15473
#4 to 15694
#5 a 17149

but then we have:

#29 Links 24448
#34 Search 25779
#37 Home 25853

which seem suspiciously like high frequency terms from web boilerplate.

If you look at the Google N-grams data
<https://books.google.com/ngrams/graph?content=Links%2Cwill%2Call%2CA%2CThis%2Csearch%2Chas%2Ccan%2CHome&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2CLinks%3B%2Cc0%3B.t1%3B%2Cwill%3B%2Cc0%3B.t1%3B%2Call%3B%2Cc0%3B.t1%3B%2CA%3B%2Cc0%3B.t1%3B%2CThis%3B%2Cc0%3B.t1%3B%2Csearch%3B%2Cc0%3B.t1%3B%2Chas%3B%2Cc0%3B.t1%3B%2Ccan%3B%2Cc0%3B.t1%3B%2CHome%3B%2Cc0>,
you can see that the frequency of "Links" is orders of magnitude lower.

How much of an impact does the word-freq list have on the OCR?  Would I get
better results on printed documents if I tuned the word frequencies to
match their contents?

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEE%3DyCYAvDbh84h5hVS99%3DsHtJ8jFPu6LZjr9Z_6eV2QwA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Corpus for word frequencies in eng.cube.word-freq ?

Reply via email to