Re: Cleaning up dirty OCR

Tom Burton-West Thu, 11 Mar 2010 15:20:52 -0800

Interesting.  I wonder though if we have 4 million English documents and 250
in Urdu, if the Urdu words would score badly when compared to ngram
statistics for the entire corpus.



hossman wrote:
> 
> 
> 
> Since you are dealing with multiple langugaes, and multiple varient usages 
> of langauges (ie: olde english) I wonder if one way to try and generalize 
> the idea of "unlikely" letter combinations into a math problem (instead of 
> grammer/spelling problem) would be to score all the hapax legomenon 
> words in your index based on the frequency of (character) N-grams in 
> each of those words, relative the entire corpus, and then eliminate any of 
> the hapax legomenon words whose score is below some cut off threshold 
> (that you'd have to pick arbitrarily, probably by eyeballing the sorted 
> list of words and their contexts to deide if they are legitimate)
> 
>       ?
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871353.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Cleaning up dirty OCR

Reply via email to