Interesting. I wonder though if we have 4 million English documents and 250 in Urdu, if the Urdu words would score badly when compared to ngram statistics for the entire corpus.
hossman wrote: > > > > Since you are dealing with multiple langugaes, and multiple varient usages > of langauges (ie: olde english) I wonder if one way to try and generalize > the idea of "unlikely" letter combinations into a math problem (instead of > grammer/spelling problem) would be to score all the hapax legomenon > words in your index based on the frequency of (character) N-grams in > each of those words, relative the entire corpus, and then eliminate any of > the hapax legomenon words whose score is below some cut off threshold > (that you'd have to pick arbitrarily, probably by eyeballing the sorted > list of words and their contexts to deide if they are legitimate) > > ? > > > -Hoss > > > -- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871353.html Sent from the Solr - User mailing list archive at Nabble.com.