We've been thinking about running some kind of a classifier against each book to select books with a high percentage of dirty OCR for some kind of special processing. Haven't quite figured out a multilingual feature set yet other than the punctuation/alphanumeric and character block ideas mentioned above.
I'm not sure I understand your suggestion. Since real word hapax legomenons are generally pretty common (maybe 40-60% of unique words) wouldn't using them as the "no" set provide mixed signals to the classifier? Tom Walter Underwood-2 wrote: > > > Hmm, how about a classifier? Common words are the "yes" training set, > hapax legomenons are the "no" set, and n-grams are the features. > > But why isn't the OCR program already doing this? > > wunder > > > > > > -- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871444.html Sent from the Solr - User mailing list archive at Nabble.com.