RE: Cleaning up dirty OCR
Thanks Robert, I've been thinking about this since you suggested it on another thread. One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once (http://en.wikipedia.org/wiki/Hapax_legomenon.) There are a couple of use cases where removing words that occur only once might be a problem. One is for genealogical searches where a user might want to retrieve a document if their relative is only mentioned once in the document. We have quite a few government documents and other resources such as the Lineage Book of the Daughters of the American Revolution. Another use case is humanities researchers doing phrase searching for quotes. In this case, if we remove one of the words in the quote because it occurs only once in a document, then the phrase search would fail. For example if someone were searching Macbeth and entered the phrase query: Eye of newt and toe of frog it would fail if we had removed newt from the index because newt occurs only once in Macbeth. I ran a quick check against a couple of our copies of Macbeth and found out of about 5,000 unique words about 3,000 occurred only once. Of these about 1,800 were in the unix dictionary, so at least 1800 words that would be removed would be real words as opposed to OCR errors (a spot check of the words not in the unix /usr/share/dict/words file revealed most of them also as real words rather than OCR errors.) I also ran a quick check against a document with bad OCR and out of about 30,000 unique words, 20,000 occurred only once. Of those 20,000 only about 300 were in the unix dictionary so your intuition that a lot of OCR errors will occur only once seems spot on. A quick look at the words not in the dictionary revealed a mix of technical terms, common names, and obvious OCR nonsense such as ffll.lj'slall'lm I guess the question I need to determine is whether the benefit of removing words that occur only once outweighs the costs in terms of the two use cases outlined above. When we get our new test server set up, sometime in the next month, I think I will go ahead and prune a test index of 500K docs and do some performance testing just to get an idea of the potential performance gains of pruning the index. I have some other questions about index pruning, but I want to do a bit more reading and then I'll post a question to either the Solr or Lucene list. Can you suggest which list I should post an index pruning question to? Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Tuesday, March 09, 2010 2:36 PM To: solr-user@lucene.apache.org Subject: Re: Cleaning up dirty OCR Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream? one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter terms that only appear once in the document. -- Robert Muir rcm...@gmail.com
Re: Cleaning up dirty OCR
On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom tburt...@umich.edu wrote: Thanks Robert, I've been thinking about this since you suggested it on another thread. One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once (http://en.wikipedia.org/wiki/Hapax_legomenon.) You are correct. I really hate recommending you 'remove data', but at the same time, as perhaps an intermediate step, this could be a brutally simple approach to move you along. I guess the question I need to determine is whether the benefit of removing words that occur only once outweighs the costs in terms of the two use cases outlined above. When we get our new test server set up, sometime in the next month, I think I will go ahead and prune a test index of 500K docs and do some performance testing just to get an idea of the potential performance gains of pruning the index. Well, one thing I did with Andrzej's patch is immediately relevance-test this approach against some corpora I had. The results are on the JIRA issue, and the test collection itself is in openrelevance. In my opinion the p...@n is probably overstated, and the MAP values are probably understated (due to it being a pooled relevance collection), but I think its fair to say for that specific large text collection, pruning terms that only appear in the document a single time does not hurt relevance. At the same time I will not dispute that it could actually help p...@n, I am just saying I'm not sold :) Either way its extremely interesting, cut your index size in half, and get the same relevance! I have some other questions about index pruning, but I want to do a bit more reading and then I'll post a question to either the Solr or Lucene list. Can you suggest which list I should post an index pruning question to? I would recommend posting it to the JIRA issue: http://issues.apache.org/jira/browse/LUCENE-1812 This way someone who knows more (Andrzej) could see it, too. -- Robert Muir rcm...@gmail.com
Re: Cleaning up dirty OCR
Thanks Simon, We can probably implement your suggestion about runs of punctuation and unlikely mixes of alpha/numeric/punctuation. I'm also thinking about looking for unlikely mixes of unicode character blocks. For example some of the CJK material ends up with Cyrillic characters. (except we would have to watch out for any Russian-Chinese dictionaries:) Tom There wasn't any completely satisfactory solution; there were a large number of two and three letter n-grams so we were able to use a dictionary approach to eliminate those (names tend to be longer). We also looked for runs of punctuation, unlikely mixes of alpha/numeric/punctuation, and also eliminated longer words which consisted of runs of not-ocurring-in-English bigrams. Hope this helps -Simon -- -- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27869940.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Cleaning up dirty OCR
On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West tburtonw...@gmail.com wrote: Thanks Simon, We can probably implement your suggestion about runs of punctuation and unlikely mixes of alpha/numeric/punctuation. I'm also thinking about looking for unlikely mixes of unicode character blocks. For example some of the CJK material ends up with Cyrillic characters. (except we would have to watch out for any Russian-Chinese dictionaries:) Ok this is a new one for me, I am just curious, have you figured out why this is happening? Separately, i would love to know some sort of character frequency data for your non-english text, are you OCR'ing that data too? Are you using Unicode normalization or anything to prevent explosion of terms that are really the same? -- Robert Muir rcm...@gmail.com
Re: Cleaning up dirty OCR
: We can probably implement your suggestion about runs of punctuation and : unlikely mixes of alpha/numeric/punctuation. I'm also thinking about : looking for unlikely mixes of unicode character blocks. For example some of : the CJK material ends up with Cyrillic characters. (except we would have to : watch out for any Russian-Chinese dictionaries:) Since you are dealing with multiple langugaes, and multiple varient usages of langauges (ie: olde english) I wonder if one way to try and generalize the idea of unlikely letter combinations into a math problem (instead of grammer/spelling problem) would be to score all the hapax legomenon words in your index based on the frequency of (character) N-grams in each of those words, relative the entire corpus, and then eliminate any of the hapax legomenon words whose score is below some cut off threshold (that you'd have to pick arbitrarily, probably by eyeballing the sorted list of words and their contexts to deide if they are legitimate) ? -Hoss
Re: Cleaning up dirty OCR
On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote: I wonder if one way to try and generalize the idea of unlikely letter combinations into a math problem (instead of grammer/spelling problem) would be to score all the hapax legomenon words in your index Hmm, how about a classifier? Common words are the yes training set, hapax legomenons are the no set, and n-grams are the features. But why isn't the OCR program already doing this? wunder
Re: Cleaning up dirty OCR
Interesting. I wonder though if we have 4 million English documents and 250 in Urdu, if the Urdu words would score badly when compared to ngram statistics for the entire corpus. hossman wrote: Since you are dealing with multiple langugaes, and multiple varient usages of langauges (ie: olde english) I wonder if one way to try and generalize the idea of unlikely letter combinations into a math problem (instead of grammer/spelling problem) would be to score all the hapax legomenon words in your index based on the frequency of (character) N-grams in each of those words, relative the entire corpus, and then eliminate any of the hapax legomenon words whose score is below some cut off threshold (that you'd have to pick arbitrarily, probably by eyeballing the sorted list of words and their contexts to deide if they are legitimate) ? -Hoss -- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871353.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Cleaning up dirty OCR
We've been thinking about running some kind of a classifier against each book to select books with a high percentage of dirty OCR for some kind of special processing. Haven't quite figured out a multilingual feature set yet other than the punctuation/alphanumeric and character block ideas mentioned above. I'm not sure I understand your suggestion. Since real word hapax legomenons are generally pretty common (maybe 40-60% of unique words) wouldn't using them as the no set provide mixed signals to the classifier? Tom Walter Underwood-2 wrote: Hmm, how about a classifier? Common words are the yes training set, hapax legomenons are the no set, and n-grams are the features. But why isn't the OCR program already doing this? wunder -- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871444.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Cleaning up dirty OCR
: Interesting. I wonder though if we have 4 million English documents and 250 : in Urdu, if the Urdu words would score badly when compared to ngram : statistics for the entire corpus. Well it doesn't have to be a strict ratio cutoff .. you could look at the average frequency of all character Ngrams in your index, and then consider any Ngram that appeared fewer then X stddev's below the average to be suspicious, and eliminate any work that contains Y or more suspicious Ngrams. Of you could just start really simple and eliminate any work that contains an Ngram that doesn't appear in *any* other word in your corpus. I don't deal with a lot of multi-lingual stuff, but my understanding is that this sort of thing gets a lot easier if you can partition your docs by language -- and even if you can't, doing some langauge detection on the (dirty) OCRed text to get a language guess (and then partition by language and attempt to find the suspicious words in each partition) -Hoss
Re: Cleaning up dirty OCR
I don't deal with a lot of multi-lingual stuff, but my understanding is that this sort of thing gets a lot easier if you can partition your docs by language -- and even if you can't, doing some langauge detection on the (dirty) OCRed text to get a language guess (and then partition by language and attempt to find the suspicious words in each partition) and if you are really OCR'ing Urdu text and trying to search it automatically, then this is your last priority. -- Robert Muir rcm...@gmail.com
Cleaning up dirty OCR
Hello all, We have been indexing a large collection of OCR'd text. About 5 million books in over 200 languages. With 1.5 billion OCR'd pages, even a small OCR error rate creates a relatively large number of meaningless unique terms. (See http://www.hathitrust.org/blogs/large-scale-search/too-many-words ) We would like to remove some *fraction* of these nonsense words caused by OCR errors prior to indexing. ( We don't want to remove real words, so we need some method with very few false positives.) A dictionary based approach does not seem feasible given the number of languages and the inclusion of proper names, place names, and technical terms. We are considering using some heuristics, such as looking for strings over a certain length or strings containing more than some number of punctuation characters. This paper has a few such heuristics: Kazem Taghva, Tom Nartker, Allen Condit, and Julie Borsack. Automatic Removal of ``Garbage Strings'' in OCR Text: An Implementation. In The 5th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, Florida, July 2001. http://www.isri.unlv.edu/publications/isripub/Taghva01b.pdf Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream? Tom Burton-West University of Michigan Library www.hathitrust.org
Re: Cleaning up dirty OCR
Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream? one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter terms that only appear once in the document. -- Robert Muir rcm...@gmail.com
Re: Cleaning up dirty OCR
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir rcm...@gmail.com wrote: Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream? one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter terms that only appear once in the document. In another life (and with another search engine) I also had to find a solution to the dirty OCR problem. Fortunately only in English, unfortunately a corpus containing many non-American/non-English names, so we also had to be very conservative and reduce the number of false positives. There wasn't any completely satisfactory solution; there were a large number of two and three letter n-grams so we were able to use a dictionary approach to eliminate those (names tend to be longer). We also looked for runs of punctuation, unlikely mixes of alpha/numeric/punctuation, and also eliminated longer words which consisted of runs of not-ocurring-in-English bigrams. Hope this helps -Simon --