RE: Cleaning up dirty OCR

2010-03-11 Thread Burton-West, Tom
Thanks Robert,

I've been thinking about this since you suggested it on another thread.  One 
problem is that it would also remove real words. Apparently 40-60% of the words 
in large corpora occur only once 
(http://en.wikipedia.org/wiki/Hapax_legomenon.)  

There are a couple of use cases where removing words that occur only once might 
be a problem.  

One is for genealogical searches where a user might want to retrieve a document 
if their relative is only mentioned once in the document.  We have quite a few 
government documents and other resources such as the Lineage Book of the 
Daughters of the American Revolution.  

Another use case is humanities researchers doing phrase searching for quotes.  
In this case, if we remove one of the words in the quote because it occurs only 
once in a document, then the phrase search would fail.  For example if someone 
were searching Macbeth and entered the phrase query: Eye of newt and toe of 
frog it would fail if we had removed newt from the index because newt 
occurs only once in Macbeth.

I ran a quick check against a couple of our copies of Macbeth and found out of 
about 5,000 unique words about 3,000 occurred only once.  Of these about 1,800 
were in the unix dictionary, so at least 1800 words that would be removed would 
be real words as opposed to OCR errors (a spot check of the words not in the 
unix /usr/share/dict/words file revealed most of them also as real words rather 
than OCR errors.)

I also ran a quick check against a document with bad OCR and out of about 
30,000 unique words, 20,000 occurred only once.  Of those 20,000 only about 300 
were in the unix dictionary so your intuition that a lot of OCR errors will 
occur only once seems spot on.  A quick look at the words not in the dictionary 
revealed a mix of technical terms, common names, and obvious OCR nonsense such 
as ffll.lj'slall'lm 

I guess the question I need to determine is whether the benefit of removing 
words that occur only once outweighs the costs in terms of the two use cases 
outlined above.   When we get our new test server set up, sometime in the next 
month, I think I will go ahead and prune a test index of 500K docs and do some 
performance testing just to get an idea of the potential performance gains of 
pruning the index.

I have some other questions about index pruning, but I want to do a bit more 
reading and then I'll post a question to either the Solr or Lucene list.  Can 
you suggest which list I should post an index pruning question to?

Tom








-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Tuesday, March 09, 2010 2:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Cleaning up dirty OCR

 Can anyone suggest any practical solutions to removing some fraction of the 
 tokens containing OCR errors from our input stream?

one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812

and filter terms that only appear once in the document.


-- 
Robert Muir
rcm...@gmail.com


Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Thanks Robert,

 I've been thinking about this since you suggested it on another thread.  One 
 problem is that it would also remove real words. Apparently 40-60% of the 
 words in large corpora occur only once 
 (http://en.wikipedia.org/wiki/Hapax_legomenon.)


You are correct. I really hate recommending you 'remove data', but at
the same time, as perhaps an intermediate step, this could be a
brutally simple approach to move you along.


 I guess the question I need to determine is whether the benefit of removing 
 words that occur only once outweighs the costs in terms of the two use cases 
 outlined above.   When we get our new test server set up, sometime in the 
 next month, I think I will go ahead and prune a test index of 500K docs and 
 do some performance testing just to get an idea of the potential performance 
 gains of pruning the index.

Well, one thing I did with Andrzej's patch is immediately
relevance-test this approach against some corpora I had. The results
are on the JIRA issue, and the test collection itself is in
openrelevance.

In my opinion the p...@n is probably overstated, and the MAP values are
probably understated (due to it being a pooled relevance collection),
but I think its fair to say for that specific large text collection,
pruning terms that only appear in the document a single time does not
hurt relevance.

At the same time I will not dispute that it could actually help p...@n, I
am just saying I'm not sold :)

Either way its extremely interesting, cut your index size in half, and
get the same relevance!


 I have some other questions about index pruning, but I want to do a bit more 
 reading and then I'll post a question to either the Solr or Lucene list.  Can 
 you suggest which list I should post an index pruning question to?


I would recommend posting it to the JIRA issue:
http://issues.apache.org/jira/browse/LUCENE-1812

This way someone who knows more (Andrzej) could see it, too.


-- 
Robert Muir
rcm...@gmail.com


Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West

Thanks Simon,

We can probably implement your suggestion about runs of punctuation and
unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
looking for unlikely mixes of unicode character blocks.  For example some of
the CJK material ends up with Cyrillic characters. (except we would have to
watch out for any Russian-Chinese dictionaries:)

Tom



 
 
 There wasn't any completely satisfactory solution; there were a large
 number
 of two and three letter n-grams so we were able to use a dictionary
 approach
 to eliminate those (names tend to be longer).  We also looked for runs of
 punctuation,  unlikely mixes of alpha/numeric/punctuation, and also
 eliminated longer words which consisted of runs of not-ocurring-in-English
 bigrams.
 
 Hope this helps
 
 -Simon
 

 --

 
 

-- 
View this message in context: 
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27869940.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West tburtonw...@gmail.com wrote:

 Thanks Simon,

 We can probably implement your suggestion about runs of punctuation and
 unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
 looking for unlikely mixes of unicode character blocks.  For example some of
 the CJK material ends up with Cyrillic characters. (except we would have to
 watch out for any Russian-Chinese dictionaries:)


Ok this is a new one for me, I am just curious, have you figured out
why this is happening?

Separately, i would love to know some sort of character frequency data
for your non-english text, are you OCR'ing that data too? Are you
using Unicode normalization or anything to prevent explosion of terms
that are really the same?

-- 
Robert Muir
rcm...@gmail.com


Re: Cleaning up dirty OCR

2010-03-11 Thread Chris Hostetter

: We can probably implement your suggestion about runs of punctuation and
: unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
: looking for unlikely mixes of unicode character blocks.  For example some of
: the CJK material ends up with Cyrillic characters. (except we would have to
: watch out for any Russian-Chinese dictionaries:)

Since you are dealing with multiple langugaes, and multiple varient usages 
of langauges (ie: olde english) I wonder if one way to try and generalize 
the idea of unlikely letter combinations into a math problem (instead of 
grammer/spelling problem) would be to score all the hapax legomenon 
words in your index based on the frequency of (character) N-grams in 
each of those words, relative the entire corpus, and then eliminate any of 
the hapax legomenon words whose score is below some cut off threshold 
(that you'd have to pick arbitrarily, probably by eyeballing the sorted 
list of words and their contexts to deide if they are legitimate)

?


-Hoss



Re: Cleaning up dirty OCR

2010-03-11 Thread Walter Underwood
On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote:

 I wonder if one way to try and generalize 
 the idea of unlikely letter combinations into a math problem (instead of 
 grammer/spelling problem) would be to score all the hapax legomenon 
 words in your index


Hmm, how about a classifier? Common words are the yes training set, hapax 
legomenons are the no set, and n-grams are the features.

But why isn't the OCR program already doing this?

wunder






Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West

Interesting.  I wonder though if we have 4 million English documents and 250
in Urdu, if the Urdu words would score badly when compared to ngram
statistics for the entire corpus.  


hossman wrote:
 
 
 
 Since you are dealing with multiple langugaes, and multiple varient usages 
 of langauges (ie: olde english) I wonder if one way to try and generalize 
 the idea of unlikely letter combinations into a math problem (instead of 
 grammer/spelling problem) would be to score all the hapax legomenon 
 words in your index based on the frequency of (character) N-grams in 
 each of those words, relative the entire corpus, and then eliminate any of 
 the hapax legomenon words whose score is below some cut off threshold 
 (that you'd have to pick arbitrarily, probably by eyeballing the sorted 
 list of words and their contexts to deide if they are legitimate)
 
   ?
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871353.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West

We've been thinking about running some kind of a classifier against each book
to select books with a high percentage of dirty OCR for some kind of special
processing.  Haven't quite figured out a multilingual feature set yet other
than the punctuation/alphanumeric and character block ideas mentioned above.   

I'm not sure I understand your suggestion. Since real word hapax legomenons
are generally pretty common (maybe 40-60% of unique words) wouldn't  using
them as the no set provide mixed signals to the classifier?

Tom


Walter Underwood-2 wrote:
 
 
 Hmm, how about a classifier? Common words are the yes training set,
 hapax legomenons are the no set, and n-grams are the features.
 
 But why isn't the OCR program already doing this?
 
 wunder
 
 
 
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871444.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Cleaning up dirty OCR

2010-03-11 Thread Chris Hostetter

: Interesting.  I wonder though if we have 4 million English documents and 250
: in Urdu, if the Urdu words would score badly when compared to ngram
: statistics for the entire corpus.  

Well it doesn't have to be a strict ratio cutoff .. you could look at the 
average frequency of all character Ngrams in your index, and then 
consider any Ngram that appeared fewer then X stddev's below the average 
to be suspicious, and eliminate any work that contains Y or more 
suspicious Ngrams.

Of you could just start really simple and eliminate any work that contains 
an Ngram that doesn't appear in *any* other word in your corpus.

I don't deal with a lot of multi-lingual stuff, but my understanding is 
that this sort of thing gets a lot easier if you can partition your docs 
by language -- and even if you can't, doing some langauge detection on the 
(dirty) OCRed text to get a language guess (and then partition by language 
and attempt to find the suspicious words in each partition)


-Hoss



Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir

 I don't deal with a lot of multi-lingual stuff, but my understanding is
 that this sort of thing gets a lot easier if you can partition your docs
 by language -- and even if you can't, doing some langauge detection on the
 (dirty) OCRed text to get a language guess (and then partition by language
 and attempt to find the suspicious words in each partition)


and if you are really OCR'ing Urdu text and trying to search it automatically,
then this is your last priority.

-- 
Robert Muir
rcm...@gmail.com


Cleaning up dirty OCR

2010-03-09 Thread Burton-West, Tom
Hello all,

We have been indexing a large collection of OCR'd text. About 5 million books 
in over 200 languages.  With 1.5 billion OCR'd pages, even a small OCR error 
rate creates a relatively large number of meaningless unique terms.  (See  
http://www.hathitrust.org/blogs/large-scale-search/too-many-words )

We would like to remove some *fraction* of these nonsense words caused by OCR 
errors prior to indexing. ( We don't want to remove real words, so we need 
some method with very few false positives.)

A dictionary based approach does not seem feasible given the number of 
languages and the inclusion of proper names, place names, and technical terms.  
 We are considering using some heuristics, such as looking for strings over a 
certain length or strings containing more than some number of punctuation 
characters.

This paper has a few such heuristics:
Kazem Taghva, Tom Nartker, Allen Condit, and Julie Borsack. Automatic Removal 
of ``Garbage Strings'' in OCR Text: An Implementation. In The 5th World 
Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, Florida, 
July 2001. http://www.isri.unlv.edu/publications/isripub/Taghva01b.pdf

Can anyone suggest any practical solutions to removing some fraction of the 
tokens containing OCR errors from our input stream?

Tom Burton-West
University of Michigan Library
www.hathitrust.org



Re: Cleaning up dirty OCR

2010-03-09 Thread Robert Muir
 Can anyone suggest any practical solutions to removing some fraction of the 
 tokens containing OCR errors from our input stream?

one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812

and filter terms that only appear once in the document.


-- 
Robert Muir
rcm...@gmail.com


Re: Cleaning up dirty OCR

2010-03-09 Thread simon
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir rcm...@gmail.com wrote:

  Can anyone suggest any practical solutions to removing some fraction of
 the tokens containing OCR errors from our input stream?

 one approach would be to try
 http://issues.apache.org/jira/browse/LUCENE-1812

 and filter terms that only appear once in the document.


In another life (and with another search engine) I also had to find a
solution to the dirty OCR problem. Fortunately only in English,
unfortunately a corpus containing many non-American/non-English names, so we
also had to be very conservative and reduce the number of false positives.

There wasn't any completely satisfactory solution; there were a large number
of two and three letter n-grams so we were able to use a dictionary approach
to eliminate those (names tend to be longer).  We also looked for runs of
punctuation,  unlikely mixes of alpha/numeric/punctuation, and also
eliminated longer words which consisted of runs of not-ocurring-in-English
bigrams.

Hope this helps

-Simon


 --