Re: OCR - Saving multi-term position

2014-07-03 Thread Charlie Hull
On 02/07/2014 15:19, Manuel Le Normand wrote: Hello, Many of our indexed documents are scanned and OCR'ed documents. Unfortunately we were not able to improve much the OCR quality (less than 80% word accuracy) for various reasons, a fact which badly hurts the retrieval quality. As we use an

OCR - Saving multi-term position

2014-07-02 Thread Manuel Le Normand
Hello, Many of our indexed documents are scanned and OCR'ed documents. Unfortunately we were not able to improve much the OCR quality (less than 80% word accuracy) for various reasons, a fact which badly hurts the retrieval quality. As we use an open-source OCR, we think of changing every scanned

Re: OCR - Saving multi-term position

2014-07-02 Thread Michael Della Bitta
I don't have first hand knowledge of how you implement that, but I bet a look at the WordDelimiterFilter would help you understand how to emit multiple terms with the same positions pretty easily. I've heard of this bag of word variants approach to indexing poor-quality OCR output before for

Re: OCR - Saving multi-term position

2014-07-02 Thread Erick Erickson
Problem here is that you wind up with a zillion unique terms in your index, which may lead to performance issues, but you probably already know that :). I've seen situations where running it through a dictionary helps. That is, does each term in the OCR match some dictionary? Problem here is that

Re: OCR - Saving multi-term position

2014-07-02 Thread Manuel Le Normand
Thanks for your answers Erick and Michael. The term confidence level is an OCR output metric which tells for every word what are the odds it's the actual scanned term. I wish the OCR prog to output all the suspected words that sum up to above ~90% of confidence it is the actual term instead of

Re: OCR - Saving multi-term position

2014-07-02 Thread Jack Krupansky
- Saving multi-term position Thanks for your answers Erick and Michael. The term confidence level is an OCR output metric which tells for every word what are the odds it's the actual scanned term. I wish the OCR prog to output all the suspected words that sum up to above ~90% of confidence

Re: OCR - Saving multi-term position

2014-07-02 Thread Koji Sekiguchi
Hi Manuel, I think OCR error correction is one of well-known NLP tasks. I'd thought it could be implemented in the past by using Lucene. This is a brief idea: 1. You have got a Lucene index. This existing index is made from correct (i.e. error free) documents that are same domain of OCR