Thanks for your answers Erick and Michael. The term confidence level is an OCR output metric which tells for every word what are the odds it's the actual scanned term. I wish the OCR prog to output all the "suspected words" that sum up to above ~90% of confidence it is the actual term instead of outputting a single word as default behaviour.
I'm happy to hear this approach was used before, I will implement an analyser that indexes these terms in same position to enable positional queries. Hope it works on well. In case it does I will open up a Jira ticket for it. If anyone else has had experience with this use case I'd love hearing, Manuel On Wed, Jul 2, 2014 at 7:28 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Problem here is that you wind up with a zillion unique terms in your > index, which may lead to performance issues, but you probably already > know that :). > > I've seen situations where running it through a dictionary helps. That > is, does each term in the OCR match some dictionary? Problem here is > that it then de-values terms that don't happen to be in the > dictionary, names for instance. > > But to answer your question: No, there really isn't a pre-built > analysis chain that i know of that does this. Root issue is how to > assign "confidence"? No clue for your specific domain. > > So payloads seem quite reasonable here. Happens there's a recent > end-to-end example, see: > http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/ > > Best, > Erick > > On Wed, Jul 2, 2014 at 7:58 AM, Michael Della Bitta > <michael.della.bi...@appinions.com> wrote: > > I don't have first hand knowledge of how you implement that, but I bet a > > look at the WordDelimiterFilter would help you understand how to emit > > multiple terms with the same positions pretty easily. > > > > I've heard of this "bag of word variants" approach to indexing > poor-quality > > OCR output before for findability reasons and I heard it works out OK. > > > > Michael Della Bitta > > > > Applications Developer > > > > o: +1 646 532 3062 > > > > appinions inc. > > > > “The Science of Influence Marketing” > > > > 18 East 41st Street > > > > New York, NY 10017 > > > > t: @appinions <https://twitter.com/Appinions> | g+: > > plus.google.com/appinions > > < > https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts > > > > w: appinions.com <http://www.appinions.com/> > > > > > > On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand < > > manuel.lenorm...@gmail.com> wrote: > > > >> Hello, > >> Many of our indexed documents are scanned and OCR'ed documents. > >> Unfortunately we were not able to improve much the OCR quality (less > than > >> 80% word accuracy) for various reasons, a fact which badly hurts the > >> retrieval quality. > >> > >> As we use an open-source OCR, we think of changing every scanned term > >> output to it's main possible variations to get a higher level of > >> confidence. > >> > >> Is there any analyser that supports this kind of need or should I make > up a > >> syntax and analyser of my own, i.e the payload syntax? > >> > >> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 > fox|4 > >> > >> Thanks, > >> Manuel > >> >