Problem here is that you wind up with a zillion unique terms in your index, which may lead to performance issues, but you probably already know that :).
I've seen situations where running it through a dictionary helps. That is, does each term in the OCR match some dictionary? Problem here is that it then de-values terms that don't happen to be in the dictionary, names for instance. But to answer your question: No, there really isn't a pre-built analysis chain that i know of that does this. Root issue is how to assign "confidence"? No clue for your specific domain. So payloads seem quite reasonable here. Happens there's a recent end-to-end example, see: http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/ Best, Erick On Wed, Jul 2, 2014 at 7:58 AM, Michael Della Bitta <michael.della.bi...@appinions.com> wrote: > I don't have first hand knowledge of how you implement that, but I bet a > look at the WordDelimiterFilter would help you understand how to emit > multiple terms with the same positions pretty easily. > > I've heard of this "bag of word variants" approach to indexing poor-quality > OCR output before for findability reasons and I heard it works out OK. > > Michael Della Bitta > > Applications Developer > > o: +1 646 532 3062 > > appinions inc. > > “The Science of Influence Marketing” > > 18 East 41st Street > > New York, NY 10017 > > t: @appinions <https://twitter.com/Appinions> | g+: > plus.google.com/appinions > <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts> > w: appinions.com <http://www.appinions.com/> > > > On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand < > manuel.lenorm...@gmail.com> wrote: > >> Hello, >> Many of our indexed documents are scanned and OCR'ed documents. >> Unfortunately we were not able to improve much the OCR quality (less than >> 80% word accuracy) for various reasons, a fact which badly hurts the >> retrieval quality. >> >> As we use an open-source OCR, we think of changing every scanned term >> output to it's main possible variations to get a higher level of >> confidence. >> >> Is there any analyser that supports this kind of need or should I make up a >> syntax and analyser of my own, i.e the payload syntax? >> >> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4 >> >> Thanks, >> Manuel >>