Thanks for your answers Erick and Michael.

The term confidence level is an OCR output metric which tells for every
word what are the odds it's the actual scanned term. I wish the OCR prog to
output all the "suspected words" that sum up to above ~90% of confidence it
is the actual term instead of outputting a single word as default behaviour.

I'm happy to hear this approach was used before, I will implement an
analyser that indexes these terms in same position to enable positional
queries.
Hope it works on well. In case it does I will open up a Jira ticket for it.

If anyone else has had experience with this use case I'd love hearing,

Manuel


On Wed, Jul 2, 2014 at 7:28 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Problem here is that you wind up with a zillion unique terms in your
> index, which may lead to performance issues, but you probably already
> know that :).
>
> I've seen situations where running it through a dictionary helps. That
> is, does each term in the OCR match some dictionary? Problem here is
> that it then de-values terms that don't happen to be in the
> dictionary, names for instance.
>
> But to answer your question: No, there really isn't a pre-built
> analysis chain that i know of that does this. Root issue is how to
> assign "confidence"? No clue for your specific domain.
>
> So payloads seem quite reasonable here. Happens there's a recent
> end-to-end example, see:
> http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/
>
> Best,
> Erick
>
> On Wed, Jul 2, 2014 at 7:58 AM, Michael Della Bitta
> <michael.della.bi...@appinions.com> wrote:
> > I don't have first hand knowledge of how you implement that, but I bet a
> > look at the WordDelimiterFilter would help you understand how to emit
> > multiple terms with the same positions pretty easily.
> >
> > I've heard of this "bag of word variants" approach to indexing
> poor-quality
> > OCR output before for findability reasons and I heard it works out OK.
> >
> > Michael Della Bitta
> >
> > Applications Developer
> >
> > o: +1 646 532 3062
> >
> > appinions inc.
> >
> > “The Science of Influence Marketing”
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions <https://twitter.com/Appinions> | g+:
> > plus.google.com/appinions
> > <
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> > w: appinions.com <http://www.appinions.com/>
> >
> >
> > On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand <
> > manuel.lenorm...@gmail.com> wrote:
> >
> >> Hello,
> >> Many of our indexed documents are scanned and OCR'ed documents.
> >> Unfortunately we were not able to improve much the OCR quality (less
> than
> >> 80% word accuracy) for various reasons, a fact which badly hurts the
> >> retrieval quality.
> >>
> >> As we use an open-source OCR, we think of changing every scanned term
> >> output to it's main possible variations to get a higher level of
> >> confidence.
> >>
> >> Is there any analyser that supports this kind of need or should I make
> up a
> >> syntax and analyser of my own, i.e the payload syntax?
> >>
> >> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3
> fox|4
> >>
> >> Thanks,
> >> Manuel
> >>
>

Reply via email to