Re: OCR - Saving multi-term position

Manuel Le Normand Wed, 02 Jul 2014 09:58:37 -0700

Thanks for your answers Erick and Michael.

The term confidence level is an OCR output metric which tells for every
word what are the odds it's the actual scanned term. I wish the OCR prog to
output all the "suspected words" that sum up to above ~90% of confidence it
is the actual term instead of outputting a single word as default behaviour.


I'm happy to hear this approach was used before, I will implement an
analyser that indexes these terms in same position to enable positional
queries.
Hope it works on well. In case it does I will open up a Jira ticket for it.

If anyone else has had experience with this use case I'd love hearing,

Manuel


On Wed, Jul 2, 2014 at 7:28 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Problem here is that you wind up with a zillion unique terms in your
> index, which may lead to performance issues, but you probably already
> know that :).
>
> I've seen situations where running it through a dictionary helps. That
> is, does each term in the OCR match some dictionary? Problem here is
> that it then de-values terms that don't happen to be in the
> dictionary, names for instance.
>
> But to answer your question: No, there really isn't a pre-built
> analysis chain that i know of that does this. Root issue is how to
> assign "confidence"? No clue for your specific domain.
>
> So payloads seem quite reasonable here. Happens there's a recent
> end-to-end example, see:
> http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/
>
> Best,
> Erick
>
> On Wed, Jul 2, 2014 at 7:58 AM, Michael Della Bitta
> <michael.della.bi...@appinions.com> wrote:
> > I don't have first hand knowledge of how you implement that, but I bet a
> > look at the WordDelimiterFilter would help you understand how to emit
> > multiple terms with the same positions pretty easily.
> >
> > I've heard of this "bag of word variants" approach to indexing
> poor-quality
> > OCR output before for findability reasons and I heard it works out OK.
> >
> > Michael Della Bitta
> >
> > Applications Developer
> >
> > o: +1 646 532 3062
> >
> > appinions inc.
> >
> > “The Science of Influence Marketing”
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions <https://twitter.com/Appinions> | g+:
> > plus.google.com/appinions
> > <
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> > w: appinions.com <http://www.appinions.com/>
> >
> >
> > On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand <
> > manuel.lenorm...@gmail.com> wrote:
> >
> >> Hello,
> >> Many of our indexed documents are scanned and OCR'ed documents.
> >> Unfortunately we were not able to improve much the OCR quality (less
> than
> >> 80% word accuracy) for various reasons, a fact which badly hurts the
> >> retrieval quality.
> >>
> >> As we use an open-source OCR, we think of changing every scanned term
> >> output to it's main possible variations to get a higher level of
> >> confidence.
> >>
> >> Is there any analyser that supports this kind of need or should I make
> up a
> >> syntax and analyser of my own, i.e the payload syntax?
> >>
> >> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3
> fox|4
> >>
> >> Thanks,
> >> Manuel
> >>
>

Re: OCR - Saving multi-term position

Reply via email to