org
Sent: Tuesday, 29 January, 2008 8:00:56 AM
Subject: Re: Lucene to index OCR text
Op
Tuesday
29
January
2008
03:32:08
schreef
Daniel
Noll:
>
On
Friday
25
January
2008
19:26:44
Paul
Elschot
wrote:
>
>
There
is
no
way
to
do
exact
phrase
matching
on
OCR
data,
be
Op Tuesday 29 January 2008 03:32:08 schreef Daniel Noll:
> On Friday 25 January 2008 19:26:44 Paul Elschot wrote:
> > There is no way to do exact phrase matching on OCR data, because no
> > correction of OCR data will be perfect. Otherwise the OCR would have made
> > the correction...
> >
>
> The
On Friday 25 January 2008 19:26:44 Paul Elschot wrote:
> There is no way to do exact phrase matching on OCR data, because no
> correction of OCR data will be perfect. Otherwise the OCR would have made
> the correction...
>
The problem I see with a fuzzy query is that if you have the fuzziness set
PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Friday, January 25, 2008 7:31 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene to index OCR text
Thanks everyone for their ideas and suggestions! Some had occurred to us but
were discarded because we feel our solution needs to be automated --
45 million
Thanks everyone for their ideas and suggestions! Some had occurred to us
but were discarded because we feel our solution needs to be automated --
45 million pages are a lot of thrust on any human-driven effort.
I like Itamar's idea of doing "competing" OCR, and keeping the best
result. Unfortunate
That is brilliant!
On Jan 25, 2008 6:12 AM, mark harwood <[EMAIL PROTECTED]> wrote:
> Probably not a practical solution for you to set up but I love this idea:
> http://blog.wired.com/monkeybites/2007/05/recaptcha_fight.html
>
> - Original Message
> From: Renaud Waldura <[EMAIL PROTECTE
Probably not a practical solution for you to set up but I love this idea:
http://blog.wired.com/monkeybites/2007/05/recaptcha_fight.html
- Original Message
From: Renaud Waldura <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 25 January, 2008 1:43:06 AM
Subject: Lucene
re
too big?
Itamar.
-Original Message-
From: Paul Elschot [mailto:[EMAIL PROTECTED]
Sent: Friday, January 25, 2008 10:27 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene to index OCR text
Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell:
> > I've been poking around
Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell:
> > I've been poking around the list archives and didn't really come up against
> > anything interesting. Anyone using Lucene to index OCR text? Any
> > strategies/algorithms/packages you recommend?
> >
> > I have a large collection (10^7 doc
> I've been poking around the list archives and didn't really come up against
> anything interesting. Anyone using Lucene to index OCR text? Any
> strategies/algorithms/packages you recommend?
>
> I have a large collection (10^7 docs) that's mostly the result of OCR. We
> index/search/etc. with Luc
Lots of luck to you, because I haven't a clue. My company deals with
OCR data and we haven't had a single workable idea. Of course, our
data sets are minuscule compared to what you're talking about, so we
haven't tried to heuristically clean up the data.
But given that Google is scanning the entir
11 matches
Mail list logo