subject:"Re\: Lucene to index OCR text"

Re: Lucene to index OCR text

2008-01-29 Thread mark harwood

org Sent: Tuesday, 29 January, 2008 8:00:56 AM Subject: Re: Lucene to index OCR text Op Tuesday 29 January 2008 03:32:08 schreef Daniel Noll: > On Friday 25 January 2008 19:26:44 Paul Elschot wrote: > > There is no way to do exact phrase matching on OCR data, be

Re: Lucene to index OCR text

2008-01-29 Thread Paul Elschot

Op Tuesday 29 January 2008 03:32:08 schreef Daniel Noll: > On Friday 25 January 2008 19:26:44 Paul Elschot wrote: > > There is no way to do exact phrase matching on OCR data, because no > > correction of OCR data will be perfect. Otherwise the OCR would have made > > the correction... > > > > The

Re: Lucene to index OCR text

2008-01-28 Thread Daniel Noll

On Friday 25 January 2008 19:26:44 Paul Elschot wrote: > There is no way to do exact phrase matching on OCR data, because no > correction of OCR data will be perfect. Otherwise the OCR would have made > the correction... > The problem I see with a fuzzy query is that if you have the fuzziness set

RE: Lucene to index OCR text

2008-01-25 Thread Renaud Waldura

PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, January 25, 2008 7:31 AM To: java-user@lucene.apache.org Subject: Re: Lucene to index OCR text Thanks everyone for their ideas and suggestions! Some had occurred to us but were discarded because we feel our solution needs to be automated -- 45 million

Re: Lucene to index OCR text

2008-01-25 Thread waldura

Thanks everyone for their ideas and suggestions! Some had occurred to us but were discarded because we feel our solution needs to be automated -- 45 million pages are a lot of thrust on any human-driven effort. I like Itamar's idea of doing "competing" OCR, and keeping the best result. Unfortunate

Re: Lucene to index OCR text

2008-01-25 Thread Erick Erickson

That is brilliant! On Jan 25, 2008 6:12 AM, mark harwood <[EMAIL PROTECTED]> wrote: > Probably not a practical solution for you to set up but I love this idea: > http://blog.wired.com/monkeybites/2007/05/recaptcha_fight.html > > - Original Message > From: Renaud Waldura <[EMAIL PROTECTE

Re: Lucene to index OCR text

2008-01-25 Thread mark harwood

Probably not a practical solution for you to set up but I love this idea: http://blog.wired.com/monkeybites/2007/05/recaptcha_fight.html - Original Message From: Renaud Waldura <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 25 January, 2008 1:43:06 AM Subject: Lucene

RE: Lucene to index OCR text

2008-01-25 Thread Itamar Syn-Hershko

re too big? Itamar. -Original Message- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Friday, January 25, 2008 10:27 AM To: java-user@lucene.apache.org Subject: Re: Lucene to index OCR text Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell: > > I've been poking around

Re: Lucene to index OCR text

2008-01-25 Thread Paul Elschot

Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell: > > I've been poking around the list archives and didn't really come up against > > anything interesting. Anyone using Lucene to index OCR text? Any > > strategies/algorithms/packages you recommend? > > > > I have a large collection (10^7 doc

Re: Lucene to index OCR text

2008-01-24 Thread Kyle Maxwell

> I've been poking around the list archives and didn't really come up against > anything interesting. Anyone using Lucene to index OCR text? Any > strategies/algorithms/packages you recommend? > > I have a large collection (10^7 docs) that's mostly the result of OCR. We > index/search/etc. with Luc

Re: Lucene to index OCR text

2008-01-24 Thread Erick Erickson

Lots of luck to you, because I haven't a clue. My company deals with OCR data and we haven't had a single workable idea. Of course, our data sets are minuscule compared to what you're talking about, so we haven't tried to heuristically clean up the data. But given that Google is scanning the entir

Re: Lucene to index OCR text

Re: Lucene to index OCR text

Re: Lucene to index OCR text

RE: Lucene to index OCR text

Re: Lucene to index OCR text

Re: Lucene to index OCR text

Re: Lucene to index OCR text

RE: Lucene to index OCR text

Re: Lucene to index OCR text

Re: Lucene to index OCR text

Re: Lucene to index OCR text

11 matches

Site Navigation

Mail list logo

Footer information