Re: [CODE4LIB] "best" OCR package?

2009-02-05 Thread Lars Aronsson
On February 2, Walter Lewis wrote:

> The "good" news from the perspective of searching is that a 
> reasonable percentage of those errors will affect terms that are 
> either rarely used in searching or are repeated correctly in the 
> vicinity.

This is why OCR should be done by a search engine company (such as 
Google), which has statistics on what real people really search 
for, and can improve the OCR process as it goes.  Software 
developing companies such as ABBYY or Omnipage never get that kind 
of feedback from actual users.  They only represent a fraction of 
the entire feedback loop.  All my experience of scanning old 
Swedish and Danish books with ABBYY Finereader, never got back to 
ABBYY, they never asked for any of that feedback.

I have no idea to what degree Google Book Search does this right, 
but by controlling the entire scan-search loop they have one 
excuse less to fail.


-- 
  Lars Aronsson (l...@aronsson.se)
  Aronsson Datateknik - http://aronsson.se


Re: [CODE4LIB] "best" OCR package? [SEC=UNCLASSIFIED]

2009-02-04 Thread Dyer, Renata
Emanuel,
I have used Microsoft Office Document Imaging that works really well with tiff 
files. Most, if not all scanners, will scan into tiffs which you can then 
convert into text, rtf or word files easily.
The other one I used was Pro Millennium which is compatible with ms word, excel 
etc.
I would highly recommend both of them.

Renata

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of 
Emmanuel Di Pretoro
Sent: Tuesday, 3 February 2009 7:54 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] "best" OCR package?

Hi,

It wasn't a recommendation since I never try it, but I've heard a lot of good 
about tesseract. It was currently developed by Google, but I don't know if they 
use it.

Some link :
 - http://code.google.com/p/tesseract-ocr/
 - http://en.wikipedia.org/wiki/Tesseract_%28software%29

Hope this help,

Emmanuel Di Pretoro

2009/2/3 Alberto Accomazzi 

> Sorry if this is a bit off-topic, but I was wondering if any of you
> clever fellows have a recommendation for an OCR package, possibly with
> a native linux port.  I know about OCRopus but I have a feeling that
> commercial products still have a significant edge over public domain
> packages.  So what are you using and/or do you know what the big guys
> (google, IA, microsoft) are using?
>
> Thanks,
> -- Alberto
>
>
> --
> Dr. Alberto Accomazzi  aaccomazzi(at)cfa harvard edu
> Project Manager
> NASA Astrophysics Data Systemads.harvard.edu
> Harvard-Smithsonian Center for Astrophysics  www.cfa.harvard.edu
> 60 Garden St, MS 67, Cambridge, MA 02138, USA
>

**
Please Note: The information contained in this e-mail message 
and any attached files may be confidential information and 
may also be the subject of legal professional privilege.  If you are
not the intended recipient, any use, disclosure or copying of this
e-mail is unauthorised.  If you have received this e-mail by error
please notify the sender immediately by reply e-mail and delete all
copies of this transmission together with any attachments.
**


Re: [CODE4LIB] "best" OCR package?

2009-02-03 Thread Walter Lewis

Gabriel Farrell wrote:

On Tue, Feb 03, 2009 at 10:09:54AM -0500, Walter Lewis wrote:
  
If we had to correct it all: a) it would never get done and b) it would  
be better than some of the originals which are rife with typographic 
errors.



Hence the genius of Distributed Proofreaders [1] and reCAPTCHA [2].

[1] http://www.pgdp.net/c/
[2] http://recaptcha.net/learnmore.html
  
I have tremendous respect for the genius behind these projects, but the 
Victorian four page village newspapers have enough text for a your 
average government report.  Put four together and you get a three-decker 
novel. The folks in the Distributed Proofreaders rarely sign up for the 
labours of Hercules (and, according to my sources, he only hung in there 
for twelve tasks).


Then you have to deal with the fact that OCRing some of the microfilm 
I've seen is probably not statistically different from invoking a random 
token generator ...


Walter


Re: [CODE4LIB] "best" OCR package?

2009-02-03 Thread Gabriel Farrell
On Tue, Feb 03, 2009 at 10:09:54AM -0500, Walter Lewis wrote:
> If we had to correct it all: a) it would never get done and b) it would  
> be better than some of the originals which are rife with typographic 
> errors.

Hence the genius of Distributed Proofreaders [1] and reCAPTCHA [2].

[1] http://www.pgdp.net/c/
[2] http://recaptcha.net/learnmore.html


Re: [CODE4LIB] "best" OCR package?

2009-02-03 Thread Walter Lewis

Karen Coyle wrote:
I know that 98% is impressive, but I always like to remember that with 
an average of 2000 characters per page that means 40 potential errors 
per book page. Just to give us some perspective on the level of 
cleanup that will be needed for books being digitized today.
The "good" news from the perspective of searching is that a reasonable 
percentage of those errors will affect terms that are either rarely used 
in searching or are repeated correctly in the vicinity. 

The bad news:  phrase search is compromised. Screen readers for the 
visually impaired are compromised. Relevance that depends on term 
clustered is compromised.


If we had to correct it all: a) it would never get done and b) it would 
be better than some of the originals which are rife with typographic errors.


Walter
 so still regrets the Swedish Chef OCR of most microfilm newspaper projects


Re: [CODE4LIB] "best" OCR package?

2009-02-03 Thread Karen Coyle

Randy Stern wrote:
Abbyy Finereader and Nuance Omnipage are the two leading commercial 
OCR products. Both can achieve 98% + character accuracy on most 
book-like material scanned at 300 dpi.


I know that 98% is impressive, but I always like to remember that with 
an average of 2000 characters per page that means 40 potential errors 
per book page. Just to give us some perspective on the level of cleanup 
that will be needed for books being digitized today.


kc

--
---
Karen Coyle / Digital Library Consultant
kco...@kcoyle.net http://www.kcoyle.net
ph.: 510-540-7596   skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234



Re: [CODE4LIB] "best" OCR package?

2009-02-03 Thread Walter Lewis

Randy Stern wrote:
Abbyy Finereader and Nuance Omnipage are the two leading commercial 
OCR products. Both can achieve 98% + character accuracy on most 
book-like material scanned at 300 dpi.


At 07:37 AM 2/3/2009 -0500, Nicole Engard wrote:

I'm with Christian - I loved Abbyy FineReader when I used it at both
my previous libraries.  It's very accurate and it's affordable if
you're not using it for mass digitization :) but we never got the
server contract because like Christian said - it is quite expensive.
Abbyy's engine is actually quite affordable for mass digitization 
efforts as well.  Indeed, if you look closely at the outputs from the 
Internet Archive you'll see they use it extensively.  The desktop model 
requires bodies to handle the inputs and outputs; the server version can 
be built into a workflow.  Once you get past the time to set it up, the 
cost per page is *very* low ( from memory ~1 to 2 cents per page).


Walter Lewis


Re: [CODE4LIB] "best" OCR package?

2009-02-03 Thread Randy Stern
Abbyy Finereader and Nuance Omnipage are the two leading commercial OCR 
products. Both can achieve 98% + character accuracy on most book-like 
material scanned at 300 dpi.


- Randy Stern (who formerly worked in the OCR industry)

At 07:37 AM 2/3/2009 -0500, Nicole Engard wrote:

I'm with Christian - I loved Abbyy FineReader when I used it at both
my previous libraries.  It's very accurate and it's affordable if
you're not using it for mass digitization :) but we never got the
server contract because like Christian said - it is quite expensive.

---

Nicole C. Engard
Open Source Evangelist, LibLime
(888) Koha ILS (564-2457) ext. 714
n...@liblime.com
AIM/Y!/Skype: nengard

http://liblime.com
http://blogs.liblime.com/open-sesame/



On Tue, Feb 3, 2009 at 6:23 AM, MJ Ray  wrote:
> Alberto Accomazzi  wrote:
>> [...] I know about OCRopus but I have a feeling that
>> commercial products still have a significant edge over public domain
>> packages. [...]
>
> OCRopus is released under the Apache License 2.0, which allows
> commercial development.  It is not a public domain package.
> Feel free to use it as a commercial product without fear.
>
> Hope that helps,
> --
> MJ Ray (slef)
> Webmaster for hire, statistician and online shop builder for a small
> worker cooperative http://www.ttllp.co.uk/ http://mjr.towers.org.uk/
> (Notice http://mjr.towers.org.uk/email.html) tel:+44-844-4437-237
>


Re: [CODE4LIB] "best" OCR package?

2009-02-03 Thread Nicole Engard
I'm with Christian - I loved Abbyy FineReader when I used it at both
my previous libraries.  It's very accurate and it's affordable if
you're not using it for mass digitization :) but we never got the
server contract because like Christian said - it is quite expensive.

---

Nicole C. Engard
Open Source Evangelist, LibLime
(888) Koha ILS (564-2457) ext. 714
n...@liblime.com
AIM/Y!/Skype: nengard

http://liblime.com
http://blogs.liblime.com/open-sesame/



On Tue, Feb 3, 2009 at 6:23 AM, MJ Ray  wrote:
> Alberto Accomazzi  wrote:
>> [...] I know about OCRopus but I have a feeling that
>> commercial products still have a significant edge over public domain
>> packages. [...]
>
> OCRopus is released under the Apache License 2.0, which allows
> commercial development.  It is not a public domain package.
> Feel free to use it as a commercial product without fear.
>
> Hope that helps,
> --
> MJ Ray (slef)
> Webmaster for hire, statistician and online shop builder for a small
> worker cooperative http://www.ttllp.co.uk/ http://mjr.towers.org.uk/
> (Notice http://mjr.towers.org.uk/email.html) tel:+44-844-4437-237
>


Re: [CODE4LIB] "best" OCR package?

2009-02-03 Thread MJ Ray
Alberto Accomazzi  wrote:
> [...] I know about OCRopus but I have a feeling that 
> commercial products still have a significant edge over public domain 
> packages. [...]

OCRopus is released under the Apache License 2.0, which allows
commercial development.  It is not a public domain package.
Feel free to use it as a commercial product without fear.

Hope that helps,
-- 
MJ Ray (slef)
Webmaster for hire, statistician and online shop builder for a small
worker cooperative http://www.ttllp.co.uk/ http://mjr.towers.org.uk/
(Notice http://mjr.towers.org.uk/email.html) tel:+44-844-4437-237


Re: [CODE4LIB] "best" OCR package?

2009-02-03 Thread Christian Mahnke
Hello,

2009/2/3 Alberto Accomazzi 

> Sorry if this is a bit off-topic, but I was wondering if any of you clever
> fellows have a recommendation for an OCR package, possibly with a native
> linux port.  I know about OCRopus but I have a feeling that commercial
> products still have a significant edge over public domain packages.  So what
> are you using and/or do you know what the big guys (google, IA, microsoft)
> are using?
>

We are using the Abbyy Finereader Engine [1] which also has a Linux port
available. But it's quite expensive for mass digitization affords, since
it's licensed on a per page base.

Best,
Christian

[1] http://www.abbyy.com/sdk/


Re: [CODE4LIB] "best" OCR package?

2009-02-03 Thread Emmanuel Di Pretoro
Hi,

It wasn't a recommendation since I never try it, but I've heard a lot of
good about tesseract. It was currently developed by Google, but I don't know
if they use it.

Some link :
 - http://code.google.com/p/tesseract-ocr/
 - http://en.wikipedia.org/wiki/Tesseract_%28software%29

Hope this help,

Emmanuel Di Pretoro

2009/2/3 Alberto Accomazzi 

> Sorry if this is a bit off-topic, but I was wondering if any of you clever
> fellows have a recommendation for an OCR package, possibly with a native
> linux port.  I know about OCRopus but I have a feeling that commercial
> products still have a significant edge over public domain packages.  So what
> are you using and/or do you know what the big guys (google, IA, microsoft)
> are using?
>
> Thanks,
> -- Alberto
>
>
> --
> Dr. Alberto Accomazzi  aaccomazzi(at)cfa harvard edu
> Project Manager
> NASA Astrophysics Data Systemads.harvard.edu
> Harvard-Smithsonian Center for Astrophysics  www.cfa.harvard.edu
> 60 Garden St, MS 67, Cambridge, MA 02138, USA
>


[CODE4LIB] "best" OCR package?

2009-02-02 Thread Alberto Accomazzi
Sorry if this is a bit off-topic, but I was wondering if any of you 
clever fellows have a recommendation for an OCR package, possibly with a 
native linux port.  I know about OCRopus but I have a feeling that 
commercial products still have a significant edge over public domain 
packages.  So what are you using and/or do you know what the big guys 
(google, IA, microsoft) are using?


Thanks,
-- Alberto


--
Dr. Alberto Accomazzi  aaccomazzi(at)cfa harvard edu
Project Manager
NASA Astrophysics Data Systemads.harvard.edu
Harvard-Smithsonian Center for Astrophysics  www.cfa.harvard.edu
60 Garden St, MS 67, Cambridge, MA 02138, USA