OCR Form tools

2011-12-08 Thread Adam Vande More
I have thousands of forms equivalent to invoices that I'd like to put into
a database.  I'm thinking I would like to have some OCR app/tool scan these
forms, and then generate a CSV with each field.  Does anyone have
recommendations on software for this?

-- 
Adam Vande More
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OCR Form tools

2011-12-08 Thread Ryan Coleman
I don't know about the second half, you might have to write something up in 
Perl for that but I am using tesseract (graphics/tesseract) to do some old 
document converting for my father… works pretty slick.

On Dec 8, 2011, at 10:10 AM, Adam Vande More wrote:

 I have thousands of forms equivalent to invoices that I'd like to put into
 a database.  I'm thinking I would like to have some OCR app/tool scan these
 forms, and then generate a CSV with each field.  Does anyone have
 recommendations on software for this?
 
 -- 
 Adam Vande More
 ___
 freebsd-questions@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-questions
 To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OCR...

2009-01-29 Thread Andrew Gould
On Wed, Jan 28, 2009 at 8:23 PM, Gary Kline kl...@thought.org wrote:

 On Wed, Jan 28, 2009 at 07:33:41PM -0600, Andrew Gould wrote:
  On Wed, Jan 28, 2009 at 5:09 PM, Gary Kline kl...@thought.org wrote:
 
   On Wed, Jan 28, 2009 at 01:32:57PM -0600, Andrew Gould wrote:
On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org
 wrote:
   
 On Wed, Jan 28, 2009 at 12:08:55PM +0200, Reko Turja wrote:
  so what is the best commercial/shareware that can read a
 10pt-font
  file?  (( also, when i have time to get back into actually
 hacking,
  this [[turning imaged pdf into OCR'able ascii or 8859-1]] is
 giong
  to
  be a first target.  any idea which team i should go with.  gOCR
  looks
  best so far to me.
 
  AABBYY Finereader - Omnipage haven't been able to catch it in
 several
  years either feature or qualitywise. No idea if Finereader runs
 under
  emulator though.  If the file is already a PDF and 72 DPI with
 text
   as
  graphics most of the damage has already been done, and it will be
  extremely hard to OCR.
 

well, damage is probably done.  how can i check the
 resolution?
i tried to increase it by creating huge ppm and tif files,
 but
then that's really absurd since there can only be just so
 much
data per image.  i _could_ try xv and jpeg and smoothing
 image
   to
refine, but too much hassle.

(i used gocr -m 130 and saw the glyphs it (presumably)
 saw.
seemed pretty much okay to my eyes.  but then i'm not a
 computer
program.  [MAYBE :)]

gary



  -Reko
 

 --
  Gary Kline  kl...@thought.org  http://www.thought.org  Public
 Service
 Unix
http://jottings.thought.org
 http://transfinite.thought.org
The 2.23a release of Jottings:
   http://jottings.thought.org/index.php

   
At one point in time, the Abby folks were offering a back-end that
 ran on
FreeBSD.  I tried to get the free download; but it never happened.
  (They
misplaced my signed, faxed license agreement and I finally got tired
 of
   the
back-and-forth prerequisite communication.)
   
Abby also no longer supports Mac OS X.  I use an old version and like
 it
   a
lot.
   
  
  
   OK, now i know what to expect.  I found theit site and signed
 up
  to get the linux version; trial.  not likrly to go any
  further
  
  gary
  
  
Andrew
  
   --
Gary Kline  kl...@thought.org  http://www.thought.org  Public Service
   Unix
  http://jottings.thought.org   http://transfinite.thought.org
  The 2.23a release of Jottings:
 http://jottings.thought.org/index.php
  
  
  I'm rooting for you!  :-)


 well, i just got an email from a david hazard who said to look on
their website; i replied that i had and couldn't find their test
suite  if/when this guy replies, i'll share.

gary


Start here:

http://www.abbyy.com/sdk/?param=59956

I will try again, as well.

Andrew
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OCR...

2009-01-29 Thread Reko Turja


--
From: Gary Kline kl...@thought.org
Sent: Thursday, January 29, 2009 4:23 AM
To: Andrew Gould andrewlylego...@gmail.com
Cc: Reko Turja reko.tu...@liukuma.net; FreeBSD Mailing List 
freebsd-questions@freebsd.org

Subject: Re: OCR...


On Wed, Jan 28, 2009 at 07:33:41PM -0600, Andrew Gould wrote:
On Wed, Jan 28, 2009 at 5:09 PM, Gary Kline kl...@thought.org 
wrote:


 On Wed, Jan 28, 2009 at 01:32:57PM -0600, Andrew Gould wrote:
  On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org 
  wrote:


  well, damage is probably done.  how can i check the 
   resolution?
  i tried to increase it by creating huge ppm and tif 
   files, but
  then that's really absurd since there can only be just 
   so much
  data per image.  i _could_ try xv and jpeg and 
   smoothing image


Yeah, if the image resolution is already at 72DPI, there's sadly no 
trick in the world that can reliably return the lost information. 
I've read some horrid scans with low resolution in Finereader, and it 
can grab much of the information nicely. With low resolution be 
prepared to manually correcting problem spots though. Only reliable 
way to quesstimate resolution is the font size when at 100% in the 
screen. If the text is about 10 pixels high, the information has 
probably been stored in 72DPI for space saving purposes.


Wasn't aware of the FreeBSD/Linux backend, but if that works it'd be 
great - haven't myself visited their website in ages as the version I 
have does the job I got it for.


-Reko 


___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OCR...

2009-01-29 Thread Andrew Gould
On Thu, Jan 29, 2009 at 7:11 AM, Reko Turja reko.tu...@liukuma.net wrote:


 --
 From: Gary Kline kl...@thought.org
 Sent: Thursday, January 29, 2009 4:23 AM
 To: Andrew Gould andrewlylego...@gmail.com
 Cc: Reko Turja reko.tu...@liukuma.net; FreeBSD Mailing List 
 freebsd-questions@freebsd.org
 Subject: Re: OCR...

  On Wed, Jan 28, 2009 at 07:33:41PM -0600, Andrew Gould wrote:

 On Wed, Jan 28, 2009 at 5:09 PM, Gary Kline kl...@thought.org wrote:

  On Wed, Jan 28, 2009 at 01:32:57PM -0600, Andrew Gould wrote:
   On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org  
 wrote:


well, damage is probably done.  how can i check the   
 resolution?
   i tried to increase it by creating huge ppm and tif   
 files, but
   then that's really absurd since there can only be just   
 so much
   data per image.  i _could_ try xv and jpeg and   
 smoothing image


 Yeah, if the image resolution is already at 72DPI, there's sadly no trick
 in the world that can reliably return the lost information. I've read some
 horrid scans with low resolution in Finereader, and it can grab much of the
 information nicely. With low resolution be prepared to manually correcting
 problem spots though. Only reliable way to quesstimate resolution is the
 font size when at 100% in the screen. If the text is about 10 pixels high,
 the information has probably been stored in 72DPI for space saving purposes.

 Wasn't aware of the FreeBSD/Linux backend, but if that works it'd be great
 - haven't myself visited their website in ages as the version I have does
 the job I got it for.

 -Reko


I may have used the wrong term.  I think it used to be an sdk; but now
there's a product with extended platform support for Linux and FreeBSD.

Andrew
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OCR...

2009-01-28 Thread Michel Talon

Gary Kline wrote:

 well, i'm ashamed to admit that i've put at least a dozen hours in
 trying, then re-re-retrying to OCR a imaged pdf file with as many
 open source ocr packages as i can find.

I have seen good results with tesseract which is in the ports and free.
Otherwise with OmniPage for commercial software (it runs under wine).


-- 

Michel TALON

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OCR...

2009-01-28 Thread Reko Turja

so what is the best commercial/shareware that can read a 10pt-font
file?  (( also, when i have time to get back into actually hacking,
this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong 
to
be a first target.  any idea which team i should go with.  gOCR 
looks

best so far to me.


AABBYY Finereader - Omnipage haven't been able to catch it in several 
years either feature or qualitywise. No idea if Finereader runs under 
emulator though.  If the file is already a PDF and 72 DPI with text as 
graphics most of the damage has already been done, and it will be 
extremely hard to OCR.


-Reko 


___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OCR...

2009-01-28 Thread Gary Kline
On Wed, Jan 28, 2009 at 12:08:55PM +0200, Reko Turja wrote:
 so what is the best commercial/shareware that can read a 10pt-font
 file?  (( also, when i have time to get back into actually hacking,
 this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong 
 to
 be a first target.  any idea which team i should go with.  gOCR 
 looks
 best so far to me.
 
 AABBYY Finereader - Omnipage haven't been able to catch it in several 
 years either feature or qualitywise. No idea if Finereader runs under 
 emulator though.  If the file is already a PDF and 72 DPI with text as 
 graphics most of the damage has already been done, and it will be 
 extremely hard to OCR.
 

well, damage is probably done.  how can i check the resolution?
i tried to increase it by creating huge ppm and tif files, but
then that's really absurd since there can only be just so much
data per image.  i _could_ try xv and jpeg and smoothing image to
refine, but too much hassle.  

(i used gocr -m 130 and saw the glyphs it (presumably) saw.
seemed pretty much okay to my eyes.  but then i'm not a computer
program.  [MAYBE :)]

gary



 -Reko 
 

-- 
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service Unix
http://jottings.thought.org   http://transfinite.thought.org
The 2.23a release of Jottings: http://jottings.thought.org/index.php

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OCR...

2009-01-28 Thread Andrew Gould
On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org wrote:

 On Wed, Jan 28, 2009 at 12:08:55PM +0200, Reko Turja wrote:
  so what is the best commercial/shareware that can read a 10pt-font
  file?  (( also, when i have time to get back into actually hacking,
  this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong
  to
  be a first target.  any idea which team i should go with.  gOCR
  looks
  best so far to me.
 
  AABBYY Finereader - Omnipage haven't been able to catch it in several
  years either feature or qualitywise. No idea if Finereader runs under
  emulator though.  If the file is already a PDF and 72 DPI with text as
  graphics most of the damage has already been done, and it will be
  extremely hard to OCR.
 

well, damage is probably done.  how can i check the resolution?
i tried to increase it by creating huge ppm and tif files, but
then that's really absurd since there can only be just so much
data per image.  i _could_ try xv and jpeg and smoothing image to
refine, but too much hassle.

(i used gocr -m 130 and saw the glyphs it (presumably) saw.
seemed pretty much okay to my eyes.  but then i'm not a computer
program.  [MAYBE :)]

gary



  -Reko
 

 --
  Gary Kline  kl...@thought.org  http://www.thought.org  Public Service
 Unix
http://jottings.thought.org   http://transfinite.thought.org
The 2.23a release of Jottings: http://jottings.thought.org/index.php


At one point in time, the Abby folks were offering a back-end that ran on
FreeBSD.  I tried to get the free download; but it never happened.  (They
misplaced my signed, faxed license agreement and I finally got tired of the
back-and-forth prerequisite communication.)

Abby also no longer supports Mac OS X.  I use an old version and like it a
lot.

Andrew
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OCR...

2009-01-28 Thread Gary Kline
On Wed, Jan 28, 2009 at 01:32:57PM -0600, Andrew Gould wrote:
 On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org wrote:
 
  On Wed, Jan 28, 2009 at 12:08:55PM +0200, Reko Turja wrote:
   so what is the best commercial/shareware that can read a 10pt-font
   file?  (( also, when i have time to get back into actually hacking,
   this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong
   to
   be a first target.  any idea which team i should go with.  gOCR
   looks
   best so far to me.
  
   AABBYY Finereader - Omnipage haven't been able to catch it in several
   years either feature or qualitywise. No idea if Finereader runs under
   emulator though.  If the file is already a PDF and 72 DPI with text as
   graphics most of the damage has already been done, and it will be
   extremely hard to OCR.
  
 
 well, damage is probably done.  how can i check the resolution?
 i tried to increase it by creating huge ppm and tif files, but
 then that's really absurd since there can only be just so much
 data per image.  i _could_ try xv and jpeg and smoothing image to
 refine, but too much hassle.
 
 (i used gocr -m 130 and saw the glyphs it (presumably) saw.
 seemed pretty much okay to my eyes.  but then i'm not a computer
 program.  [MAYBE :)]
 
 gary
 
 
 
   -Reko
  
 
  --
   Gary Kline  kl...@thought.org  http://www.thought.org  Public Service
  Unix
 http://jottings.thought.org   http://transfinite.thought.org
 The 2.23a release of Jottings: http://jottings.thought.org/index.php
 
 
 At one point in time, the Abby folks were offering a back-end that ran on
 FreeBSD.  I tried to get the free download; but it never happened.  (They
 misplaced my signed, faxed license agreement and I finally got tired of the
 back-and-forth prerequisite communication.)
 
 Abby also no longer supports Mac OS X.  I use an old version and like it a
 lot.
 


OK, now i know what to expect.  I found theit site and signed up
to get the linux version; trial.  not likrly to go any
further 

gary


 Andrew

-- 
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service Unix
http://jottings.thought.org   http://transfinite.thought.org
The 2.23a release of Jottings: http://jottings.thought.org/index.php

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OCR...

2009-01-28 Thread Andrew Gould
On Wed, Jan 28, 2009 at 5:09 PM, Gary Kline kl...@thought.org wrote:

 On Wed, Jan 28, 2009 at 01:32:57PM -0600, Andrew Gould wrote:
  On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org wrote:
 
   On Wed, Jan 28, 2009 at 12:08:55PM +0200, Reko Turja wrote:
so what is the best commercial/shareware that can read a 10pt-font
file?  (( also, when i have time to get back into actually hacking,
this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong
to
be a first target.  any idea which team i should go with.  gOCR
looks
best so far to me.
   
AABBYY Finereader - Omnipage haven't been able to catch it in several
years either feature or qualitywise. No idea if Finereader runs under
emulator though.  If the file is already a PDF and 72 DPI with text
 as
graphics most of the damage has already been done, and it will be
extremely hard to OCR.
   
  
  well, damage is probably done.  how can i check the resolution?
  i tried to increase it by creating huge ppm and tif files, but
  then that's really absurd since there can only be just so much
  data per image.  i _could_ try xv and jpeg and smoothing image
 to
  refine, but too much hassle.
  
  (i used gocr -m 130 and saw the glyphs it (presumably) saw.
  seemed pretty much okay to my eyes.  but then i'm not a computer
  program.  [MAYBE :)]
  
  gary
  
  
  
-Reko
   
  
   --
Gary Kline  kl...@thought.org  http://www.thought.org  Public Service
   Unix
  http://jottings.thought.org   http://transfinite.thought.org
  The 2.23a release of Jottings:
 http://jottings.thought.org/index.php
  
 
  At one point in time, the Abby folks were offering a back-end that ran on
  FreeBSD.  I tried to get the free download; but it never happened.  (They
  misplaced my signed, faxed license agreement and I finally got tired of
 the
  back-and-forth prerequisite communication.)
 
  Abby also no longer supports Mac OS X.  I use an old version and like it
 a
  lot.
 


 OK, now i know what to expect.  I found theit site and signed up
to get the linux version; trial.  not likrly to go any
further

gary


  Andrew

 --
  Gary Kline  kl...@thought.org  http://www.thought.org  Public Service
 Unix
http://jottings.thought.org   http://transfinite.thought.org
The 2.23a release of Jottings: http://jottings.thought.org/index.php


I'm rooting for you!  :-)
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OCR...

2009-01-28 Thread Gary Kline
On Wed, Jan 28, 2009 at 07:33:41PM -0600, Andrew Gould wrote:
 On Wed, Jan 28, 2009 at 5:09 PM, Gary Kline kl...@thought.org wrote:
 
  On Wed, Jan 28, 2009 at 01:32:57PM -0600, Andrew Gould wrote:
   On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org wrote:
  
On Wed, Jan 28, 2009 at 12:08:55PM +0200, Reko Turja wrote:
 so what is the best commercial/shareware that can read a 10pt-font
 file?  (( also, when i have time to get back into actually hacking,
 this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong
 to
 be a first target.  any idea which team i should go with.  gOCR
 looks
 best so far to me.

 AABBYY Finereader - Omnipage haven't been able to catch it in several
 years either feature or qualitywise. No idea if Finereader runs under
 emulator though.  If the file is already a PDF and 72 DPI with text
  as
 graphics most of the damage has already been done, and it will be
 extremely hard to OCR.

   
   well, damage is probably done.  how can i check the resolution?
   i tried to increase it by creating huge ppm and tif files, but
   then that's really absurd since there can only be just so much
   data per image.  i _could_ try xv and jpeg and smoothing image
  to
   refine, but too much hassle.
   
   (i used gocr -m 130 and saw the glyphs it (presumably) saw.
   seemed pretty much okay to my eyes.  but then i'm not a computer
   program.  [MAYBE :)]
   
   gary
   
   
   
 -Reko

   
--
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service
Unix
   http://jottings.thought.org   http://transfinite.thought.org
   The 2.23a release of Jottings:
  http://jottings.thought.org/index.php
   
  
   At one point in time, the Abby folks were offering a back-end that ran on
   FreeBSD.  I tried to get the free download; but it never happened.  (They
   misplaced my signed, faxed license agreement and I finally got tired of
  the
   back-and-forth prerequisite communication.)
  
   Abby also no longer supports Mac OS X.  I use an old version and like it
  a
   lot.
  
 
 
  OK, now i know what to expect.  I found theit site and signed up
 to get the linux version; trial.  not likrly to go any
 further
 
 gary
 
 
   Andrew
 
  --
   Gary Kline  kl...@thought.org  http://www.thought.org  Public Service
  Unix
 http://jottings.thought.org   http://transfinite.thought.org
 The 2.23a release of Jottings: http://jottings.thought.org/index.php
 
 
 I'm rooting for you!  :-)


well, i just got an email from a david hazard who said to look on
their website; i replied that i had and couldn't find their test
suite  if/when this guy replies, i'll share.

gary



-- 
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service Unix
http://jottings.thought.org   http://transfinite.thought.org
The 2.23a release of Jottings: http://jottings.thought.org/index.php

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


OCR...

2009-01-27 Thread Gary Kline

guys,

well, i'm ashamed to admit that i've put at least a dozen hours in
trying, then re-re-retrying to OCR a imaged pdf file with as many
open source ocr packages as i can find.  before i quit for supper
tonight, i finally threw in the towel.  realized than i would have
been THROUGH with all 181 pages of the text on Aristotle if i had
just read the bloody thing.  but anyway, i'm done.  there simply is
no freeware that runs on a 'nix computer//real computer.

so what is the best commercial/shareware that can read a 10pt-font
file?  (( also, when i have time to get back into actually hacking, 
this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong to
be a first target.  any idea which team i should go with.  gOCR looks
best so far to me.

gary



-- 
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service Unix
http://jottings.thought.org   http://transfinite.thought.org
The 2.23a release of Jottings: http://jottings.thought.org/index.php

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: any way to turn a pdf file into something OCR-able?

2008-12-02 Thread Roland Smith
On Mon, Dec 01, 2008 at 08:23:09PM -0500, Robert Huff wrote:
 
 Roland Smith writes:
 
 pdftotext fail on the large [32MB] file I've got.  Is there any
 other way I can translate this huge textfile to ascii or html or
 text?
   
 
   Please define fail in this context? I've used pdftotxt on
   documents exceeding 40MB. However there are of course things that
   don't work;
   
   1) Some PDFs are just wrappers around JPEG images. In this case
   there is no text for pdftotext to convert = epic fail.
 
   In this case convert from the ImageMagick port will get you a
 series of .jpg/.gif/.whatever.  Read the manual carefully before
 attempting; also note this can be a slow process.

Which still doesn't give plain text. But in this case one would need an
OCR app.

There is a new one available in ports called cuneiform. It is supposed
to be quite good, but I haven't had the need to try it yet. 

I've tried gocr and tesseract in the past but was not really impressed
with them. For short documents it's easier to do the OCR with the Mk I
eyeball  brain. :-) You'll have to completely check an OCR-ed document
for errors anyway.

Roland
-- 
R.F.Smith   http://www.xs4all.nl/~rsmith/
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)


pgpxQMfB0hnro.pgp
Description: PGP signature


Re: any way to turn a pdf file into something OCR-able?

2008-12-02 Thread Gary Kline
On Tue, Dec 02, 2008 at 02:07:30AM +0100, Roland Smith wrote:
 On Mon, Dec 01, 2008 at 03:14:43PM -0800, Gary Kline wrote:
  pdftotext fail on the large [32MB] file I've got.  Is there any
  other way I can translate this huge textfile to ascii or html or
  text?
 
 Please define fail in this context? I've used pdftotxt on documents
 exceeding 40MB. However there are of course things that don't work;
 
 1) Some PDFs are just wrappers around JPEG images. In this case there is
 no text for pdftotext to convert = epic fail.
 
 2) If the text contains ligatures etc. you should use the proper
 encoding that contains such characters (e.g. '-enc UTF-8') or you will
 loose them.
 
 3) Things like equations will not render well, if at all. This also
 depends on the encoding.


It probably was a pdf wrapped around a jpeg.   I was able to to
another pdf to plaintext in a flash.   (*sigh*)  it wasn't a total
waste of time because I found the entire text transfered to  buugy
ASCII somewhere [[ thanks to some prof ]].  So, if I ever want to run 
aspell
against a 900-page file, at least I have that option!

gary


 
 Roland
 -- 
 R.F.Smith   http://www.xs4all.nl/~rsmith/
 [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
 pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)



-- 
 Gary Kline  [EMAIL PROTECTED]  http://www.thought.org  Public Service Unix
http://jottings.thought.org   http://transfinite.thought.org
 Flash: The alpha release of Jottings is available: 
http://jottings.thought.org/index.php

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: any way to turn a pdf file into something OCR-able?

2008-12-02 Thread Gary Kline
On Tue, Dec 02, 2008 at 02:22:27PM -0500, Chris Shenton wrote:
 Gary Kline [EMAIL PROTECTED] writes:
 
  pdftotext fail on the large [32MB] file I've got.  Is there any other 
  way I
  can translate this huge textfile to ascii or html or text?
 
 I wrote some code using Python PDF library 'pypdf' to split a multipage
 PDF scan into individual pages, then used the tesseract OCR to convert
 to text.  Not 100% of course, and it really got confused by pages that
 were not right-side-up, but not a bad start for pages that are really
 scans -- images -- rather than PDF representation of text. 
 
 Sadly, I haven't gotten it into a suitable state to release. 


Well, sounds hopeful for when I scan around 200 pages of pre-1923 
journal 
articles.  These are in columnal form IIRC correctly.  

--Be WONDERFUL if there were some kind of hardware top translate Old 
books
and journals automagically.  ... .

gary



-- 
 Gary Kline  [EMAIL PROTECTED]  http://www.thought.org  Public Service Unix
http://jottings.thought.org   http://transfinite.thought.org
 Flash: The alpha release of Jottings is available: 
http://jottings.thought.org/index.php

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


any way to turn a pdf file into something OCR-able?

2008-12-01 Thread Gary Kline

Guys,

pdftotext fail on the large [32MB] file I've got.  Is there any other 
way I
can translate this huge textfile to ascii or html or text?

thanks,

gary


-- 
 Gary Kline  [EMAIL PROTECTED]  http://www.thought.org  Public Service Unix
http://jottings.thought.org   http://transfinite.thought.org


___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: any way to turn a pdf file into something OCR-able?

2008-12-01 Thread Roland Smith
On Mon, Dec 01, 2008 at 03:14:43PM -0800, Gary Kline wrote:
   pdftotext fail on the large [32MB] file I've got.  Is there any
   other way I can translate this huge textfile to ascii or html or
   text?

Please define fail in this context? I've used pdftotxt on documents
exceeding 40MB. However there are of course things that don't work;

1) Some PDFs are just wrappers around JPEG images. In this case there is
no text for pdftotext to convert = epic fail.

2) If the text contains ligatures etc. you should use the proper
encoding that contains such characters (e.g. '-enc UTF-8') or you will
loose them.

3) Things like equations will not render well, if at all. This also
depends on the encoding.

Roland
-- 
R.F.Smith   http://www.xs4all.nl/~rsmith/
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)


pgp1pm9biULbz.pgp
Description: PGP signature


Re: any way to turn a pdf file into something OCR-able?

2008-12-01 Thread Robert Huff

Roland Smith writes:

  pdftotext fail on the large [32MB] file I've got.  Is there any
  other way I can translate this huge textfile to ascii or html or
  text?
  

  Please define fail in this context? I've used pdftotxt on
  documents exceeding 40MB. However there are of course things that
  don't work;
  
  1) Some PDFs are just wrappers around JPEG images. In this case
  there is no text for pdftotext to convert = epic fail.

In this case convert from the ImageMagick port will get you a
series of .jpg/.gif/.whatever.  Read the manual carefully before
attempting; also note this can be a slow process.


Robert Huff


___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: any way to turn a pdf file into something OCR-able?

2008-12-01 Thread Olivier Nicole
   1) Some PDFs are just wrappers around JPEG images. In this case
   there is no text for pdftotext to convert = epic fail.
 
   In this case convert from the ImageMagick port will get you a
 series of .jpg/.gif/.whatever.  Read the manual carefully before
 attempting; also note this can be a slow process.

pdfimages (from ports graphics/xpdf) can also do that, maybe at a
lesser cost.

Bests,

Olivier
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: best OCR scanner??

2005-09-02 Thread Gary Kline
On Thu, Sep 01, 2005 at 08:07:26PM -0700, Gary Kline wrote:
 People,
 
 I want to scan ~400 pp of an out-of-print and out-of-copyright 
 book (from 1913) and need to know what the best scanner is
 and if there has been substantial improvement in OCR 
 software in recent years.  This book has few footnotes 
 or different typefaces, so it should make things easier.
 
 Oh, an if there is something that plugs into DOS/DOZE 
 and just works, super.  I'lll use my W2K box.  (Hopefully,
 something that plugs into COM0 or COM1. USB okay too.)
 
 thanks for any clues; I've never used a scanner before!
 --yea, no kidding:-)
 
... just a postscript here to the list: my 
interest in the best scanner obviously applies to 
the Unix realm, too.  Just FWIW.

gary



-- 
   Gary Kline [EMAIL PROTECTED]   www.thought.org Public service Unix

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: best OCR scanner??

2005-09-02 Thread Nikolas Britton
On 9/1/05, Gary Kline [EMAIL PROTECTED] wrote:
 People,
 
 I want to scan ~400 pp of an out-of-print and out-of-copyright
 book (from 1913) and need to know what the best scanner is
 and if there has been substantial improvement in OCR
 software in recent years.  This book has few footnotes
 or different typefaces, so it should make things easier.
 
 Oh, an if there is something that plugs into DOS/DOZE
 and just works, super.  I'lll use my W2K box.  (Hopefully,
 something that plugs into COM0 or COM1. USB okay too.)
 

Any scanner will work when your scanning a 2 tone document! The only
thing that matters is the OCR software and their is only one game in
town, OmniPage Pro by scansoft.

BTW it's faster (and won't damage the book) to photograph the book and
then crop and covert to BW, white balance, contrast, etc in photoshop
or gimp etc., and then import the photos into the OCR software. The
OCR software should produce less errors too.

After all is done post the book on gutenberg, http://www.gutenberg.org/

oh, you should be able to fine some tips about scanning books at the
gutenberg site too.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: best OCR scanner??

2005-09-02 Thread Roger Merritt

At 08:07 PM 9/1/2005 -0700, Gary Kline wrote:

People,

I want to scan ~400 pp of an out-of-print and out-of-copyright
book (from 1913) and need to know what the best scanner is
and if there has been substantial improvement in OCR
software in recent years.  This book has few footnotes
or different typefaces, so it should make things easier.

Oh, an if there is something that plugs into DOS/DOZE
and just works, super.  I'lll use my W2K box.  (Hopefully,
something that plugs into COM0 or COM1. USB okay too.)

thanks for any clues; I've never used a scanner before!
--yea, no kidding:-)


I happen to have some recent experience on a Windoze machine that may be 
useful. Of the several programs that Google found for me the one that met 
my needs best was Textbridge. The others put every paragraph into a 
separate text box, made correcting layout and formatting a nightmare. 
Textbridge (at least the current version) seems to do a good job as long as 
the print is reasonable clear. All the OCR programs I tried had problems 
putting pictures in the right place. I don't know what's available for 
FreeBSD, since I use my boxen for gateways, not even connected to printers. 
I should warn you, though, scanning isn't quick -- figure about two minutes 
per page (YMMV) plus any formatting fixup you have to do afterward. There 
are industrial-strength applications out there, but they cost.


Can't offer advice about hardware -- I've got an Epson flatbed, pretty 
inexpensive but works good.


--
Roger

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: best OCR scanner??

2005-09-02 Thread Gary Kline
On Fri, Sep 02, 2005 at 03:21:03PM +0700, Roger Merritt wrote:
 At 08:07 PM 9/1/2005 -0700, Gary Kline wrote:
 People,
 
 I want to scan ~400 pp of an out-of-print and out-of-copyright
 book (from 1913) and need to know what the best scanner is
 and if there has been substantial improvement in OCR
 software in recent years.  This book has few footnotes
 or different typefaces, so it should make things easier.
 
 Oh, an if there is something that plugs into DOS/DOZE
 and just works, super.  I'lll use my W2K box.  (Hopefully,
 something that plugs into COM0 or COM1. USB okay too.)
 
 thanks for any clues; I've never used a scanner before!
 --yea, no kidding:-)
 
 I happen to have some recent experience on a Windoze machine that may be 
 useful. Of the several programs that Google found for me the one that met 
 my needs best was Textbridge. The others put every paragraph into a 
 separate text box, made correcting layout and formatting a nightmare. 
 Textbridge (at least the current version) seems to do a good job as long as 
 the print is reasonable clear. All the OCR programs I tried had problems 
 putting pictures in the right place. I don't know what's available for 
 FreeBSD, since I use my boxen for gateways, not even connected to printers. 
 I should warn you, though, scanning isn't quick -- figure about two minutes 
 per page (YMMV) plus any formatting fixup you have to do afterward. There 
 are industrial-strength applications out there, but they cost.
 
 Can't offer advice about hardware -- I've got an Epson flatbed, pretty 
 inexpensive but works good.

Doesn't sound very encouraging... :(

-gary

 
 -- 
 Roger
 
 ___
 freebsd-questions@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-questions
 To unsubscribe, send any mail to [EMAIL PROTECTED]

-- 
   Gary Kline [EMAIL PROTECTED]   www.thought.org Public service Unix

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: best OCR scanner??

2005-09-02 Thread Gary Kline
On Fri, Sep 02, 2005 at 03:01:12AM -0500, Nikolas Britton wrote:
 On 9/1/05, Gary Kline [EMAIL PROTECTED] wrote:
  People,
  
  I want to scan ~400 pp of an out-of-print and out-of-copyright
  book (from 1913) and need to know what the best scanner is
  and if there has been substantial improvement in OCR
  software in recent years.  This book has few footnotes
  or different typefaces, so it should make things easier.
  
  Oh, an if there is something that plugs into DOS/DOZE
  and just works, super.  I'lll use my W2K box.  (Hopefully,
  something that plugs into COM0 or COM1. USB okay too.)
  
 
 Any scanner will work when your scanning a 2 tone document! The only
 thing that matters is the OCR software and their is only one game in
 town, OmniPage Pro by scansoft.

Well, the book I want to scan is from 1913:: just text.
Does this scanner work with FreeBSD? or only Windows?

 
 BTW it's faster (and won't damage the book) to photograph the book and
 then crop and covert to BW, white balance, contrast, etc in photoshop
 or gimp etc., and then import the photos into the OCR software. The
 OCR software should produce less errors too.

Okay, can do; thanks.

 
 After all is done post the book on gutenberg, http://www.gutenberg.org/
 
 oh, you should be able to fine some tips about scanning books at the
 gutenberg site too.


Yep; that's my idea.  I've volunteered for PG, just never
at the scanning level.

gary

 

-- 
   Gary Kline [EMAIL PROTECTED]   www.thought.org Public service Unix

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: best OCR scanner??

2005-09-02 Thread Roland Smith
On Fri, Sep 02, 2005 at 12:46:27AM -0700, Gary Kline wrote:
 On Thu, Sep 01, 2005 at 08:07:26PM -0700, Gary Kline wrote:
  People,
  
  I want to scan ~400 pp of an out-of-print and out-of-copyright 
  book (from 1913) and need to know what the best scanner is

Any scanner that works with SANE (http://www.sane-project.org/) scanner
support framework should do. For supported hardware see:
http://www.sane-project.org/sane-mfgs.html 

Epson seems to have the most supported scanners. I've got an Epson
Perfection 1650, which works fine.

  and if there has been substantial improvement in OCR 
  software in recent years.  This book has few footnotes 
  or different typefaces, so it should make things easier.

There are several free OCR programs. I've used gocr
(http://jocr.sourceforge.net/ and no, that's not a typo) and ocrad
(http://www.gnu.org/software/ocrad/ocrad.html)

Ocrad works ok, but you'll definitely have to correct errors, depending
on the quality of the pictures/scans. 

HTH,

Roland
-- 
R.F.Smith (http://www.xs4all.nl/~rsmith/) Please send e-mail as plain text.
public key: http://www.xs4all.nl/~rsmith/pubkey.txt


pgpKDz7HzZI7u.pgp
Description: PGP signature


Re: best OCR scanner??

2005-09-02 Thread Bill Campbell
On Fri, Sep 02, 2005, Gary Kline wrote:
...
   Well, the book I want to scan is from 1913:: just text.
   Does this scanner work with FreeBSD? or only Windows?

As somebody else suggested, you may well be better off ``scanning'' books
with a digital camera than with a scanner.  It's often difficult to get a
book to lay flat enough on a scanner bed to get good scans.

I've been planning on getting a photographic copy table that holds the
camera at a fixed distance above its bed.  I think it would also work best
to have a flat glass or plastic sheet that can hold the page flat while
it's been photographed, with something to keep the opposite page out of the
camera's way.

I have to admit that I do all my scanning and OCR on an OS X system, only
marginally related to FreeBSD.  I use an older HP Scanjet with automatic
document feeder (ADF), and the HP software will scan straight to PDF
documents.  The Readiris OCS software can then OCR the PDF file making it
fairly easy to deal with multiple pages.

At one point we developed a perl::Tk program that worked with Vividata's
scanning and OCR software to scan and OCR large documents from high-end
Ricoh scanners with ADF.

Bill
--
INTERNET:   [EMAIL PROTECTED]  Bill Campbell; Celestial Software LLC
UUCP:   camco!bill  PO Box 820; 6641 E. Mercer Way
FAX:(206) 232-9186  Mercer Island, WA 98040-0820; (206) 236-1676
URL: http://www.celestial.com/

Imagine if every Thursday your shoes exploded if you tied them the usual
way.  This happens to us all the time with computers, and nobody thinks of
complaining.
-- Jef Raskin http://jefraskin.com/
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: best OCR scanner??

2005-09-02 Thread Nikolas Britton
On 9/2/05, Gary Kline [EMAIL PROTECTED] wrote:
 On Fri, Sep 02, 2005 at 03:01:12AM -0500, Nikolas Britton wrote:
  On 9/1/05, Gary Kline [EMAIL PROTECTED] wrote:
   People,
  
   I want to scan ~400 pp of an out-of-print and out-of-copyright
   book (from 1913) and need to know what the best scanner is
   and if there has been substantial improvement in OCR
   software in recent years.  This book has few footnotes
   or different typefaces, so it should make things easier.
  
   Oh, an if there is something that plugs into DOS/DOZE
   and just works, super.  I'lll use my W2K box.  (Hopefully,
   something that plugs into COM0 or COM1. USB okay too.)
  
 
  Any scanner will work when your scanning a 2 tone document! The only
  thing that matters is the OCR software and their is only one game in
  town, OmniPage Pro by scansoft.
 
 Well, the book I want to scan is from 1913:: just text.
 Does this scanner work with FreeBSD? or only Windows?

The OCR software? It works on windows and Mac OS-X. The software isn't
cheap though, the current full version, 15, retails for $500. You may
be able to find a demo version , so you can try before you buy, if you
look in the right places.

 
 
  BTW it's faster (and won't damage the book) to photograph the book and
  then crop and covert to BW, white balance, contrast, etc in photoshop
  or gimp etc., and then import the photos into the OCR software. The
  OCR software should produce less errors too.
 
 Okay, can do; thanks.

Have you ever seen a spy (movies) use a scanner to copy top secret
documents? :-)

I would just make a jig out of wood to hold the digital camera and a
flat bottem to hold the book. It would be best if you had a 35mm AF
SLR camera with like a 20 - 50mm macro len, but any camera should
work. If you have an SLR camera but no macro lens you can try flipping
your lens around.

 
 
  After all is done post the book on gutenberg, http://www.gutenberg.org/
 
  oh, you should be able to fine some tips about scanning books at the
  gutenberg site too.
 
 
 Yep; that's my idea.  I've volunteered for PG, just never
 at the scanning level.
 

Cool.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


best OCR scanner??

2005-09-01 Thread Gary Kline
People,

I want to scan ~400 pp of an out-of-print and out-of-copyright 
book (from 1913) and need to know what the best scanner is
and if there has been substantial improvement in OCR 
software in recent years.  This book has few footnotes 
or different typefaces, so it should make things easier.

Oh, an if there is something that plugs into DOS/DOZE 
and just works, super.  I'lll use my W2K box.  (Hopefully,
something that plugs into COM0 or COM1. USB okay too.)

thanks for any clues; I've never used a scanner before!
--yea, no kidding:-)

gary


-- 
   Gary Kline [EMAIL PROTECTED]   www.thought.org Public service Unix

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]