OCR Form tools
I have thousands of forms equivalent to invoices that I'd like to put into a database. I'm thinking I would like to have some OCR app/tool scan these forms, and then generate a CSV with each field. Does anyone have recommendations on software for this? -- Adam Vande More ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OCR Form tools
I don't know about the second half, you might have to write something up in Perl for that but I am using tesseract (graphics/tesseract) to do some old document converting for my father… works pretty slick. On Dec 8, 2011, at 10:10 AM, Adam Vande More wrote: I have thousands of forms equivalent to invoices that I'd like to put into a database. I'm thinking I would like to have some OCR app/tool scan these forms, and then generate a CSV with each field. Does anyone have recommendations on software for this? -- Adam Vande More ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OCR...
On Wed, Jan 28, 2009 at 8:23 PM, Gary Kline kl...@thought.org wrote: On Wed, Jan 28, 2009 at 07:33:41PM -0600, Andrew Gould wrote: On Wed, Jan 28, 2009 at 5:09 PM, Gary Kline kl...@thought.org wrote: On Wed, Jan 28, 2009 at 01:32:57PM -0600, Andrew Gould wrote: On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org wrote: On Wed, Jan 28, 2009 at 12:08:55PM +0200, Reko Turja wrote: so what is the best commercial/shareware that can read a 10pt-font file? (( also, when i have time to get back into actually hacking, this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong to be a first target. any idea which team i should go with. gOCR looks best so far to me. AABBYY Finereader - Omnipage haven't been able to catch it in several years either feature or qualitywise. No idea if Finereader runs under emulator though. If the file is already a PDF and 72 DPI with text as graphics most of the damage has already been done, and it will be extremely hard to OCR. well, damage is probably done. how can i check the resolution? i tried to increase it by creating huge ppm and tif files, but then that's really absurd since there can only be just so much data per image. i _could_ try xv and jpeg and smoothing image to refine, but too much hassle. (i used gocr -m 130 and saw the glyphs it (presumably) saw. seemed pretty much okay to my eyes. but then i'm not a computer program. [MAYBE :)] gary -Reko -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.23a release of Jottings: http://jottings.thought.org/index.php At one point in time, the Abby folks were offering a back-end that ran on FreeBSD. I tried to get the free download; but it never happened. (They misplaced my signed, faxed license agreement and I finally got tired of the back-and-forth prerequisite communication.) Abby also no longer supports Mac OS X. I use an old version and like it a lot. OK, now i know what to expect. I found theit site and signed up to get the linux version; trial. not likrly to go any further gary Andrew -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.23a release of Jottings: http://jottings.thought.org/index.php I'm rooting for you! :-) well, i just got an email from a david hazard who said to look on their website; i replied that i had and couldn't find their test suite if/when this guy replies, i'll share. gary Start here: http://www.abbyy.com/sdk/?param=59956 I will try again, as well. Andrew ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OCR...
-- From: Gary Kline kl...@thought.org Sent: Thursday, January 29, 2009 4:23 AM To: Andrew Gould andrewlylego...@gmail.com Cc: Reko Turja reko.tu...@liukuma.net; FreeBSD Mailing List freebsd-questions@freebsd.org Subject: Re: OCR... On Wed, Jan 28, 2009 at 07:33:41PM -0600, Andrew Gould wrote: On Wed, Jan 28, 2009 at 5:09 PM, Gary Kline kl...@thought.org wrote: On Wed, Jan 28, 2009 at 01:32:57PM -0600, Andrew Gould wrote: On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org wrote: well, damage is probably done. how can i check the resolution? i tried to increase it by creating huge ppm and tif files, but then that's really absurd since there can only be just so much data per image. i _could_ try xv and jpeg and smoothing image Yeah, if the image resolution is already at 72DPI, there's sadly no trick in the world that can reliably return the lost information. I've read some horrid scans with low resolution in Finereader, and it can grab much of the information nicely. With low resolution be prepared to manually correcting problem spots though. Only reliable way to quesstimate resolution is the font size when at 100% in the screen. If the text is about 10 pixels high, the information has probably been stored in 72DPI for space saving purposes. Wasn't aware of the FreeBSD/Linux backend, but if that works it'd be great - haven't myself visited their website in ages as the version I have does the job I got it for. -Reko ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OCR...
On Thu, Jan 29, 2009 at 7:11 AM, Reko Turja reko.tu...@liukuma.net wrote: -- From: Gary Kline kl...@thought.org Sent: Thursday, January 29, 2009 4:23 AM To: Andrew Gould andrewlylego...@gmail.com Cc: Reko Turja reko.tu...@liukuma.net; FreeBSD Mailing List freebsd-questions@freebsd.org Subject: Re: OCR... On Wed, Jan 28, 2009 at 07:33:41PM -0600, Andrew Gould wrote: On Wed, Jan 28, 2009 at 5:09 PM, Gary Kline kl...@thought.org wrote: On Wed, Jan 28, 2009 at 01:32:57PM -0600, Andrew Gould wrote: On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org wrote: well, damage is probably done. how can i check the resolution? i tried to increase it by creating huge ppm and tif files, but then that's really absurd since there can only be just so much data per image. i _could_ try xv and jpeg and smoothing image Yeah, if the image resolution is already at 72DPI, there's sadly no trick in the world that can reliably return the lost information. I've read some horrid scans with low resolution in Finereader, and it can grab much of the information nicely. With low resolution be prepared to manually correcting problem spots though. Only reliable way to quesstimate resolution is the font size when at 100% in the screen. If the text is about 10 pixels high, the information has probably been stored in 72DPI for space saving purposes. Wasn't aware of the FreeBSD/Linux backend, but if that works it'd be great - haven't myself visited their website in ages as the version I have does the job I got it for. -Reko I may have used the wrong term. I think it used to be an sdk; but now there's a product with extended platform support for Linux and FreeBSD. Andrew ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OCR...
Gary Kline wrote: well, i'm ashamed to admit that i've put at least a dozen hours in trying, then re-re-retrying to OCR a imaged pdf file with as many open source ocr packages as i can find. I have seen good results with tesseract which is in the ports and free. Otherwise with OmniPage for commercial software (it runs under wine). -- Michel TALON ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OCR...
so what is the best commercial/shareware that can read a 10pt-font file? (( also, when i have time to get back into actually hacking, this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong to be a first target. any idea which team i should go with. gOCR looks best so far to me. AABBYY Finereader - Omnipage haven't been able to catch it in several years either feature or qualitywise. No idea if Finereader runs under emulator though. If the file is already a PDF and 72 DPI with text as graphics most of the damage has already been done, and it will be extremely hard to OCR. -Reko ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OCR...
On Wed, Jan 28, 2009 at 12:08:55PM +0200, Reko Turja wrote: so what is the best commercial/shareware that can read a 10pt-font file? (( also, when i have time to get back into actually hacking, this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong to be a first target. any idea which team i should go with. gOCR looks best so far to me. AABBYY Finereader - Omnipage haven't been able to catch it in several years either feature or qualitywise. No idea if Finereader runs under emulator though. If the file is already a PDF and 72 DPI with text as graphics most of the damage has already been done, and it will be extremely hard to OCR. well, damage is probably done. how can i check the resolution? i tried to increase it by creating huge ppm and tif files, but then that's really absurd since there can only be just so much data per image. i _could_ try xv and jpeg and smoothing image to refine, but too much hassle. (i used gocr -m 130 and saw the glyphs it (presumably) saw. seemed pretty much okay to my eyes. but then i'm not a computer program. [MAYBE :)] gary -Reko -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.23a release of Jottings: http://jottings.thought.org/index.php ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OCR...
On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org wrote: On Wed, Jan 28, 2009 at 12:08:55PM +0200, Reko Turja wrote: so what is the best commercial/shareware that can read a 10pt-font file? (( also, when i have time to get back into actually hacking, this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong to be a first target. any idea which team i should go with. gOCR looks best so far to me. AABBYY Finereader - Omnipage haven't been able to catch it in several years either feature or qualitywise. No idea if Finereader runs under emulator though. If the file is already a PDF and 72 DPI with text as graphics most of the damage has already been done, and it will be extremely hard to OCR. well, damage is probably done. how can i check the resolution? i tried to increase it by creating huge ppm and tif files, but then that's really absurd since there can only be just so much data per image. i _could_ try xv and jpeg and smoothing image to refine, but too much hassle. (i used gocr -m 130 and saw the glyphs it (presumably) saw. seemed pretty much okay to my eyes. but then i'm not a computer program. [MAYBE :)] gary -Reko -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.23a release of Jottings: http://jottings.thought.org/index.php At one point in time, the Abby folks were offering a back-end that ran on FreeBSD. I tried to get the free download; but it never happened. (They misplaced my signed, faxed license agreement and I finally got tired of the back-and-forth prerequisite communication.) Abby also no longer supports Mac OS X. I use an old version and like it a lot. Andrew ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OCR...
On Wed, Jan 28, 2009 at 01:32:57PM -0600, Andrew Gould wrote: On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org wrote: On Wed, Jan 28, 2009 at 12:08:55PM +0200, Reko Turja wrote: so what is the best commercial/shareware that can read a 10pt-font file? (( also, when i have time to get back into actually hacking, this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong to be a first target. any idea which team i should go with. gOCR looks best so far to me. AABBYY Finereader - Omnipage haven't been able to catch it in several years either feature or qualitywise. No idea if Finereader runs under emulator though. If the file is already a PDF and 72 DPI with text as graphics most of the damage has already been done, and it will be extremely hard to OCR. well, damage is probably done. how can i check the resolution? i tried to increase it by creating huge ppm and tif files, but then that's really absurd since there can only be just so much data per image. i _could_ try xv and jpeg and smoothing image to refine, but too much hassle. (i used gocr -m 130 and saw the glyphs it (presumably) saw. seemed pretty much okay to my eyes. but then i'm not a computer program. [MAYBE :)] gary -Reko -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.23a release of Jottings: http://jottings.thought.org/index.php At one point in time, the Abby folks were offering a back-end that ran on FreeBSD. I tried to get the free download; but it never happened. (They misplaced my signed, faxed license agreement and I finally got tired of the back-and-forth prerequisite communication.) Abby also no longer supports Mac OS X. I use an old version and like it a lot. OK, now i know what to expect. I found theit site and signed up to get the linux version; trial. not likrly to go any further gary Andrew -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.23a release of Jottings: http://jottings.thought.org/index.php ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OCR...
On Wed, Jan 28, 2009 at 5:09 PM, Gary Kline kl...@thought.org wrote: On Wed, Jan 28, 2009 at 01:32:57PM -0600, Andrew Gould wrote: On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org wrote: On Wed, Jan 28, 2009 at 12:08:55PM +0200, Reko Turja wrote: so what is the best commercial/shareware that can read a 10pt-font file? (( also, when i have time to get back into actually hacking, this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong to be a first target. any idea which team i should go with. gOCR looks best so far to me. AABBYY Finereader - Omnipage haven't been able to catch it in several years either feature or qualitywise. No idea if Finereader runs under emulator though. If the file is already a PDF and 72 DPI with text as graphics most of the damage has already been done, and it will be extremely hard to OCR. well, damage is probably done. how can i check the resolution? i tried to increase it by creating huge ppm and tif files, but then that's really absurd since there can only be just so much data per image. i _could_ try xv and jpeg and smoothing image to refine, but too much hassle. (i used gocr -m 130 and saw the glyphs it (presumably) saw. seemed pretty much okay to my eyes. but then i'm not a computer program. [MAYBE :)] gary -Reko -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.23a release of Jottings: http://jottings.thought.org/index.php At one point in time, the Abby folks were offering a back-end that ran on FreeBSD. I tried to get the free download; but it never happened. (They misplaced my signed, faxed license agreement and I finally got tired of the back-and-forth prerequisite communication.) Abby also no longer supports Mac OS X. I use an old version and like it a lot. OK, now i know what to expect. I found theit site and signed up to get the linux version; trial. not likrly to go any further gary Andrew -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.23a release of Jottings: http://jottings.thought.org/index.php I'm rooting for you! :-) ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OCR...
On Wed, Jan 28, 2009 at 07:33:41PM -0600, Andrew Gould wrote: On Wed, Jan 28, 2009 at 5:09 PM, Gary Kline kl...@thought.org wrote: On Wed, Jan 28, 2009 at 01:32:57PM -0600, Andrew Gould wrote: On Wed, Jan 28, 2009 at 1:22 PM, Gary Kline kl...@thought.org wrote: On Wed, Jan 28, 2009 at 12:08:55PM +0200, Reko Turja wrote: so what is the best commercial/shareware that can read a 10pt-font file? (( also, when i have time to get back into actually hacking, this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong to be a first target. any idea which team i should go with. gOCR looks best so far to me. AABBYY Finereader - Omnipage haven't been able to catch it in several years either feature or qualitywise. No idea if Finereader runs under emulator though. If the file is already a PDF and 72 DPI with text as graphics most of the damage has already been done, and it will be extremely hard to OCR. well, damage is probably done. how can i check the resolution? i tried to increase it by creating huge ppm and tif files, but then that's really absurd since there can only be just so much data per image. i _could_ try xv and jpeg and smoothing image to refine, but too much hassle. (i used gocr -m 130 and saw the glyphs it (presumably) saw. seemed pretty much okay to my eyes. but then i'm not a computer program. [MAYBE :)] gary -Reko -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.23a release of Jottings: http://jottings.thought.org/index.php At one point in time, the Abby folks were offering a back-end that ran on FreeBSD. I tried to get the free download; but it never happened. (They misplaced my signed, faxed license agreement and I finally got tired of the back-and-forth prerequisite communication.) Abby also no longer supports Mac OS X. I use an old version and like it a lot. OK, now i know what to expect. I found theit site and signed up to get the linux version; trial. not likrly to go any further gary Andrew -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.23a release of Jottings: http://jottings.thought.org/index.php I'm rooting for you! :-) well, i just got an email from a david hazard who said to look on their website; i replied that i had and couldn't find their test suite if/when this guy replies, i'll share. gary -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.23a release of Jottings: http://jottings.thought.org/index.php ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
OCR...
guys, well, i'm ashamed to admit that i've put at least a dozen hours in trying, then re-re-retrying to OCR a imaged pdf file with as many open source ocr packages as i can find. before i quit for supper tonight, i finally threw in the towel. realized than i would have been THROUGH with all 181 pages of the text on Aristotle if i had just read the bloody thing. but anyway, i'm done. there simply is no freeware that runs on a 'nix computer//real computer. so what is the best commercial/shareware that can read a 10pt-font file? (( also, when i have time to get back into actually hacking, this [[turning imaged pdf into OCR'able ascii or 8859-1]] is giong to be a first target. any idea which team i should go with. gOCR looks best so far to me. gary -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.23a release of Jottings: http://jottings.thought.org/index.php ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: any way to turn a pdf file into something OCR-able?
On Mon, Dec 01, 2008 at 08:23:09PM -0500, Robert Huff wrote: Roland Smith writes: pdftotext fail on the large [32MB] file I've got. Is there any other way I can translate this huge textfile to ascii or html or text? Please define fail in this context? I've used pdftotxt on documents exceeding 40MB. However there are of course things that don't work; 1) Some PDFs are just wrappers around JPEG images. In this case there is no text for pdftotext to convert = epic fail. In this case convert from the ImageMagick port will get you a series of .jpg/.gif/.whatever. Read the manual carefully before attempting; also note this can be a slow process. Which still doesn't give plain text. But in this case one would need an OCR app. There is a new one available in ports called cuneiform. It is supposed to be quite good, but I haven't had the need to try it yet. I've tried gocr and tesseract in the past but was not really impressed with them. For short documents it's easier to do the OCR with the Mk I eyeball brain. :-) You'll have to completely check an OCR-ed document for errors anyway. Roland -- R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) pgpxQMfB0hnro.pgp Description: PGP signature
Re: any way to turn a pdf file into something OCR-able?
On Tue, Dec 02, 2008 at 02:07:30AM +0100, Roland Smith wrote: On Mon, Dec 01, 2008 at 03:14:43PM -0800, Gary Kline wrote: pdftotext fail on the large [32MB] file I've got. Is there any other way I can translate this huge textfile to ascii or html or text? Please define fail in this context? I've used pdftotxt on documents exceeding 40MB. However there are of course things that don't work; 1) Some PDFs are just wrappers around JPEG images. In this case there is no text for pdftotext to convert = epic fail. 2) If the text contains ligatures etc. you should use the proper encoding that contains such characters (e.g. '-enc UTF-8') or you will loose them. 3) Things like equations will not render well, if at all. This also depends on the encoding. It probably was a pdf wrapped around a jpeg. I was able to to another pdf to plaintext in a flash. (*sigh*) it wasn't a total waste of time because I found the entire text transfered to buugy ASCII somewhere [[ thanks to some prof ]]. So, if I ever want to run aspell against a 900-page file, at least I have that option! gary Roland -- R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) -- Gary Kline [EMAIL PROTECTED] http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org Flash: The alpha release of Jottings is available: http://jottings.thought.org/index.php ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: any way to turn a pdf file into something OCR-able?
On Tue, Dec 02, 2008 at 02:22:27PM -0500, Chris Shenton wrote: Gary Kline [EMAIL PROTECTED] writes: pdftotext fail on the large [32MB] file I've got. Is there any other way I can translate this huge textfile to ascii or html or text? I wrote some code using Python PDF library 'pypdf' to split a multipage PDF scan into individual pages, then used the tesseract OCR to convert to text. Not 100% of course, and it really got confused by pages that were not right-side-up, but not a bad start for pages that are really scans -- images -- rather than PDF representation of text. Sadly, I haven't gotten it into a suitable state to release. Well, sounds hopeful for when I scan around 200 pages of pre-1923 journal articles. These are in columnal form IIRC correctly. --Be WONDERFUL if there were some kind of hardware top translate Old books and journals automagically. ... . gary -- Gary Kline [EMAIL PROTECTED] http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org Flash: The alpha release of Jottings is available: http://jottings.thought.org/index.php ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
any way to turn a pdf file into something OCR-able?
Guys, pdftotext fail on the large [32MB] file I've got. Is there any other way I can translate this huge textfile to ascii or html or text? thanks, gary -- Gary Kline [EMAIL PROTECTED] http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: any way to turn a pdf file into something OCR-able?
On Mon, Dec 01, 2008 at 03:14:43PM -0800, Gary Kline wrote: pdftotext fail on the large [32MB] file I've got. Is there any other way I can translate this huge textfile to ascii or html or text? Please define fail in this context? I've used pdftotxt on documents exceeding 40MB. However there are of course things that don't work; 1) Some PDFs are just wrappers around JPEG images. In this case there is no text for pdftotext to convert = epic fail. 2) If the text contains ligatures etc. you should use the proper encoding that contains such characters (e.g. '-enc UTF-8') or you will loose them. 3) Things like equations will not render well, if at all. This also depends on the encoding. Roland -- R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) pgp1pm9biULbz.pgp Description: PGP signature
Re: any way to turn a pdf file into something OCR-able?
Roland Smith writes: pdftotext fail on the large [32MB] file I've got. Is there any other way I can translate this huge textfile to ascii or html or text? Please define fail in this context? I've used pdftotxt on documents exceeding 40MB. However there are of course things that don't work; 1) Some PDFs are just wrappers around JPEG images. In this case there is no text for pdftotext to convert = epic fail. In this case convert from the ImageMagick port will get you a series of .jpg/.gif/.whatever. Read the manual carefully before attempting; also note this can be a slow process. Robert Huff ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: any way to turn a pdf file into something OCR-able?
1) Some PDFs are just wrappers around JPEG images. In this case there is no text for pdftotext to convert = epic fail. In this case convert from the ImageMagick port will get you a series of .jpg/.gif/.whatever. Read the manual carefully before attempting; also note this can be a slow process. pdfimages (from ports graphics/xpdf) can also do that, maybe at a lesser cost. Bests, Olivier ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: best OCR scanner??
On Thu, Sep 01, 2005 at 08:07:26PM -0700, Gary Kline wrote: People, I want to scan ~400 pp of an out-of-print and out-of-copyright book (from 1913) and need to know what the best scanner is and if there has been substantial improvement in OCR software in recent years. This book has few footnotes or different typefaces, so it should make things easier. Oh, an if there is something that plugs into DOS/DOZE and just works, super. I'lll use my W2K box. (Hopefully, something that plugs into COM0 or COM1. USB okay too.) thanks for any clues; I've never used a scanner before! --yea, no kidding:-) ... just a postscript here to the list: my interest in the best scanner obviously applies to the Unix realm, too. Just FWIW. gary -- Gary Kline [EMAIL PROTECTED] www.thought.org Public service Unix ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: best OCR scanner??
On 9/1/05, Gary Kline [EMAIL PROTECTED] wrote: People, I want to scan ~400 pp of an out-of-print and out-of-copyright book (from 1913) and need to know what the best scanner is and if there has been substantial improvement in OCR software in recent years. This book has few footnotes or different typefaces, so it should make things easier. Oh, an if there is something that plugs into DOS/DOZE and just works, super. I'lll use my W2K box. (Hopefully, something that plugs into COM0 or COM1. USB okay too.) Any scanner will work when your scanning a 2 tone document! The only thing that matters is the OCR software and their is only one game in town, OmniPage Pro by scansoft. BTW it's faster (and won't damage the book) to photograph the book and then crop and covert to BW, white balance, contrast, etc in photoshop or gimp etc., and then import the photos into the OCR software. The OCR software should produce less errors too. After all is done post the book on gutenberg, http://www.gutenberg.org/ oh, you should be able to fine some tips about scanning books at the gutenberg site too. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: best OCR scanner??
At 08:07 PM 9/1/2005 -0700, Gary Kline wrote: People, I want to scan ~400 pp of an out-of-print and out-of-copyright book (from 1913) and need to know what the best scanner is and if there has been substantial improvement in OCR software in recent years. This book has few footnotes or different typefaces, so it should make things easier. Oh, an if there is something that plugs into DOS/DOZE and just works, super. I'lll use my W2K box. (Hopefully, something that plugs into COM0 or COM1. USB okay too.) thanks for any clues; I've never used a scanner before! --yea, no kidding:-) I happen to have some recent experience on a Windoze machine that may be useful. Of the several programs that Google found for me the one that met my needs best was Textbridge. The others put every paragraph into a separate text box, made correcting layout and formatting a nightmare. Textbridge (at least the current version) seems to do a good job as long as the print is reasonable clear. All the OCR programs I tried had problems putting pictures in the right place. I don't know what's available for FreeBSD, since I use my boxen for gateways, not even connected to printers. I should warn you, though, scanning isn't quick -- figure about two minutes per page (YMMV) plus any formatting fixup you have to do afterward. There are industrial-strength applications out there, but they cost. Can't offer advice about hardware -- I've got an Epson flatbed, pretty inexpensive but works good. -- Roger ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: best OCR scanner??
On Fri, Sep 02, 2005 at 03:21:03PM +0700, Roger Merritt wrote: At 08:07 PM 9/1/2005 -0700, Gary Kline wrote: People, I want to scan ~400 pp of an out-of-print and out-of-copyright book (from 1913) and need to know what the best scanner is and if there has been substantial improvement in OCR software in recent years. This book has few footnotes or different typefaces, so it should make things easier. Oh, an if there is something that plugs into DOS/DOZE and just works, super. I'lll use my W2K box. (Hopefully, something that plugs into COM0 or COM1. USB okay too.) thanks for any clues; I've never used a scanner before! --yea, no kidding:-) I happen to have some recent experience on a Windoze machine that may be useful. Of the several programs that Google found for me the one that met my needs best was Textbridge. The others put every paragraph into a separate text box, made correcting layout and formatting a nightmare. Textbridge (at least the current version) seems to do a good job as long as the print is reasonable clear. All the OCR programs I tried had problems putting pictures in the right place. I don't know what's available for FreeBSD, since I use my boxen for gateways, not even connected to printers. I should warn you, though, scanning isn't quick -- figure about two minutes per page (YMMV) plus any formatting fixup you have to do afterward. There are industrial-strength applications out there, but they cost. Can't offer advice about hardware -- I've got an Epson flatbed, pretty inexpensive but works good. Doesn't sound very encouraging... :( -gary -- Roger ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED] -- Gary Kline [EMAIL PROTECTED] www.thought.org Public service Unix ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: best OCR scanner??
On Fri, Sep 02, 2005 at 03:01:12AM -0500, Nikolas Britton wrote: On 9/1/05, Gary Kline [EMAIL PROTECTED] wrote: People, I want to scan ~400 pp of an out-of-print and out-of-copyright book (from 1913) and need to know what the best scanner is and if there has been substantial improvement in OCR software in recent years. This book has few footnotes or different typefaces, so it should make things easier. Oh, an if there is something that plugs into DOS/DOZE and just works, super. I'lll use my W2K box. (Hopefully, something that plugs into COM0 or COM1. USB okay too.) Any scanner will work when your scanning a 2 tone document! The only thing that matters is the OCR software and their is only one game in town, OmniPage Pro by scansoft. Well, the book I want to scan is from 1913:: just text. Does this scanner work with FreeBSD? or only Windows? BTW it's faster (and won't damage the book) to photograph the book and then crop and covert to BW, white balance, contrast, etc in photoshop or gimp etc., and then import the photos into the OCR software. The OCR software should produce less errors too. Okay, can do; thanks. After all is done post the book on gutenberg, http://www.gutenberg.org/ oh, you should be able to fine some tips about scanning books at the gutenberg site too. Yep; that's my idea. I've volunteered for PG, just never at the scanning level. gary -- Gary Kline [EMAIL PROTECTED] www.thought.org Public service Unix ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: best OCR scanner??
On Fri, Sep 02, 2005 at 12:46:27AM -0700, Gary Kline wrote: On Thu, Sep 01, 2005 at 08:07:26PM -0700, Gary Kline wrote: People, I want to scan ~400 pp of an out-of-print and out-of-copyright book (from 1913) and need to know what the best scanner is Any scanner that works with SANE (http://www.sane-project.org/) scanner support framework should do. For supported hardware see: http://www.sane-project.org/sane-mfgs.html Epson seems to have the most supported scanners. I've got an Epson Perfection 1650, which works fine. and if there has been substantial improvement in OCR software in recent years. This book has few footnotes or different typefaces, so it should make things easier. There are several free OCR programs. I've used gocr (http://jocr.sourceforge.net/ and no, that's not a typo) and ocrad (http://www.gnu.org/software/ocrad/ocrad.html) Ocrad works ok, but you'll definitely have to correct errors, depending on the quality of the pictures/scans. HTH, Roland -- R.F.Smith (http://www.xs4all.nl/~rsmith/) Please send e-mail as plain text. public key: http://www.xs4all.nl/~rsmith/pubkey.txt pgpKDz7HzZI7u.pgp Description: PGP signature
Re: best OCR scanner??
On Fri, Sep 02, 2005, Gary Kline wrote: ... Well, the book I want to scan is from 1913:: just text. Does this scanner work with FreeBSD? or only Windows? As somebody else suggested, you may well be better off ``scanning'' books with a digital camera than with a scanner. It's often difficult to get a book to lay flat enough on a scanner bed to get good scans. I've been planning on getting a photographic copy table that holds the camera at a fixed distance above its bed. I think it would also work best to have a flat glass or plastic sheet that can hold the page flat while it's been photographed, with something to keep the opposite page out of the camera's way. I have to admit that I do all my scanning and OCR on an OS X system, only marginally related to FreeBSD. I use an older HP Scanjet with automatic document feeder (ADF), and the HP software will scan straight to PDF documents. The Readiris OCS software can then OCR the PDF file making it fairly easy to deal with multiple pages. At one point we developed a perl::Tk program that worked with Vividata's scanning and OCR software to scan and OCR large documents from high-end Ricoh scanners with ADF. Bill -- INTERNET: [EMAIL PROTECTED] Bill Campbell; Celestial Software LLC UUCP: camco!bill PO Box 820; 6641 E. Mercer Way FAX:(206) 232-9186 Mercer Island, WA 98040-0820; (206) 236-1676 URL: http://www.celestial.com/ Imagine if every Thursday your shoes exploded if you tied them the usual way. This happens to us all the time with computers, and nobody thinks of complaining. -- Jef Raskin http://jefraskin.com/ ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: best OCR scanner??
On 9/2/05, Gary Kline [EMAIL PROTECTED] wrote: On Fri, Sep 02, 2005 at 03:01:12AM -0500, Nikolas Britton wrote: On 9/1/05, Gary Kline [EMAIL PROTECTED] wrote: People, I want to scan ~400 pp of an out-of-print and out-of-copyright book (from 1913) and need to know what the best scanner is and if there has been substantial improvement in OCR software in recent years. This book has few footnotes or different typefaces, so it should make things easier. Oh, an if there is something that plugs into DOS/DOZE and just works, super. I'lll use my W2K box. (Hopefully, something that plugs into COM0 or COM1. USB okay too.) Any scanner will work when your scanning a 2 tone document! The only thing that matters is the OCR software and their is only one game in town, OmniPage Pro by scansoft. Well, the book I want to scan is from 1913:: just text. Does this scanner work with FreeBSD? or only Windows? The OCR software? It works on windows and Mac OS-X. The software isn't cheap though, the current full version, 15, retails for $500. You may be able to find a demo version , so you can try before you buy, if you look in the right places. BTW it's faster (and won't damage the book) to photograph the book and then crop and covert to BW, white balance, contrast, etc in photoshop or gimp etc., and then import the photos into the OCR software. The OCR software should produce less errors too. Okay, can do; thanks. Have you ever seen a spy (movies) use a scanner to copy top secret documents? :-) I would just make a jig out of wood to hold the digital camera and a flat bottem to hold the book. It would be best if you had a 35mm AF SLR camera with like a 20 - 50mm macro len, but any camera should work. If you have an SLR camera but no macro lens you can try flipping your lens around. After all is done post the book on gutenberg, http://www.gutenberg.org/ oh, you should be able to fine some tips about scanning books at the gutenberg site too. Yep; that's my idea. I've volunteered for PG, just never at the scanning level. Cool. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
best OCR scanner??
People, I want to scan ~400 pp of an out-of-print and out-of-copyright book (from 1913) and need to know what the best scanner is and if there has been substantial improvement in OCR software in recent years. This book has few footnotes or different typefaces, so it should make things easier. Oh, an if there is something that plugs into DOS/DOZE and just works, super. I'lll use my W2K box. (Hopefully, something that plugs into COM0 or COM1. USB okay too.) thanks for any clues; I've never used a scanner before! --yea, no kidding:-) gary -- Gary Kline [EMAIL PROTECTED] www.thought.org Public service Unix ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]