[Vo]:Text under image Acrobat files

Jed Rothwell Fri, 04 Sep 2009 14:00:26 -0700

Michel Jullian wrote:

For pdfs with low fi underlying ocr, in my experience saving them as
pure image files and then re-ocr-ing them with the latest version of
Acrobat often improves the ocr quality considerably.

Yes. This is because OCR accuracy has improved. It may also bebecause many text-under-image Acrobat files were generated with theHewlett-Packard scanning program. This is a good program overall butcompanies that specialize in OCR do a better job.

Also, did you know you can batch-ocr any number of pdfs at a time?


With what program?

I think it would be only a matter of a few days of automated computer
work to make your whole collection of many thousands of CF and
peripheral papers searchable.

They are mostly searchable now. But there would be no point, since Icannot upload them.

A thought regarding copyright issues, rather than seeking uploading
permission for every single paper, would there be a big risk in
uploading everything and then removing only those the copyright
holders ask you to remove?

There would be a big risk because when a publisher notices one paperthey tend to look around and find the others, and then they asked meto remove all them. This has happened to me a couple of occasions.

For those, how about functioning like a real library, where the
library card holders can download copyrighted material?

I am trying to arrange something along these lines. I needinstitutional support. I am negotiating with several institutions,slo-o-o-w-l-y.


. . .

For people unfamiliar with Acrobat let me explain that a text underimage Acrobat file is a facsimile of the original document with textapparently aligned under the image. You can view the text alone by two methods:

1. Use PDF Professional or some other specialist program and select"save file as text" (or "save as Microsoft Word").

2. For a quick check, you put a block around the text, copy, and thenpaste it into a text editor. If nothing appears on your screen, trypasting again into a graphics program such as IrfanView. If an imageappears then you are looking at an image only Acrobat file. Ifnothing appears then you are looking at a copy protected Acrobat file.

If you have a copy-protected Acrobat file, print it on paper and thenscan it in and OCR it yourself. A variation on this method is toinstall a print server program that outputs disk images, Acrobatfiles and various other formats. In other words, you print from theAcrobat file into another Acrobat file. All of the paid Acrobatutilities from Adobe, Nuance and other vendors have such drivers.

Printing to paper and then scanning in degrades image quality but ithas no effect on text and no measurable effect on OCR accuracy.Actually, sometimes the OCR works better after you do this. Printingfrom Acrobat to an Acrobat print server should have no effect onquality. If it does, check the settings.

To summarize, it is so simple to overcome Acrobat copy protection youwonder why they even bother with it.

I am looking at the Acrobat creation screen here and I find . . . Itis possible to restrict a file even from printing, or to specify "lowresolution printing only." Surely that would surely defeat the wholepurpose of Acrobat.

By the way, when you scan to images, rather than scanning directly toAcrobat, always make the image size at least 300 dpi, grayscale.Never use black and white bitmap. Color is seldom needed and mayactually degrade OCR quality. Scanning at resolutions higher than 300dpi does not improve OCR quality, except with very small fine print,and even then it does not help much.


- Jed

[Vo]:Text under image Acrobat files

Reply via email to