Michel Jullian wrote:

For pdfs with low fi underlying ocr, in my experience saving them as
pure image files and then re-ocr-ing them with the latest version of
Acrobat often improves the ocr quality considerably.

Yes. This is because OCR accuracy has improved. It may also be because many text-under-image Acrobat files were generated with the Hewlett-Packard scanning program. This is a good program overall but companies that specialize in OCR do a better job.


Also, did you know you can batch-ocr any number of pdfs at a time?

With what program?


I think it would be only a matter of a few days of automated computer
work to make your whole collection of many thousands of CF and
peripheral papers searchable.

They are mostly searchable now. But there would be no point, since I cannot upload them.


A thought regarding copyright issues, rather than seeking uploading
permission for every single paper, would there be a big risk in
uploading everything and then removing only those the copyright
holders ask you to remove?

There would be a big risk because when a publisher notices one paper they tend to look around and find the others, and then they asked me to remove all them. This has happened to me a couple of occasions.




For those, how about functioning like a real library, where the
library card holders can download copyrighted material?

I am trying to arrange something along these lines. I need institutional support. I am negotiating with several institutions, slo-o-o-w-l-y.

. . .

For people unfamiliar with Acrobat let me explain that a text under image Acrobat file is a facsimile of the original document with text apparently aligned under the image. You can view the text alone by two methods:

1. Use PDF Professional or some other specialist program and select "save file as text" (or "save as Microsoft Word").

2. For a quick check, you put a block around the text, copy, and then paste it into a text editor. If nothing appears on your screen, try pasting again into a graphics program such as IrfanView. If an image appears then you are looking at an image only Acrobat file. If nothing appears then you are looking at a copy protected Acrobat file.

If you have a copy-protected Acrobat file, print it on paper and then scan it in and OCR it yourself. A variation on this method is to install a print server program that outputs disk images, Acrobat files and various other formats. In other words, you print from the Acrobat file into another Acrobat file. All of the paid Acrobat utilities from Adobe, Nuance and other vendors have such drivers.

Printing to paper and then scanning in degrades image quality but it has no effect on text and no measurable effect on OCR accuracy. Actually, sometimes the OCR works better after you do this. Printing from Acrobat to an Acrobat print server should have no effect on quality. If it does, check the settings.

To summarize, it is so simple to overcome Acrobat copy protection you wonder why they even bother with it.

I am looking at the Acrobat creation screen here and I find . . . It is possible to restrict a file even from printing, or to specify "low resolution printing only." Surely that would surely defeat the whole purpose of Acrobat.

By the way, when you scan to images, rather than scanning directly to Acrobat, always make the image size at least 300 dpi, grayscale. Never use black and white bitmap. Color is seldom needed and may actually degrade OCR quality. Scanning at resolutions higher than 300 dpi does not improve OCR quality, except with very small fine print, and even then it does not help much.

- Jed

Reply via email to