Michel Jullian wrote:
For pdfs with low fi underlying ocr, in my experience saving them as
pure image files and then re-ocr-ing them with the latest version of
Acrobat often improves the ocr quality considerably.
Yes. This is because OCR accuracy has improved. It may also be
because many text-under-image Acrobat files were generated with the
Hewlett-Packard scanning program. This is a good program overall but
companies that specialize in OCR do a better job.
Also, did you know you can batch-ocr any number of pdfs at a time?
With what program?
I think it would be only a matter of a few days of automated computer
work to make your whole collection of many thousands of CF and
peripheral papers searchable.
They are mostly searchable now. But there would be no point, since I
cannot upload them.
A thought regarding copyright issues, rather than seeking uploading
permission for every single paper, would there be a big risk in
uploading everything and then removing only those the copyright
holders ask you to remove?
There would be a big risk because when a publisher notices one paper
they tend to look around and find the others, and then they asked me
to remove all them. This has happened to me a couple of occasions.
For those, how about functioning like a real library, where the
library card holders can download copyrighted material?
I am trying to arrange something along these lines. I need
institutional support. I am negotiating with several institutions,
slo-o-o-w-l-y.
. . .
For people unfamiliar with Acrobat let me explain that a text under
image Acrobat file is a facsimile of the original document with text
apparently aligned under the image. You can view the text alone by two methods:
1. Use PDF Professional or some other specialist program and select
"save file as text" (or "save as Microsoft Word").
2. For a quick check, you put a block around the text, copy, and then
paste it into a text editor. If nothing appears on your screen, try
pasting again into a graphics program such as IrfanView. If an image
appears then you are looking at an image only Acrobat file. If
nothing appears then you are looking at a copy protected Acrobat file.
If you have a copy-protected Acrobat file, print it on paper and then
scan it in and OCR it yourself. A variation on this method is to
install a print server program that outputs disk images, Acrobat
files and various other formats. In other words, you print from the
Acrobat file into another Acrobat file. All of the paid Acrobat
utilities from Adobe, Nuance and other vendors have such drivers.
Printing to paper and then scanning in degrades image quality but it
has no effect on text and no measurable effect on OCR accuracy.
Actually, sometimes the OCR works better after you do this. Printing
from Acrobat to an Acrobat print server should have no effect on
quality. If it does, check the settings.
To summarize, it is so simple to overcome Acrobat copy protection you
wonder why they even bother with it.
I am looking at the Acrobat creation screen here and I find . . . It
is possible to restrict a file even from printing, or to specify "low
resolution printing only." Surely that would surely defeat the whole
purpose of Acrobat.
By the way, when you scan to images, rather than scanning directly to
Acrobat, always make the image size at least 300 dpi, grayscale.
Never use black and white bitmap. Color is seldom needed and may
actually degrade OCR quality. Scanning at resolutions higher than 300
dpi does not improve OCR quality, except with very small fine print,
and even then it does not help much.
- Jed