Summary:

Scanned PDF documents with OCR text layer do not render the same with
pdfbox as with other pdf viewers.

Detail:

I am a pdfbox and pdf newbie working with a large set of .pdf files
from scanned documents. The documents are basically pages from books
of photos with captions.

The scanner software is running OCR on the captions and storing the
text in a layer behind the scanned pages.

When I view these .pdf files with OS X Preview or Win32 Acrobat Reader
I only see the scanned image.

When I render these .pdf files with pdfbox PDFReader or PDFToImage the
text layer is rendered on top of the page image. Not surprisingly in
most cases the text is staggered.

This looks like a bug to me.

For my application, I think I can work around this using
ExtractImages, ExtractTextByArea, etc.

I know nothing about PDF format. I am wondering ...

Q: Is there a tag in the PDF format that indicates that the OCR text
layer should not be rendered?


Michael

Reply via email to