Summary: Scanned PDF documents with OCR text layer do not render the same with pdfbox as with other pdf viewers.
Detail: I am a pdfbox and pdf newbie working with a large set of .pdf files from scanned documents. The documents are basically pages from books of photos with captions. The scanner software is running OCR on the captions and storing the text in a layer behind the scanned pages. When I view these .pdf files with OS X Preview or Win32 Acrobat Reader I only see the scanned image. When I render these .pdf files with pdfbox PDFReader or PDFToImage the text layer is rendered on top of the page image. Not surprisingly in most cases the text is staggered. This looks like a bug to me. For my application, I think I can work around this using ExtractImages, ExtractTextByArea, etc. I know nothing about PDF format. I am wondering ... Q: Is there a tag in the PDF format that indicates that the OCR text layer should not be rendered? Michael

