Re: Handling Graphics from Scanned PDF

Andreas Lehmkuehler Sun, 09 Dec 2012 07:38:04 -0800

Hi,

Am 09.12.2012 15:36, schrieb Eliot Kimber:

Yes, I believe this is a masked image. I did a close reading of the PDF 1.7
spec and I think that's what I have.


The sample I'm testing with can be found here:

https://dl.dropbox.com/u/20078596/pdfScannedPageWithMaskedImage.pdf

Here are the dictionary entries for the three XObjects in the document:

9 0 obj
<</BitsPerComponent
8/ColorSpace/DeviceGray/Filter[/FlateDecode/DCTDecode]/Height 1100/Length
19570/Name/image_bg0/Subtype/Image/Type/XObject/Width 850>>

10 0 obj
<</BitsPerComponent
8/ColorSpace/DeviceGray/Filter[/FlateDecode/DCTDecode]/Height 1100/Length
8521/Mask 11 0 R/Name/image_fg0/Subtype/Image/Type/XObject/Width 850>>

11 0 obj
<</BitsPerComponent 1/DecodeParms<</Columns 2550/K
-1>>/Filter/CCITTFaxDecode/Height 3300/ImageMask true/Length
10266/Name/image_sel/Subtype/Image/Type/XObject/Width 2550>>

So if I understand what this is saying, object 11 is the image mask applied
to object 10.

Correct. FYI: did you ever try the PDFDebugger which comes with PDFBox? It's atool to inspect the content of a pdf using a hierarchic tree view.

In my test code I made a little StreamEngine that simply reports on all
XObjects and writes any PDXObjectImage objects to the file system. This is
the output I get on this test document:

processOperator(): objectName="image_bg0"
processOperator(): object type="PDJpeg"
processOperator(): image class=PDJpeg
processOperator(): imageWidth="850"
processOperator(): imageHeight="1100"
Creating file
/var/folders/_r/zht66_tx2lzcz4k18rzbxc240000gp/T/TestPdfUtils/image_bg0_0.jp
g
processOperator(): objectName="image_fg0"
processOperator(): object type="PDJpeg"
processOperator(): image class=PDJpeg
processOperator(): imageWidth="850"
processOperator(): imageHeight="1100"
Creating file
/var/folders/_r/zht66_tx2lzcz4k18rzbxc240000gp/T/TestPdfUtils/image_fg0_1.jp
g

Where the objectName="image_bg0" line will be emitted for any XObject of any
type.

So it looks like the ImageMask object is not being reported as an XObject.

That's correct too. The mask is not a "standalone" XObjectImage, it's part ofthe fg_0 image. The mask represents the alpha channel of the image.bg_0 is painted first. fg_0 is painted on top of bg_0, due to the alpha channelmost of fg_0 is treated as transparent and doesn't overwrite anything.

I don't have a clue why the scanned picture is splitted into two parts. At leastthe most recent trunk version of PDFBox is able to handle this after fixingimproving the mask handling, see [1] for further details.


Maybe you should just use the combined image generated by PDFToImage.

Thanks,

Eliot

On 12/9/12 6:58 AM, "Andreas Lehmkuehler" <[email protected]> wrote:

Hi,

Am 06.12.2012 18:48, schrieb Eliot Kimber:

I am trying to find QR codes on PDFs that are scanned page images. My code
works fine for scans produced by my OfficeJet and for page images produced
out of Acrobat but scans produced by my client's eCopy ShareScan device
(according to the PDF metadata) are not usable.

Looking into the PDF data stream, each page is represented by two images, a
"bg" image that is what I would expect for the page image, but very faint
grey, and a "fg" image that reflects the page content but with lots of grey
and ghosting.

Sounds like masked images, but that's just a guess.

The PDF renderer must be combining these two images in some way to provide
the clear image I see in Acrobat.

Is there something I can find in the PDF data stream that will tell me how
these images are combined and, if so, can anyone point me in the right
direction for processing these images? I am pretty new to Java image
processing so I'm not sure where to look or what to look for.

The images themselves are repored by PDFBox as PDJpeg objects.

I can provide a sample PDF page if it's needed.

Due to some restrictions you can't attach it to a posting. Please post a
download link referring to a public location or create an issue on jira [1]


Thanks,

Eliot



BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX


BR
Andreas Lehmkühler

[1]https://issues.apache.org/jira/browse/PDFBOX-1445

Re: Handling Graphics from Scanned PDF

Reply via email to