At least using the 1.7.1 release, the combined image is not usable--it doesn't reflect the application of the mask (at least I think that's what's happening).
I'm not sure how to build the pdfbox-app jar, which is what my application uses. I have successfully built the pdfbox.jar from the latest source but I didn't see anything in the build.xml for producing the pdfbox-app.jar. Thanks, Eliot On 12/9/12 9:37 AM, "Andreas Lehmkuehler" <[email protected]> wrote: > Hi, > > Am 09.12.2012 15:36, schrieb Eliot Kimber: >> Yes, I believe this is a masked image. I did a close reading of the PDF 1.7 >> spec and I think that's what I have. >> >> The sample I'm testing with can be found here: >> >> https://dl.dropbox.com/u/20078596/pdfScannedPageWithMaskedImage.pdf >> >> Here are the dictionary entries for the three XObjects in the document: >> >> 9 0 obj >> <</BitsPerComponent >> 8/ColorSpace/DeviceGray/Filter[/FlateDecode/DCTDecode]/Height 1100/Length >> 19570/Name/image_bg0/Subtype/Image/Type/XObject/Width 850>> >> >> 10 0 obj >> <</BitsPerComponent >> 8/ColorSpace/DeviceGray/Filter[/FlateDecode/DCTDecode]/Height 1100/Length >> 8521/Mask 11 0 R/Name/image_fg0/Subtype/Image/Type/XObject/Width 850>> >> >> 11 0 obj >> <</BitsPerComponent 1/DecodeParms<</Columns 2550/K >> -1>>/Filter/CCITTFaxDecode/Height 3300/ImageMask true/Length >> 10266/Name/image_sel/Subtype/Image/Type/XObject/Width 2550>> >> >> So if I understand what this is saying, object 11 is the image mask applied >> to object 10. > Correct. FYI: did you ever try the PDFDebugger which comes with PDFBox? It's a > tool to inspect the content of a pdf using a hierarchic tree view. > >> In my test code I made a little StreamEngine that simply reports on all >> XObjects and writes any PDXObjectImage objects to the file system. This is >> the output I get on this test document: >> >> processOperator(): objectName="image_bg0" >> processOperator(): object type="PDJpeg" >> processOperator(): image class=PDJpeg >> processOperator(): imageWidth="850" >> processOperator(): imageHeight="1100" >> Creating file >> /var/folders/_r/zht66_tx2lzcz4k18rzbxc240000gp/T/TestPdfUtils/image_bg0_0.jp >> g >> processOperator(): objectName="image_fg0" >> processOperator(): object type="PDJpeg" >> processOperator(): image class=PDJpeg >> processOperator(): imageWidth="850" >> processOperator(): imageHeight="1100" >> Creating file >> /var/folders/_r/zht66_tx2lzcz4k18rzbxc240000gp/T/TestPdfUtils/image_fg0_1.jp >> g >> >> Where the objectName="image_bg0" line will be emitted for any XObject of any >> type. >> >> So it looks like the ImageMask object is not being reported as an XObject. > That's correct too. The mask is not a "standalone" XObjectImage, it's part of > the fg_0 image. The mask represents the alpha channel of the image. > bg_0 is painted first. fg_0 is painted on top of bg_0, due to the alpha > channel > most of fg_0 is treated as transparent and doesn't overwrite anything. > > I don't have a clue why the scanned picture is splitted into two parts. At > least > the most recent trunk version of PDFBox is able to handle this after fixing > improving the mask handling, see [1] for further details. > > Maybe you should just use the combined image generated by PDFToImage. > >> Thanks, >> >> Eliot >> >> On 12/9/12 6:58 AM, "Andreas Lehmkuehler" <[email protected]> wrote: >> >>> Hi, >>> >>> Am 06.12.2012 18:48, schrieb Eliot Kimber: >>>> I am trying to find QR codes on PDFs that are scanned page images. My code >>>> works fine for scans produced by my OfficeJet and for page images produced >>>> out of Acrobat but scans produced by my client's eCopy ShareScan device >>>> (according to the PDF metadata) are not usable. >>>> >>>> Looking into the PDF data stream, each page is represented by two images, a >>>> "bg" image that is what I would expect for the page image, but very faint >>>> grey, and a "fg" image that reflects the page content but with lots of grey >>>> and ghosting. >>> Sounds like masked images, but that's just a guess. >>> >>>> The PDF renderer must be combining these two images in some way to provide >>>> the clear image I see in Acrobat. >>>> >>>> Is there something I can find in the PDF data stream that will tell me how >>>> these images are combined and, if so, can anyone point me in the right >>>> direction for processing these images? I am pretty new to Java image >>>> processing so I'm not sure where to look or what to look for. >>>> >>>> The images themselves are repored by PDFBox as PDJpeg objects. >>>> >>>> I can provide a sample PDF page if it's needed. >>> Due to some restrictions you can't attach it to a posting. Please post a >>> download link referring to a public location or create an issue on jira [1] >>> >>>> >>>> Thanks, >>>> >>>> Eliot >>>> >>> >>> >>> BR >>> Andreas Lehmkühler >>> >>> [1] https://issues.apache.org/jira/browse/PDFBOX >> > > BR > Andreas Lehmkühler > > [1]https://issues.apache.org/jira/browse/PDFBOX-1445 > -- Eliot Kimber Senior Solutions Architect, RSI Content Solutions "Bringing Strategy, Content, and Technology Together" Main: 512.554.9368 www.rsicms.com www.rsuitecms.com Book: DITA For Practitioners, from XML Press, http://xmlpress.net/publications/dita/practitioners-1/

