Re: Handling Graphics from Scanned PDF

Eliot Kimber Sun, 09 Dec 2012 07:45:54 -0800

At least using the 1.7.1 release, the combined image is not usable--it
doesn't reflect the application of the mask (at least I think that's what's
happening).


I'm not sure how to build the pdfbox-app jar, which is what my application
uses. I have successfully built the pdfbox.jar from the latest source but I
didn't see anything in the build.xml for producing the pdfbox-app.jar.

Thanks,

Eliot



On 12/9/12 9:37 AM, "Andreas Lehmkuehler" <[email protected]> wrote:

> Hi,
> 
> Am 09.12.2012 15:36, schrieb Eliot Kimber:
>> Yes, I believe this is a masked image. I did a close reading of the PDF 1.7
>> spec and I think that's what I have.
>> 
>> The sample I'm testing with can be found here:
>> 
>> https://dl.dropbox.com/u/20078596/pdfScannedPageWithMaskedImage.pdf
>> 
>> Here are the dictionary entries for the three XObjects in the document:
>> 
>> 9 0 obj
>> <</BitsPerComponent
>> 8/ColorSpace/DeviceGray/Filter[/FlateDecode/DCTDecode]/Height 1100/Length
>> 19570/Name/image_bg0/Subtype/Image/Type/XObject/Width 850>>
>> 
>> 10 0 obj
>> <</BitsPerComponent
>> 8/ColorSpace/DeviceGray/Filter[/FlateDecode/DCTDecode]/Height 1100/Length
>> 8521/Mask 11 0 R/Name/image_fg0/Subtype/Image/Type/XObject/Width 850>>
>> 
>> 11 0 obj
>> <</BitsPerComponent 1/DecodeParms<</Columns 2550/K
>> -1>>/Filter/CCITTFaxDecode/Height 3300/ImageMask true/Length
>> 10266/Name/image_sel/Subtype/Image/Type/XObject/Width 2550>>
>> 
>> So if I understand what this is saying, object 11 is the image mask applied
>> to object 10.
> Correct. FYI: did you ever try the PDFDebugger which comes with PDFBox? It's a
> tool to inspect the content of a pdf using a hierarchic tree view.
> 
>> In my test code I made a little StreamEngine that simply reports on all
>> XObjects and writes any PDXObjectImage objects to the file system. This is
>> the output I get on this test document:
>> 
>> processOperator(): objectName="image_bg0"
>> processOperator(): object type="PDJpeg"
>> processOperator(): image class=PDJpeg
>> processOperator(): imageWidth="850"
>> processOperator(): imageHeight="1100"
>> Creating file
>> /var/folders/_r/zht66_tx2lzcz4k18rzbxc240000gp/T/TestPdfUtils/image_bg0_0.jp
>> g
>> processOperator(): objectName="image_fg0"
>> processOperator(): object type="PDJpeg"
>> processOperator(): image class=PDJpeg
>> processOperator(): imageWidth="850"
>> processOperator(): imageHeight="1100"
>> Creating file
>> /var/folders/_r/zht66_tx2lzcz4k18rzbxc240000gp/T/TestPdfUtils/image_fg0_1.jp
>> g
>> 
>> Where the objectName="image_bg0" line will be emitted for any XObject of any
>> type.
>> 
>> So it looks like the ImageMask object is not being reported as an XObject.
> That's correct too. The mask is not a "standalone" XObjectImage, it's part of
> the fg_0 image. The mask represents the alpha channel of the image.
> bg_0 is painted first. fg_0 is painted on top of bg_0, due to the alpha
> channel
> most of fg_0 is treated as transparent and doesn't overwrite anything.
> 
> I don't have a clue why the scanned picture is splitted into two parts. At
> least
> the most recent trunk version of PDFBox is able to handle this after fixing
> improving the mask handling, see [1] for further details.
> 
> Maybe you should just use the combined image generated by PDFToImage.
> 
>> Thanks,
>> 
>> Eliot
>> 
>> On 12/9/12 6:58 AM, "Andreas Lehmkuehler" <[email protected]> wrote:
>> 
>>> Hi,
>>> 
>>> Am 06.12.2012 18:48, schrieb Eliot Kimber:
>>>> I am trying to find QR codes on PDFs that are scanned page images. My code
>>>> works fine for scans produced by my OfficeJet and for page images produced
>>>> out of Acrobat but scans produced by my client's eCopy ShareScan device
>>>> (according to the PDF metadata) are not usable.
>>>> 
>>>> Looking into the PDF data stream, each page is represented by two images, a
>>>> "bg" image that is what I would expect for the page image, but very faint
>>>> grey, and a "fg" image that reflects the page content but with lots of grey
>>>> and ghosting.
>>> Sounds like masked images, but that's just a guess.
>>> 
>>>> The PDF renderer must be combining these two images in some way to provide
>>>> the clear image I see in Acrobat.
>>>> 
>>>> Is there something I can find in the PDF data stream that will tell me how
>>>> these images are combined and, if so, can anyone point me in the right
>>>> direction for processing these images? I am pretty new to Java image
>>>> processing so I'm not sure where to look or what to look for.
>>>> 
>>>> The images themselves are repored by PDFBox as PDJpeg objects.
>>>> 
>>>> I can provide a sample PDF page if it's needed.
>>> Due to some restrictions you can't attach it to a posting. Please post a
>>> download link referring to a public location or create an issue on jira [1]
>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Eliot
>>>> 
>>> 
>>> 
>>> BR
>>> Andreas Lehmkühler
>>> 
>>> [1] https://issues.apache.org/jira/browse/PDFBOX
>> 
> 
> BR
> Andreas Lehmkühler
> 
> [1]https://issues.apache.org/jira/browse/PDFBOX-1445
> 

-- 
Eliot Kimber
Senior Solutions Architect, RSI Content Solutions
"Bringing Strategy, Content, and Technology Together"
Main: 512.554.9368
www.rsicms.com
www.rsuitecms.com
Book: DITA For Practitioners, from XML Press,
http://xmlpress.net/publications/dita/practitioners-1/

Re: Handling Graphics from Scanned PDF

Reply via email to