Image extraction

Dan Fulea Wed, 30 Sep 2020 22:35:55 -0700

Hello,
I am using pdfbox for handling PDFs and it is doing its job quite fine most
of the time.
However, I encounter a strange behaviour when extracting images embedded in
some PDFs.
I start with the following code (I think it is taken from one of yours
tutorials):
for (PDPage page : list) {
       PDResources pdResources = page.getResources();
       for (COSName c : pdResources.getXObjectNames()) {
           PDXObject o = pdResources.getXObject(c);
           if (o instanceof PDImageXObject) {
            imageCount++;
            WRITEIMAGE(o,....);//WRITING IMAGE TO DISK GOES HERE
           }
       }
}
This is clean, have logic and seems natural, but poses a problem:
The problem with this approach is that we always obtain DOUBLED images for
each one real image in PDF. One image is a good one, the other is some kind
of "negative" of the good one. Moreover, the images order (the image index
as they appear in PDF from top to bottom) are scrambled.


The second approach involve the following tutorial:
https://www.tutorialkart.com/pdfbox/how-to-get-location-and-size-of-images-in-pdf/

The image writting routine is done inside the processOperator method, just
before the following line:
System.out.println("\nImage [" + objectName.getName() + "]");
In this approach, we get the correct images count (no duplicates) and in
correct order. This is what I want and it is very very good,

Although those approaches look somehow similar, why the first one behaves
so strangely?
Which way do you recommend to extract the images?
I am uncomfortable not fully understanding all of these issues.

Please help me understand better, thank you,
Dan Fulea

Image extraction

Reply via email to