Hello, I am using pdfbox for handling PDFs and it is doing its job quite fine most of the time. However, I encounter a strange behaviour when extracting images embedded in some PDFs. I start with the following code (I think it is taken from one of yours tutorials): for (PDPage page : list) { PDResources pdResources = page.getResources(); for (COSName c : pdResources.getXObjectNames()) { PDXObject o = pdResources.getXObject(c); if (o instanceof PDImageXObject) { imageCount++; WRITEIMAGE(o,....);//WRITING IMAGE TO DISK GOES HERE } } } This is clean, have logic and seems natural, but poses a problem: The problem with this approach is that we always obtain DOUBLED images for each one real image in PDF. One image is a good one, the other is some kind of "negative" of the good one. Moreover, the images order (the image index as they appear in PDF from top to bottom) are scrambled.
The second approach involve the following tutorial: https://www.tutorialkart.com/pdfbox/how-to-get-location-and-size-of-images-in-pdf/ The image writting routine is done inside the processOperator method, just before the following line: System.out.println("\nImage [" + objectName.getName() + "]"); In this approach, we get the correct images count (no duplicates) and in correct order. This is what I want and it is very very good, Although those approaches look somehow similar, why the first one behaves so strangely? Which way do you recommend to extract the images? I am uncomfortable not fully understanding all of these issues. Please help me understand better, thank you, Dan Fulea