[Oops, replied directly to Damien last time so adding back in the list so everyone can make derisive comments about my code :-), and maybe suggest a better approach.]

There may be a better way to do it, but I get a page as a stream, then iterate over the stream. Here's some code I ripped out of some other stuff I have:

PDPage                  page = doc.getPage(1);
PDFStreamParser    parser = new PDFStreamParser(page);
List<COSBase>      operands = new ArrayList<COSBase>();
Object                    token;

while ((token = parser.parseNextToken()) != null)
{
    if (token instanceof COSBase)
    {
        operands.add((COSBase) token);

        continue;
    }

    if (!(token instanceof Operator))
        throw new IllegalArgumentException("Unknown token " + token);

    String      opName = ((Operator) token).getName();

    if (opName.equals("Do")) // Draw object
        System.out.println("Invoke XObject <" + ((COSName) operands.get(0)).getName() + ">");

    operands.clear();
}

If I drop that in some code to open a file and parse it as a PDF document, when run on your document it outputs:

Invoke XObject <Im41>
Invoke XObject <Im41>
Invoke XObject <Im43>
Invoke XObject <Im43>
Invoke XObject <Im43>
Invoke XObject <Im45>

So it draws the first image twice, the second image 3 times, and the third image once. If all you need is a count, just use a Map of some sort to count the occurrences. If you actually need to know where they're drawn, that's harder, and you'll basically have to parse rest of the operators and track things like graphics states and transformation matrices. For example, the few operators before the first Do operator to draw Im41 are:

          q  - Save graphics state
            gs - Set graphics state to <GSa>
            cm - Concat
              [Scale_X:    0.333333, Shear_X:    0.000000, 0
               Shear_Y:    0.000000, Scale_Y:    0.333333, 0
               Offset_X: 145.000000, Offset_Y: 843.000000, 1]
              to transformation matrix
            cs - Set non-stoking color space to color space <CSp>
            scn - Set non-stroking color to <0, 0, 0>
            gs - Set graphics state to <GSa>
            cm - Concat
              [Scale_X:   12.000000, Shear_X:    0.000000, 0
               Shear_Y:    0.000000, Scale_Y:  -15.000000, 0
               Offset_X:   0.000000, Offset_Y:  15.000000, 1]
              to transformation matrix
            Do - Invoke XObject <Im41>

And that's inside other graphics state transformations.

As I said, this may not be the best way to do this, but it works, so that's one advantage. :-) It was also written against an older version of PDFBox so there may be things in newer versions that would help. Anyways, it should give you a start.

Brian

On 11/13/19 10:51 AM, Damien Levasseur wrote:

Thank you for your quick answer, here is the document, and i need yellow cards on page 2.

How do you suggest to iterate the document? because a loop on resources only provided one instance of image.

This is how i use it :

getImagesFromResources(document.getPage(1).getResources());

void getImagesFromResources(PDResources pdResources) throws IOException {
    String dstPath = CybeleConfig.getPath() + "/local/uploaded/tmp/";
    int imgIndex = 1;
    for (COSName name : pdResources.getXObjectNames()) {
        PDXObject xObject = pdResources.getXObject(name);

        if (xObject instanceof PDFormXObject) {
            getImagesFromResources(((PDFormXObject) xObject).getResources());

        } else if (xObject instanceof PDImageXObject) {
            PDImageXObject image = (PDImageXObject)xObject;

            String filename = dstPath + "extracted-image-" + imgIndex + ".png";
            ImageIO.write(image.getImage(), "png", new File(filename));
            imgIndex++;
        }
    }
}

Thank you for your help

Le 13/11/2019 à 18:35, Brian L. Matthews a écrit :
On 11/13/19 12:33 AM, Damien Levasseur wrote:
Hello all,

When i extract images (version 2.0.17, using PDResources, COSName, PDXObject, PDImageXObject), i correctly get all distinct images, but same image is extracted only once. In the pdf file i'm trying to work on, there is one image repeated 3 times, and i wanted to get that.

How can i get a list of resources instead of Dictionary? Or get number of occurence or position of a repeated image?

Thanks


This is partially a guess, but I'm assuming whatever wrote the PDF did that as a size optimization, and there isn't any way to know how many times an image is referenced without iterating over the document. As far as I know, there are no "back-references" associated with a resource pointing to everywhere it's used.

Brian

--
Regards,

*Damien LEVASSEUR*
Software engineer
Ingénieur Développeur

------------------------------------------------------------------------
*EdenWeb*
55bis Rue de Rennes
35510 Cesson-Sévigné    Phone: +33 2 99 83 03 05
E-mail: supp...@edenweb.fr <mailto:supp...@edenweb.fr>
Website: www.edenweb.fr <http://www.edenweb.fr>




Reply via email to