Re: Extract images and get occurrence of same image

Brian L. Matthews Wed, 13 Nov 2019 12:03:01 -0800

[Oops, replied directly to Damien last time so adding back in the listso everyone can make derisive comments about my code :-), and maybesuggest a better approach.]

There may be a better way to do it, but I get a page as a stream, theniterate over the stream. Here's some code I ripped out of some otherstuff I have:


PDPage                  page = doc.getPage(1);
PDFStreamParser    parser = new PDFStreamParser(page);
List<COSBase>      operands = new ArrayList<COSBase>();
Object                    token;

while ((token = parser.parseNextToken()) != null)
{
    if (token instanceof COSBase)
    {
        operands.add((COSBase) token);

        continue;
    }

    if (!(token instanceof Operator))
        throw new IllegalArgumentException("Unknown token " + token);

    String      opName = ((Operator) token).getName();

    if (opName.equals("Do")) // Draw object

System.out.println("Invoke XObject <" + ((COSName)operands.get(0)).getName() + ">");


    operands.clear();
}

If I drop that in some code to open a file and parse it as a PDFdocument, when run on your document it outputs:


Invoke XObject <Im41>
Invoke XObject <Im41>
Invoke XObject <Im43>
Invoke XObject <Im43>
Invoke XObject <Im43>
Invoke XObject <Im45>

So it draws the first image twice, the second image 3 times, and thethird image once. If all you need is a count, just use a Map of somesort to count the occurrences. If you actually need to know wherethey're drawn, that's harder, and you'll basically have to parse rest ofthe operators and track things like graphics states and transformationmatrices. For example, the few operators before the first Do operator todraw Im41 are:


          q  - Save graphics state
            gs - Set graphics state to <GSa>
            cm - Concat
              [Scale_X:    0.333333, Shear_X:    0.000000, 0
               Shear_Y:    0.000000, Scale_Y:    0.333333, 0
               Offset_X: 145.000000, Offset_Y: 843.000000, 1]
              to transformation matrix
            cs - Set non-stoking color space to color space <CSp>
            scn - Set non-stroking color to <0, 0, 0>
            gs - Set graphics state to <GSa>
            cm - Concat
              [Scale_X:   12.000000, Shear_X:    0.000000, 0
               Shear_Y:    0.000000, Scale_Y:  -15.000000, 0
               Offset_X:   0.000000, Offset_Y:  15.000000, 1]
              to transformation matrix
            Do - Invoke XObject <Im41>

And that's inside other graphics state transformations.

As I said, this may not be the best way to do this, but it works, sothat's one advantage. :-) It was also written against an older versionof PDFBox so there may be things in newer versions that would help.Anyways, it should give you a start.


Brian

On 11/13/19 10:51 AM, Damien Levasseur wrote:

Thank you for your quick answer, here is the document, and i needyellow cards on page 2.
How do you suggest to iterate the document? because a loop onresources only provided one instance of image.
This is how i use it :

getImagesFromResources(document.getPage(1).getResources());

void getImagesFromResources(PDResources pdResources) throws IOException {
    String dstPath = CybeleConfig.getPath() + "/local/uploaded/tmp/";
    int imgIndex = 1;
    for (COSName name : pdResources.getXObjectNames()) {
        PDXObject xObject = pdResources.getXObject(name);

        if (xObject instanceof PDFormXObject) {
getImagesFromResources(((PDFormXObject)xObject).getResources());
        } else if (xObject instanceof PDImageXObject) {
            PDImageXObject image = (PDImageXObject)xObject;
String filename = dstPath + "extracted-image-" + imgIndex+ ".png";
            ImageIO.write(image.getImage(), "png", new File(filename));
            imgIndex++;
        }
    }
}

Thank you for your help

Le 13/11/2019 à 18:35, Brian L. Matthews a écrit :
On 11/13/19 12:33 AM, Damien Levasseur wrote:
Hello all,
When i extract images (version 2.0.17, using PDResources, COSName,PDXObject, PDImageXObject), i correctly get all distinct images, butsame image is extracted only once. In the pdf file i'm trying towork on, there is one image repeated 3 times, and i wanted to get that.
How can i get a list of resources instead of Dictionary? Or getnumber of occurence or position of a repeated image?
Thanks
This is partially a guess, but I'm assuming whatever wrote the PDFdid that as a size optimization, and there isn't any way to know howmany times an image is referenced without iterating over thedocument. As far as I know, there are no "back-references" associatedwith a resource pointing to everywhere it's used.
Brian
--
Regards,

*Damien LEVASSEUR*
Software engineer
Ingénieur Développeur

------------------------------------------------------------------------
*EdenWeb*
55bis Rue de Rennes
35510 Cesson-Sévigné    Phone: +33 2 99 83 03 05
E-mail: supp...@edenweb.fr <mailto:supp...@edenweb.fr>
Website: www.edenweb.fr <http://www.edenweb.fr>

Re: Extract images and get occurrence of same image

Reply via email to