[Oops, replied directly to Damien last time so adding back in the list
so everyone can make derisive comments about my code :-), and maybe
suggest a better approach.]
There may be a better way to do it, but I get a page as a stream, then
iterate over the stream. Here's some code I ripped out of some other
stuff I have:
PDPage page = doc.getPage(1);
PDFStreamParser parser = new PDFStreamParser(page);
List<COSBase> operands = new ArrayList<COSBase>();
Object token;
while ((token = parser.parseNextToken()) != null)
{
if (token instanceof COSBase)
{
operands.add((COSBase) token);
continue;
}
if (!(token instanceof Operator))
throw new IllegalArgumentException("Unknown token " + token);
String opName = ((Operator) token).getName();
if (opName.equals("Do")) // Draw object
System.out.println("Invoke XObject <" + ((COSName)
operands.get(0)).getName() + ">");
operands.clear();
}
If I drop that in some code to open a file and parse it as a PDF
document, when run on your document it outputs:
Invoke XObject <Im41>
Invoke XObject <Im41>
Invoke XObject <Im43>
Invoke XObject <Im43>
Invoke XObject <Im43>
Invoke XObject <Im45>
So it draws the first image twice, the second image 3 times, and the
third image once. If all you need is a count, just use a Map of some
sort to count the occurrences. If you actually need to know where
they're drawn, that's harder, and you'll basically have to parse rest of
the operators and track things like graphics states and transformation
matrices. For example, the few operators before the first Do operator to
draw Im41 are:
q - Save graphics state
gs - Set graphics state to <GSa>
cm - Concat
[Scale_X: 0.333333, Shear_X: 0.000000, 0
Shear_Y: 0.000000, Scale_Y: 0.333333, 0
Offset_X: 145.000000, Offset_Y: 843.000000, 1]
to transformation matrix
cs - Set non-stoking color space to color space <CSp>
scn - Set non-stroking color to <0, 0, 0>
gs - Set graphics state to <GSa>
cm - Concat
[Scale_X: 12.000000, Shear_X: 0.000000, 0
Shear_Y: 0.000000, Scale_Y: -15.000000, 0
Offset_X: 0.000000, Offset_Y: 15.000000, 1]
to transformation matrix
Do - Invoke XObject <Im41>
And that's inside other graphics state transformations.
As I said, this may not be the best way to do this, but it works, so
that's one advantage. :-) It was also written against an older version
of PDFBox so there may be things in newer versions that would help.
Anyways, it should give you a start.
Brian
On 11/13/19 10:51 AM, Damien Levasseur wrote:
Thank you for your quick answer, here is the document, and i need
yellow cards on page 2.
How do you suggest to iterate the document? because a loop on
resources only provided one instance of image.
This is how i use it :
getImagesFromResources(document.getPage(1).getResources());
void getImagesFromResources(PDResources pdResources) throws IOException {
String dstPath = CybeleConfig.getPath() + "/local/uploaded/tmp/";
int imgIndex = 1;
for (COSName name : pdResources.getXObjectNames()) {
PDXObject xObject = pdResources.getXObject(name);
if (xObject instanceof PDFormXObject) {
getImagesFromResources(((PDFormXObject)
xObject).getResources());
} else if (xObject instanceof PDImageXObject) {
PDImageXObject image = (PDImageXObject)xObject;
String filename = dstPath + "extracted-image-" + imgIndex
+ ".png";
ImageIO.write(image.getImage(), "png", new File(filename));
imgIndex++;
}
}
}
Thank you for your help
Le 13/11/2019 à 18:35, Brian L. Matthews a écrit :
On 11/13/19 12:33 AM, Damien Levasseur wrote:
Hello all,
When i extract images (version 2.0.17, using PDResources, COSName,
PDXObject, PDImageXObject), i correctly get all distinct images, but
same image is extracted only once. In the pdf file i'm trying to
work on, there is one image repeated 3 times, and i wanted to get that.
How can i get a list of resources instead of Dictionary? Or get
number of occurence or position of a repeated image?
Thanks
This is partially a guess, but I'm assuming whatever wrote the PDF
did that as a size optimization, and there isn't any way to know how
many times an image is referenced without iterating over the
document. As far as I know, there are no "back-references" associated
with a resource pointing to everywhere it's used.
Brian
--
Regards,
*Damien LEVASSEUR*
Software engineer
Ingénieur Développeur
------------------------------------------------------------------------
*EdenWeb*
55bis Rue de Rennes
35510 Cesson-Sévigné Phone: +33 2 99 83 03 05
E-mail: supp...@edenweb.fr <mailto:supp...@edenweb.fr>
Website: www.edenweb.fr <http://www.edenweb.fr>