W dniu 2010-05-07 12:01, Antoni Mylka pisze:
Hello,

I'm experimenting with getting all images from a PDF using the approach
presented in

http://kickjava.com/src/org/pdfbox/ExtractImages.java.htm

I'm getting a lot of duplicates, it seems that the same physical images
are reused in a PDF file many times. I have two questions.

1. Is there some image pool in a PDF, so that I can iterate over a
single data structure?
2. If there is not (or most probably the PDF allows all kinds of
structures) then how can I get the byte offset from a PDXObjectImage
instance, so that I can store the offsets of already visited images.

Answering to myself:

I've come up with the following loop:

COSDocument cosDoc = pdDocument.getDocument();
List list = cosDoc.getObjectsByType(COSName.XOBJECT);
for (Object obj : list) {
    COSObject cosOb = (COSObject)obj;
    COSBase baseObject = cosOb.getObject();
    if (baseObject != null && baseObject instanceof COSStream) {
        COSStream st = (COSStream)baseObject;
        String subtype = st.getNameAsString(COSName.SUBTYPE);
        if (subtype != null && subtype.equalsIgnoreCase("image")) {
            processSingleImage(st);
        }
    }
}

It seems to do the trick. I can iterate over all images in a document. Then when I have a COSStream I can get the undecoded image bytes and decoded image bytes with following methods:

// should return a byte array that uniquely identifies the image
// and with minimal overhead
private byte [] getUnDecodedImageBytes(COSStream st) throws IOException {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    IOUtil.writeStream(st.getUnfilteredStream(), baos);
    return baos.toByteArray();
}
        
// should return a byte array that can be saved to a file with a proper
// extension and make it viewable
private byte [] getDecodedImageBytes(COSStream st) throws IOException {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    PDXObjectImage ximage = (PDXObjectImage)PDXObject.createXObject(st);
    ximage.write2OutputStream(baos);
    return baos.toByteArray();
}

So two questions
1. Is the loop correct, are all images wrapped in an XOBJECT, are they always available via the getObjectsByType method of the COSDocument class or are there PDF files with different structure, that require a different approach.
2. Can the byte[] methods be made faster?

Antoni Mylka
[email protected]

Reply via email to