Hi,

I'm writing a program that extracts Images from a PDF. I was inspired by the algorithm presented in the ExtractImages class

http://kickjava.com/src/org/pdfbox/ExtractImages.java.htm

I tried it on a 8MB ebook which turned out to contain on the order of 50K small images. They were all pngs. My profiler revealed that 95% of the time was spent in image.write2OutputStream method, vast majority in Deflater - the class that decompresses the image to get a normal PNG file.

My idea was
 - get the basic raw, undecoded bytes of the image
 - compute a hash of them
- if the hash hasn't been seen before - decode the full image, otherwise go on

My reasoning was that the same image must occur many times, so decoding only unique ones, would make it all faster.

Now my question: how to get the basic, raw, undecoded bytes from an instance of the PDXObjectImage.

I tried

image.getPDStream().createInputStream()
image.getPDStream().getStream().getUnfilteredStream().

both work on pdfs with embedded PNG files, but if I have embedded JPGs I get a warning:

Warning: DCTFilter.decode is not implemented yet, skipping this stream.

and the returned stream is empty.

What to do?

Thanks in advance

Antoni Mylka
[email protected]

Reply via email to