Getting the raw, undecoded content of an Image

Antoni Mylka Wed, 28 Apr 2010 05:45:35 -0700

Hi,

I'm writing a program that extracts Images from a PDF. I was inspired bythe algorithm presented in the ExtractImages class


http://kickjava.com/src/org/pdfbox/ExtractImages.java.htm

I tried it on a 8MB ebook which turned out to contain on the order of50K small images. They were all pngs. My profiler revealed that 95% ofthe time was spent in image.write2OutputStream method, vast majority inDeflater - the class that decompresses the image to get a normal PNG file.


My idea was
 - get the basic raw, undecoded bytes of the image
 - compute a hash of them

- if the hash hasn't been seen before - decode the full image,otherwise go on

My reasoning was that the same image must occur many times, so decodingonly unique ones, would make it all faster.

Now my question: how to get the basic, raw, undecoded bytes from aninstance of the PDXObjectImage.


I tried

image.getPDStream().createInputStream()
image.getPDStream().getStream().getUnfilteredStream().

both work on pdfs with embedded PNG files, but if I have embedded JPGs Iget a warning:


Warning: DCTFilter.decode is not implemented yet, skipping this stream.

and the returned stream is empty.

What to do?

Thanks in advance

Antoni Mylka
[email protected]

Getting the raw, undecoded content of an Image

Reply via email to