Problem with extracted JPEG images with RGB colorspace (from a PDF)

Joe Ye Mon, 07 Dec 2015 04:55:10 -0800

Hi,


We've been using PDFBox to extract images from PDF files and recently
upgraded to PDFBox version 2.0.0-RC2. I noticed that class PDXObjectImage
is renamed/rewritten and method PDXObjectImage.write2OutputStream we used
to write images to disk no longer exists?



Therefore, I've been trying to use the new class PDImageXObject and follow
your example org.apache.pdfbox.tools.ExtractImages#write2file in order to
extract images from PDF and write them to disk. It appears that there's a
code path (IOUtils.copy etc) for RGB or Gray colorspace where it just
copies the unmodified JPEG stream. However, I have a couple of JPEG images
with RBG colorspace in a PDF and used this code to extract and write them
to disk, and they can't be opened by any image viewer, suggesting that the
images may be damaged…



If I change the code to call ImageIOUtil.writeImage instead, then the
extracted images can be viewed ok. But I don't know the implication here as
the code suggests that the JPEG will be converted.



Please could you suggest why IOUtils.copy for RGB or Gray did not work
properly and what's the recommended/ correct way to process them?


Kind regards,

Joe

Problem with extracted JPEG images with RGB colorspace (from a PDF)

Reply via email to