Re: Looking for a way to iterate over images in a PDF

Tilman Hausherr Fri, 07 Apr 2017 14:33:18 -0700

Am 07.04.2017 um 22:59 schrieb David Patterson:

Tilman,


The ExtractImages sample code is a 1.8 artifact (I believe). It has a lot
of errors when compiled with 2.0.5 libraries.


Please try this one:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup

Tilman


1) two imports are no longer in the 2.0.5 library
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;

2) missing methods or methods with different signatures:
PDDocument.loadNonSeq(                                            ** method
not define
PDDocument.load(                                                       **
load now requires a File, not a String
document.openProtection (
document.getDocumentCatalog().getAllPages()              ** getAllPages is
missing from the PDDocumentCatalog
resources.getXObjects()                                               **
where resources is a PDResources object
if (xobject instanceof PDXObjectImage)                         **
PDXObjectImage is not defined
else if (xobject instanceof PDXObjectForm)                   ** same with
PDXObjectForm

Maybe a new ExtractImages2 program needs to be developed for the PDFBox 2
era.

Dave Patterson




On Thu, Apr 6, 2017 at 5:02 PM, Tilman Hausherr <[email protected]>
wrote:

Am 06.04.2017 um 21:22 schrieb David Patterson:

I've got some PDF's to try to read. Many of them have images in them. I'd
like to be able to iterate over the images and determine their encoding
(png vs. jpeg vs. ?) and size.

I've found a sample that lets me iterate over the PDXObject entities, but
I'm missing a key piece to determine the size and format of the objects.

a) Is a PDXObject always an image, or could it be something else?

Yes it could be a form. That's why all examples (e.g. ExtractImages.java)
always check the type, and the cast to the image xobject type. That one
will give the size and the filters.

Tilman

Here is the code I've got so far.

for ( PDPage aPage : pdfDocument.getPages() ) {
PDResources pdResources = aPage.getResources();
for ( COSName cosObject : pdResources.getXObjectNames() ) {
PDXObject xObj = pdResources.getXObject( cosObject);
System.out.println( "got an image maybe" );

This is where I've gotten stumped. I've looked at lots of lists of
COS-whatever things, but it has not led me to "the answer."

Thanks for any guidance you can provide.

Dave Patterson

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Looking for a way to iterate over images in a PDF

Reply via email to