Re: Looking for a way to iterate over images in a PDF

David Patterson Sat, 08 Apr 2017 05:48:09 -0700

Tilman,

Thanks. That works perfectly. Now I need to go through it in detail to
figure out how it extracts the image and metadata.


Dave Patterson

On Fri, Apr 7, 2017 at 5:32 PM, Tilman Hausherr <[email protected]>
wrote:

> Am 07.04.2017 um 22:59 schrieb David Patterson:
>
>> Tilman,
>>
>> The ExtractImages sample code is a 1.8 artifact (I believe). It has a lot
>> of errors when compiled with 2.0.5 libraries.
>>
>
> Please try this one:
> https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/ja
> va/org/apache/pdfbox/tools/ExtractImages.java?view=markup
>
> Tilman
>
>
>
>> 1) two imports are no longer in the 2.0.5 library
>> import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm;
>> import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
>>
>> 2) missing methods or methods with different signatures:
>> PDDocument.loadNonSeq(                                            **
>> method
>> not define
>> PDDocument.load(                                                       **
>> load now requires a File, not a String
>> document.openProtection (
>> document.getDocumentCatalog().getAllPages()              ** getAllPages
>> is
>> missing from the PDDocumentCatalog
>> resources.getXObjects()                                               **
>> where resources is a PDResources object
>> if (xobject instanceof PDXObjectImage)                         **
>> PDXObjectImage is not defined
>> else if (xobject instanceof PDXObjectForm)                   ** same with
>> PDXObjectForm
>>
>> Maybe a new ExtractImages2 program needs to be developed for the PDFBox 2
>> era.
>>
>> Dave Patterson
>>
>>
>>
>>
>> On Thu, Apr 6, 2017 at 5:02 PM, Tilman Hausherr <[email protected]>
>> wrote:
>>
>> Am 06.04.2017 um 21:22 schrieb David Patterson:
>>>
>>> I've got some PDF's to try to read. Many of them have images in them. I'd
>>>> like to be able to iterate over the images and determine their encoding
>>>> (png vs. jpeg vs. ?) and size.
>>>>
>>>> I've found a sample that lets me iterate over the PDXObject entities,
>>>> but
>>>> I'm missing a key piece to determine the size and format of the objects.
>>>>
>>>> a) Is a PDXObject always an image, or could it be something else?
>>>>
>>>> Yes it could be a form. That's why all examples (e.g.
>>> ExtractImages.java)
>>> always check the type, and the cast to the image xobject type. That one
>>> will give the size and the filters.
>>>
>>> Tilman
>>>
>>>
>>> Here is the code I've got so far.
>>>>
>>>> for ( PDPage aPage : pdfDocument.getPages() ) {
>>>> PDResources pdResources = aPage.getResources();
>>>> for ( COSName cosObject : pdResources.getXObjectNames() ) {
>>>> PDXObject xObj = pdResources.getXObject( cosObject);
>>>> System.out.println( "got an image maybe" );
>>>>
>>>> This is where I've gotten stumped. I've looked at lots of lists of
>>>> COS-whatever things, but it has not led me to "the answer."
>>>>
>>>> Thanks for any guidance you can provide.
>>>>
>>>> Dave Patterson
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Looking for a way to iterate over images in a PDF

Reply via email to