Re: Help with removing images from a PDF

Nicholas Tiong Sun, 14 Oct 2012 18:57:34 -0700

Hi Andreas,

I've commented out the 'do' line, but still cannot get rid of the images.


I've basically opened the document and loaded the resources and then saved
the document. See code below.

This seems to be insufficient. Do I need to parse the PDF stream somehow?

Regards,
Nicholas Tiong

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.resources.*;
import java.io.IOException;

public class ExtractImages {
    public static void main(String[] argv) throws COSVisitorException,
InvalidPasswordException, CryptographyException, IOException {
        PDDocument document = PDDocument.load("input.pdf");

        if (document.isEncrypted()) {
            document.decrypt("");
        }

        PDDocumentCatalog catalog = document.getDocumentCatalog();
        for (Object pageObj :  catalog.getAllPages()) {
            PDPage page = (PDPage) pageObj;
            PDResources resources = page.findResources();

            
        }

        document.save("strippedOfImages.pdf");
    }
}



On 15/10/12 8:23 AM, "Andreas Lehmkuehler" <andr...@lehmi.de> wrote:

>Hi,
>
>Am 14.10.2012 22:47, schrieb Nicholas Tiong:
>> Hi Andreas,
>>
>> Thanks for your help, but I am not sure where to find this 'Do' line in
>> pagedrawer.properties. I see that there is a package in the pdfbox jar
>> file that is called org.apache.pdfbox.util.operator.pagedrawer, but I'm
>> not sure where the 'do' line is. I'm guessing its somewhere within the
>> invoke.class file but I am unable to find it.
>You have to look here:
>
>org/apache/pdfbox/resources/PageDrawer.properties
>
>
>> Also after disabling this, what operators would I need run on the pdf
>>file?
>Just comment that operator and all images should disappear when using
>PDFBox
>
>> Thanks for your assistance.
>>
>> Regards,
>> Nicholas Tiong
>>
>> On 15/10/12 4:37 AM, "Andreas Lehmkuehler" <andr...@lehmi.de> wrote:
>>
>>> Hi,
>>>
>>> Am 04.10.2012 02:58, schrieb Nicholas Tiong:
>>>> Hi,
>>>>
>>>> I'm new here and I've just discovered PDFBox. My experience with
>>>>coding
>>>> is
>>>> fairly basic.
>>>>
>>>> Based on a sample code I found here,
>>>>
>>>> 
>>>>http://stackoverflow.com/questions/6831194/how-can-i-remove-all-images-
>>>>dr
>>>> awi
>>>> ngs-from-a-pdf-file-and-leave-text-only-in-java
>>> That code removes only those images which are directly referenced
>>>within
>>> the
>>> resources of a page/document. But those which are part of an other
>>> XObject won't
>>> be removed.
>>>
>>>> It seems that it should work for my purpose; that is to remove all
>>>> images
>>>> from a PDF whilst preserving formatting. Basically I plan to print a
>>>> large
>>>> document in black and white on a laser printer without pictures, and
>>>> then
>>>> run it through a colour inkjet for the pictures.
>>>>
>>>> Could anyone help me figure out why the code in the link above does
>>>>not
>>>> work? It creates the 'stripped' file and throws no exceptions but all
>>>> the
>>>> images are still within.
>>>>
>>>> I've found another PDFBox code that extract images and saves it to
>>>>file
>>>> which works for all individual pictures in the document, so I am
>>>> certain the
>>>> PDF is formatted correctly with pictures embedded within it.
>>>>
>>>> Any help would be much appreciated.
>>> I guess it's easier to deactivate the "draw image" operator. Commenting
>>> the "Do"
>>> line in PageDrawer.properties should do the trick.
>>>
>>>> Regards,
>>>> Nicholas Tiong
>>>
>>> BR
>>> Andreas Lehmkühler
>
>BR
>Andreas Lehmkühler
>

Re: Help with removing images from a PDF

Reply via email to