Re: How to flatedecode and find all acroform fields in a compressed PDF

Tilman Hausherr Tue, 19 May 2015 22:52:51 -0700

Hi,

"How to properly decompress the PDF using pdfbox APIs" - see the sourcecode of WriteDecodedDoc:

https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/WriteDecodedDoc.java?view=markup&sortby=date

How was the decompressing of the PDF from your customer done - did yourcustomer also use PDFBox? Or something else?

And I read in the first post that the decompressed customer file was OK,but not the compressed file... so the problem is to find if somethingis missing in the compressed file, or if PDFBox has a bug causing tomiss it.


Tilman

PS: image didn't go through. Maybe upload it to imageshack.us.


Am 20.05.2015 um 03:24 schrieb Balaji Venkatamohan:

Thank you for your pointers and sorry about the image. I am attachingit with this email.

The point I am trying to make is that the PDF, which was decompressedusing WriteDecodedDoc, is smaller in size than the original PDF givento us by our customers.Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox didnot have any PDAcroform fields whereas the decompressed PDF given tous by the customers does contain Acroform fields. Hence I wanted toknow how to properly decompress the PDF using pdfbox APIs. The reasonwhy I was analyzing COSStream was to check if the decompression of thecompressed PDF was happening correctly while using PDFBox APIs.I know it would have been difficult for you to help me without theactual PDFs. For that, I would like to thank you for your time andpointers.

On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr<[email protected] <mailto:[email protected]>> wrote:


    Hi,

    The image doesn't appear in the mailing list.

    This is all very confusing... /acroform is in the document
    catalog. I don't see how the page content stream is related to it.
    The best is that you either go through the source code, or read
    the spec and then look at the pdf.

    To find out what's going on, you'd have to start from that
    /acroform entry and then compare the two files.

    It is really difficult to help you without the files. The cause
    could be a bug in pdfbox, or a malformed pdf...

    Some more ideas:
    - use loadNonSeq(file, null) instead of load(file)
    - try the unreleased 2.0 version, that one has some improvements
    in the acroform stuff. Note that the API is different.
    https://pdfbox.apache.org/download.cgi#scm
    https://pdfbox.apache.org/2.0/getting-started.html

    If you still need help, one possibility would be 1) post the
    smallest possible code that fails, and 2) post a small part of the
    raw PDF, i.e. the objects relevant to the field in your code.


    Tilman


    Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:

        Moreover, for every page of the compressed PDF (there are 3
        pages), I tried getting the COSStream for each of the page :

        PDPage firstPage=(PDPage)
        document.getDocumentCatalog().getAllPages().get(0);
                    pdStream=firstPage.getContents();
                    COSStream stream=pdStream.getStream();

        In the above code snippet, the object stream, when analyzed in
        debug mode, has the following:


        The line from the compressed PDF as opened with Notepad++ is :

        <</Filter/FlateDecode/Length 5675>>stream

        From this point on, using the COSStream object for every page,
        how can I decompress and find out the acroform fields given
        that the unFilteredStream object is null for COSStream?
        

        On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

            Thank you for your response Tilman.

            I had previously tried using the WriteDecodedDoc for my
        compressed
            PDF and I tried to get the number of acro form fields
        present in     the output file generated by WriteDecodedDoc.
        The API still could
            not find the acro form fields in the generated
        decompressed file.
             Also the decompressed file generated is 75 KB which is
        far less
            than the original decompressed file which I have (1.6 MB)
        though I
            could edit the acro form fields using acrobat reader.

            Thanks,
            Balaji



            On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
            <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>
        wrote:

                Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:

                    My question is: how do I flatedecode a PDF so that
        I can
                    find all the
                    acroform fields within it. ANy help or pointers
        would be
                    highly appreciated.


                You could try the WriteDecodedDoc option of the
        command line app
        https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc

                Maybe you can have further ideas by comparing the two
        files
                with NOTEPAD++.... however the two files might have their
                objects in different order.

                Tilman



        ---------------------------------------------------------------------
                To unsubscribe, e-mail:
        [email protected]
        <mailto:[email protected]>
                <mailto:[email protected]
        <mailto:[email protected]>>
                For additional commands, e-mail:
        [email protected] <mailto:[email protected]>
                <mailto:[email protected]
        <mailto:[email protected]>>







---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: How to flatedecode and find all acroform fields in a compressed PDF

Reply via email to