Re: How to flatedecode and find all acroform fields in a compressed PDF

Tilman Hausherr Fri, 22 May 2015 06:47:03 -0700

Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:

Hi,

Balaji Venkatamohan <[email protected]> hat am 20. Mai 2015 um 03:24
geschrieben:


Thank you for your pointers and sorry about the image. I am attaching it
with this email.

The point I am trying to make is that the PDF, which was decompressed using
WriteDecodedDoc, is smaller in size than the original PDF given to us by
our customers.
Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox did not
have any PDAcroform fields whereas the decompressed PDF given to us by the
customers does contain Acroform fields. Hence I wanted to know how to
properly decompress the PDF using pdfbox APIs. The reason why I was
analyzing COSStream was to check if the decompression of the compressed PDF
was happening correctly while using PDFBox APIs.
I know it would have been difficult for you to help me without the actual
PDFs. For that, I would like to thank you for your time and pointers.

Maybe it's worth to try to share the file "visually" with us. Open both files
(compressed and decompressed) with PDFDebugger [1] and post a screenshot of both
somehwere (dropbox etc.) and share the link with us. Maybe that could shed some
light on your issue.


@Balaji: here's an example on how such a screenshot would look like:
http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png

Tilman


BR
Andreas Lehmkühler

[1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger

On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <[email protected]>
wrote:

Hi,

The image doesn't appear in the mailing list.

This is all very confusing... /acroform is in the document catalog. I
don't see how the page content stream is related to it. The best is that
you either go through the source code, or read the spec and then look at
the pdf.

To find out what's going on, you'd have to start from that /acroform entry
and then compare the two files.

It is really difficult to help you without the files. The cause could be a
bug in pdfbox, or a malformed pdf...

Some more ideas:
- use loadNonSeq(file, null) instead of load(file)
- try the unreleased 2.0 version, that one has some improvements in the
acroform stuff. Note that the API is different.
https://pdfbox.apache.org/download.cgi#scm
https://pdfbox.apache.org/2.0/getting-started.html

If you still need help, one possibility would be 1) post the smallest
possible code that fails, and 2) post a small part of the raw PDF, i.e. the
objects relevant to the field in your code.


Tilman


Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:

Moreover, for every page of the compressed PDF (there are 3 pages), I
tried getting the COSStream for each of the page :

PDPage firstPage=(PDPage)
document.getDocumentCatalog().getAllPages().get(0);
             pdStream=firstPage.getContents();
             COSStream stream=pdStream.getStream();

In the above code snippet, the object stream, when analyzed in debug
mode, has the following:


The line from the compressed PDF as opened with Notepad++ is :

<</Filter/FlateDecode/Length 5675>>stream

 From this point on, using the COSStream object for every page, how can I
decompress and find out the acroform fields given that the unFilteredStream
object is null for COSStream?


On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <[email protected]
<mailto:[email protected]>> wrote:

     Thank you for your response Tilman.

     I had previously tried using the WriteDecodedDoc for my compressed
     PDF and I tried to get the number of acro form fields present in
  the output file generated by WriteDecodedDoc. The API still could
     not find the acro form fields in the generated decompressed file.
      Also the decompressed file generated is 75 KB which is far less
     than the original decompressed file which I have (1.6 MB) though I
     could edit the acro form fields using acrobat reader.

     Thanks,
     Balaji



     On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
     <[email protected] <mailto:[email protected]>> wrote:

         Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:

             My question is: how do I flatedecode a PDF so that I can
             find all the
             acroform fields within it. ANy help or pointers would be
             highly appreciated.


         You could try the WriteDecodedDoc option of the command line app
         https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc

         Maybe you can have further ideas by comparing the two files
         with NOTEPAD++.... however the two files might have their
         objects in different order.

         Tilman




---------------------------------------------------------------------
         To unsubscribe, e-mail: [email protected]
         <mailto:[email protected]>
         For additional commands, e-mail: [email protected]
         <mailto:[email protected]>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: How to flatedecode and find all acroform fields in a compressed PDF

Reply via email to