Re: How to flatedecode and find all acroform fields in a compressed PDF

Maruan Sahyoun Sun, 24 May 2015 02:39:03 -0700

Hi,

> Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <[email protected]>:
> 
> Hi,
> 
> So AcroForms/Fields is an empty Array?
> 
> Yes, in the filled interview_compressed.pdf, the acroforms are not null but
> empty. Size of array is zero.
> 
> Also, I tried qpdf command line tool to compress the file interview.pdf and
> the resultant compressed file size of 1.6MB was no way near the file size
> of interview_compressed.pdf (21 KB).


would you think it's possible to get a similar PDF file or permission to use it 
internally so we have a sample to look at a potential fix.

Although the PDF is not inline with the spec as Acrobat is able to handle it we 
could look into getting a similar result.

BR
Maruan


> 
> Thanks,
> Balaji
> 
> On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <[email protected]>
> wrote:
> 
>> Hi,
>> 
>>> Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <[email protected]>:
>>> 
>>> I opened the interview_compressed in notepad++ and did not see any
>>> 'Acroform' text anywhere.
>>> However, as Maruan suggested, I entered some data into what looks like
>> form
>>> fields of interview_compressed.pdf and saved it. When I opened this file
>> in
>>> notepad++, I did see 'Acroform' text in it. I also noticed an increase in
>>> file size from 21 KB to ~530 KB.
>>> 
>>> I then ran this filled saved compressed PDF in pdfdebugger.java and saw
>>> that the field values were getting stored but not under Acroform fields
>> but
>>> under Annotations.
>> 
>> 
>> 
>> So AcroForms/Fields is an empty Array?
>> 
>>> Please refer to this image:
>>> 
>>> http://imageshack.com/a/img540/9951/QGLDtS.jpg
>>> 
>>> So, whatever the compression technique was, it simply made all the
>> Acroform
>>> fields disappear from the original PDF but retained all annotations which
>>> also contain the interactive forms and this helped reduce the file size
>> so
>>> much? If this is the case, can pdfbox API also use similar compression
>>> technique to compress such a a huge file into a smaller one?
>>> 
>>> 
>>> 
>>> 
>>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <[email protected]>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <[email protected]
>>> :
>>>>> 
>>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
>>>>>> Hello,
>>>>>> 
>>>>>> I used PdfDebugger to make the internal PDF structure of the two files
>>>> (1)
>>>>>> interview.pdf and (2) interview_compressed.pdf  visually available
>> and I
>>>>>> have uploaded my images to imageshack. Here are the four links:
>>>>>> 
>>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
>>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
>>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
>>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
>>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>>>>>> 
>>>>>> The first two links are from the internal structure of interview.pdf
>>>>>> (original uncompressed file)
>>>>>> The third and fourth links are from the internal structure of
>>>>>> interview_compressed.pdf (compressed file)
>>>>>> The fifth link compares the file sizes of the two files and as you can
>>>> also
>>>>>> see, the difference is huge.
>>>>>> 
>>>>>> As you might notice, the file interview_compressed.pdf has no acroform
>>>>> 
>>>>> Indeed... but this is needed - from the spec:
>>>>> 
>>>>> "The contents and properties of a document’s interactive form shall be
>>>> defined by an interactive form dictionary that shall be referenced from
>> the
>>>> AcroForm entry in the document catalogue (see 7.7.2, “Document
>> Catalog”).
>>>> Table 218 shows the contents of this dictionary."
>>>>> 
>>>> 
>>>> correct
>>>> 
>>>>>> fields listed even though opening the PDF in pdf reader allows me to
>>>> enter
>>>>>> values in places which look like AcroForm fields and also save them.
>> Are
>>>>>> there any other PDF 'types' similar to Acroform fields which would
>>>> enable
>>>>>> users to fill data and which can be accessed in PdfBox APIs without
>>>> having
>>>>>> to go through PDAcrofield?
>>>>> 
>>>>> Yes, annotations... there are some common parts, but this is just a
>>>> vague observation from me, I'm not the acroform specialist.
>>>> 
>>>> from a first glance it looks like there are all entries necessary to
>> (re-)
>>>> generate the form fields. That's what's likely happening for this
>> document
>>>> in Adobe Reader. Would be interesting to see what's being save after the
>>>> forms has been filled out and saved using Acrobat. We'd need a test
>> form to
>>>> come up with an enhancement like this.
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>> 
>>>>> 
>>>>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm"
>> in
>>>> the "compressed" file.
>>>>> - if it is missing, tell the client (or your boss) just that
>>>>> - if it isn't missing, then there's some problem in PDFBox (try also
>> the
>>>> loadNonSeq I mentioned earlier)
>>>>> 
>>>>> Tilman
>>>>> 
>>>>>> 
>>>>>> You can use qpdf , then use these options:
>>>>>> 
>>>>>> I will now try using this link to compress the original file.
>>>>>> 
>>>>>> Another strategy to think about - can your client generate a
>>>>>> non-confidential file, so that you can share it, and the "compressed"
>>>> file?
>>>>>> 
>>>>>> I wish I had direct communication with the clients but due to
>>>> bureaucracy,
>>>>>> I am having to go through multiple layers to get my message across to
>>>> them.
>>>>>> I will share more information as soon as I have them.
>>>>>> 
>>>>>> PS: i sent these image links to my personal email first to make sure
>>>> that I
>>>>>> can open them. I could and so I am hoping you all could too. If you
>> are
>>>>>> unable to open them, please let me know.
>>>>>> 
>>>>>> Thanks,
>>>>>> Balaji
>>>>>> 
>>>>>> 
>>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
>> [email protected]
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Balaji Venkatamohan <[email protected]> hat am 20. Mai 2015 um
>>>> 03:24
>>>>>>>>> geschrieben:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thank you for your pointers and sorry about the image. I am
>>>> attaching it
>>>>>>>>> with this email.
>>>>>>>>> 
>>>>>>>>> The point I am trying to make is that the PDF, which was
>> decompressed
>>>>>>>>> using
>>>>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given to
>>>> us by
>>>>>>>>> our customers.
>>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox
>>>> did
>>>>>>>>> not
>>>>>>>>> have any PDAcroform fields whereas the decompressed PDF given to us
>>>> by
>>>>>>>>> the
>>>>>>>>> customers does contain Acroform fields. Hence I wanted to know how
>> to
>>>>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I was
>>>>>>>>> analyzing COSStream was to check if the decompression of the
>>>> compressed
>>>>>>>>> PDF
>>>>>>>>> was happening correctly while using PDFBox APIs.
>>>>>>>>> I know it would have been difficult for you to help me without the
>>>> actual
>>>>>>>>> PDFs. For that, I would like to thank you for your time and
>> pointers.
>>>>>>>>> 
>>>>>>>> Maybe it's worth to try to share the file "visually" with us. Open
>>>> both
>>>>>>>> files
>>>>>>>> (compressed and decompressed) with PDFDebugger [1] and post a
>>>> screenshot
>>>>>>>> of both
>>>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that
>> could
>>>>>>>> shed some
>>>>>>>> light on your issue.
>>>>>>>> 
>>>>>>> @Balaji: here's an example on how such a screenshot would look like:
>>>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>>>>>> 
>>>>>>> Tilman
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> BR
>>>>>>>> Andreas Lehmkühler
>>>>>>>> 
>>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>>>>>> 
>>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>>> The image doesn't appear in the mailing list.
>>>>>>>>>> 
>>>>>>>>>> This is all very confusing... /acroform is in the document
>> catalog.
>>>> I
>>>>>>>>>> don't see how the page content stream is related to it. The best
>> is
>>>> that
>>>>>>>>>> you either go through the source code, or read the spec and then
>>>> look at
>>>>>>>>>> the pdf.
>>>>>>>>>> 
>>>>>>>>>> To find out what's going on, you'd have to start from that
>> /acroform
>>>>>>>>>> entry
>>>>>>>>>> and then compare the two files.
>>>>>>>>>> 
>>>>>>>>>> It is really difficult to help you without the files. The cause
>>>> could
>>>>>>>>>> be a
>>>>>>>>>> bug in pdfbox, or a malformed pdf...
>>>>>>>>>> 
>>>>>>>>>> Some more ideas:
>>>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>>>>>>> - try the unreleased 2.0 version, that one has some improvements
>> in
>>>> the
>>>>>>>>>> acroform stuff. Note that the API is different.
>>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>>>>>> 
>>>>>>>>>> If you still need help, one possibility would be 1) post the
>>>> smallest
>>>>>>>>>> possible code that fails, and 2) post a small part of the raw PDF,
>>>> i.e.
>>>>>>>>>> the
>>>>>>>>>> objects relevant to the field in your code.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Tilman
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>>>>>> 
>>>>>>>>>> Moreover, for every page of the compressed PDF (there are 3
>>>> pages), I
>>>>>>>>>>> tried getting the COSStream for each of the page :
>>>>>>>>>>> 
>>>>>>>>>>> PDPage firstPage=(PDPage)
>>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>>>>>>            pdStream=firstPage.getContents();
>>>>>>>>>>>            COSStream stream=pdStream.getStream();
>>>>>>>>>>> 
>>>>>>>>>>> In the above code snippet, the object stream, when analyzed in
>>>> debug
>>>>>>>>>>> mode, has the following:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> The line from the compressed PDF as opened with Notepad++ is :
>>>>>>>>>>> 
>>>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>>>>>> 
>>>>>>>>>>> From this point on, using the COSStream object for every page,
>> how
>>>>>>>>>>> can I
>>>>>>>>>>> decompress and find out the acroform fields given that the
>>>>>>>>>>> unFilteredStream
>>>>>>>>>>> object is null for COSStream?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
>>>>>>>>>>> [email protected]
>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>    Thank you for your response Tilman.
>>>>>>>>>>> 
>>>>>>>>>>>    I had previously tried using the WriteDecodedDoc for my
>>>> compressed
>>>>>>>>>>>    PDF and I tried to get the number of acro form fields present
>>>> in
>>>>>>>>>>> the output file generated by WriteDecodedDoc. The API still
>> could
>>>>>>>>>>>    not find the acro form fields in the generated decompressed
>>>> file.
>>>>>>>>>>>     Also the decompressed file generated is 75 KB which is far
>>>> less
>>>>>>>>>>>    than the original decompressed file which I have (1.6 MB)
>>>> though I
>>>>>>>>>>>    could edit the acro form fields using acrobat reader.
>>>>>>>>>>> 
>>>>>>>>>>>    Thanks,
>>>>>>>>>>>    Balaji
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>    On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>>>>>>    <[email protected] <mailto:[email protected]>>
>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>        Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>>>>>>> 
>>>>>>>>>>>            My question is: how do I flatedecode a PDF so that I
>>>> can
>>>>>>>>>>>            find all the
>>>>>>>>>>>            acroform fields within it. ANy help or pointers would
>>>> be
>>>>>>>>>>>            highly appreciated.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>        You could try the WriteDecodedDoc option of the command
>>>> line
>>>>>>>>>>> app
>>>>>>>>>>> 
>>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>>>>>> 
>>>>>>>>>>>        Maybe you can have further ideas by comparing the two
>>>> files
>>>>>>>>>>>        with NOTEPAD++.... however the two files might have their
>>>>>>>>>>>        objects in different order.
>>>>>>>>>>> 
>>>>>>>>>>>        Tilman
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>>>        To unsubscribe, e-mail:
>>>> [email protected]
>>>>>>>>>>>        <mailto:[email protected]>
>>>>>>>>>>>        For additional commands, e-mail:
>>>> [email protected]
>>>>>>>>>>>        <mailto:[email protected]>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>> 
>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected] <mailto:
>>>> [email protected]>
>>>>> For additional commands, e-mail: [email protected] <mailto:
>>>> [email protected]>
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: How to flatedecode and find all acroform fields in a compressed PDF

Reply via email to