Re: How to flatedecode and find all acroform fields in a compressed PDF

Tilman Hausherr Wed, 27 May 2015 11:21:12 -0700

Am 27.05.2015 um 20:02 schrieb Balaji Venkatamohan:

Thanks Tilman for letting the website developer know about the shortcomings
of their compression technique.


The PDF owner did not share with us the information about which website
they used for compressing the PDF. My teammates helped in identifying this
website. I will let the customer know about this particular website and
will leave it to them regarding continuing to use this website for their
PDF documents.

Could you also answer the following question please?
Would Pdfbox API change its code to accommodate the incorrect condition
that annotation fields (editable fields) are outside acro form fields as
well? I know the PDF compressed by the website is incorrect and hence I
would understand if you don't go ahead with this.

As I said, I'm not the acroform specialist here. So I can't tell if itis possible to repair these, and if there aren't side effects that e.g.PDFs with annotations end up being forms. Yes, we've done all sorts ofthings to accomodate broken PDFs. But here the fault is known, it is awebsite that deletes data from PDFs to "compress" them. The bettersolution would be to have this guy fix his website, i.e. allow optionsto decide what is to be removed, and what not. Another solution (which Imentioned before) would be to have your customer compress his PDFs withthe method I mentioned in this thread, i.e. if this customer of yoursgenerates PDFs, but doesn't have the knowledge to compress the streams.He could of course look into our source code (FlateFilter.java, it isjust 10 lines) and see how to compress himself.


Tilman


Thanks,
Balaji


On Tue, May 26, 2015 at 10:45 PM, Tilman Hausherr <[email protected]>
wrote:

I just tested it. It also removes /Outlines and /Metadata and more
important data from PDF files.

So your client can't share the PDF with us, but he shared it some website.

A little research shows that this website is owned by Lauri Lehtinen from
Talinn, Estonia.
http://www.checkdomain.com/cgi-bin/checkdomain.pl?domain=pdfcompress.com
https://www.linkedin.com/in/laurilehtinen
https://twitter.com/laurii

I also tweeted him.

Tilman


Am 27.05.2015 um 03:06 schrieb Balaji Venkatamohan:

Okay, I found out the online tool used by the customer to compress their
PDF.

It is : https://www.pdfcompress.com/

I don't need to rely on the PDF sent by the customer because all PDFs that
are available on the web, are compressed in the same manner by this tool,
that is, it gets rid of all acro form fields during compression.

For example, the f941 govt form available at this site:
http://www.irs.gov/pub/irs-pdf/f941.pdf
If we compress this using the online tool, the resultant file size is very
low, which is good. However, there are no acro form fields in the
compressed PDF.

Thanks,
Balaji



On Sun, May 24, 2015 at 2:38 AM, Maruan Sahyoun <[email protected]>
wrote:

  Hi,

  Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <[email protected]

Hi,

So AcroForms/Fields is an empty Array?

Yes, in the filled interview_compressed.pdf, the acroforms are not null

but

empty. Size of array is zero.

Also, I tried qpdf command line tool to compress the file interview.pdf

and

the resultant compressed file size of 1.6MB was no way near the file
size
of interview_compressed.pdf (21 KB).

would you think it's possible to get a similar PDF file or permission to
use it internally so we have a sample to look at a potential fix.

Although the PDF is not inline with the spec as Acrobat is able to handle
it we could look into getting a similar result.

BR
Maruan


  Thanks,

Balaji

On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <
[email protected]

wrote:

  Hi,

  Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <

[email protected]

:
I opened the interview_compressed in notepad++ and did not see any

'Acroform' text anywhere.
However, as Maruan suggested, I entered some data into what looks like

form

fields of interview_compressed.pdf and saved it. When I opened this

file

in

notepad++, I did see 'Acroform' text in it. I also noticed an increase

in

file size from 21 KB to ~530 KB.

I then ran this filled saved compressed PDF in pdfdebugger.java and
saw
that the field values were getting stored but not under Acroform
fields

but

under Annotations.


So AcroForms/Fields is an empty Array?

  Please refer to this image:

http://imageshack.com/a/img540/9951/QGLDtS.jpg

So, whatever the compression technique was, it simply made all the

Acroform

fields disappear from the original PDF but retained all annotations

which

also contain the interactive forms and this helped reduce the file size

so

much? If this is the case, can pdfbox API also use similar compression
technique to compress such a a huge file into a smaller one?




On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <

[email protected]>

wrote:

Hi,

  Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <
[email protected]

Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:

Hello,

I used PdfDebugger to make the internal PDF structure of the two

files

(1)

interview.pdf and (2) interview_compressed.pdf  visually available
and I

have uploaded my images to imageshack. Here are the four links:

http://imageshack.com/a/img538/8277/JghCpG.jpg
http://imageshack.com/a/img909/6140/KsYNGR.jpg
http://imageshack.com/a/img903/8644/mk15As.jpg
http://imageshack.com/a/img901/8610/NXe3mJ.jpg
http://imageshack.com/a/img673/8633/0GMdjQ.jpg

The first two links are from the internal structure of
interview.pdf
(original uncompressed file)
The third and fourth links are from the internal structure of
interview_compressed.pdf (compressed file)
The fifth link compares the file sizes of the two files and as you

can

also

see, the difference is huge.

As you might notice, the file interview_compressed.pdf has no

acroform

Indeed... but this is needed - from the spec:

"The contents and properties of a document’s interactive form shall

be

defined by an interactive form dictionary that shall be referenced

from

the

AcroForm entry in the document catalogue (see 7.7.2, “Document
Catalog”).
Table 218 shows the contents of this dictionary."

correct

  fields listed even though opening the PDF in pdf reader allows me to

enter
values in places which look like AcroForm fields and also save them.
Are

there any other PDF 'types' similar to Acroform fields which would

enable
users to fill data and which can be accessed in PdfBox APIs without
having
to go through PDAcrofield?
Yes, annotations... there are some common parts, but this is just a

vague observation from me, I'm not the acroform specialist.

from a first glance it looks like there are all entries necessary to

(re-)
generate the form fields. That's what's likely happening for this
document
in Adobe Reader. Would be interesting to see what's being save after
the

forms has been filled out and saved using Acrobat. We'd need a test

form to
come up with an enhancement like this.

BR
Maruan


  What you should do: use NOTEPAD++ to look whether there's

"/AcroForm"

in

the "compressed" file.

- if it is missing, tell the client (or your boss) just that
- if it isn't missing, then there's some problem in PDFBox (try also

the

loadNonSeq I mentioned earlier)

Tilman

  You can use qpdf , then use these options:

I will now try using this link to compress the original file.

Another strategy to think about - can your client generate a
non-confidential file, so that you can share it, and the

"compressed"

file?

I wish I had direct communication with the clients but due to
bureaucracy,
I am having to go through multiple layers to get my message across
to

them.

I will share more information as soon as I have them.

PS: i sent these image links to my personal email first to make
sure

that I
can open them. I could and so I am hoping you all could too. If you
are

unable to open them, please let me know.

Thanks,
Balaji


On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <

[email protected]

wrote:

  Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:

Hi,

Balaji Venkatamohan <[email protected]> hat am 20. Mai 2015 um

03:24

geschrieben:


Thank you for your pointers and sorry about the image. I am

attaching it

with this email.

The point I am trying to make is that the PDF, which was

decompressed

using

WriteDecodedDoc, is smaller in size than the original PDF given

to

us by

our customers.

Also, the decompressed PDF generated by WriterDecodedDoc of

PDFBox

did

not

have any PDAcroform fields whereas the decompressed PDF given to

us

by

the

customers does contain Acroform fields. Hence I wanted to know

how

to

properly decompress the PDF using pdfbox APIs. The reason why I

was

analyzing COSStream was to check if the decompression of the

compressed

PDF

was happening correctly while using PDFBox APIs.
I know it would have been difficult for you to help me without

the

actual

PDFs. For that, I would like to thank you for your time and

pointers.

Maybe it's worth to try to share the file "visually" with us. Open

both

files

(compressed and decompressed) with PDFDebugger [1] and post a

screenshot

of both

somehwere (dropbox etc.) and share the link with us. Maybe that

could

shed some

light on your issue.

  @Balaji: here's an example on how such a screenshot would look

like:

http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png

Tilman



  BR

Andreas Lehmkühler

[1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger

On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <

[email protected]>

wrote:

Hi,

The image doesn't appear in the mailing list.

This is all very confusing... /acroform is in the document

catalog.

don't see how the page content stream is related to it. The best

is

that

you either go through the source code, or read the spec and then

look at

the pdf.

To find out what's going on, you'd have to start from that

/acroform

entry

and then compare the two files.

It is really difficult to help you without the files. The cause

could

be a

bug in pdfbox, or a malformed pdf...

Some more ideas:
- use loadNonSeq(file, null) instead of load(file)
- try the unreleased 2.0 version, that one has some
improvements

in

the

acroform stuff. Note that the API is different.

https://pdfbox.apache.org/download.cgi#scm
https://pdfbox.apache.org/2.0/getting-started.html

If you still need help, one possibility would be 1) post the

smallest

possible code that fails, and 2) post a small part of the raw

PDF,

i.e.

the

objects relevant to the field in your code.


Tilman


Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:

Moreover, for every page of the compressed PDF (there are 3

pages), I

tried getting the COSStream for each of the page :

PDPage firstPage=(PDPage)
document.getDocumentCatalog().getAllPages().get(0);
             pdStream=firstPage.getContents();
             COSStream stream=pdStream.getStream();

In the above code snippet, the object stream, when analyzed in

debug

mode, has the following:


The line from the compressed PDF as opened with Notepad++ is :

<</Filter/FlateDecode/Length 5675>>stream

  From this point on, using the COSStream object for every
page,

how

can I

decompress and find out the acroform fields given that the
unFilteredStream
object is null for COSStream?


On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
[email protected]
<mailto:[email protected]>> wrote:

     Thank you for your response Tilman.

     I had previously tried using the WriteDecodedDoc for my

compressed

     PDF and I tried to get the number of acro form fields

present

in

the output file generated by WriteDecodedDoc. The API still

could

     not find the acro form fields in the generated decompressed

file.

      Also the decompressed file generated is 75 KB which is far

less

     than the original decompressed file which I have (1.6 MB)

though I

     could edit the acro form fields using acrobat reader.

     Thanks,
     Balaji



     On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
     <[email protected] <mailto:[email protected]>>

wrote:

         Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:

             My question is: how do I flatedecode a PDF so
that I

can

             find all the

             acroform fields within it. ANy help or pointers

would

be

             highly appreciated.


         You could try the WriteDecodedDoc option of the
command

line

app

https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc

         Maybe you can have further ideas by comparing the two

files

         with NOTEPAD++.... however the two files might have

their

         objects in different order.

         Tilman

---------------------------------------------------------------------

         To unsubscribe, e-mail:

[email protected]

         <mailto:[email protected]>

         For additional commands, e-mail:

[email protected]

         <mailto:[email protected]>

---------------------------------------------------------------------

To unsubscribe, e-mail: [email protected]

For additional commands, e-mail: [email protected]

---------------------------------------------------------------------

To unsubscribe, e-mail: [email protected]

For additional commands, e-mail: [email protected]

---------------------------------------------------------------------

To unsubscribe, e-mail: [email protected]

For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
<mailto:

[email protected]>

For additional commands, e-mail: [email protected]

<mailto:

[email protected]>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: How to flatedecode and find all acroform fields in a compressed PDF

Reply via email to