Hi, > Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <[email protected]>: > > Hi, > > So AcroForms/Fields is an empty Array? > > Yes, in the filled interview_compressed.pdf, the acroforms are not null but > empty. Size of array is zero. > > Also, I tried qpdf command line tool to compress the file interview.pdf and > the resultant compressed file size of 1.6MB was no way near the file size > of interview_compressed.pdf (21 KB).
would you think it's possible to get a similar PDF file or permission to use it internally so we have a sample to look at a potential fix. Although the PDF is not inline with the spec as Acrobat is able to handle it we could look into getting a similar result. BR Maruan > > Thanks, > Balaji > > On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <[email protected]> > wrote: > >> Hi, >> >>> Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <[email protected]>: >>> >>> I opened the interview_compressed in notepad++ and did not see any >>> 'Acroform' text anywhere. >>> However, as Maruan suggested, I entered some data into what looks like >> form >>> fields of interview_compressed.pdf and saved it. When I opened this file >> in >>> notepad++, I did see 'Acroform' text in it. I also noticed an increase in >>> file size from 21 KB to ~530 KB. >>> >>> I then ran this filled saved compressed PDF in pdfdebugger.java and saw >>> that the field values were getting stored but not under Acroform fields >> but >>> under Annotations. >> >> >> >> So AcroForms/Fields is an empty Array? >> >>> Please refer to this image: >>> >>> http://imageshack.com/a/img540/9951/QGLDtS.jpg >>> >>> So, whatever the compression technique was, it simply made all the >> Acroform >>> fields disappear from the original PDF but retained all annotations which >>> also contain the interactive forms and this helped reduce the file size >> so >>> much? If this is the case, can pdfbox API also use similar compression >>> technique to compress such a a huge file into a smaller one? >>> >>> >>> >>> >>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <[email protected] >>> : >>>>> >>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan: >>>>>> Hello, >>>>>> >>>>>> I used PdfDebugger to make the internal PDF structure of the two files >>>> (1) >>>>>> interview.pdf and (2) interview_compressed.pdf visually available >> and I >>>>>> have uploaded my images to imageshack. Here are the four links: >>>>>> >>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg >>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg >>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg >>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg >>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg >>>>>> >>>>>> The first two links are from the internal structure of interview.pdf >>>>>> (original uncompressed file) >>>>>> The third and fourth links are from the internal structure of >>>>>> interview_compressed.pdf (compressed file) >>>>>> The fifth link compares the file sizes of the two files and as you can >>>> also >>>>>> see, the difference is huge. >>>>>> >>>>>> As you might notice, the file interview_compressed.pdf has no acroform >>>>> >>>>> Indeed... but this is needed - from the spec: >>>>> >>>>> "The contents and properties of a document’s interactive form shall be >>>> defined by an interactive form dictionary that shall be referenced from >> the >>>> AcroForm entry in the document catalogue (see 7.7.2, “Document >> Catalog”). >>>> Table 218 shows the contents of this dictionary." >>>>> >>>> >>>> correct >>>> >>>>>> fields listed even though opening the PDF in pdf reader allows me to >>>> enter >>>>>> values in places which look like AcroForm fields and also save them. >> Are >>>>>> there any other PDF 'types' similar to Acroform fields which would >>>> enable >>>>>> users to fill data and which can be accessed in PdfBox APIs without >>>> having >>>>>> to go through PDAcrofield? >>>>> >>>>> Yes, annotations... there are some common parts, but this is just a >>>> vague observation from me, I'm not the acroform specialist. >>>> >>>> from a first glance it looks like there are all entries necessary to >> (re-) >>>> generate the form fields. That's what's likely happening for this >> document >>>> in Adobe Reader. Would be interesting to see what's being save after the >>>> forms has been filled out and saved using Acrobat. We'd need a test >> form to >>>> come up with an enhancement like this. >>>> >>>> BR >>>> Maruan >>>> >>>> >>>>> >>>>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm" >> in >>>> the "compressed" file. >>>>> - if it is missing, tell the client (or your boss) just that >>>>> - if it isn't missing, then there's some problem in PDFBox (try also >> the >>>> loadNonSeq I mentioned earlier) >>>>> >>>>> Tilman >>>>> >>>>>> >>>>>> You can use qpdf , then use these options: >>>>>> >>>>>> I will now try using this link to compress the original file. >>>>>> >>>>>> Another strategy to think about - can your client generate a >>>>>> non-confidential file, so that you can share it, and the "compressed" >>>> file? >>>>>> >>>>>> I wish I had direct communication with the clients but due to >>>> bureaucracy, >>>>>> I am having to go through multiple layers to get my message across to >>>> them. >>>>>> I will share more information as soon as I have them. >>>>>> >>>>>> PS: i sent these image links to my personal email first to make sure >>>> that I >>>>>> can open them. I could and so I am hoping you all could too. If you >> are >>>>>> unable to open them, please let me know. >>>>>> >>>>>> Thanks, >>>>>> Balaji >>>>>> >>>>>> >>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr < >> [email protected] >>>>> >>>>>> wrote: >>>>>> >>>>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Balaji Venkatamohan <[email protected]> hat am 20. Mai 2015 um >>>> 03:24 >>>>>>>>> geschrieben: >>>>>>>>> >>>>>>>>> >>>>>>>>> Thank you for your pointers and sorry about the image. I am >>>> attaching it >>>>>>>>> with this email. >>>>>>>>> >>>>>>>>> The point I am trying to make is that the PDF, which was >> decompressed >>>>>>>>> using >>>>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given to >>>> us by >>>>>>>>> our customers. >>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox >>>> did >>>>>>>>> not >>>>>>>>> have any PDAcroform fields whereas the decompressed PDF given to us >>>> by >>>>>>>>> the >>>>>>>>> customers does contain Acroform fields. Hence I wanted to know how >> to >>>>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I was >>>>>>>>> analyzing COSStream was to check if the decompression of the >>>> compressed >>>>>>>>> PDF >>>>>>>>> was happening correctly while using PDFBox APIs. >>>>>>>>> I know it would have been difficult for you to help me without the >>>> actual >>>>>>>>> PDFs. For that, I would like to thank you for your time and >> pointers. >>>>>>>>> >>>>>>>> Maybe it's worth to try to share the file "visually" with us. Open >>>> both >>>>>>>> files >>>>>>>> (compressed and decompressed) with PDFDebugger [1] and post a >>>> screenshot >>>>>>>> of both >>>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that >> could >>>>>>>> shed some >>>>>>>> light on your issue. >>>>>>>> >>>>>>> @Balaji: here's an example on how such a screenshot would look like: >>>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png >>>>>>> >>>>>>> Tilman >>>>>>> >>>>>>> >>>>>>> >>>>>>>> BR >>>>>>>> Andreas Lehmkühler >>>>>>>> >>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger >>>>>>>> >>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr < >>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>>> The image doesn't appear in the mailing list. >>>>>>>>>> >>>>>>>>>> This is all very confusing... /acroform is in the document >> catalog. >>>> I >>>>>>>>>> don't see how the page content stream is related to it. The best >> is >>>> that >>>>>>>>>> you either go through the source code, or read the spec and then >>>> look at >>>>>>>>>> the pdf. >>>>>>>>>> >>>>>>>>>> To find out what's going on, you'd have to start from that >> /acroform >>>>>>>>>> entry >>>>>>>>>> and then compare the two files. >>>>>>>>>> >>>>>>>>>> It is really difficult to help you without the files. The cause >>>> could >>>>>>>>>> be a >>>>>>>>>> bug in pdfbox, or a malformed pdf... >>>>>>>>>> >>>>>>>>>> Some more ideas: >>>>>>>>>> - use loadNonSeq(file, null) instead of load(file) >>>>>>>>>> - try the unreleased 2.0 version, that one has some improvements >> in >>>> the >>>>>>>>>> acroform stuff. Note that the API is different. >>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm >>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html >>>>>>>>>> >>>>>>>>>> If you still need help, one possibility would be 1) post the >>>> smallest >>>>>>>>>> possible code that fails, and 2) post a small part of the raw PDF, >>>> i.e. >>>>>>>>>> the >>>>>>>>>> objects relevant to the field in your code. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Tilman >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan: >>>>>>>>>> >>>>>>>>>> Moreover, for every page of the compressed PDF (there are 3 >>>> pages), I >>>>>>>>>>> tried getting the COSStream for each of the page : >>>>>>>>>>> >>>>>>>>>>> PDPage firstPage=(PDPage) >>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0); >>>>>>>>>>> pdStream=firstPage.getContents(); >>>>>>>>>>> COSStream stream=pdStream.getStream(); >>>>>>>>>>> >>>>>>>>>>> In the above code snippet, the object stream, when analyzed in >>>> debug >>>>>>>>>>> mode, has the following: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The line from the compressed PDF as opened with Notepad++ is : >>>>>>>>>>> >>>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream >>>>>>>>>>> >>>>>>>>>>> From this point on, using the COSStream object for every page, >> how >>>>>>>>>>> can I >>>>>>>>>>> decompress and find out the acroform fields given that the >>>>>>>>>>> unFilteredStream >>>>>>>>>>> object is null for COSStream? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan < >>>>>>>>>>> [email protected] >>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>> >>>>>>>>>>> Thank you for your response Tilman. >>>>>>>>>>> >>>>>>>>>>> I had previously tried using the WriteDecodedDoc for my >>>> compressed >>>>>>>>>>> PDF and I tried to get the number of acro form fields present >>>> in >>>>>>>>>>> the output file generated by WriteDecodedDoc. The API still >> could >>>>>>>>>>> not find the acro form fields in the generated decompressed >>>> file. >>>>>>>>>>> Also the decompressed file generated is 75 KB which is far >>>> less >>>>>>>>>>> than the original decompressed file which I have (1.6 MB) >>>> though I >>>>>>>>>>> could edit the acro form fields using acrobat reader. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Balaji >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr >>>>>>>>>>> <[email protected] <mailto:[email protected]>> >> wrote: >>>>>>>>>>> >>>>>>>>>>> Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan: >>>>>>>>>>> >>>>>>>>>>> My question is: how do I flatedecode a PDF so that I >>>> can >>>>>>>>>>> find all the >>>>>>>>>>> acroform fields within it. ANy help or pointers would >>>> be >>>>>>>>>>> highly appreciated. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> You could try the WriteDecodedDoc option of the command >>>> line >>>>>>>>>>> app >>>>>>>>>>> >>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc >>>>>>>>>>> >>>>>>>>>>> Maybe you can have further ideas by comparing the two >>>> files >>>>>>>>>>> with NOTEPAD++.... however the two files might have their >>>>>>>>>>> objects in different order. >>>>>>>>>>> >>>>>>>>>>> Tilman >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>>>> To unsubscribe, e-mail: >>>> [email protected] >>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>> For additional commands, e-mail: >>>> [email protected] >>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>> >>>>>>>> >> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>> >>>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>> For additional commands, e-mail: [email protected] >>>>>>> >>>>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] <mailto: >>>> [email protected]> >>>>> For additional commands, e-mail: [email protected] <mailto: >>>> [email protected]> >>>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

