Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

2015-02-23 Thread Andreas Lehmkühler
Hi,

I've improved the self repair mechnism of the trunk based on Steves report.

@Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue still
persist?

BR
Andreas Lehmkühler

 Steve Antoch sant...@yuzu.com hat am 17. Februar 2015 um 00:05 geschrieben:
 
 
 
 Andreas-
 Thanks for the response.
 Sorry for sending directly.
 
 Yes, it tries to read from offset 112085940, but does not find the xrefstm
 there, so 
 that's when it goes searching.  It seems to be landing in the middle of
 something else (perhaps an image?)
 
 I tried running the preflight command on the file, and this is what it found
 there.
 This is in the middle of a whole series of repetitive byte patterns like
 these, which is interspersed with other sections of content that is also
 binary only.
 
 ?xml version=1.0 encoding=UTF-8 standalone=no?
 preflight name=file.pdf
   executionTimeMS2646/executionTimeMS
   isValid type=false/isValid
   errors count=1
 error count=1
   code1.0/code
   detailsSyntax error, Error: Expected a long type at offset 112085940,
 instead got
 '6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ'/details
 /error
   /errors
 /preflight
 
 The patterns seem to be:
 
 lots of these: 6lÙ³fÍ#155;
 interspersed between blocks that are similar to this:
 ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ'
 
 It just so happens that the offset 112085940 falls right in the middle of a
 big block of those 6lÙ³fÍ#155; repetitive blocks.
 
 Not sure if that's any help. 
 
 Steve
 
 
 From: Andreas Lehmkühler andr...@lehmi.de
 Sent: Monday, February 16, 2015 3:34 AM
 To: users@pdfbox.apache.org
 Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
 (or variation of it still present)
 
 Hi,
 
  Steve Antoch sant...@yuzu.com hat am 13. Februar 2015 um 23:34
  geschrieben:
 
 
 
  ​Hi Tilman and Andreas--
 Please don't contact developers directly, use our mailing lists instead. I've
 put the users list back into the boat...
 
  I am working with Krasimir on this issue.
 
  Although we asked, we were denied permission to send the document out.
 :-(
 
  The failure is being triggered when we attempt to use the Encrypt() class to
  password protect the pdf.
  We end up with the Expected a long type at offset 113884174, instead got
  'xref' failure.
 
  I have debugged into the PDFBox code and found the offending parts.
 
  PdfBox is  trying to parse an xref table located at 113884174.
 
  The problem we are seeing is that the inside the trailer it finds the
  /XRefStm
  label, and its offset value is returned as 112085940 (which is what is given
  in the file),
  However, the checkXRefOffset() call made to verify it doesn't find the xref
  stream there, so it goes searching and ends up returning the closest xref
  offset it can find, which happens to be that it returns its own offset at
  113884174.
 
 
  I believe that there is an error in PdfBox with respect to this fixup logic,
  even if it had found the 'correct' xref stream.
  That is because the fixup offset can NEVER work.  Every time it fixes up the
  location, it lands on a section which begins with xref.
  The next call is to skip the whitespace, but since there is never any there
  (it's already proven to be 'xref'),  it does not advance the input stream.
  Then, the first call to parse that xrefstm always calls readObjectID(),
  which
  always will throw the exception because the bytes are always 'xref'.
 
  So, my questions are:
 
  1) Are these docs fixable or are they truly corrupt?
 Without having a hand on the pdf itself it's hard to give a 100% answer. But I
 guess there has to be fix, as adobe is able to open that pdf. I'll try to find
 one, following your description of the pdf
 
  2) Is this xref issue a known issue with PdfBox?  I would try to create a
  document that displays the error but I honesty don't know how to do so
  (beyond
  sending the ones that we have that DO display it).
 Not until now
 
  3) Do you have any idea how these documents end up in this state if they are
  being edited by tools such as InDesign, Acrobat, etc? Is there something I
  can
  do to identify them?
 There are a lot of more or less corrupt files in the wild. Those are created
 using different tools.
 
  4) If this is a truly corrupted document, why would Acrobat be able to open
  these files but pdfBox cannot?  Are these streams somehow ignorable?  I ask
  this because I saw this statement on a web page
   (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/)
  when
  I 

RE: setting permissions on a new document

2015-02-23 Thread Allison, Timothy B.
Alright.   After the exorcism, all is working.  I have no idea why it wasn't 
working before.  Thank you, Tilman!

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Friday, February 20, 2015 6:42 PM
To: users@pdfbox.apache.org
Subject: RE: setting permissions on a new document

Thank you, Tilman.  

Ha!  Sorry, I was just giving the minimal code.  The actual code was:

public static void main(String[] args) throws Exception {
File f = new File(C:/temp/testPDF_protected.pdf);
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);

PDFont font = HELVETICA_BOLD;
PDPageContentStream contentStream = new PDPageContentStream(document, 
page);
contentStream.beginText();
contentStream.setFont( font, 12 );
contentStream.moveTextPositionByAmount( 100, 700 );
contentStream.drawString( Hello World );
contentStream.endText();
contentStream.close();

AccessPermission ap = new AccessPermission();
ap.setReadOnly();
StandardProtectionPolicy spp = new StandardProtectionPolicy(owner, 
user, ap);
document.protect(spp);
document.save(f);
document.close();
}

The error is There was a problem reading this document (57).   I can move the 
AccessPermissions line before and after creating the page, and I get the same 
error.  If I don't create/add a page, I get the same error.

If I comment out the AccessPermission - protect lines, Adobe Reader is able to 
open the file.

I generated the document on Windows with PDFBox 2.0 SNAPSHOT and Java 1.8.0_31.

For the record, I'm still sure I'm doing something wrong!  :)

Best,

Tim

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Friday, February 20, 2015 5:25 PM
To: users@pdfbox.apache.org
Subject: Re: setting permissions on a new document

Hi Tim,

add a page to the document.

PDPage page = new PDPage();
document.addPage(page);

Tilman

Am 20.02.2015 um 22:12 schrieb Allison, Timothy B.:
 All,
I'm trying to create a test doc for permission checking over on Tika,  
 when I try the most basic program:

  public static void main(String[] args) throws Exception {
  File f = new File(C:/temp/testPDF_protected.pdf);
  PDDocument document = new PDDocument();
  AccessPermission ap = new AccessPermission();
  ap.setReadOnly();
  StandardProtectionPolicy spp = new StandardProtectionPolicy(owner, 
 user, ap);
  document.protect(spp);
  document.save(f);
  document.close();
  }

 AdobeReader isn't able to open the file.  I'm sure that this is user 
 error...what am I doing wrong?

   Thank you.

 Best,

  Tim

 -
 To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: users-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Importing FDF file issue

2015-02-23 Thread Rajeev Menon
Hello,

I am trying to import a FDF file into a PDF using the PDFBox. But the PDF
file comes back with no values in the fields. I am using PDFBox version
1.8.8. From the release notes it appears that this issue was fixed in 1.8.8.

Here is the text from release notes.

[PDFBOX-2413] - Loaded FDF document returns null fields


Here is the line of code that I am using to populate the PDF.

acroForm.importFDF(fdfdoc);

I did a lot of research and troubleshooting, so thought of asking if
anybody knows what is the status of this long pending issue in PDFBox.

Thanks,
Rajeev.


Re: Importing FDF file issue

2015-02-23 Thread Maruan Sahyoun
Hi,

would you mind sharing your PDF template and FDF so we can have a look at it to 
replicate the issue? As the mailing list doesn't support attachments please 
upload the files to a public location.

BR
Maruan

Am 23.02.2015 um 18:25 schrieb Rajeev Menon rajeevrmen...@gmail.com:

 Hello,
 
 I am trying to import a FDF file into a PDF using the PDFBox. But the PDF
 file comes back with no values in the fields. I am using PDFBox version
 1.8.8. From the release notes it appears that this issue was fixed in 1.8.8.
 
 Here is the text from release notes.
 
 [PDFBOX-2413] - Loaded FDF document returns null fields
 
 
 Here is the line of code that I am using to populate the PDF.
 
 acroForm.importFDF(fdfdoc);
 
 I did a lot of research and troubleshooting, so thought of asking if
 anybody knows what is the status of this long pending issue in PDFBox.
 
 Thanks,
 Rajeev.



Re: Importing FDF file issue

2015-02-23 Thread Rajeev Menon
Does that mean, it is supposed to work? If that is the case, let me try to
use a simple PDF with just one field.

Thanks.



On Mon, Feb 23, 2015 at 1:11 PM, Maruan Sahyoun sahy...@fileaffairs.de
wrote:

 Hi,

 would you mind sharing your PDF template and FDF so we can have a look at
 it to replicate the issue? As the mailing list doesn't support attachments
 please upload the files to a public location.

 BR
 Maruan

 Am 23.02.2015 um 18:25 schrieb Rajeev Menon rajeevrmen...@gmail.com:

  Hello,
 
  I am trying to import a FDF file into a PDF using the PDFBox. But the PDF
  file comes back with no values in the fields. I am using PDFBox version
  1.8.8. From the release notes it appears that this issue was fixed in
 1.8.8.
 
  Here is the text from release notes.
 
  [PDFBOX-2413] - Loaded FDF document returns null fields
 
 
  Here is the line of code that I am using to populate the PDF.
 
  acroForm.importFDF(fdfdoc);
 
  I did a lot of research and troubleshooting, so thought of asking if
  anybody knows what is the status of this long pending issue in PDFBox.
 
  Thanks,
  Rajeev.




Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

2015-02-23 Thread Steve Antoch
@Andreas-

I have downloaded the latest trunk and came close (it got much further) before 
failing.
However, I think I may have a fix for that failure:

The code is returning 0 when the xrefstm fixedOffset is not found.  However, 
the code still tries to load and parse from xref 0, resulting in a null 
reference exception later in parser.parse().

However, thinking about this, I came up with this:

// check for a XRef stream, it may contain some object ids of 
compressed objects 
if(trailer.containsKey(COSName.XREF_STM))
{
int streamOffset = trailer.getInt(COSName.XREF_STM);
// check the xref stream reference
fixedOffset = checkXRefStreamOffset(streamOffset, false);   
//== fixedoffset comes back as 0 = not found
if (fixedOffset  -1  fixedOffset != streamOffset)
{
streamOffset = (int)fixedOffset;
   // == streamOffset gets set to 0 here
trailer.setInt(COSName.XREF_STM, streamOffset);
}

if (streamOffset  0)//  I added this test because 
an xref stream starting at 
   //  offset 0 can 
never happen, so we should simply skip it
{
pdfSource.seek(streamOffset);
skipSpaces();
parseXrefObjStream(prev, false);  == this call 
ultimately throws a null ref exception if streamOffset == 0 on entry
}
}

Adding that, the file successfully parses.

Also, there was this proposal that I put up on github in a repo that I directly 
forked from pdfbox (it is the only change)
It relaxes the looping a bit to allow limited recursion.  I would appreciate 
your thoughts on it. 

https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd

Thank you so much!  You have been tremendously helpful.  I wish I could have 
given you the files, but unfortunately, they are proprietary and we cannot 
release them.  :-(

Best regards-
Steve


From: Andreas Lehmkühler andr...@lehmi.de
Sent: Monday, February 23, 2015 3:43 AM
To: users@pdfbox.apache.org
Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present 
(or variation of it still present)

Hi,

I've improved the self repair mechnism of the trunk based on Steves report.

@Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue still
persist?

BR
Andreas Lehmkühler

 Steve Antoch sant...@yuzu.com hat am 17. Februar 2015 um 00:05 geschrieben:



 Andreas-
 Thanks for the response.
 Sorry for sending directly.

 Yes, it tries to read from offset 112085940, but does not find the xrefstm
 there, so
 that's when it goes searching.  It seems to be landing in the middle of
 something else (perhaps an image?)

 I tried running the preflight command on the file, and this is what it found
 there.
 This is in the middle of a whole series of repetitive byte patterns like
 these, which is interspersed with other sections of content that is also
 binary only.

 ?xml version=1.0 encoding=UTF-8 standalone=no?
 preflight name=file.pdf
   executionTimeMS2646/executionTimeMS
   isValid type=false/isValid
   errors count=1
 error count=1
   code1.0/code
   detailsSyntax error, Error: Expected a long type at offset 112085940,
 instead got
 '6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ'/details
 /error
   /errors
 /preflight

 The patterns seem to be:

 lots of these: 6lÙ³fÍ#155;
 interspersed between blocks that are similar to this:
 ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ'

 It just so happens that the offset 112085940 falls right in the middle of a
 big block of those 6lÙ³fÍ#155; repetitive blocks.

 Not sure if that's any help.

 Steve

 
 From: Andreas Lehmkühler andr...@lehmi.de
 Sent: Monday, February 16, 2015 3:34 AM
 To: users@pdfbox.apache.org
 Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
 (or variation of it still present)

 Hi,

  Steve Antoch sant...@yuzu.com hat am 13. Februar 2015 um 23:34
  geschrieben:
 
 
 
  ​Hi Tilman and Andreas--
 Please don't contact developers directly, use our mailing lists instead. I've
 put the users list back into the boat...

  I am working with Krasimir on this issue.
 
  Although we asked, 

Re: Importing FDF file issue

2015-02-23 Thread Maruan Sahyoun
honestly - I don't know as I'm not using FDF. But I'm doing a lot with forms 
and PDFBox so I can look into that. A test case would be great.

BR
Maruan

Am 23.02.2015 um 19:29 schrieb Rajeev Menon rajeevrmen...@gmail.com:

 Does that mean, it is supposed to work? If that is the case, let me try to
 use a simple PDF with just one field.
 
 Thanks.
 
 
 
 On Mon, Feb 23, 2015 at 1:11 PM, Maruan Sahyoun sahy...@fileaffairs.de
 wrote:
 
 Hi,
 
 would you mind sharing your PDF template and FDF so we can have a look at
 it to replicate the issue? As the mailing list doesn't support attachments
 please upload the files to a public location.
 
 BR
 Maruan
 
 Am 23.02.2015 um 18:25 schrieb Rajeev Menon rajeevrmen...@gmail.com:
 
 Hello,
 
 I am trying to import a FDF file into a PDF using the PDFBox. But the PDF
 file comes back with no values in the fields. I am using PDFBox version
 1.8.8. From the release notes it appears that this issue was fixed in
 1.8.8.
 
 Here is the text from release notes.
 
 [PDFBOX-2413] - Loaded FDF document returns null fields
 
 
 Here is the line of code that I am using to populate the PDF.
 
 acroForm.importFDF(fdfdoc);
 
 I did a lot of research and troubleshooting, so thought of asking if
 anybody knows what is the status of this long pending issue in PDFBox.
 
 Thanks,
 Rajeev.