Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

2015-03-01 Thread Tilman Hausherr

Am 25.02.2015 um 00:04 schrieb Steve Antoch:

We are planning on running a large breadth test on approximately 108,000 pdfs 
starting tonight.  I will let you know how this test goes.  It will take about 
4 days to complete.


If possible, it would be nice to test preflight as well. Just use the 
code below, plus the preflight-app jar, and the levigo and jai plugins. 
It will produce one .txt file in the current directory for every 
exception that isn't a ValidationException. Most problems that occur 
with rendering will happen with preflight as well, and the test is 
faster. 10 files might be done in 24 hours.


If any exceptions occur, please post them here.

Tilman


package com.mycompany.preflightmasstest;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FilenameFilter;
import java.io.PrintWriter;
import org.apache.pdfbox.preflight.PreflightDocument;
import org.apache.pdfbox.preflight.exception.ValidationException;
import org.apache.pdfbox.preflight.parser.PreflightParser;

/**
 *
 * @author Tilman Hausherr
 */
public class PreflightMassTest
{
public static void main(String[] args) throws FileNotFoundException
{

File dir = new File(args[0]);

int total = 0;
int failed = 0;
File[] dirList = dir.listFiles(new FilenameFilter()
{
@Override
public boolean accept(File dir, String name)
{
return name.toLowerCase().endsWith(.pdf);
}
});
for (File pdf : dirList)
{
++total;

System.out.println(pdf.getName());
// just test that it doesn't crash
try
{
new File(pdf.getName() + -exception.txt).delete();
PreflightParser parser = new PreflightParser(pdf);
parser.parse();
try (PreflightDocument preflightDocument = 
parser.getPreflightDocument())

{
preflightDocument.validate();
preflightDocument.getResult();
}
parser.close();
}
catch (ValidationException e)
{
}
catch (Throwable e)
{
++failed;
try (PrintWriter pw = new PrintWriter(new 
File(pdf.getName() + -exception.txt)))

{
e.printStackTrace(pw);
}
System.out.flush();
System.err.flush();
System.err.print(pdf.getName() +  preflight fail: );
e.printStackTrace();
System.out.flush();
System.err.flush();
}
System.out.println(total:  + total + , failed:  + 
failed + , percentage failed:  + (((float) failed) / total * 100.0) + 
%);

}

}

}


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

2015-02-26 Thread Andreas Lehmkühler
Hi,

 Steve Antoch sant...@yuzu.com hat am 25. Februar 2015 um 00:04 geschrieben:
 
 
 Hi Andreas-
 
 Thanks again.
 
 I downloaded and built the latest from trunk.  
 There was no change for the book I was testing.  I first tried it after taking
 out my if (streamOffset  0) test, but the null reference exception still
 occurred.
OK, thanks again for testing. I've fixed the issue based on your analysis.

 We are planning on running a large breadth test on approximately 108,000 pdfs
 starting tonight.  I will let you know how this test goes.  It will take about
 4 days to complete.
Cool, I'm looking forward to see the results.

 With respect to the small change I made in my fork:
 https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd
 
 The issue was a separate but fairly rare failure that we found in a small
 number (about 10) of our pdfs.
 Adobe and Pdfium (Chrome) were both able to open them but pdfBox was not due
 to disallowing nesting.  I figured that if Pdfium allows 64 levels of nesting,
 we might be able to relax this test from 0 levels to allowing 1 level and see
 if it worked.  Since it did, I wanted to run those changes by you for your
 comments.
Is there any chance to get a hand on a sample pdf? I would be good enough to
send it via private mail to me:

BR
Andreas Lehmkühler

 
 Thanks-
 Steve
 
 
 From: Andreas Lehmkühler andr...@lehmi.de
 Sent: Tuesday, February 24, 2015 3:30 AM
 To: users@pdfbox.apache.org
 Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
 (or variation of it still present)
 
 Hi Steve,
 
  Steve Antoch sant...@yuzu.com hat am 23. Februar 2015 um 19:42
  geschrieben:
 
 
  @Andreas-
 
  I have downloaded the latest trunk and came close (it got much further)
  before
  failing.
  However, I think I may have a fix for that failure:
 Thanks for the test
 
  The code is returning 0 when the xrefstm fixedOffset is not found.  However,
  the code still tries to load and parse from xref 0, resulting in a null
  reference exception later in parser.parse().
 Your analysis is correct, but I hope that my last improvements should
 eliminate
 such cases, see PDFBOX-2572 for details. Could you give the latest trunk
 (r1661747) a try?
 
  However, thinking about this, I came up with this:
 
  // check for a XRef stream, it may contain some object ids
  of
  compressed objects
  if(trailer.containsKey(COSName.XREF_STM))
  {
  int streamOffset = trailer.getInt(COSName.XREF_STM);
  // check the xref stream reference
  fixedOffset = checkXRefStreamOffset(streamOffset,
  false);
//== fixedoffset comes back as 0 = not found
  if (fixedOffset  -1  fixedOffset != streamOffset)
  {
  streamOffset = (int)fixedOffset;
// == streamOffset gets set
  to
  0 here
  trailer.setInt(COSName.XREF_STM, streamOffset);
  }
 
  if (streamOffset  0)//  I added this test
  because an xref stream starting at
 //  offset 0 can
  never happen, so we should simply skip it
  {
  pdfSource.seek(streamOffset);
  skipSpaces();
  parseXrefObjStream(prev, false);  == this call
  ultimately throws a null ref exception if streamOffset == 0 on entry
  }
  }
 
  Adding that, the file successfully parses.
 
  Also, there was this proposal that I put up on github in a repo that I
  directly forked from pdfbox (it is the only change)
  It relaxes the looping a bit to allow limited recursion.  I would appreciate
  your thoughts on it.
 Is this change related to the discussed issue above?
 
  https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd
 
  Thank you so much!  You have been tremendously helpful.  I wish I could have
  given you the files, but unfortunately, they are proprietary and we cannot
  release them.  :-(
 No need to worry, you are not the only one who is not allowed to share a
 specific pdf.
 
  Best regards-
  Steve
 
 BR
 Andreas Lehmkühler
 
 
  
  From: Andreas Lehmkühler andr...@lehmi.de
  Sent: Monday, February 23, 2015 3:43 AM
  To: users@pdfbox.apache.org
  Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
  (or variation of it still present)
 
  Hi,
 
  I've improved the self repair mechnism of the trunk based on Steves report.
 
  @Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue
  still
  persist?
 
  BR
  Andreas Lehmkühler
 
   Steve Antoch sant...@yuzu.com hat am 17. Februar 2015 um 00:05
   geschrieben

Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

2015-02-23 Thread Andreas Lehmkühler
Hi,

I've improved the self repair mechnism of the trunk based on Steves report.

@Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue still
persist?

BR
Andreas Lehmkühler

 Steve Antoch sant...@yuzu.com hat am 17. Februar 2015 um 00:05 geschrieben:
 
 
 
 Andreas-
 Thanks for the response.
 Sorry for sending directly.
 
 Yes, it tries to read from offset 112085940, but does not find the xrefstm
 there, so 
 that's when it goes searching.  It seems to be landing in the middle of
 something else (perhaps an image?)
 
 I tried running the preflight command on the file, and this is what it found
 there.
 This is in the middle of a whole series of repetitive byte patterns like
 these, which is interspersed with other sections of content that is also
 binary only.
 
 ?xml version=1.0 encoding=UTF-8 standalone=no?
 preflight name=file.pdf
   executionTimeMS2646/executionTimeMS
   isValid type=false/isValid
   errors count=1
 error count=1
   code1.0/code
   detailsSyntax error, Error: Expected a long type at offset 112085940,
 instead got
 '6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ'/details
 /error
   /errors
 /preflight
 
 The patterns seem to be:
 
 lots of these: 6lÙ³fÍ#155;
 interspersed between blocks that are similar to this:
 ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ'
 
 It just so happens that the offset 112085940 falls right in the middle of a
 big block of those 6lÙ³fÍ#155; repetitive blocks.
 
 Not sure if that's any help. 
 
 Steve
 
 
 From: Andreas Lehmkühler andr...@lehmi.de
 Sent: Monday, February 16, 2015 3:34 AM
 To: users@pdfbox.apache.org
 Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
 (or variation of it still present)
 
 Hi,
 
  Steve Antoch sant...@yuzu.com hat am 13. Februar 2015 um 23:34
  geschrieben:
 
 
 
  ​Hi Tilman and Andreas--
 Please don't contact developers directly, use our mailing lists instead. I've
 put the users list back into the boat...
 
  I am working with Krasimir on this issue.
 
  Although we asked, we were denied permission to send the document out.
 :-(
 
  The failure is being triggered when we attempt to use the Encrypt() class to
  password protect the pdf.
  We end up with the Expected a long type at offset 113884174, instead got
  'xref' failure.
 
  I have debugged into the PDFBox code and found the offending parts.
 
  PdfBox is  trying to parse an xref table located at 113884174.
 
  The problem we are seeing is that the inside the trailer it finds the
  /XRefStm
  label, and its offset value is returned as 112085940 (which is what is given
  in the file),
  However, the checkXRefOffset() call made to verify it doesn't find the xref
  stream there, so it goes searching and ends up returning the closest xref
  offset it can find, which happens to be that it returns its own offset at
  113884174.
 
 
  I believe that there is an error in PdfBox with respect to this fixup logic,
  even if it had found the 'correct' xref stream.
  That is because the fixup offset can NEVER work.  Every time it fixes up the
  location, it lands on a section which begins with xref.
  The next call is to skip the whitespace, but since there is never any there
  (it's already proven to be 'xref'),  it does not advance the input stream.
  Then, the first call to parse that xrefstm always calls readObjectID(),
  which
  always will throw the exception because the bytes are always 'xref'.
 
  So, my questions are:
 
  1) Are these docs fixable or are they truly corrupt?
 Without having a hand on the pdf itself it's hard to give a 100% answer. But I
 guess there has to be fix, as adobe is able to open that pdf. I'll try to find
 one, following your description of the pdf
 
  2) Is this xref issue a known issue with PdfBox?  I would try to create a
  document that displays the error but I honesty don't know how to do so
  (beyond
  sending the ones that we have that DO display it).
 Not until now
 
  3) Do you have any idea how these documents end up in this state if they are
  being edited by tools such as InDesign, Acrobat, etc? Is there something I
  can
  do to identify them?
 There are a lot of more or less corrupt files in the wild. Those are created
 using different tools.
 
  4) If this is a truly corrupted document, why would Acrobat be able to open
  these files but pdfBox cannot?  Are these streams somehow ignorable?  I ask
  this because I saw this statement on a web page
   (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/)
  when
  I

Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

2015-02-23 Thread Steve Antoch
@Andreas-

I have downloaded the latest trunk and came close (it got much further) before 
failing.
However, I think I may have a fix for that failure:

The code is returning 0 when the xrefstm fixedOffset is not found.  However, 
the code still tries to load and parse from xref 0, resulting in a null 
reference exception later in parser.parse().

However, thinking about this, I came up with this:

// check for a XRef stream, it may contain some object ids of 
compressed objects 
if(trailer.containsKey(COSName.XREF_STM))
{
int streamOffset = trailer.getInt(COSName.XREF_STM);
// check the xref stream reference
fixedOffset = checkXRefStreamOffset(streamOffset, false);   
//== fixedoffset comes back as 0 = not found
if (fixedOffset  -1  fixedOffset != streamOffset)
{
streamOffset = (int)fixedOffset;
   // == streamOffset gets set to 0 here
trailer.setInt(COSName.XREF_STM, streamOffset);
}

if (streamOffset  0)//  I added this test because 
an xref stream starting at 
   //  offset 0 can 
never happen, so we should simply skip it
{
pdfSource.seek(streamOffset);
skipSpaces();
parseXrefObjStream(prev, false);  == this call 
ultimately throws a null ref exception if streamOffset == 0 on entry
}
}

Adding that, the file successfully parses.

Also, there was this proposal that I put up on github in a repo that I directly 
forked from pdfbox (it is the only change)
It relaxes the looping a bit to allow limited recursion.  I would appreciate 
your thoughts on it. 

https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd

Thank you so much!  You have been tremendously helpful.  I wish I could have 
given you the files, but unfortunately, they are proprietary and we cannot 
release them.  :-(

Best regards-
Steve


From: Andreas Lehmkühler andr...@lehmi.de
Sent: Monday, February 23, 2015 3:43 AM
To: users@pdfbox.apache.org
Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present 
(or variation of it still present)

Hi,

I've improved the self repair mechnism of the trunk based on Steves report.

@Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue still
persist?

BR
Andreas Lehmkühler

 Steve Antoch sant...@yuzu.com hat am 17. Februar 2015 um 00:05 geschrieben:



 Andreas-
 Thanks for the response.
 Sorry for sending directly.

 Yes, it tries to read from offset 112085940, but does not find the xrefstm
 there, so
 that's when it goes searching.  It seems to be landing in the middle of
 something else (perhaps an image?)

 I tried running the preflight command on the file, and this is what it found
 there.
 This is in the middle of a whole series of repetitive byte patterns like
 these, which is interspersed with other sections of content that is also
 binary only.

 ?xml version=1.0 encoding=UTF-8 standalone=no?
 preflight name=file.pdf
   executionTimeMS2646/executionTimeMS
   isValid type=false/isValid
   errors count=1
 error count=1
   code1.0/code
   detailsSyntax error, Error: Expected a long type at offset 112085940,
 instead got
 '6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ'/details
 /error
   /errors
 /preflight

 The patterns seem to be:

 lots of these: 6lÙ³fÍ#155;
 interspersed between blocks that are similar to this:
 ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ'

 It just so happens that the offset 112085940 falls right in the middle of a
 big block of those 6lÙ³fÍ#155; repetitive blocks.

 Not sure if that's any help.

 Steve

 
 From: Andreas Lehmkühler andr...@lehmi.de
 Sent: Monday, February 16, 2015 3:34 AM
 To: users@pdfbox.apache.org
 Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
 (or variation of it still present)

 Hi,

  Steve Antoch sant...@yuzu.com hat am 13. Februar 2015 um 23:34
  geschrieben:
 
 
 
  ​Hi Tilman and Andreas--
 Please don't contact developers directly, use our mailing lists instead. I've
 put the users list back into the boat...

  I am working with Krasimir on this issue.
 
  Although we asked

Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

2015-02-16 Thread Andreas Lehmkühler
Hi,

 Steve Antoch sant...@yuzu.com hat am 13. Februar 2015 um 23:34 geschrieben:
 
 
 
 ​Hi Tilman and Andreas--
Please don't contact developers directly, use our mailing lists instead. I've
put the users list back into the boat...

 I am working with Krasimir on this issue.
 
 Although we asked, we were denied permission to send the document out.
:-(

 The failure is being triggered when we attempt to use the Encrypt() class to
 password protect the pdf.
 We end up with the Expected a long type at offset 113884174, instead got
 'xref' failure.
 
 I have debugged into the PDFBox code and found the offending parts.
 
 PdfBox is  trying to parse an xref table located at 113884174. 
 
 The problem we are seeing is that the inside the trailer it finds the /XRefStm
 label, and its offset value is returned as 112085940 (which is what is given
 in the file), 
 However, the checkXRefOffset() call made to verify it doesn't find the xref
 stream there, so it goes searching and ends up returning the closest xref
 offset it can find, which happens to be that it returns its own offset at
 113884174.  
 
 
 I believe that there is an error in PdfBox with respect to this fixup logic,
 even if it had found the 'correct' xref stream.
 That is because the fixup offset can NEVER work.  Every time it fixes up the
 location, it lands on a section which begins with xref.
 The next call is to skip the whitespace, but since there is never any there
 (it's already proven to be 'xref'),  it does not advance the input stream. 
 Then, the first call to parse that xrefstm always calls readObjectID(), which
 always will throw the exception because the bytes are always 'xref'.
 
 So, my questions are:
 
 1) Are these docs fixable or are they truly corrupt?
Without having a hand on the pdf itself it's hard to give a 100% answer. But I
guess there has to be fix, as adobe is able to open that pdf. I'll try to find
one, following your description of the pdf

 2) Is this xref issue a known issue with PdfBox?  I would try to create a
 document that displays the error but I honesty don't know how to do so (beyond
 sending the ones that we have that DO display it).
Not until now

 3) Do you have any idea how these documents end up in this state if they are
 being edited by tools such as InDesign, Acrobat, etc? Is there something I can
 do to identify them?  
There are a lot of more or less corrupt files in the wild. Those are created
using different tools.

 4) If this is a truly corrupted document, why would Acrobat be able to open
 these files but pdfBox cannot?  Are these streams somehow ignorable?  I ask
 this because I saw this statement on a web page
  (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/) when
 I was searching for answers on this:
Adobe implements a lot of self healing mechanisms to repair broken files and we
try to do so too, but it's complicated.

 – /XrefStm [integer]: specifies the offset from the beginning of the file to
 the cross-reference stream in the decoded stream. This is only present in
 hybrid-reference files, which is specified if we would also like to open
 documents even if the applications  don’t support compressed reference
 streams.
 
 Any light you can shed on this is appreciated.
 
 Thanks-
 Steve
 
 
 See below for the pertinent data and the code which is marked with the values
 as I traced through.
 
 I have confirmed that the byte offset of the word xref below is exactly at
 113884174.

Does the xref stream start at 112085940 (stream offset from the trailer
dictionary) or what did you find at that offset? 


 xref
 0 53641
 00 65535 f
 17 0 n
 
 massive snip/
 
 
 trailer
 \\
 /Size 53641
 /Root 1 0 R
 /XRefStm 112085940
 /Info 8 0 R
 /ID [\19790A83488211E283B50017F203355C \E3DF7097A16969B08238787F19E7E219]
 
 startxref
 113884174
 %%EOF1 0 obj\\/Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages 5 0
 R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R
 endobj
 
 
     protected COSDictionary parseXref(long startXRefOffset) throws IOException
     {
     pdfSource.seek(startXRefOffset);
     long startXrefOffset = parseStartXref();
     // check the startxref offset
     long fixedOffset = checkXRefOffset(startXrefOffset);
     if (fixedOffset  -1)
     {
     startXrefOffset = fixedOffset;
     }
     document.setStartXref(startXrefOffset);
     long prev = startXrefOffset;
     //  parse whole chain of xref tables/object streams using PREV
 reference
     while (prev  -1)  == prev here is 113884174.
     {
     // seek to xref table
     pdfSource.seek(prev);
 
     // skip white spaces
     skipSpaces();
     // -- parse xref
     if (pdfSource.peek() == X)
     {
     // xref table and trailer
     // use existing parser to parse xref table
     parseXrefTable(prev);
     // parse 

https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

2015-02-09 Thread Krasimir Aleksandrov
ATT: Andreas Lehmkühler, Ilya Vasiuk

Hello,
My team is using a snapshot from /trunk of PDFBox and I'm seeing an instance ( 
or variation)  of https://issues.apache.org/jira/browse/PDFBOX-2523 still 
present even with the Non Sequential Parser. My stack looks exactly the same as 
reported by Ilya, but instead of instead got  I have instead got 'xref' ( 
and the offset is different of course). I have validated that I'm using the new 
parser ( I'm using load() which is using the new parser (the non-sequential 
parser is now  hosted in cosparser.java if I'm not mistaken)). The PDFs that 
report that error can be opened in Acrobat  and PDFIum based renderers.
Thoughts?

Thanks,
Krasimir