Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)
Am 25.02.2015 um 00:04 schrieb Steve Antoch: We are planning on running a large breadth test on approximately 108,000 pdfs starting tonight. I will let you know how this test goes. It will take about 4 days to complete. If possible, it would be nice to test preflight as well. Just use the code below, plus the preflight-app jar, and the levigo and jai plugins. It will produce one .txt file in the current directory for every exception that isn't a ValidationException. Most problems that occur with rendering will happen with preflight as well, and the test is faster. 10 files might be done in 24 hours. If any exceptions occur, please post them here. Tilman package com.mycompany.preflightmasstest; import java.io.File; import java.io.FileNotFoundException; import java.io.FilenameFilter; import java.io.PrintWriter; import org.apache.pdfbox.preflight.PreflightDocument; import org.apache.pdfbox.preflight.exception.ValidationException; import org.apache.pdfbox.preflight.parser.PreflightParser; /** * * @author Tilman Hausherr */ public class PreflightMassTest { public static void main(String[] args) throws FileNotFoundException { File dir = new File(args[0]); int total = 0; int failed = 0; File[] dirList = dir.listFiles(new FilenameFilter() { @Override public boolean accept(File dir, String name) { return name.toLowerCase().endsWith(.pdf); } }); for (File pdf : dirList) { ++total; System.out.println(pdf.getName()); // just test that it doesn't crash try { new File(pdf.getName() + -exception.txt).delete(); PreflightParser parser = new PreflightParser(pdf); parser.parse(); try (PreflightDocument preflightDocument = parser.getPreflightDocument()) { preflightDocument.validate(); preflightDocument.getResult(); } parser.close(); } catch (ValidationException e) { } catch (Throwable e) { ++failed; try (PrintWriter pw = new PrintWriter(new File(pdf.getName() + -exception.txt))) { e.printStackTrace(pw); } System.out.flush(); System.err.flush(); System.err.print(pdf.getName() + preflight fail: ); e.printStackTrace(); System.out.flush(); System.err.flush(); } System.out.println(total: + total + , failed: + failed + , percentage failed: + (((float) failed) / total * 100.0) + %); } } } - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)
Hi, Steve Antoch sant...@yuzu.com hat am 25. Februar 2015 um 00:04 geschrieben: Hi Andreas- Thanks again. I downloaded and built the latest from trunk. There was no change for the book I was testing. I first tried it after taking out my if (streamOffset 0) test, but the null reference exception still occurred. OK, thanks again for testing. I've fixed the issue based on your analysis. We are planning on running a large breadth test on approximately 108,000 pdfs starting tonight. I will let you know how this test goes. It will take about 4 days to complete. Cool, I'm looking forward to see the results. With respect to the small change I made in my fork: https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd The issue was a separate but fairly rare failure that we found in a small number (about 10) of our pdfs. Adobe and Pdfium (Chrome) were both able to open them but pdfBox was not due to disallowing nesting. I figured that if Pdfium allows 64 levels of nesting, we might be able to relax this test from 0 levels to allowing 1 level and see if it worked. Since it did, I wanted to run those changes by you for your comments. Is there any chance to get a hand on a sample pdf? I would be good enough to send it via private mail to me: BR Andreas Lehmkühler Thanks- Steve From: Andreas Lehmkühler andr...@lehmi.de Sent: Tuesday, February 24, 2015 3:30 AM To: users@pdfbox.apache.org Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present) Hi Steve, Steve Antoch sant...@yuzu.com hat am 23. Februar 2015 um 19:42 geschrieben: @Andreas- I have downloaded the latest trunk and came close (it got much further) before failing. However, I think I may have a fix for that failure: Thanks for the test The code is returning 0 when the xrefstm fixedOffset is not found. However, the code still tries to load and parse from xref 0, resulting in a null reference exception later in parser.parse(). Your analysis is correct, but I hope that my last improvements should eliminate such cases, see PDFBOX-2572 for details. Could you give the latest trunk (r1661747) a try? However, thinking about this, I came up with this: // check for a XRef stream, it may contain some object ids of compressed objects if(trailer.containsKey(COSName.XREF_STM)) { int streamOffset = trailer.getInt(COSName.XREF_STM); // check the xref stream reference fixedOffset = checkXRefStreamOffset(streamOffset, false); //== fixedoffset comes back as 0 = not found if (fixedOffset -1 fixedOffset != streamOffset) { streamOffset = (int)fixedOffset; // == streamOffset gets set to 0 here trailer.setInt(COSName.XREF_STM, streamOffset); } if (streamOffset 0)// I added this test because an xref stream starting at // offset 0 can never happen, so we should simply skip it { pdfSource.seek(streamOffset); skipSpaces(); parseXrefObjStream(prev, false); == this call ultimately throws a null ref exception if streamOffset == 0 on entry } } Adding that, the file successfully parses. Also, there was this proposal that I put up on github in a repo that I directly forked from pdfbox (it is the only change) It relaxes the looping a bit to allow limited recursion. I would appreciate your thoughts on it. Is this change related to the discussed issue above? https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd Thank you so much! You have been tremendously helpful. I wish I could have given you the files, but unfortunately, they are proprietary and we cannot release them. :-( No need to worry, you are not the only one who is not allowed to share a specific pdf. Best regards- Steve BR Andreas Lehmkühler From: Andreas Lehmkühler andr...@lehmi.de Sent: Monday, February 23, 2015 3:43 AM To: users@pdfbox.apache.org Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present) Hi, I've improved the self repair mechnism of the trunk based on Steves report. @Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue still persist? BR Andreas Lehmkühler Steve Antoch sant...@yuzu.com hat am 17. Februar 2015 um 00:05 geschrieben
Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)
Hi, I've improved the self repair mechnism of the trunk based on Steves report. @Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue still persist? BR Andreas Lehmkühler Steve Antoch sant...@yuzu.com hat am 17. Februar 2015 um 00:05 geschrieben: Andreas- Thanks for the response. Sorry for sending directly. Yes, it tries to read from offset 112085940, but does not find the xrefstm there, so that's when it goes searching. It seems to be landing in the middle of something else (perhaps an image?) I tried running the preflight command on the file, and this is what it found there. This is in the middle of a whole series of repetitive byte patterns like these, which is interspersed with other sections of content that is also binary only. ?xml version=1.0 encoding=UTF-8 standalone=no? preflight name=file.pdf executionTimeMS2646/executionTimeMS isValid type=false/isValid errors count=1 error count=1 code1.0/code detailsSyntax error, Error: Expected a long type at offset 112085940, instead got '6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ'/details /error /errors /preflight The patterns seem to be: lots of these: 6lÙ³fÍ#155; interspersed between blocks that are similar to this: ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ' It just so happens that the offset 112085940 falls right in the middle of a big block of those 6lÙ³fÍ#155; repetitive blocks. Not sure if that's any help. Steve From: Andreas Lehmkühler andr...@lehmi.de Sent: Monday, February 16, 2015 3:34 AM To: users@pdfbox.apache.org Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present) Hi, Steve Antoch sant...@yuzu.com hat am 13. Februar 2015 um 23:34 geschrieben: Hi Tilman and Andreas-- Please don't contact developers directly, use our mailing lists instead. I've put the users list back into the boat... I am working with Krasimir on this issue. Although we asked, we were denied permission to send the document out. :-( The failure is being triggered when we attempt to use the Encrypt() class to password protect the pdf. We end up with the Expected a long type at offset 113884174, instead got 'xref' failure. I have debugged into the PDFBox code and found the offending parts. PdfBox is trying to parse an xref table located at 113884174. The problem we are seeing is that the inside the trailer it finds the /XRefStm label, and its offset value is returned as 112085940 (which is what is given in the file), However, the checkXRefOffset() call made to verify it doesn't find the xref stream there, so it goes searching and ends up returning the closest xref offset it can find, which happens to be that it returns its own offset at 113884174. I believe that there is an error in PdfBox with respect to this fixup logic, even if it had found the 'correct' xref stream. That is because the fixup offset can NEVER work. Every time it fixes up the location, it lands on a section which begins with xref. The next call is to skip the whitespace, but since there is never any there (it's already proven to be 'xref'), it does not advance the input stream. Then, the first call to parse that xrefstm always calls readObjectID(), which always will throw the exception because the bytes are always 'xref'. So, my questions are: 1) Are these docs fixable or are they truly corrupt? Without having a hand on the pdf itself it's hard to give a 100% answer. But I guess there has to be fix, as adobe is able to open that pdf. I'll try to find one, following your description of the pdf 2) Is this xref issue a known issue with PdfBox? I would try to create a document that displays the error but I honesty don't know how to do so (beyond sending the ones that we have that DO display it). Not until now 3) Do you have any idea how these documents end up in this state if they are being edited by tools such as InDesign, Acrobat, etc? Is there something I can do to identify them? There are a lot of more or less corrupt files in the wild. Those are created using different tools. 4) If this is a truly corrupted document, why would Acrobat be able to open these files but pdfBox cannot? Are these streams somehow ignorable? I ask this because I saw this statement on a web page (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/) when I
Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)
@Andreas- I have downloaded the latest trunk and came close (it got much further) before failing. However, I think I may have a fix for that failure: The code is returning 0 when the xrefstm fixedOffset is not found. However, the code still tries to load and parse from xref 0, resulting in a null reference exception later in parser.parse(). However, thinking about this, I came up with this: // check for a XRef stream, it may contain some object ids of compressed objects if(trailer.containsKey(COSName.XREF_STM)) { int streamOffset = trailer.getInt(COSName.XREF_STM); // check the xref stream reference fixedOffset = checkXRefStreamOffset(streamOffset, false); //== fixedoffset comes back as 0 = not found if (fixedOffset -1 fixedOffset != streamOffset) { streamOffset = (int)fixedOffset; // == streamOffset gets set to 0 here trailer.setInt(COSName.XREF_STM, streamOffset); } if (streamOffset 0)// I added this test because an xref stream starting at // offset 0 can never happen, so we should simply skip it { pdfSource.seek(streamOffset); skipSpaces(); parseXrefObjStream(prev, false); == this call ultimately throws a null ref exception if streamOffset == 0 on entry } } Adding that, the file successfully parses. Also, there was this proposal that I put up on github in a repo that I directly forked from pdfbox (it is the only change) It relaxes the looping a bit to allow limited recursion. I would appreciate your thoughts on it. https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd Thank you so much! You have been tremendously helpful. I wish I could have given you the files, but unfortunately, they are proprietary and we cannot release them. :-( Best regards- Steve From: Andreas Lehmkühler andr...@lehmi.de Sent: Monday, February 23, 2015 3:43 AM To: users@pdfbox.apache.org Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present) Hi, I've improved the self repair mechnism of the trunk based on Steves report. @Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue still persist? BR Andreas Lehmkühler Steve Antoch sant...@yuzu.com hat am 17. Februar 2015 um 00:05 geschrieben: Andreas- Thanks for the response. Sorry for sending directly. Yes, it tries to read from offset 112085940, but does not find the xrefstm there, so that's when it goes searching. It seems to be landing in the middle of something else (perhaps an image?) I tried running the preflight command on the file, and this is what it found there. This is in the middle of a whole series of repetitive byte patterns like these, which is interspersed with other sections of content that is also binary only. ?xml version=1.0 encoding=UTF-8 standalone=no? preflight name=file.pdf executionTimeMS2646/executionTimeMS isValid type=false/isValid errors count=1 error count=1 code1.0/code detailsSyntax error, Error: Expected a long type at offset 112085940, instead got '6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ'/details /error /errors /preflight The patterns seem to be: lots of these: 6lÙ³fÍ#155; interspersed between blocks that are similar to this: ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ' It just so happens that the offset 112085940 falls right in the middle of a big block of those 6lÙ³fÍ#155; repetitive blocks. Not sure if that's any help. Steve From: Andreas Lehmkühler andr...@lehmi.de Sent: Monday, February 16, 2015 3:34 AM To: users@pdfbox.apache.org Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present) Hi, Steve Antoch sant...@yuzu.com hat am 13. Februar 2015 um 23:34 geschrieben: Hi Tilman and Andreas-- Please don't contact developers directly, use our mailing lists instead. I've put the users list back into the boat... I am working with Krasimir on this issue. Although we asked
Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)
Hi, Steve Antoch sant...@yuzu.com hat am 13. Februar 2015 um 23:34 geschrieben: Hi Tilman and Andreas-- Please don't contact developers directly, use our mailing lists instead. I've put the users list back into the boat... I am working with Krasimir on this issue. Although we asked, we were denied permission to send the document out. :-( The failure is being triggered when we attempt to use the Encrypt() class to password protect the pdf. We end up with the Expected a long type at offset 113884174, instead got 'xref' failure. I have debugged into the PDFBox code and found the offending parts. PdfBox is trying to parse an xref table located at 113884174. The problem we are seeing is that the inside the trailer it finds the /XRefStm label, and its offset value is returned as 112085940 (which is what is given in the file), However, the checkXRefOffset() call made to verify it doesn't find the xref stream there, so it goes searching and ends up returning the closest xref offset it can find, which happens to be that it returns its own offset at 113884174. I believe that there is an error in PdfBox with respect to this fixup logic, even if it had found the 'correct' xref stream. That is because the fixup offset can NEVER work. Every time it fixes up the location, it lands on a section which begins with xref. The next call is to skip the whitespace, but since there is never any there (it's already proven to be 'xref'), it does not advance the input stream. Then, the first call to parse that xrefstm always calls readObjectID(), which always will throw the exception because the bytes are always 'xref'. So, my questions are: 1) Are these docs fixable or are they truly corrupt? Without having a hand on the pdf itself it's hard to give a 100% answer. But I guess there has to be fix, as adobe is able to open that pdf. I'll try to find one, following your description of the pdf 2) Is this xref issue a known issue with PdfBox? I would try to create a document that displays the error but I honesty don't know how to do so (beyond sending the ones that we have that DO display it). Not until now 3) Do you have any idea how these documents end up in this state if they are being edited by tools such as InDesign, Acrobat, etc? Is there something I can do to identify them? There are a lot of more or less corrupt files in the wild. Those are created using different tools. 4) If this is a truly corrupted document, why would Acrobat be able to open these files but pdfBox cannot? Are these streams somehow ignorable? I ask this because I saw this statement on a web page (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/) when I was searching for answers on this: Adobe implements a lot of self healing mechanisms to repair broken files and we try to do so too, but it's complicated. – /XrefStm [integer]: specifies the offset from the beginning of the file to the cross-reference stream in the decoded stream. This is only present in hybrid-reference files, which is specified if we would also like to open documents even if the applications don’t support compressed reference streams. Any light you can shed on this is appreciated. Thanks- Steve See below for the pertinent data and the code which is marked with the values as I traced through. I have confirmed that the byte offset of the word xref below is exactly at 113884174. Does the xref stream start at 112085940 (stream offset from the trailer dictionary) or what did you find at that offset? xref 0 53641 00 65535 f 17 0 n massive snip/ trailer \\ /Size 53641 /Root 1 0 R /XRefStm 112085940 /Info 8 0 R /ID [\19790A83488211E283B50017F203355C \E3DF7097A16969B08238787F19E7E219] startxref 113884174 %%EOF1 0 obj\\/Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages 5 0 R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R endobj protected COSDictionary parseXref(long startXRefOffset) throws IOException { pdfSource.seek(startXRefOffset); long startXrefOffset = parseStartXref(); // check the startxref offset long fixedOffset = checkXRefOffset(startXrefOffset); if (fixedOffset -1) { startXrefOffset = fixedOffset; } document.setStartXref(startXrefOffset); long prev = startXrefOffset; // parse whole chain of xref tables/object streams using PREV reference while (prev -1) == prev here is 113884174. { // seek to xref table pdfSource.seek(prev); // skip white spaces skipSpaces(); // -- parse xref if (pdfSource.peek() == X) { // xref table and trailer // use existing parser to parse xref table parseXrefTable(prev); // parse
https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)
ATT: Andreas Lehmkühler, Ilya Vasiuk Hello, My team is using a snapshot from /trunk of PDFBox and I'm seeing an instance ( or variation) of https://issues.apache.org/jira/browse/PDFBOX-2523 still present even with the Non Sequential Parser. My stack looks exactly the same as reported by Ilya, but instead of instead got I have instead got 'xref' ( and the offset is different of course). I have validated that I'm using the new parser ( I'm using load() which is using the new parser (the non-sequential parser is now hosted in cosparser.java if I'm not mistaken)). The PDFs that report that error can be opened in Acrobat and PDFIum based renderers. Thoughts? Thanks, Krasimir