Hi, Am 28.02.2015 um 17:53 schrieb Andreas Lehmkuehler <[email protected]>:
> Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun: >> Hi, >> >> Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <[email protected]>: >> >>> Hi >>> >>> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr: >>>> Hi Andrea, >>>> >>>> While a speed improvement in parsing of large files would be much >>>> appreciated >>>> (especially by the TIKA users), there are several problems with your >>>> change: >>> +1 >>> >>>> - don't do changes that need JDK7 or higher even if they are cool. We use >>>> JDK6 >>>> currently. >>>> >>>> - regressions: >>>> >>>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf >>>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0 >>>> at >>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696) >>>> at >>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639) >>>> at >>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600) >>>> at >>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346) >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373) >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811) >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757) >>>> at >>>> org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201) >>>> at >>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> >>>> at java.lang.reflect.Method.invoke(Method.java:606) >>>> at junit.framework.TestCase.runTest(TestCase.java:176) >>>> at junit.framework.TestCase.runBare(TestCase.java:141) >>>> at junit.framework.TestResult$1.protect(TestResult.java:122) >>>> at junit.framework.TestResult.runProtected(TestResult.java:142) >>>> at junit.framework.TestResult.run(TestResult.java:125) >>>> at junit.framework.TestCase.run(TestCase.java:129) >>>> at junit.framework.TestSuite.runTest(TestSuite.java:255) >>>> at junit.framework.TestSuite.run(TestSuite.java:250) >>>> at junit.textui.TestRunner.doRun(TestRunner.java:116) >>>> at junit.textui.TestRunner.start(TestRunner.java:183) >>>> at junit.textui.TestRunner.main(TestRunner.java:137) >>>> at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393) >>>> >>>> >>>> Error converting file PDFBOX-2599.pdf >>>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0 >>>> at >>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696) >>>> at >>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639) >>>> at >>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600) >>>> at >>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346) >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373) >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811) >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757) >>>> at >>>> org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201) >>>> at >>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> >>>> at java.lang.reflect.Method.invoke(Method.java:606) >>>> at junit.framework.TestCase.runTest(TestCase.java:176) >>>> at junit.framework.TestCase.runBare(TestCase.java:141) >>>> at junit.framework.TestResult$1.protect(TestResult.java:122) >>>> at junit.framework.TestResult.runProtected(TestResult.java:142) >>>> at junit.framework.TestResult.run(TestResult.java:125) >>>> at junit.framework.TestCase.run(TestCase.java:129) >>>> at junit.framework.TestSuite.runTest(TestSuite.java:255) >>>> at junit.framework.TestSuite.run(TestSuite.java:250) >>>> at junit.textui.TestRunner.doRun(TestRunner.java:116) >>>> at junit.textui.TestRunner.start(TestRunner.java:183) >>>> at junit.textui.TestRunner.main(TestRunner.java:137) >>>> at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393) >>>> >>>> >>>> - why change only one of the members of that cosobjectkey class to int? >>>> According to the spec, both are integers. Maybe there's a good reason, but >>>> I'd >>>> like to know. >>> ASFAIK there is no good reason not to change both to int. >> >> as the offset is a 10 digit number is that really covered being an int? > It's about the object number not the offset. We are using a long for the > offset. The spec is quite clear about those numbers. They have to be integers > and the max value for an integer within a pdf is 2^31-1 due to the fact that > the assumed default platform for a conforming reader should be 32-bit. > > BTW, I've changed the object/generation number to int. Yes, but that's a should in the spec and not a shall so it's recommended but might not be followed. > >> >> BR >> Maruan >> >>> >>>> - even if you get rid of the regressions, a remaining problem is that >>>> - Andreas L. is currently working on some parser stuff in PDFBOX-2527 >>> That's not a problem. For now I'm focused on the parsing process itself and >>> am working on one last piece, the rebuild mechanism. >>> >>>> - your change is too big to evaluate (I'm speaking only for myself >>>> there). >>>> It would be better to first submit only small refactorings in PDFBOX-2576, >>>> and >>> >>> I agree. We should try to break up the patch into smaller pieces if >>> possible. Let's start with the long -> int change >>> >>>> then the optimization you mention (or the other way around). The parser is >>>> indeed a tricky part of the code (And SonarQube and Software Diagnostics >>>> have >>>> also flagged it as too complex). I did some refactorings a few weeks ago >>>> there >>>> (splitting methods), but stopped because I couldn't come up with names for >>>> the >>>> new methods. I just didn't understand what they were doing. >>>> >>>> Tilman >>> >>> BR >>> Andreas Lehmkühler >>> >>>> >>>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio: >>>>> Hi, >>>>> few days ago I was profiling PDFBox when loading medium/large size >>>>> documents and I think I found something. >>>>> If you try loading the document >>>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see >>>>> it takes quite some time and that's mostly spent in the >>>>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every >>>>> time >>>>> an object contained in an unparsed object stream is found, the >>>>> XrefTrailerResolver performs a full scan of the xref entries found in the >>>>> document, in this case hundreds of thousands. If the object streams are >>>>> many (like in the given doc), it performs many full scans resulting in >>>>> poor >>>>> performance. >>>>> I'm trying to get familiar with the PDFBox code and I decided to try and >>>>> fix this herehttps://github.com/torakiki/sambox/tree/xref >>>>> As you can see I refactored a bit extracting some classes and covered the >>>>> expect behaviour with unit tests. I tested it with few random docs, >>>>> loading >>>>> and saving them back and the output is exactly the same with or without my >>>>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as >>>>> this >>>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>>>> it takes half the time. Other kind of docs loads in a comparable amount of >>>>> time and even profiling memory usage it seems comparable if not a little >>>>> less. >>>>> Maybe someone wants to take a look? >>>>> >>>>> I understand my changes look a bit invasive and the issue could probably >>>>> be >>>>> fixed differently, on the other hand the couple BaseParser+COSParser looks >>>>> like a big intimidating monster to a newcomer like me and it's quite >>>>> difficult to follow the expected behaviour so I thought this might be a >>>>> chance to start breaking them down in smaller, distilled classes... >>>>> something a little more manageable and testable... anyway, grab what you >>>>> like, leave what you don't :) >>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

