Re: Xref parsing performance

Maruan Sahyoun Sat, 28 Feb 2015 09:11:29 -0800

Hi,

Am 28.02.2015 um 17:53 schrieb Andreas Lehmkuehler <[email protected]>:


> Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
>> Hi,
>> 
>> Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <[email protected]>:
>> 
>>> Hi
>>> 
>>> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>>>> Hi Andrea,
>>>> 
>>>> While a speed improvement in parsing of large files would be much 
>>>> appreciated
>>>> (especially by the TIKA users), there are several problems with your 
>>>> change:
>>> +1
>>> 
>>>> - don't do changes that need JDK7 or higher even if they are cool. We use 
>>>> JDK6
>>>> currently.
>>>> 
>>>> - regressions:
>>>> 
>>>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>>>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>>>>     at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>     at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>     at 
>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>     at 
>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>     at 
>>>> org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>     at
>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>     at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>     at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> 
>>>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>>>     at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>     at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>     at junit.framework.TestResult.run(TestResult.java:125)
>>>>     at junit.framework.TestCase.run(TestCase.java:129)
>>>>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>     at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>     at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>     at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>> 
>>>> 
>>>> Error converting file PDFBOX-2599.pdf
>>>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>>>>     at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>     at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>     at 
>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>     at 
>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>     at 
>>>> org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>     at
>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>     at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>     at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> 
>>>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>>>     at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>     at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>     at junit.framework.TestResult.run(TestResult.java:125)
>>>>     at junit.framework.TestCase.run(TestCase.java:129)
>>>>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>     at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>     at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>     at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>> 
>>>> 
>>>> - why change only one of the members of that cosobjectkey class to int?
>>>> According to the spec, both are integers. Maybe there's a good reason, but 
>>>> I'd
>>>> like to know.
>>> ASFAIK there is no good reason not to change both to int.
>> 
>> as the offset is a 10 digit number is that really covered being an int?
> It's about the object number not the offset. We are using a long for the 
> offset. The spec is quite clear about those numbers. They have to be integers 
> and the max value for an integer within a pdf is 2^31-1 due to the fact that 
> the assumed default platform for a conforming reader should be 32-bit.
> 
> BTW, I've changed the object/generation number to int.

Yes, but that's a should in the spec and not a shall so it's recommended but 
might not be followed.


> 
>> 
>> BR
>> Maruan
>> 
>>> 
>>>> - even if you get rid of the regressions, a remaining problem is that
>>>>    - Andreas L. is currently working on some parser stuff in PDFBOX-2527
>>> That's not a problem. For now I'm focused on the parsing process itself and 
>>> am working on one last piece, the rebuild mechanism.
>>> 
>>>>    - your change is too big to evaluate (I'm speaking only for myself 
>>>> there).
>>>> It would be better to first submit only small refactorings in PDFBOX-2576, 
>>>> and
>>> 
>>> I agree. We should try to break up the patch into smaller pieces if 
>>> possible. Let's start with the long -> int change
>>> 
>>>> then the optimization you mention (or the other way around). The parser is
>>>> indeed a tricky part of the code (And SonarQube and Software Diagnostics 
>>>> have
>>>> also flagged it as too complex). I did some refactorings a few weeks ago 
>>>> there
>>>> (splitting methods), but stopped because I couldn't come up with names for 
>>>> the
>>>> new methods. I just didn't understand what they were doing.
>>>> 
>>>> Tilman
>>> 
>>> BR
>>> Andreas Lehmkühler
>>> 
>>>> 
>>>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>>>> Hi,
>>>>> few days ago I was profiling PDFBox when loading medium/large size
>>>>> documents and I think I found something.
>>>>> If you try loading the document
>>>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
>>>>> it takes quite some time and that's mostly spent in the
>>>>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every 
>>>>> time
>>>>> an object contained in an unparsed object stream is found, the
>>>>> XrefTrailerResolver performs a full scan of the xref entries found in the
>>>>> document, in this case hundreds of thousands. If the object streams are
>>>>> many (like in the given doc), it performs many full scans resulting in 
>>>>> poor
>>>>> performance.
>>>>> I'm trying to get familiar with the PDFBox code and I decided to try and
>>>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>>>> As you can see I refactored a bit extracting some classes and covered the
>>>>> expect behaviour with unit tests. I tested it with few random docs, 
>>>>> loading
>>>>> and saving them back and the output is exactly the same with or without my
>>>>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>>>>> this
>>>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>> it takes half the time. Other kind of docs loads in a comparable amount of
>>>>> time and even profiling memory usage it seems comparable if not a little
>>>>> less.
>>>>> Maybe someone wants to take a look?
>>>>> 
>>>>> I understand my changes look a bit invasive and the issue could probably 
>>>>> be
>>>>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>>>>> like a big intimidating monster to a newcomer like me and it's quite
>>>>> difficult to follow the expected behaviour so I thought this might be a
>>>>> chance to start breaking them down in smaller, distilled classes...
>>>>> something a little more manageable and testable... anyway, grab what you
>>>>> like, leave what you don't  :)
>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Xref parsing performance

Reply via email to