OutOfMemoryError in PDExtendedGraphicsState#getLineDashPattern

Andreas Hubold Tue, 20 Mar 2018 14:36:07 -0700

Hi,

I'm getting an OutOfMemoryError from PDFBox when parsing a certain PDFusing the Apache Tika App v 1.17 - which uses PDFBox 2.0.8 internally.This is reproducible even with 8GB heap.

The OutOfMemoryError happens inorg.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState#getLineDashPattern,which contains this piece of suspicious code:


        COSArray dp = (COSArray) dict.getDictionaryObject( COSName.D );
        if( dp != null )
        {
            COSArray array = new COSArray();
            dp.addAll(dp);

The last line seems to wrong? It appends all elements from 'dp' to 'dp'again, effectively duplicating the elements in the list. Maybe it shouldbe 'array.addAll(dp)' or something like that?

Can you confirm this being a bug? Should I open a JIRA ticket for thisproblem?

Do you know a workaround to avoid the crash, e.g. an option to skip someparts of the file for text extraction?


Here's the stacktrace:

[Full GC (Allocation Failure) 4225609K->4224664K(5989888K), 32,9544686secs]

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3210)
    at java.util.Arrays.copyOf(Arrays.java:3181)
    at java.util.ArrayList.grow(ArrayList.java:261)
    at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
    at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
    at java.util.ArrayList.addAll(ArrayList.java:579)
    at org.apache.pdfbox.cos.COSArray.addAll(COSArray.java:124)

atorg.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState.getLineDashPattern(PDExtendedGraphicsState.java:280) atorg.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState.copyIntoGraphicsState(PDExtendedGraphicsState.java:89) atorg.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters.process(SetGraphicsStateParameters.java:61) atorg.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) atorg.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495) atorg.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469) atorg.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) atorg.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) atorg.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)

    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)

atorg.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) atorg.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)

    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:168)

atorg.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) atorg.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) atorg.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)

    at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:205)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:486)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)

I'm not yet sure if I can share the PDF. If needed, I can check that.

Best regards,
Andreas


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

OutOfMemoryError in PDExtendedGraphicsState#getLineDashPattern

Reply via email to