Tilman Hausherr created PDFBOX-5799:
---------------------------------------

             Summary: Page with thousands of content streams takes extremely 
long to render or extract
                 Key: PDFBOX-5799
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5799
             Project: PDFBox
          Issue Type: Bug
          Components: Rendering, Text extraction
    Affects Versions: 3.0.2 PDFBox
            Reporter: Tilman Hausherr


As reported by Erik Branks on the mailing list:
{quote}when attempting text extraction from the PDF at 
[https://d-nb.info/1324982411/34] , either using PDFBox 3.0.0 or PDFBox 
4.0.0-SNAPSHOT, the extraction uses about 1,8 GB heap memory and does not seem 
to terminate. I cancelled the extraction attempt after roughly 20 minutes. Is 
this another bad PDF or is there a bug in PDFBox?{quote}

This happens with pages 230 and 231 (maybe others). Both have thousands of 
content streams in the content stream array. The profiler suggests that most 
time is spent in {{SequenceRandomAccessRead.seek()}}.

Rendering page 230 with PDFBox 2.0: 50 seconds

Rendering page 230 with PDFBox trunk: 2990 seconds

Rendering page 231 with PDFBox trunk: 4798 seconds 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to