Re: Text extraction from a certain PDF uses up multiple GB of memory

2023-12-14 Thread Andreas Lehmkühler

Looks like https://issues.apache.org/jira/browse/PDFBOX-5479

Am 13.12.23 um 14:50 schrieb Tilman Hausherr:

On 13.12.2023 11:23, Brangs, Erik wrote:

Hi,

we ran into problems when doing text extraction from the PDF 
athttps://d-nb.info/1312454512/34  . We were using PDFBox 3.0.0 to extract the 
text and the text extraction used up multiple GB of memory. The problem can be 
reproduced with PDFBox 4.0.0-SNAPSHOT and PDFBOX 3.0.2-SNAPSHOT. Is there room 
for improvement in text extraction in PDFBox for this case or is this just a 
badly generated PDF?

Yeah it's a weird PDF: they have different font objects that point to 
the same font file (See FontFile2). So the font is opened each time and 
all tables are read amd stored. And since 3.0 we read much more tables 
than in 2.0.

Tilman



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Text extraction from a certain PDF uses up multiple GB of memory

2023-12-13 Thread Brangs, Erik
Hi,

we ran into problems when doing text extraction from the PDF at 
https://d-nb.info/1312454512/34 . We were using PDFBox 3.0.0 to extract the 
text and the text extraction used up multiple GB of memory. The problem can be 
reproduced with PDFBox 4.0.0-SNAPSHOT and PDFBOX 3.0.2-SNAPSHOT. Is there room 
for improvement in text extraction in PDFBox for this case or is this just a 
badly generated PDF?

-- 
Erik Brangs
Deutsche Nationalbibliothek
Informationstechnik
Adickesallee 1
60322 Frankfurt am Main
Telefon: +49 69 1525-1792
Telefax: +49 69 1525-1799
mailto:e.bra...@dnb.de
https://www.dnb.de


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org