Re: Parsing huge PDF (400Mb, 2700 pages)

Maruan Sahyoun Thu, 14 Nov 2019 10:23:53 -0800

Mhmm - not running on AWS lambda but I do have an application handling PDFs 
with up to 30.000 pages and it takes only 2 minutes.
Although the environments are not comparable it would be good to get a better 
idea of the content of the PDFs. Maybe there is
something in there causing that long runtime.


Could you share it privately?

BR
Maruan

> Hi,
> 
> I’ve read regarding the on-demand parser. I might have a look.
> 
> Unfortunately, I am NOT allowed to share the PDF.
> 
> What am I trying to do is the following: I am writing an AWS Lambda for 
> parsing the PDF by page. The text should be extracted and send to 
> Elasticsearch.
> 
> Because of the Lambda environment, I have limited resources: 3GB and 15mn 
> runtime max.
> 
> This setup works marvelously with the majority of the PDFs. With the ones 
> bigger than around 400Mb, I am overrunning the time limit.
> 
> The problem is NOT Tika related, it is PDFBox related (I did a check). So I 
> will have to find another strategy for the time being.
> 
> Thanks to all for the feedback. Very appreciated.
> 
> Kind regards and have a nice evening,
> 
> christian
> 
> From: Tilman Hausherr <thaush...@t-online.de>
> Sent: Donnerstag, 14. November 2019 18:05
> To: u...@tika.apache.org
> Cc: users@pdfbox.apache.org
> Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> 
> The PDF can be much bigger than 3GB when decompressed.
> 
> What you could try
> 
> 1) using a scratch file (will be even slower) when opening the document
> 2) the on-demand parser, see
> https://issues.apache.org/jira/browse/PDFBOX-4569<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_PDFBOX-2D4569&d=DwMDaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=MS-8N6QwQqjzb8iBrPi691rQrebnovN-5Sk5-OHOxKQ&m=_kWIdSjT5LgtYpKhEsWgOOWGO2MZJ5uZvxB6zIG-Ixk&s=AOHvMcfbvkw1ADozf5VqFIwLMHmBvs7o7hvhSgJD7cM&e=>
> 
> there is a branch on the svn server, you have to build from source.
> 
> Tilman
> 
> Am 14.11.2019 um 17:15 schrieb Ribeaud, Christian (Ext):
> Good evening,
> 
> No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) 
> that PDFBox does NOT stream the PDF.
> So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
> 
> christian
> 
> From: Tim Allison <talli...@apache.org><mailto:talli...@apache.org>
> Sent: Donnerstag, 14. November 2019 15:07
> To: u...@tika.apache.org<mailto:u...@tika.apache.org>
> Cc: users@pdfbox.apache.org<mailto:users@pdfbox.apache.org>
> Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> 
> CC'ing colleagues on PDFBox...any recommendations?
> 
> Sergey's recommendation is great for documents that can be parsed via 
> streaming.  However, PDFBox does not currently parse PDFs in a streaming 
> mode.  It builds the full document tree -- PDFBox colleagues let me know if 
> I'm wrong.
> 
> On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin 
> <sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>> wrote:
> Hi,
> Are you using tika-server ? If yes and you can submit the data using a 
> multipart/form-data payload then it may help, CXF (used by tika-server) 
> should do the best effort at saving the multipart payloads to the temp 
> locations on the disk, and thus minimize the memory requirements
> 
> Cheers, Sergey
> 
> 
> On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) 
> <christian.ribe...@novartis.com<mailto:christian.ribe...@novartis.com>> wrote:
> Hi,
> 
> My application handles all kind of documents (mainly PDFs). In a very few 
> cases, you might expect huge PDFs (< 500MB).
> 
> By around 400MB I am hitting the wall, parsing takes ages (although quite 
> fast at the beginning). I've tried several ideas but none of them brought the 
> desired amelioration.
> 
> I have the impression that memory plays a role. I have no more than 3GB (and 
> I think this should be enough as we are streaming the document and using 
> event based XML parser).
> 
> Are they things I should be aware of?
> 
> Any hint would be very welcome. Thanks and have a nice day,
> 
> christian
> 
> 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Parsing huge PDF (400Mb, 2700 pages)

Reply via email to