Mhmm - not running on AWS lambda but I do have an application handling PDFs with up to 30.000 pages and it takes only 2 minutes. Although the environments are not comparable it would be good to get a better idea of the content of the PDFs. Maybe there is something in there causing that long runtime.
Could you share it privately? BR Maruan > Hi, > > I’ve read regarding the on-demand parser. I might have a look. > > Unfortunately, I am NOT allowed to share the PDF. > > What am I trying to do is the following: I am writing an AWS Lambda for > parsing the PDF by page. The text should be extracted and send to > Elasticsearch. > > Because of the Lambda environment, I have limited resources: 3GB and 15mn > runtime max. > > This setup works marvelously with the majority of the PDFs. With the ones > bigger than around 400Mb, I am overrunning the time limit. > > The problem is NOT Tika related, it is PDFBox related (I did a check). So I > will have to find another strategy for the time being. > > Thanks to all for the feedback. Very appreciated. > > Kind regards and have a nice evening, > > christian > > From: Tilman Hausherr <thaush...@t-online.de> > Sent: Donnerstag, 14. November 2019 18:05 > To: u...@tika.apache.org > Cc: users@pdfbox.apache.org > Subject: Re: Parsing huge PDF (400Mb, 2700 pages) > > The PDF can be much bigger than 3GB when decompressed. > > What you could try > > 1) using a scratch file (will be even slower) when opening the document > 2) the on-demand parser, see > https://issues.apache.org/jira/browse/PDFBOX-4569<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_PDFBOX-2D4569&d=DwMDaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=MS-8N6QwQqjzb8iBrPi691rQrebnovN-5Sk5-OHOxKQ&m=_kWIdSjT5LgtYpKhEsWgOOWGO2MZJ5uZvxB6zIG-Ixk&s=AOHvMcfbvkw1ADozf5VqFIwLMHmBvs7o7hvhSgJD7cM&e=> > > there is a branch on the svn server, you have to build from source. > > Tilman > > Am 14.11.2019 um 17:15 schrieb Ribeaud, Christian (Ext): > Good evening, > > No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) > that PDFBox does NOT stream the PDF. > So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours. > > christian > > From: Tim Allison <talli...@apache.org><mailto:talli...@apache.org> > Sent: Donnerstag, 14. November 2019 15:07 > To: u...@tika.apache.org<mailto:u...@tika.apache.org> > Cc: users@pdfbox.apache.org<mailto:users@pdfbox.apache.org> > Subject: Re: Parsing huge PDF (400Mb, 2700 pages) > > CC'ing colleagues on PDFBox...any recommendations? > > Sergey's recommendation is great for documents that can be parsed via > streaming. However, PDFBox does not currently parse PDFs in a streaming > mode. It builds the full document tree -- PDFBox colleagues let me know if > I'm wrong. > > On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin > <sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>> wrote: > Hi, > Are you using tika-server ? If yes and you can submit the data using a > multipart/form-data payload then it may help, CXF (used by tika-server) > should do the best effort at saving the multipart payloads to the temp > locations on the disk, and thus minimize the memory requirements > > Cheers, Sergey > > > On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) > <christian.ribe...@novartis.com<mailto:christian.ribe...@novartis.com>> wrote: > Hi, > > My application handles all kind of documents (mainly PDFs). In a very few > cases, you might expect huge PDFs (< 500MB). > > By around 400MB I am hitting the wall, parsing takes ages (although quite > fast at the beginning). I've tried several ideas but none of them brought the > desired amelioration. > > I have the impression that memory plays a role. I have no more than 3GB (and > I think this should be enough as we are streaming the document and using > event based XML parser). > > Are they things I should be aware of? > > Any hint would be very welcome. Thanks and have a nice day, > > christian > > -- Maruan Sahyoun FileAffairs GmbH Josef-Schappe-Straße 21 40882 Ratingen Tel: +49 (2102) 89497 88 Fax: +49 (2102) 89497 91 sahy...@fileaffairs.de www.fileaffairs.de Geschäftsführer: Maruan Sahyoun Handelsregister: AG Düsseldorf, HRB 53837 UST.-ID: DE248275827 --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org