well - PDF ist not really easily streamable as - it's organized as a random access format - the refernce table about the objects forming the PDF is at the end of the file to you have to read the last parts first and then move back - objects making up the content can be spread around the file - pages can be organized in trees - page resources such as Images or fonts may be shared across pages - the information/content of these resources may be sitting before or after the page objects - PDFs can be incrementally changed so information in a section might be outdated by a revision which comes later in the file
... so it's more similar to buidling a DOM from an XML and handling that than stream parsing an XML. That doesn't mean that there are ways to improve the current parsing ... BR Maruan > Good evening, > > No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) > that PDFBox does NOT stream the PDF. > So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours. > > christian > > From: Tim Allison <talli...@apache.org> > Sent: Donnerstag, 14. November 2019 15:07 > To: u...@tika.apache.org > Cc: users@pdfbox.apache.org > Subject: Re: Parsing huge PDF (400Mb, 2700 pages) > > CC'ing colleagues on PDFBox...any recommendations? > > Sergey's recommendation is great for documents that can be parsed via > streaming. However, PDFBox does not currently parse PDFs in a streaming > mode. It builds the full document tree -- PDFBox colleagues let me know if > I'm wrong. > > On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin > <sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>> wrote: > Hi, > Are you using tika-server ? If yes and you can submit the data using a > multipart/form-data payload then it may help, CXF (used by tika-server) > should do the best effort at saving the multipart payloads to the temp > locations on the disk, and thus minimize the memory requirements > > Cheers, Sergey > > > On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) > <christian.ribe...@novartis.com<mailto:christian.ribe...@novartis.com>> wrote: > Hi, > > My application handles all kind of documents (mainly PDFs). In a very few > cases, you might expect huge PDFs (< 500MB). > > By around 400MB I am hitting the wall, parsing takes ages (although quite > fast at the beginning). I've tried several ideas but none of them brought the > desired amelioration. > > I have the impression that memory plays a role. I have no more than 3GB (and > I think this should be enough as we are streaming the document and using > event based XML parser). > > Are they things I should be aware of? > > Any hint would be very welcome. Thanks and have a nice day, > > christian -- Maruan Sahyoun FileAffairs GmbH Josef-Schappe-Straße 21 40882 Ratingen Tel: +49 (2102) 89497 88 Fax: +49 (2102) 89497 91 sahy...@fileaffairs.de www.fileaffairs.de Geschäftsführer: Maruan Sahyoun Handelsregister: AG Düsseldorf, HRB 53837 UST.-ID: DE248275827 --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org