Re: Parsing huge PDF (400Mb, 2700 pages)

Maruan Sahyoun Thu, 14 Nov 2019 08:33:09 -0800

well - PDF ist not really easily streamable as 

- it's organized as a random access format
- the refernce table about the objects forming the PDF is at the end of the 
file to you have to read the last parts first and
then move back
- objects making up the content can be spread around the file
- pages can be organized in trees
- page resources such as Images or fonts may be shared across pages
- the information/content of these resources may be sitting before or after the 
page objects
- PDFs can be incrementally changed so information in a section might be 
outdated by a revision which comes later in the file


...

so it's more similar to buidling a DOM from an XML and handling that than 
stream parsing an XML.

That doesn't mean that there are ways to improve the current parsing ...

BR
Maruan
  
> Good evening,
> 
> No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) 
> that PDFBox does NOT stream the PDF.
> So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
> 
> christian
> 
> From: Tim Allison <talli...@apache.org>
> Sent: Donnerstag, 14. November 2019 15:07
> To: u...@tika.apache.org
> Cc: users@pdfbox.apache.org
> Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> 
> CC'ing colleagues on PDFBox...any recommendations?
> 
> Sergey's recommendation is great for documents that can be parsed via 
> streaming.  However, PDFBox does not currently parse PDFs in a streaming 
> mode.  It builds the full document tree -- PDFBox colleagues let me know if 
> I'm wrong.
> 
> On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin 
> <sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>> wrote:
> Hi,
> Are you using tika-server ? If yes and you can submit the data using a 
> multipart/form-data payload then it may help, CXF (used by tika-server) 
> should do the best effort at saving the multipart payloads to the temp 
> locations on the disk, and thus minimize the memory requirements
> 
> Cheers, Sergey
> 
> 
> On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) 
> <christian.ribe...@novartis.com<mailto:christian.ribe...@novartis.com>> wrote:
> Hi,
> 
> My application handles all kind of documents (mainly PDFs). In a very few 
> cases, you might expect huge PDFs (< 500MB).
> 
> By around 400MB I am hitting the wall, parsing takes ages (although quite 
> fast at the beginning). I've tried several ideas but none of them brought the 
> desired amelioration.
> 
> I have the impression that memory plays a role. I have no more than 3GB (and 
> I think this should be enough as we are streaming the document and using 
> event based XML parser).
> 
> Are they things I should be aware of?
> 
> Any hint would be very welcome. Thanks and have a nice day,
> 
> christian
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Parsing huge PDF (400Mb, 2700 pages)

Reply via email to