Processing large amounts of PDFs in parallel without running out of memory

Paul Pearcy Mon, 05 Dec 2011 17:17:36 -0800

Hello!
  First off, thanks to all who have contributed to this great library. It has 
made my life a lot easier :)


I am processing a large number of PDFs for search indexing and starting with 
tika version 0.9, I started hitting out of memory errors while processing PDFs. 
The heap dumps I get indicate that most of the memory is used up by pdfbox 
RandomAccessBuffers.

It appears that under the hood pdfbox can work with either a RandomAccessFile 
(http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessFile.html ) 
or a RandomAccessBuffer 
(http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessBuffer.html) 
and that tika uses RandomAccessBuffers for better performance. I'd like to 
sacrifice this performance for less RAM usage.

Is this possible?

Previously, I was passing in tika a byte array and switched to a File in hopes 
that it would use RandomAccessFile, but that didn't appear to make a difference.

I have a hunch that using TikaInputStreams may be able to address, but am not 
sure.

Thanks and Best Regards,
Paul



________________________________
This e-mail, including accompanying communications and attachments, is strictly 
confidential and only for the intended recipient. Any retention, use or 
disclosure not expressly authorised by Markit is prohibited. This email is 
subject to all waivers and other terms at the following link: 
http://www.markit.com/en/about/legal/email-disclaimer.page

Please visit http://www.markit.com/en/about/contact/contact-us.page? for 
contact information on our offices worldwide.

Processing large amounts of PDFs in parallel without running out of memory

Reply via email to