Hello! First off, thanks to all who have contributed to this great library. It has made my life a lot easier :)
I am processing a large number of PDFs for search indexing and starting with tika version 0.9, I started hitting out of memory errors while processing PDFs. The heap dumps I get indicate that most of the memory is used up by pdfbox RandomAccessBuffers. It appears that under the hood pdfbox can work with either a RandomAccessFile (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessFile.html ) or a RandomAccessBuffer (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessBuffer.html) and that tika uses RandomAccessBuffers for better performance. I'd like to sacrifice this performance for less RAM usage. Is this possible? Previously, I was passing in tika a byte array and switched to a File in hopes that it would use RandomAccessFile, but that didn't appear to make a difference. I have a hunch that using TikaInputStreams may be able to address, but am not sure. Thanks and Best Regards, Paul ________________________________ This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by Markit is prohibited. This email is subject to all waivers and other terms at the following link: http://www.markit.com/en/about/legal/email-disclaimer.page Please visit http://www.markit.com/en/about/contact/contact-us.page? for contact information on our offices worldwide.
