Thanks Nick.
I believe I found the relevant code:
https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
PDDocument pdfDocument =
PDDocument.load(new CloseShieldInputStream(stream), true);
It seems that passing in a RandomAccessFile vs RandomAccessBuffer to the load
method will control how pdfbox uses temporary memory.
Anybody have thoughts on whether it makes sense to do this based on the type of
the underlying stream the parse method receives? Not sure if there is a better
option for controlling this behavior.
Best Regards,
Paul
-----Original Message-----
From: Nick Burch [mailto:[email protected]]
Sent: Monday, December 05, 2011 6:31 PM
To: [email protected]
Subject: Re: Processing large amounts of PDFs in parallel without running out
of memory
On Mon, 5 Dec 2011, Paul Pearcy wrote:
> It appears that under the hood pdfbox can work with either a
> RandomAccessFile
> (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessFile.html
> ) or a RandomAccessBuffer
> (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessBuffer.html)
> and that tika uses RandomAccessBuffers for better performance. I'd like
> to sacrifice this performance for less RAM usage.
>
> Is this possible?
I think it should be a fairly simple change, to test if we have a
TikaInputStream, and if so one with a File, and if so use the File
constructor to PDFBox rather than the stream one.
I don't know the PDFBox related code well though, so I'll wait for others
to comment on the sanity of this... :)
Nick
This e-mail, including accompanying communications and attachments, is strictly
confidential and only for the intended recipient. Any retention, use or
disclosure not expressly authorised by Markit is prohibited. This email is
subject to all waivers and other terms at the following link:
http://www.markit.com/en/about/legal/email-disclaimer.page
Please visit http://www.markit.com/en/about/contact/contact-us.page? for
contact information on our offices worldwide.