RE: Processing large amounts of PDFs in parallel without running out of memory

Paul Pearcy Mon, 12 Dec 2011 12:12:33 -0800

Thanks Nick.

I believe I found the relevant code:
https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java

PDDocument pdfDocument =
            PDDocument.load(new CloseShieldInputStream(stream), true);

It seems that passing in a RandomAccessFile vs RandomAccessBuffer to the load 
method will control how pdfbox uses temporary memory.

Anybody have thoughts on whether it makes sense to do this based on the type of 
the underlying stream the parse method receives? Not sure if there is a better 
option for controlling this behavior.

Best Regards,
Paul

-----Original Message-----
From: Nick Burch [mailto:[email protected]]
Sent: Monday, December 05, 2011 6:31 PM
To: [email protected]
Subject: Re: Processing large amounts of PDFs in parallel without running out 
of memory

On Mon, 5 Dec 2011, Paul Pearcy wrote:
> It appears that under the hood pdfbox can work with either a
> RandomAccessFile
> (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessFile.html
> ) or a RandomAccessBuffer
> (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessBuffer.html)
> and that tika uses RandomAccessBuffers for better performance. I'd like
> to sacrifice this performance for less RAM usage.
>
> Is this possible?

I think it should be a fairly simple change, to test if we have a
TikaInputStream, and if so one with a File, and if so use the File
constructor to PDFBox rather than the stream one.

I don't know the PDFBox related code well though, so I'll wait for others
to comment on the sanity of this... :)

Nick

This e-mail, including accompanying communications and attachments, is strictly 
confidential and only for the intended recipient. Any retention, use or 
disclosure not expressly authorised by Markit is prohibited. This email is 
subject to all waivers and other terms at the following link: 
http://www.markit.com/en/about/legal/email-disclaimer.page

Please visit http://www.markit.com/en/about/contact/contact-us.page? for 
contact information on our offices worldwide.

RE: Processing large amounts of PDFs in parallel without running out of memory

Reply via email to