On Mon, 5 Dec 2011, Paul Pearcy wrote:
It appears that under the hood pdfbox can work with either a
RandomAccessFile
(http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessFile.html
) or a RandomAccessBuffer
(http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessBuffer.html)
and that tika uses RandomAccessBuffers for better performance. I'd like
to sacrifice this performance for less RAM usage.
Is this possible?
I think it should be a fairly simple change, to test if we have a
TikaInputStream, and if so one with a File, and if so use the File
constructor to PDFBox rather than the stream one.
I don't know the PDFBox related code well though, so I'll wait for others
to comment on the sanity of this... :)
Nick