Re: RandomAccessReadBuffer performance issues with inputStreams in 3.0

bnncdv Mon, 18 Sep 2023 07:36:50 -0700

Thanks, tried 3.0.1-SNAPSHOT and does seem fixed.

Just in case here is a basic example (simplified cleanup/etc):


> InputStream is = new FileInputStream(new File("/tests/big.pdf"));
> PDDocument doc = ...;
> //  PDDocument.load(is); //2.0.x
> //  Loader.loadPDF(new RandomAccessReadBuffer(is)); //3.0.x
>
> List<PDDocument> docs = new Splitter().split(doc); //timings here

With a ~70MB PDF file of 600 pages (created by joining a PDF with a
full-page image N times)
- 2.0.29 = ~0.5 sec, ~300MB; 3.0.0 = ~7 sec, ~3500MB; 3.0.1: ~0.9 sec,
~130MB
With a ~900MB PDF of 9600 pages (uncommon, but a real file sent by a
client):
- 2.0.29 = ~3.5 sec, ~3800MB; 3.0.0 = out of memory exception after ~30
sec; 3.0.1: ~0.9s, ~330MB

Not exact timings but ok enough to compare (those would vary/increase after
handling the List but not relevant here). High CPU probably depended on
Java/SDK version, since I assume it would be linked to GC calls for the
extra objects, and frequency/etc would vary per system, so was indirectly
fixed.

***

Also, for 2.0 we typically use:
- PDDocument.load(is, MemoryUsageSetting.setupMixed(MAX_BYTES))
that seems to reduce/control memory  a bit (at the cost of some CPU/etc).
Does 3.0 have some direct equivalent? Tried stuff like:
- Loader.loadPDF(rarb), null, null, null,
MemoryUsageSetting.setupMixed(MAX_BYTES).streamCache)
but doesn't seem to change much. 2.0 may be using Scratchfile internally
but not sure how to setup that in 3.0?


Thanks.

Re: RandomAccessReadBuffer performance issues with inputStreams in 3.0

Reply via email to