Re: RandomAccessReadBuffer performance issues with inputStreams in 3.0
Thanks, tried 3.0.1-SNAPSHOT and does seem fixed. Just in case here is a basic example (simplified cleanup/etc): > InputStream is = new FileInputStream(new File("/tests/big.pdf")); > PDDocument doc = ...; > // PDDocument.load(is); //2.0.x > // Loader.loadPDF(new RandomAccessReadBuffer(is)); //3.0.x > > List docs = new Splitter().split(doc); //timings here With a ~70MB PDF file of 600 pages (created by joining a PDF with a full-page image N times) - 2.0.29 = ~0.5 sec, ~300MB; 3.0.0 = ~7 sec, ~3500MB; 3.0.1: ~0.9 sec, ~130MB With a ~900MB PDF of 9600 pages (uncommon, but a real file sent by a client): - 2.0.29 = ~3.5 sec, ~3800MB; 3.0.0 = out of memory exception after ~30 sec; 3.0.1: ~0.9s, ~330MB Not exact timings but ok enough to compare (those would vary/increase after handling the List but not relevant here). High CPU probably depended on Java/SDK version, since I assume it would be linked to GC calls for the extra objects, and frequency/etc would vary per system, so was indirectly fixed. *** Also, for 2.0 we typically use: - PDDocument.load(is, MemoryUsageSetting.setupMixed(MAX_BYTES)) that seems to reduce/control memory a bit (at the cost of some CPU/etc). Does 3.0 have some direct equivalent? Tried stuff like: - Loader.loadPDF(rarb), null, null, null, MemoryUsageSetting.setupMixed(MAX_BYTES).streamCache) but doesn't seem to change much. 2.0 may be using Scratchfile internally but not sure how to setup that in 3.0? Thanks.
Re: RandomAccessReadBuffer performance issues with inputStreams in 3.0
Am 28.08.23 um 13:30 schrieb bnncdv: When migrating from 2.0 to 3.0 I noticed some operations were very slow, mainly the Splitter tool. With a big-ish file it would take *a lot* more memory/cpu (jdk8). What exactly are you doing? I've tried to reproduce the issue and I've bee succesful with regard to the memory footprint but I can't confirm the higher cpu usage. What exactly are doing? I've splitted the PDF spec, 32Mb file with more than 1.300 pages, into 2 pages pdfs and it can't see any difference with regard to the cup usage wether I use a file or a input stream. However, I was able to reproduce the regression with regard to the memory consumption and fixed/optimized it in [1] I believe the culprit is RandomAccessReadBuffer with inputstreams. This fully reads the stream in 4KB chunks (not a problem), however every time We have o do that as we need random access to the file. 2.0.x does the same createView(..) is called (on every PDPage access I think) it call a clone RARB constructor, and all its ByteArray chunks are duplicate()'d which for bigger files with many pages means *tons* of wasted objects + calls (even if the underlying buf is the same). Simplifying that, for example by reusing the parent bufferList rather than duplicting it uses the expected cpu/memory (I don't know the implications though). From simple observations Splitter seems to take x4 more cpu/heap. For example I'd assume with a 100MB file of 300 pages (normal enough if you deal with scanned docs) + inputstream: 100MB = 25600 chunks of 4KB * 300 pages = 768 objects created+gc'd in a short time, at least. With smaller files (few pages) this isn't very noticeable, nor with RandomAccessReadBufferedFile (different handling). Passing a pre-read byte[] file to RandomAccessReadBuffer works ok (minimal dupes). RandomAccessReadBufferedFile has a builtin cache to avoid to many copies, see [1] RandomAccess.createBuffer(inputStream) in alpha3 was also ok but removed in beta1. Either way, I don't think code should be copying/duping so much and could be restructured, specially since the migration guide hints at using RandomAccessReadBuffer for inputStreams. Alpha3 did the same as final version 3.0.0. The removed method was redundant. Also, for RARB it'd make more sense to read chunks as needed in read() rather than all at once in the constructor I think (faster metadata query'ing). Incidentally, may be useful to increase the default chunk size (or allow users to set it) to reduce fragmentation, since it's going the read the whole thing and PDFs < 4kb aren't that common I'd say. We have to read all data as need random access to the pdf. In many case on of the first steps is to jump to the end of the pdf to read the cross reference table/stream. (I don't have a publishable example at hand but can be easily replicated by using the PDFMergerUtility and joining the same non-tiny PDF xN times, then splitting it). There has to be something special about your use case and/or pdf as I can't reproduce the cpu issue, see above. Andreas Thanks. [1] https://issues.apache.org/jira/browse/PDFBOX-5685 - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
RandomAccessReadBuffer performance issues with inputStreams in 3.0
When migrating from 2.0 to 3.0 I noticed some operations were very slow, mainly the Splitter tool. With a big-ish file it would take *a lot* more memory/cpu (jdk8). I believe the culprit is RandomAccessReadBuffer with inputstreams. This fully reads the stream in 4KB chunks (not a problem), however every time createView(..) is called (on every PDPage access I think) it call a clone RARB constructor, and all its ByteArray chunks are duplicate()'d which for bigger files with many pages means *tons* of wasted objects + calls (even if the underlying buf is the same). Simplifying that, for example by reusing the parent bufferList rather than duplicting it uses the expected cpu/memory (I don't know the implications though). >From simple observations Splitter seems to take x4 more cpu/heap. For example I'd assume with a 100MB file of 300 pages (normal enough if you deal with scanned docs) + inputstream: 100MB = 25600 chunks of 4KB * 300 pages = 768 objects created+gc'd in a short time, at least. With smaller files (few pages) this isn't very noticeable, nor with RandomAccessReadBufferedFile (different handling). Passing a pre-read byte[] file to RandomAccessReadBuffer works ok (minimal dupes). RandomAccess.createBuffer(inputStream) in alpha3 was also ok but removed in beta1. Either way, I don't think code should be copying/duping so much and could be restructured, specially since the migration guide hints at using RandomAccessReadBuffer for inputStreams. Also, for RARB it'd make more sense to read chunks as needed in read() rather than all at once in the constructor I think (faster metadata query'ing). Incidentally, may be useful to increase the default chunk size (or allow users to set it) to reduce fragmentation, since it's going the read the whole thing and PDFs < 4kb aren't that common I'd say. (I don't have a publishable example at hand but can be easily replicated by using the PDFMergerUtility and joining the same non-tiny PDF xN times, then splitting it). Thanks.