Hi This is an interesting observation. I'd be quite interested in following this up since I've also seen extraordinary high gc trashing when running my tool threaded accessing an abstracted high-level API of PDFBox. Judging from the brief amount of time I have spent reading the PDFBox source code, I believe it was written with stability in mind rather than speed. Having said that, though, I'm not exactly qualified to make such statements, since I'm merely a user of PDFBox.
Now, would you mind running your profiling using JFR and JMC. You need at least JDK 1.7u40, and you can enable basic flight recording using at least the following arguments to the JVM: -XX:+UnlockCommercialFeatures -XX:+FlightRecorder The overhead of this kind of instrumenation is rather low (1%-2% of additional runtime CPU I/O), even for high rates of sampling and deep stack traces. Reading your post, I assume you're technically fit enough, so there is no further need to explain this kind of instrumentation. In the past, I have found that using JFR instrumenting has given me much better insights into such performance issues under memory pressure and the stack trace sampling is done beautifully. It's not quite as user-friendly and versatile as Yourkit, but it does its job. Flight recording does however not account for CPU load, so don't look at latencies. You are certainly welcome to upload a sample PDF to a place and share your piece of code, so others can try to reproduce this. I won't be able to look at this for at least another week, notwithstanding that I'm very interested in seeing some memory and speed improvements for PDFBox. Last but not least, how did you run your code in parallel? Using PDFBox calls from threads can result in nasty surprises for some methods. Make sure that each thread has access to its own PDDocument object at least, which judging from your problem description would not make sense. Now, I'm not entirely sure how you tackle your problem, however it could be worthwhile and interesting to break it down to following algorithm, which would allow you some sort of parallelism: 1. Split your input PDF into n (1-m) pages PDF documents using the worker task of each thread on the set of pages you'd like to split out. Memory pressure should be low on this one. 2. Run worker threads on all PDFs found from the above step and add footer and save again. This should contain memory pressure if PDFBox had some non-linearity regarding memory usage as a function of amount of pages. 3. Merge footer-enhanced PDFs into one final PDF. You could even consider holding all PDDocument entries in memory after splitting. Maybe this helps you pin-pointing the issue down further. Best regards Roberto On Sat, Sep 26, 2015 at 10:47 PM, Adam Retter <[email protected]> wrote: > Hi there, > > I am trying to add a Footer to each page of a PDF document. My test > document is 100MB and consists of ~2000 pages. > > My approach so far is similar to - > > > try(final PDDocument doc = PDDocument.load(pdf.asFile)) { > final List<Page> pages = doc.getDocumentCatalog.getAllPages(); > > for(final PDPage page: pages) { > try(final PDPageContentStream stream = new > PDPageContentStream(doc, page, true, true, true) > addMyFooter(doc, page, stream); > } > } > > doc.save(resultFile); > } > > > Processing the above with the JVM set to use "ParallelGC" and a 2GB > heap, takes 24 seconds here. Trying to run two of those operations in > parallel on the same JVM results in the threads running for more than > 16 minutes after which I got bored and killed it, during that time the > CPU was being absolutely hosed by the JVM. > > With a JVM set to use "ConcMarkSweepGC" and a 2GB heap, processing > takes 17 seconds. When trying to run two of these operations in > parallel on the same JVM after about 3.5 minutes I get a > java.lang.OutOfMemoryError: Java heap space. > > Finally, with a JVM set to use "G1GC" and a 2GB heap processing again > takes 17 seconds. Running two of these operations in parallel on the > same JVM, causes both to complete in about 23 seconds each. Pushing > this harder, running there of these operations in parallel on the same > JVM results in a java.lang.OutOfMemoryError: Java heap space. Just > before the OOM, GC time accounts for all of the CPU time taken by the > Java process. > > > So what I believe is that this process seems to be generating huge > amounts of GC churn, and also uses a large amount of memory, up to 2GB > for a single 100 MB PDF document. > > I don't really understand how trying to process a 100MB PDF can eat > 2GB of memory, I guess many many Java objects are the culprit (at > least with regards to the GC churn). > > Is PDFBox suitable for processing larger PDF documents, and if so, > what stupid thing am I doing that is eating all the RAM and destroying > performance? > > Thanks Adam. > > > -- > Adam Retter > > skype: adam.retter > tweet: adamretter > http://www.adamretter.org.uk > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

