Re: Memory use for large PDFs?

Roberto Nibali Mon, 28 Sep 2015 12:12:36 -0700

Hi

This is an interesting observation. I'd be quite interested in following
this up since I've also seen extraordinary high gc trashing when running my
tool threaded accessing an abstracted high-level API of PDFBox. Judging
from the brief amount of time I have spent reading the PDFBox source code,
I believe it was written with stability in mind rather than speed. Having
said that, though, I'm not exactly qualified to make such statements, since
I'm merely a user of PDFBox.

Now, would you mind running your profiling using JFR and JMC. You need at
least JDK 1.7u40, and you can enable basic flight recording using at least
the following arguments to the JVM:

-XX:+UnlockCommercialFeatures -XX:+FlightRecorder

The overhead of this kind of instrumenation is rather low (1%-2% of
additional runtime CPU I/O), even for high rates of sampling and deep stack
traces. Reading your post, I assume you're technically fit enough, so there
is no further need to explain this kind of instrumentation.

In the past, I have found that using JFR instrumenting has given me much
better insights into such performance issues under memory pressure and the
stack trace sampling is done beautifully. It's not quite as user-friendly
and versatile as Yourkit, but it does its job. Flight recording does
however not account for CPU load, so don't look at latencies.

You are certainly welcome to upload a sample PDF to a place and share your
piece of code, so others can try to reproduce this. I won't be able to look
at this for at least another week, notwithstanding that I'm very interested
in seeing some memory and speed improvements for PDFBox.

Last but not least, how did you run your code in parallel? Using PDFBox
calls from threads can result in nasty surprises for some methods. Make
sure that each thread has access to its own PDDocument object at least,
which judging from your problem description would not make sense.

Now, I'm not entirely sure how you tackle your problem, however it could be
worthwhile and interesting to break it down to following algorithm, which
would allow you some sort of parallelism:

1. Split your input PDF into n (1-m) pages PDF documents using the worker
task of each thread on the set of pages you'd like to split out. Memory
pressure should be low on this one.
2. Run worker threads on all PDFs found from the above step and add footer
and save again. This should contain memory pressure if PDFBox had some
non-linearity regarding memory usage as a function of amount of pages.
3. Merge footer-enhanced PDFs into one final PDF.

You could even consider holding all PDDocument entries in memory after
splitting.

Maybe this helps you pin-pointing the issue down further.

Best regards
Roberto

On Sat, Sep 26, 2015 at 10:47 PM, Adam Retter <[email protected]>
wrote:

> Hi there,
>
> I am trying to add a Footer to each page of a PDF document. My test
> document is 100MB and consists of ~2000 pages.
>
> My approach so far is similar to -
>
>
> try(final PDDocument doc = PDDocument.load(pdf.asFile)) {
>   final List<Page> pages = doc.getDocumentCatalog.getAllPages();
>
>   for(final PDPage page: pages) {
>     try(final PDPageContentStream stream = new
> PDPageContentStream(doc, page, true, true, true)
>       addMyFooter(doc, page, stream);
>     }
>   }
>
>   doc.save(resultFile);
> }
>
>
> Processing the above with the JVM set to use "ParallelGC" and a 2GB
> heap, takes 24 seconds here. Trying to run two of those operations in
> parallel on the same JVM results in the threads running for more than
> 16 minutes after which I got bored and killed it, during that time the
> CPU was being absolutely hosed by the JVM.
>
> With a JVM set to use "ConcMarkSweepGC" and a 2GB heap, processing
> takes 17 seconds. When trying to run two of these operations in
> parallel on the same JVM after about 3.5 minutes I get a
> java.lang.OutOfMemoryError: Java heap space.
>
> Finally, with a JVM set to use "G1GC" and a 2GB heap processing again
> takes 17 seconds. Running two of these operations in parallel on the
> same JVM, causes both to complete in about 23 seconds each. Pushing
> this harder, running there of these operations in parallel on the same
> JVM results in a java.lang.OutOfMemoryError: Java heap space. Just
> before the OOM, GC time accounts for all of the CPU time taken by the
> Java process.
>
>
> So what I believe is that this process seems to be generating huge
> amounts of GC churn, and also uses a large amount of memory, up to 2GB
> for a single 100 MB PDF document.
>
> I don't really understand how trying to process a 100MB PDF can eat
> 2GB of memory, I guess many many Java objects are the culprit (at
> least with regards to the GC churn).
>
> Is PDFBox suitable for processing larger PDF documents, and if so,
> what stupid thing am I doing that is eating all the RAM and destroying
> performance?
>
> Thanks Adam.
>
>
> --
> Adam Retter
>
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Memory use for large PDFs?

Reply via email to