Dear PDFBox users, I am using PDFBox to place overlays on lots of different input files. In general, this works very well and reliably – thank to everyone who worked for that!
However, there is one class of particularly awful input files, like the one at https://www.g-ba.de/downloads/40-268-7473/2021-03-18_ASV-RL_Anpassung-Appendizes-an-EBM_TrG.pdf. That’s a more than 50 MB, 2000+ pages beast full of complex tables with lots of cells. When I try to put a single-page PDF as overlay on it with PDFBox 2.0.27, I have to start the JVM with e. g. 8GB of heap memory, and it maxes out a CPU core on my machine for about 6 minutes. The maximum resident set size as reported by `time` is in the range of 2.4 GB. The result file is about four times the size of the input file. With a snapshot build of 3.0.0, the max RSS seems not to go above 1 GB, but processing is not finished within 15 minutes (when I aborted). Regarding 3.0.0, I have seen the remarks at https://pdfbox.apache.org/3.0/migration.html#reduced-memory-usage, so I thought it might be worth a try. Probably the overlay will end up traversing all pages anyway, so that may not make a big difference. My questions are: - Is there anything I can do to make processing of such files faster or more efficient? - What may be the reasons for the increase in output file size and can I do anything about it? Thanks! -mp.