oh - I see a major problem (it's a bit of a reverse memory leak, actually) that is involved in your disk space issue (it also results in really poor performance for image tiling operations):
The cache in TilingPaintFactory is based on WeakHashMap. WeakHashMap is designed around weak referenced *keys*, and the TilingPaintFactory.create() method is constructing a new key every call. This means that entries are not going to be in the cache for very long (especially with a generational garbage collector). In fact, in my testing, the cache is always empty. This results in a new BigBufferedImage (with a new disk cache) being created every time to the tile is painted. So I think at least some of the problem is that the cache is not implemented properly. If you are OK using an additional Apache dependency, I think that the cache should be changed to this (with strong keys and weak values): https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/map/ReferenceMap.html Next, the print job is running in stripes - so a single page gets rendered many times. Right now, a new PageDrawer is being constructed for each stripe in PDFRenderer.renderImage(...), which is resulting in a huge number of BufferedImages being created. Adding a cache for PageDrawer instances addresses this. Next, the key on the TilingPant cache in TilingPaintFactory is based on the xform affine transformation. It needs this (I think) to handle scaling - but with stripes, the offset is changing but the scaling isn't. So I think that the TilingPaintParameter.hashCode() and equals() needs to be adjusted to only consider scale - not translation. When I make these changes, it dramatically reduces the disk usage. I have submitted a pull request with these changes for your evaluation (this is NOT a complete fix for the problem - it still uses BigBufferedImage): https://github.com/apache/pdfbox/pull/135 - K Kevin Day *trumpet**p| *480.961.6003 x1002 *e| *ke...@trumpetinc.com www.trumpetinc.com <http://trumpetinc.com/> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet Blog <http://trumpetinc.com/blog/>| Twitter <https://twitter.com/trumpetinc> Proud to be Great Place To Work <https://www.greatplacetowork.com/certified-company/7012667> certified since 2019 On Sun, Nov 7, 2021 at 9:36 AM Kevin Day <ke...@trumpetinc.com> wrote: > Ah yes - the file deletion in BigBufferedImage only happens during > finalize() (or with a shutdown hook). That's why you are blowing out your > disk space. We will have the same problem even if we use ScratchFile. > > This means that we have to talk about disposing the BufferedImage > instances - but BufferedImage does not have a dispose() method on it. > > If we add dispose on the buffered image, now we also need dispose() on the > Paint object, etc.... It gets very ugly... > > - K > > Kevin Day > > *trumpet**p| *480.961.6003 x1002 > *e| *ke...@trumpetinc.com > www.trumpetinc.com <http://trumpetinc.com/> > LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet Blog > <http://trumpetinc.com/blog/>| Twitter <https://twitter.com/trumpetinc> > > Proud to be Great Place To Work > <https://www.greatplacetowork.com/certified-company/7012667> certified > since 2019 > > > On Sun, Nov 7, 2021 at 8:55 AM Kevin Day <ke...@trumpetinc.com> wrote: > >> There is another issue with BigBufferedImage - it does not work well on >> 32 bit JVMs. The problem is that it creates a single MappedByteBuffer for >> the entire size of a channel's data (for the problem file, this means that >> each buffer has 172,975,104 bytes). >> >> I think it would be better to create a DataBuffer implementation that >> uses org.apache.pdfbox.io.ScratchFile and create a PdfBoxBufferedImage... >> >> - K >> >> Kevin Day >> >> *trumpet**p| *480.961.6003 x1002 >> *e| *ke...@trumpetinc.com >> www.trumpetinc.com <http://trumpetinc.com/> >> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet Blog >> <http://trumpetinc.com/blog/>| Twitter <https://twitter.com/trumpetinc> >> >> Proud to be Great Place To Work >> <https://www.greatplacetowork.com/certified-company/7012667> certified >> since 2019 >> >> >> On Sun, Nov 7, 2021 at 8:36 AM Tilman Hausherr <thaush...@t-online.de> >> wrote: >> >>> Am 07.11.2021 um 16:09 schrieb Kevin Day: >>> > ok - I think that maybe there are two things going on, then. The >>> first is >>> > that we are allocating a single buffer that is bigger than the default >>> heap >>> > (700MB > 500MB). I really think that the only two solutions are to move >>> > this off the heap or reduce the size of the allocation. >>> > >>> > Then the second issue is maybe a memory leak (I have not been able to >>> see >>> > this because of the first problem unless I set my max heap to a huge >>> number >>> > - and that makes memory leak analysis very difficult). I am going to >>> clone >>> > the PdfBox git repo now and swap that class in then do a little bit of >>> > profiling with jvisualvm to see if I can help find the memory leak. >>> > >>> > Question: The Git repo for PdfBox says that it is a mirror. If I do >>> wind >>> > up creating a pull request against this repo, will you be able to >>> accept it? >>> >>> Not directly but we can create a .diff / .patch file from the PR. >>> >>> In the meantime I ran my rendering regression checks which renders 1000 >>> files. The hard disk C: went full after using 260 GB :-( >>> >>> Tilman >>> >>> >>> > >>> > >>> > FYI - the use of sun.* classes is just to work around a long standing >>> > native resource leak bug in the Windows MappedByteBuffer >>> implementation - >>> > On Windows, MappedByteBuffer does not release the allocated native >>> resource >>> > until finalize() is called, and because the MappedByteBuffer itself >>> does >>> > not take up much heap space, it can stay on the heap for quite a long >>> time >>> > even after all references are gone. So the sun.* reference is just to >>> > force the release of those native resources. I used the same technique >>> > when I created the high throughput IO layer for iText. Instead of >>> > referencing the package directly, we can use reflection (and only use >>> it >>> > for winXXX platform) - this shouldn't be a big problem. >>> > >>> > K >>> > >>> > Kevin Day >>> > >>> > *trumpet**p| *480.961.6003 x1002 >>> > *e| *ke...@trumpetinc.com >>> > www.trumpetinc.com <http://trumpetinc.com/> >>> > LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet Blog >>> > <http://trumpetinc.com/blog/>| Twitter < >>> https://twitter.com/trumpetinc> >>> > >>> > Proud to be Great Place To Work >>> > <https://www.greatplacetowork.com/certified-company/7012667> certified >>> > since 2019 >>> > >>> > >>> > On Sun, Nov 7, 2021 at 7:53 AM Tilman Hausherr <thaush...@t-online.de> >>> > wrote: >>> > >>> >> I tried it and it works pretty fast. The memory leak is still there >>> but >>> >> it comes later. >>> >> >>> >> Other drawbacks arethat it doesn't clean up after itself fast enough, >>> >> and it uses a sun.* class. >>> >> >>> >> Tilman >>> >> >>> >> Am 07.11.2021 um 14:31 schrieb Tilman Hausherr: >>> >>> Am 05.11.2021 um 22:09 schrieb Kevin Day: >>> >>>> ok - so should we be clamping the xstep in some way? Or at this >>> >>>> depth of >>> >>>> the algorithm do we not have enough context to actually tell that it >>> >>>> will >>> >>>> be outside the page/clipping region? >>> >>> Yes + yes, that is the problem. The context would have to include the >>> >>> region but also the current transformation matrix. Which could change >>> >>> despite using the same pattern. So it's tricky. >>> >>> >>> >>>> >>> >>>> I'm beginning to think that the BigBufferedImage might be the right >>> >>>> solution... This is very inefficient, but honestly, the PDF is not >>> >>>> exactly >>> >>>> well formed, so if these particular files wind up being slower to >>> render >>> >>>> b/c they have to swap image content to disk, I think that is OK... >>> >>> Maybe... the good thing is that we know when the image will be "too >>> >>> big". The license is OK: >>> >>> >>> >>> https://www.apache.org/legal/resolved.html >>> >>> >>> >>> But I'm still wondering whether we have a memory leak. If we have, >>> >>> then it should be fixed. >>> >>> >>> >>> Tilman >>> >>> >>> >>> >>> >>>> - K >>> >>>> >>> >>>> >>> >>>> Kevin Day >>> >>>> >>> >>>> *trumpet**p| *480.961.6003 x1002 >>> >>>> *e| *ke...@trumpetinc.com >>> >>>> www.trumpetinc.com <http://trumpetinc.com/> >>> >>>> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet >>> Blog >>> >>>> <http://trumpetinc.com/blog/>| Twitter < >>> https://twitter.com/trumpetinc> >>> >>>> >>> >>>> Proud to be Great Place To Work >>> >>>> <https://www.greatplacetowork.com/certified-company/7012667> >>> Certified >>> >>>> 2020-2021 >>> >>>> >>> >>>> >>> >>>> On Fri, Nov 5, 2021 at 12:47 PM Tilman Hausherr < >>> thaush...@t-online.de> >>> >>>> wrote: >>> >>>> >>> >>>>> Am 05.11.2021 um 20:38 schrieb Kevin Day: >>> >>>>>> I do have a bit of experience with memory leaks (side effect of >>> >>>>>> writing >>> >>>>>> high performance java code, unfortunately!) >>> >>>>>> >>> >>>>>> >>> >>>>>> I am pretty sure that this is not a memory leak, though. If this >>> >>>>>> was a >>> >>>>>> memory leak, we would see calls happening multiple times before >>> >>>>>> failure. >>> >>>>>> >>> >>>>>> The failure happens on the very first call to create the buffered >>> >>>>>> image: >>> >>>>>> >>> >>>>>> BufferedImage image = new BufferedImage(rasterWidth, >>> >>>>> rasterHeight, >>> >>>>>> BufferedImage.TYPE_INT_ARGB); >>> >>>>>> >>> >>>>>> Nothing is even getting put into the weakcache. >>> >>>>>> >>> >>>>> That is done in TilingPaintFactory. The cache was needed for some >>> files >>> >>>>> that have used the pattern several times, to avoid recreating the >>> image >>> >>>>> for the TexturePaint. >>> >>>>> >>> >>>>> >>> >>>>>> I think the core issue is that the raster data for the image is >>> >>>>>> just too >>> >>>>>> big to store on the heap (without the heap being massive). It's >>> >>>>>> 13152x13152x4 bytes (there is an alpha channel on this as well) - >>> >>>>>> that's >>> >>>>>> almost 700MB - just for this one pattern. >>> >>>>>> >>> >>>>>> >>> >>>>>> Tilman wrote: "That >>> >>>>>> pattern has an XStep and YStep of 23438 although the image is >>> 2148 x >>> >>>>>> 440. Because of a matrix scale of 0.0673396 the image pattern >>> size is >>> >>>>>> 1578 x 1578 at 72 dpi. So at 1200 dpi the size would be about >>> 26300 x >>> >>>>>> 26300" >>> >>>>>> >>> >>>>>> I am fairly familiar with PDF (I contributed the parsing library >>> to >>> >>>>>> iText >>> >>>>>> back in the day), but I'm not very familiar with rendering tile >>> >>>>>> patterns, >>> >>>>>> so the above is not intuitive to me (yet). Is there some chance >>> that >>> >>>>>> PdfBox is doing a lot more work than is necessary for this >>> particular >>> >>>>> PDF? >>> >>>>>> Like is the matrix scale just absurdly bad in this file? For an >>> image >>> >>>>> that >>> >>>>>> is originally 2148x440, it seems like requiring a 700MB raster >>> >>>>>> should not >>> >>>>>> be necessary - but I do NOT have nearly enough experience with >>> this >>> >>>>>> area >>> >>>>> of >>> >>>>>> PDF... >>> >>>>> The ridiculous thing isn't the matrix, it's the XStep and YStep. >>> This >>> >>>>> goes well outside of the page. The 0.067 matrix scaling makes it >>> >>>>> less bad. >>> >>>>> >>> >>>>> A weakness in PDFBox is that we don't readjust the tile image, or >>> >>>>> use an >>> >>>>> alternative painting method that doesn't use TexturePaint when the >>> >>>>> image >>> >>>>> would be painted only once. >>> >>>>> >>> >>>>> Tilman >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>>> Thanks, >>> >>>>>> >>> >>>>>> - Kevin >>> >>>>>> >>> >>>>>> >>> >>>>>> Kevin Day >>> >>>>>> >>> >>>>>> *trumpet**p| *480.961.6003 x1002 >>> >>>>>> *e| *ke...@trumpetinc.com >>> >>>>>> www.trumpetinc.com <http://trumpetinc.com/> >>> >>>>>> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet >>> >> Blog >>> >>>>>> <http://trumpetinc.com/blog/>| Twitter >>> >>>>>> <https://twitter.com/trumpetinc> >>> >>>>>> >>> >>>>>> Proud to be Great Place To Work >>> >>>>>> <https://www.greatplacetowork.com/certified-company/7012667> >>> >> Certified >>> >>>>>> 2020-2021 >>> >>>>>> >>> >>>>>> >>> >>>>>> On Fri, Nov 5, 2021 at 11:27 AM Tilman Hausherr >>> >>>>>> <thaush...@t-online.de> >>> >>>>>> wrote: >>> >>>>>> >>> >>>>>>> If you're experienced with memory leaks then it would be nice if >>> you >>> >>>>>>> could search. >>> >>>>>>> >>> >>>>>>> Things I tried that didn't help: >>> >>>>>>> - calling graphics.setPaint(null) after operations (to "lose" >>> >>>>>>> TilingPaint objects) >>> >>>>>>> - disabling the (weak) cache of TilingPaint objects >>> >>>>>>> - adding finalize to see if the TilingPaint class gets finalized >>> >>>>>>> (yes) >>> >>>>>>> - adding finalize to see if the HighResolutionImageIcon class >>> gets >>> >>>>>>> finalized (yes) >>> >>>>>>> >>> >>>>>>> Tilman >>> >>>>>>> >>> >>>>> >>> --------------------------------------------------------------------- >>> >>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>> >>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>> >>>>> >>> >>>>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>> >>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>> >>> >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>> >> For additional commands, e-mail: users-h...@pdfbox.apache.org >>> >> >>> >> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>> >>>