oh - I see a major problem (it's a bit of a reverse memory leak, actually)
that is involved in your disk space issue (it also results in really poor
performance for image tiling operations):


The cache in TilingPaintFactory is based on WeakHashMap.  WeakHashMap is
designed around weak referenced *keys*, and the TilingPaintFactory.create()
method is constructing a new key every call.  This means that entries are
not going to be in the cache for very long (especially with a generational
garbage collector).  In fact, in my testing, the cache is always empty.
This results in a new BigBufferedImage (with a new disk cache) being
created every time to the tile is painted.

So I think at least some of the problem is that the cache is not
implemented properly.

If you are OK using an additional Apache dependency, I think that the cache
should be changed to this (with strong keys and weak values):
https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/map/ReferenceMap.html



Next, the print job is running in stripes - so a single page gets rendered
many times.  Right now, a new PageDrawer is being constructed for each
stripe in PDFRenderer.renderImage(...), which is resulting in a huge number
of BufferedImages being created.  Adding a cache for PageDrawer instances
addresses this.


Next, the key on the TilingPant cache in TilingPaintFactory is based on the
xform affine transformation.  It needs this (I think) to handle scaling -
but with stripes, the offset is changing but the scaling isn't.  So I think
that the TilingPaintParameter.hashCode() and equals() needs to be adjusted
to only consider scale - not translation.



When I make these changes, it dramatically reduces the disk usage.


I have submitted a pull request with these changes for your evaluation
(this is NOT a complete fix for the problem - it still uses
BigBufferedImage): https://github.com/apache/pdfbox/pull/135

- K

Kevin Day

*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
www.trumpetinc.com <http://trumpetinc.com/>
LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet Blog
<http://trumpetinc.com/blog/>| Twitter  <https://twitter.com/trumpetinc>

Proud to be Great Place To Work
<https://www.greatplacetowork.com/certified-company/7012667> certified
since 2019


On Sun, Nov 7, 2021 at 9:36 AM Kevin Day <ke...@trumpetinc.com> wrote:

> Ah yes - the file deletion in BigBufferedImage only happens during
> finalize() (or with a shutdown hook).  That's why you are blowing out your
> disk space.  We will have the same problem even if we use ScratchFile.
>
> This means that we have to talk about disposing the BufferedImage
> instances - but BufferedImage does not have a dispose() method on it.
>
> If we add dispose on the buffered image, now we also need dispose() on the
> Paint object, etc....  It gets very ugly...
>
> - K
>
> Kevin Day
>
> *trumpet**p| *480.961.6003 x1002
> *e| *ke...@trumpetinc.com
> www.trumpetinc.com <http://trumpetinc.com/>
> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet Blog
> <http://trumpetinc.com/blog/>| Twitter  <https://twitter.com/trumpetinc>
>
> Proud to be Great Place To Work
> <https://www.greatplacetowork.com/certified-company/7012667> certified
> since 2019
>
>
> On Sun, Nov 7, 2021 at 8:55 AM Kevin Day <ke...@trumpetinc.com> wrote:
>
>> There is another issue with BigBufferedImage - it does not work well on
>> 32 bit JVMs.  The problem is that it creates a single MappedByteBuffer for
>> the entire size of a channel's data (for the problem file, this means that
>> each buffer has 172,975,104 bytes).
>>
>> I think it would be better to create a DataBuffer implementation that
>> uses org.apache.pdfbox.io.ScratchFile and create a PdfBoxBufferedImage...
>>
>> - K
>>
>> Kevin Day
>>
>> *trumpet**p| *480.961.6003 x1002
>> *e| *ke...@trumpetinc.com
>> www.trumpetinc.com <http://trumpetinc.com/>
>> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet Blog
>> <http://trumpetinc.com/blog/>| Twitter  <https://twitter.com/trumpetinc>
>>
>> Proud to be Great Place To Work
>> <https://www.greatplacetowork.com/certified-company/7012667> certified
>> since 2019
>>
>>
>> On Sun, Nov 7, 2021 at 8:36 AM Tilman Hausherr <thaush...@t-online.de>
>> wrote:
>>
>>> Am 07.11.2021 um 16:09 schrieb Kevin Day:
>>> > ok - I think that maybe there are two things going on, then.  The
>>> first is
>>> > that we are allocating a single buffer that is bigger than the default
>>> heap
>>> > (700MB > 500MB). I really think that the only two solutions are to move
>>> > this off the heap or reduce the size of the allocation.
>>> >
>>> > Then the second issue is maybe a memory leak (I have not been able to
>>> see
>>> > this because of the first problem unless I set my max heap to a huge
>>> number
>>> > - and that makes memory leak analysis very difficult).  I am going to
>>> clone
>>> > the PdfBox git repo now and swap that class in then do a little bit of
>>> > profiling with jvisualvm to see if I can help find the memory leak.
>>> >
>>> > Question:  The Git repo for PdfBox says that it is a mirror.  If I do
>>> wind
>>> > up creating a pull request against this repo, will you be able to
>>> accept it?
>>>
>>> Not directly but we can create a .diff / .patch file from the PR.
>>>
>>> In the meantime I ran my rendering regression checks which renders 1000
>>> files. The hard disk C: went full after using 260 GB :-(
>>>
>>> Tilman
>>>
>>>
>>> >
>>> >
>>> > FYI - the use of sun.* classes is just to work around a long standing
>>> > native resource leak bug in the Windows MappedByteBuffer
>>> implementation -
>>> > On Windows, MappedByteBuffer does not release the allocated native
>>> resource
>>> > until finalize() is called, and because the MappedByteBuffer itself
>>> does
>>> > not take up much heap space, it can stay on the heap for quite a long
>>> time
>>> > even after all references are gone.  So the sun.* reference is just to
>>> > force the release of those native resources.  I used the same technique
>>> > when I created the high throughput IO layer for iText.  Instead of
>>> > referencing the package directly, we can use reflection (and only use
>>> it
>>> > for winXXX platform) - this shouldn't be a big problem.
>>> >
>>> > K
>>> >
>>> > Kevin Day
>>> >
>>> > *trumpet**p| *480.961.6003 x1002
>>> > *e| *ke...@trumpetinc.com
>>> > www.trumpetinc.com <http://trumpetinc.com/>
>>> > LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet Blog
>>> > <http://trumpetinc.com/blog/>| Twitter  <
>>> https://twitter.com/trumpetinc>
>>> >
>>> > Proud to be Great Place To Work
>>> > <https://www.greatplacetowork.com/certified-company/7012667> certified
>>> > since 2019
>>> >
>>> >
>>> > On Sun, Nov 7, 2021 at 7:53 AM Tilman Hausherr <thaush...@t-online.de>
>>> > wrote:
>>> >
>>> >> I tried it and it works pretty fast. The memory leak is still there
>>> but
>>> >> it comes later.
>>> >>
>>> >> Other drawbacks arethat it doesn't clean up after itself fast enough,
>>> >> and it uses a sun.* class.
>>> >>
>>> >> Tilman
>>> >>
>>> >> Am 07.11.2021 um 14:31 schrieb Tilman Hausherr:
>>> >>> Am 05.11.2021 um 22:09 schrieb Kevin Day:
>>> >>>> ok - so should we be clamping the xstep in some way?  Or at this
>>> >>>> depth of
>>> >>>> the algorithm do we not have enough context to actually tell that it
>>> >>>> will
>>> >>>> be outside the page/clipping region?
>>> >>> Yes + yes, that is the problem. The context would have to include the
>>> >>> region but also the current transformation matrix. Which could change
>>> >>> despite using the same pattern. So it's tricky.
>>> >>>
>>> >>>>
>>> >>>> I'm beginning to think that the BigBufferedImage might be the right
>>> >>>> solution...  This is very inefficient, but honestly, the PDF is not
>>> >>>> exactly
>>> >>>> well formed, so if these particular files wind up being slower to
>>> render
>>> >>>> b/c they have to swap image content to disk, I think that is OK...
>>> >>> Maybe... the good thing is that we know when the image will be "too
>>> >>> big". The license is OK:
>>> >>>
>>> >>> https://www.apache.org/legal/resolved.html
>>> >>>
>>> >>> But I'm still wondering whether we have a memory leak. If we have,
>>> >>> then it should be fixed.
>>> >>>
>>> >>> Tilman
>>> >>>
>>> >>>
>>> >>>> - K
>>> >>>>
>>> >>>>
>>> >>>> Kevin Day
>>> >>>>
>>> >>>> *trumpet**p| *480.961.6003 x1002
>>> >>>> *e| *ke...@trumpetinc.com
>>> >>>> www.trumpetinc.com <http://trumpetinc.com/>
>>> >>>> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet
>>> Blog
>>> >>>> <http://trumpetinc.com/blog/>| Twitter <
>>> https://twitter.com/trumpetinc>
>>> >>>>
>>> >>>> Proud to be Great Place To Work
>>> >>>> <https://www.greatplacetowork.com/certified-company/7012667>
>>> Certified
>>> >>>> 2020-2021
>>> >>>>
>>> >>>>
>>> >>>> On Fri, Nov 5, 2021 at 12:47 PM Tilman Hausherr <
>>> thaush...@t-online.de>
>>> >>>> wrote:
>>> >>>>
>>> >>>>> Am 05.11.2021 um 20:38 schrieb Kevin Day:
>>> >>>>>> I do have a bit of experience with memory leaks (side effect of
>>> >>>>>> writing
>>> >>>>>> high performance java code, unfortunately!)
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> I am pretty sure that this is not a memory leak, though.  If this
>>> >>>>>> was a
>>> >>>>>> memory leak, we would see calls happening multiple times before
>>> >>>>>> failure.
>>> >>>>>>
>>> >>>>>> The failure happens on the very first call to create the buffered
>>> >>>>>> image:
>>> >>>>>>
>>> >>>>>>            BufferedImage image = new BufferedImage(rasterWidth,
>>> >>>>> rasterHeight,
>>> >>>>>> BufferedImage.TYPE_INT_ARGB);
>>> >>>>>>
>>> >>>>>> Nothing is even getting put into the weakcache.
>>> >>>>>>
>>> >>>>> That is done in TilingPaintFactory. The cache was needed for some
>>> files
>>> >>>>> that have used the pattern several times, to avoid recreating the
>>> image
>>> >>>>> for the TexturePaint.
>>> >>>>>
>>> >>>>>
>>> >>>>>> I think the core issue is that the raster data for the image is
>>> >>>>>> just too
>>> >>>>>> big to store on the heap (without the heap being massive). It's
>>> >>>>>> 13152x13152x4 bytes (there is an alpha channel on this as well) -
>>> >>>>>> that's
>>> >>>>>> almost 700MB - just for this one pattern.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Tilman wrote:  "That
>>> >>>>>> pattern has an XStep and YStep of 23438 although the image is
>>> 2148 x
>>> >>>>>> 440. Because of a matrix scale of 0.0673396 the image pattern
>>> size is
>>> >>>>>> 1578 x 1578 at 72 dpi. So at 1200 dpi the size would be about
>>> 26300 x
>>> >>>>>> 26300"
>>> >>>>>>
>>> >>>>>> I am fairly familiar with PDF (I contributed the parsing library
>>> to
>>> >>>>>> iText
>>> >>>>>> back in the day), but I'm not very familiar with rendering tile
>>> >>>>>> patterns,
>>> >>>>>> so the above is not intuitive to me (yet).  Is there some chance
>>> that
>>> >>>>>> PdfBox is doing a lot more work than is necessary for this
>>> particular
>>> >>>>> PDF?
>>> >>>>>> Like is the matrix scale just absurdly bad in this file?  For an
>>> image
>>> >>>>> that
>>> >>>>>> is originally 2148x440, it seems like requiring a 700MB raster
>>> >>>>>> should not
>>> >>>>>> be necessary - but I do NOT have nearly enough experience with
>>> this
>>> >>>>>> area
>>> >>>>> of
>>> >>>>>> PDF...
>>> >>>>> The ridiculous thing isn't the matrix, it's the XStep and YStep.
>>> This
>>> >>>>> goes well outside of the page. The 0.067 matrix scaling makes it
>>> >>>>> less bad.
>>> >>>>>
>>> >>>>> A weakness in PDFBox is that we don't readjust the tile image, or
>>> >>>>> use an
>>> >>>>> alternative painting method that doesn't use TexturePaint when the
>>> >>>>> image
>>> >>>>> would be painted only once.
>>> >>>>>
>>> >>>>> Tilman
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>> Thanks,
>>> >>>>>>
>>> >>>>>> - Kevin
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Kevin Day
>>> >>>>>>
>>> >>>>>> *trumpet**p| *480.961.6003 x1002
>>> >>>>>> *e| *ke...@trumpetinc.com
>>> >>>>>> www.trumpetinc.com <http://trumpetinc.com/>
>>> >>>>>> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet
>>> >> Blog
>>> >>>>>> <http://trumpetinc.com/blog/>| Twitter
>>> >>>>>> <https://twitter.com/trumpetinc>
>>> >>>>>>
>>> >>>>>> Proud to be Great Place To Work
>>> >>>>>> <https://www.greatplacetowork.com/certified-company/7012667>
>>> >> Certified
>>> >>>>>> 2020-2021
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Fri, Nov 5, 2021 at 11:27 AM Tilman Hausherr
>>> >>>>>> <thaush...@t-online.de>
>>> >>>>>> wrote:
>>> >>>>>>
>>> >>>>>>> If you're experienced with memory leaks then it would be nice if
>>> you
>>> >>>>>>> could search.
>>> >>>>>>>
>>> >>>>>>> Things I tried that didn't help:
>>> >>>>>>> - calling graphics.setPaint(null) after operations (to "lose"
>>> >>>>>>> TilingPaint objects)
>>> >>>>>>> - disabling the (weak) cache of TilingPaint objects
>>> >>>>>>> - adding finalize to see if the TilingPaint class gets finalized
>>> >>>>>>> (yes)
>>> >>>>>>> - adding finalize to see if the HighResolutionImageIcon class
>>> gets
>>> >>>>>>> finalized (yes)
>>> >>>>>>>
>>> >>>>>>> Tilman
>>> >>>>>>>
>>> >>>>>
>>> ---------------------------------------------------------------------
>>> >>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>> >>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>> >>>>>
>>> >>>>>
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>> >>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>> >>>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>> >> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>> >>
>>> >>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>
>>>

Reply via email to