Another thought:

As I've worked with the code, I think that maybe one possible issue is
that TilingPaint is creating its image during construction.

Would it be better if TilingPaint constructed an image on demand
in createContext()?  That way you would have full information about the
clipping region.  Would that work better??  Maybe it would not be necessary
to create this huge image at all if we did it that way?

- Kevin


Kevin Day

*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
www.trumpetinc.com <http://trumpetinc.com/>
LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet Blog
<http://trumpetinc.com/blog/>| Twitter  <https://twitter.com/trumpetinc>

Proud to be Great Place To Work
<https://www.greatplacetowork.com/certified-company/7012667> certified
since 2019


On Sun, Nov 7, 2021 at 11:14 AM Kevin Day <ke...@trumpetinc.com> wrote:

> oh - I see a major problem (it's a bit of a reverse memory leak, actually)
> that is involved in your disk space issue (it also results in really poor
> performance for image tiling operations):
>
>
> The cache in TilingPaintFactory is based on WeakHashMap.  WeakHashMap is
> designed around weak referenced *keys*, and the TilingPaintFactory.create()
> method is constructing a new key every call.  This means that entries are
> not going to be in the cache for very long (especially with a generational
> garbage collector).  In fact, in my testing, the cache is always empty.
> This results in a new BigBufferedImage (with a new disk cache) being
> created every time to the tile is painted.
>
> So I think at least some of the problem is that the cache is not
> implemented properly.
>
> If you are OK using an additional Apache dependency, I think that the
> cache should be changed to this (with strong keys and weak values):
> https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/map/ReferenceMap.html
>
>
>
> Next, the print job is running in stripes - so a single page gets rendered
> many times.  Right now, a new PageDrawer is being constructed for each
> stripe in PDFRenderer.renderImage(...), which is resulting in a huge number
> of BufferedImages being created.  Adding a cache for PageDrawer instances
> addresses this.
>
>
> Next, the key on the TilingPant cache in TilingPaintFactory is based on
> the xform affine transformation.  It needs this (I think) to handle scaling
> - but with stripes, the offset is changing but the scaling isn't.  So I
> think that the TilingPaintParameter.hashCode() and equals() needs to be
> adjusted to only consider scale - not translation.
>
>
>
> When I make these changes, it dramatically reduces the disk usage.
>
>
> I have submitted a pull request with these changes for your evaluation
> (this is NOT a complete fix for the problem - it still uses
> BigBufferedImage): https://github.com/apache/pdfbox/pull/135
>
> - K
>
> Kevin Day
>
> *trumpet**p| *480.961.6003 x1002
> *e| *ke...@trumpetinc.com
> www.trumpetinc.com <http://trumpetinc.com/>
> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet Blog
> <http://trumpetinc.com/blog/>| Twitter  <https://twitter.com/trumpetinc>
>
> Proud to be Great Place To Work
> <https://www.greatplacetowork.com/certified-company/7012667> certified
> since 2019
>
>
> On Sun, Nov 7, 2021 at 9:36 AM Kevin Day <ke...@trumpetinc.com> wrote:
>
>> Ah yes - the file deletion in BigBufferedImage only happens during
>> finalize() (or with a shutdown hook).  That's why you are blowing out your
>> disk space.  We will have the same problem even if we use ScratchFile.
>>
>> This means that we have to talk about disposing the BufferedImage
>> instances - but BufferedImage does not have a dispose() method on it.
>>
>> If we add dispose on the buffered image, now we also need dispose() on
>> the Paint object, etc....  It gets very ugly...
>>
>> - K
>>
>> Kevin Day
>>
>> *trumpet**p| *480.961.6003 x1002
>> *e| *ke...@trumpetinc.com
>> www.trumpetinc.com <http://trumpetinc.com/>
>> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet Blog
>> <http://trumpetinc.com/blog/>| Twitter  <https://twitter.com/trumpetinc>
>>
>> Proud to be Great Place To Work
>> <https://www.greatplacetowork.com/certified-company/7012667> certified
>> since 2019
>>
>>
>> On Sun, Nov 7, 2021 at 8:55 AM Kevin Day <ke...@trumpetinc.com> wrote:
>>
>>> There is another issue with BigBufferedImage - it does not work well on
>>> 32 bit JVMs.  The problem is that it creates a single MappedByteBuffer for
>>> the entire size of a channel's data (for the problem file, this means that
>>> each buffer has 172,975,104 bytes).
>>>
>>> I think it would be better to create a DataBuffer implementation that
>>> uses org.apache.pdfbox.io.ScratchFile and create a PdfBoxBufferedImage...
>>>
>>> - K
>>>
>>> Kevin Day
>>>
>>> *trumpet**p| *480.961.6003 x1002
>>> *e| *ke...@trumpetinc.com
>>> www.trumpetinc.com <http://trumpetinc.com/>
>>> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet Blog
>>> <http://trumpetinc.com/blog/>| Twitter  <https://twitter.com/trumpetinc>
>>>
>>> Proud to be Great Place To Work
>>> <https://www.greatplacetowork.com/certified-company/7012667> certified
>>> since 2019
>>>
>>>
>>> On Sun, Nov 7, 2021 at 8:36 AM Tilman Hausherr <thaush...@t-online.de>
>>> wrote:
>>>
>>>> Am 07.11.2021 um 16:09 schrieb Kevin Day:
>>>> > ok - I think that maybe there are two things going on, then.  The
>>>> first is
>>>> > that we are allocating a single buffer that is bigger than the
>>>> default heap
>>>> > (700MB > 500MB). I really think that the only two solutions are to
>>>> move
>>>> > this off the heap or reduce the size of the allocation.
>>>> >
>>>> > Then the second issue is maybe a memory leak (I have not been able to
>>>> see
>>>> > this because of the first problem unless I set my max heap to a huge
>>>> number
>>>> > - and that makes memory leak analysis very difficult).  I am going to
>>>> clone
>>>> > the PdfBox git repo now and swap that class in then do a little bit of
>>>> > profiling with jvisualvm to see if I can help find the memory leak.
>>>> >
>>>> > Question:  The Git repo for PdfBox says that it is a mirror.  If I do
>>>> wind
>>>> > up creating a pull request against this repo, will you be able to
>>>> accept it?
>>>>
>>>> Not directly but we can create a .diff / .patch file from the PR.
>>>>
>>>> In the meantime I ran my rendering regression checks which renders 1000
>>>> files. The hard disk C: went full after using 260 GB :-(
>>>>
>>>> Tilman
>>>>
>>>>
>>>> >
>>>> >
>>>> > FYI - the use of sun.* classes is just to work around a long standing
>>>> > native resource leak bug in the Windows MappedByteBuffer
>>>> implementation -
>>>> > On Windows, MappedByteBuffer does not release the allocated native
>>>> resource
>>>> > until finalize() is called, and because the MappedByteBuffer itself
>>>> does
>>>> > not take up much heap space, it can stay on the heap for quite a long
>>>> time
>>>> > even after all references are gone.  So the sun.* reference is just to
>>>> > force the release of those native resources.  I used the same
>>>> technique
>>>> > when I created the high throughput IO layer for iText.  Instead of
>>>> > referencing the package directly, we can use reflection (and only use
>>>> it
>>>> > for winXXX platform) - this shouldn't be a big problem.
>>>> >
>>>> > K
>>>> >
>>>> > Kevin Day
>>>> >
>>>> > *trumpet**p| *480.961.6003 x1002
>>>> > *e| *ke...@trumpetinc.com
>>>> > www.trumpetinc.com <http://trumpetinc.com/>
>>>> > LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet
>>>> Blog
>>>> > <http://trumpetinc.com/blog/>| Twitter  <
>>>> https://twitter.com/trumpetinc>
>>>> >
>>>> > Proud to be Great Place To Work
>>>> > <https://www.greatplacetowork.com/certified-company/7012667>
>>>> certified
>>>> > since 2019
>>>> >
>>>> >
>>>> > On Sun, Nov 7, 2021 at 7:53 AM Tilman Hausherr <thaush...@t-online.de
>>>> >
>>>> > wrote:
>>>> >
>>>> >> I tried it and it works pretty fast. The memory leak is still there
>>>> but
>>>> >> it comes later.
>>>> >>
>>>> >> Other drawbacks arethat it doesn't clean up after itself fast enough,
>>>> >> and it uses a sun.* class.
>>>> >>
>>>> >> Tilman
>>>> >>
>>>> >> Am 07.11.2021 um 14:31 schrieb Tilman Hausherr:
>>>> >>> Am 05.11.2021 um 22:09 schrieb Kevin Day:
>>>> >>>> ok - so should we be clamping the xstep in some way?  Or at this
>>>> >>>> depth of
>>>> >>>> the algorithm do we not have enough context to actually tell that
>>>> it
>>>> >>>> will
>>>> >>>> be outside the page/clipping region?
>>>> >>> Yes + yes, that is the problem. The context would have to include
>>>> the
>>>> >>> region but also the current transformation matrix. Which could
>>>> change
>>>> >>> despite using the same pattern. So it's tricky.
>>>> >>>
>>>> >>>>
>>>> >>>> I'm beginning to think that the BigBufferedImage might be the right
>>>> >>>> solution...  This is very inefficient, but honestly, the PDF is not
>>>> >>>> exactly
>>>> >>>> well formed, so if these particular files wind up being slower to
>>>> render
>>>> >>>> b/c they have to swap image content to disk, I think that is OK...
>>>> >>> Maybe... the good thing is that we know when the image will be "too
>>>> >>> big". The license is OK:
>>>> >>>
>>>> >>> https://www.apache.org/legal/resolved.html
>>>> >>>
>>>> >>> But I'm still wondering whether we have a memory leak. If we have,
>>>> >>> then it should be fixed.
>>>> >>>
>>>> >>> Tilman
>>>> >>>
>>>> >>>
>>>> >>>> - K
>>>> >>>>
>>>> >>>>
>>>> >>>> Kevin Day
>>>> >>>>
>>>> >>>> *trumpet**p| *480.961.6003 x1002
>>>> >>>> *e| *ke...@trumpetinc.com
>>>> >>>> www.trumpetinc.com <http://trumpetinc.com/>
>>>> >>>> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>| Trumpet
>>>> Blog
>>>> >>>> <http://trumpetinc.com/blog/>| Twitter <
>>>> https://twitter.com/trumpetinc>
>>>> >>>>
>>>> >>>> Proud to be Great Place To Work
>>>> >>>> <https://www.greatplacetowork.com/certified-company/7012667>
>>>> Certified
>>>> >>>> 2020-2021
>>>> >>>>
>>>> >>>>
>>>> >>>> On Fri, Nov 5, 2021 at 12:47 PM Tilman Hausherr <
>>>> thaush...@t-online.de>
>>>> >>>> wrote:
>>>> >>>>
>>>> >>>>> Am 05.11.2021 um 20:38 schrieb Kevin Day:
>>>> >>>>>> I do have a bit of experience with memory leaks (side effect of
>>>> >>>>>> writing
>>>> >>>>>> high performance java code, unfortunately!)
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> I am pretty sure that this is not a memory leak, though.  If this
>>>> >>>>>> was a
>>>> >>>>>> memory leak, we would see calls happening multiple times before
>>>> >>>>>> failure.
>>>> >>>>>>
>>>> >>>>>> The failure happens on the very first call to create the buffered
>>>> >>>>>> image:
>>>> >>>>>>
>>>> >>>>>>            BufferedImage image = new BufferedImage(rasterWidth,
>>>> >>>>> rasterHeight,
>>>> >>>>>> BufferedImage.TYPE_INT_ARGB);
>>>> >>>>>>
>>>> >>>>>> Nothing is even getting put into the weakcache.
>>>> >>>>>>
>>>> >>>>> That is done in TilingPaintFactory. The cache was needed for some
>>>> files
>>>> >>>>> that have used the pattern several times, to avoid recreating the
>>>> image
>>>> >>>>> for the TexturePaint.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>> I think the core issue is that the raster data for the image is
>>>> >>>>>> just too
>>>> >>>>>> big to store on the heap (without the heap being massive). It's
>>>> >>>>>> 13152x13152x4 bytes (there is an alpha channel on this as well) -
>>>> >>>>>> that's
>>>> >>>>>> almost 700MB - just for this one pattern.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Tilman wrote:  "That
>>>> >>>>>> pattern has an XStep and YStep of 23438 although the image is
>>>> 2148 x
>>>> >>>>>> 440. Because of a matrix scale of 0.0673396 the image pattern
>>>> size is
>>>> >>>>>> 1578 x 1578 at 72 dpi. So at 1200 dpi the size would be about
>>>> 26300 x
>>>> >>>>>> 26300"
>>>> >>>>>>
>>>> >>>>>> I am fairly familiar with PDF (I contributed the parsing library
>>>> to
>>>> >>>>>> iText
>>>> >>>>>> back in the day), but I'm not very familiar with rendering tile
>>>> >>>>>> patterns,
>>>> >>>>>> so the above is not intuitive to me (yet).  Is there some chance
>>>> that
>>>> >>>>>> PdfBox is doing a lot more work than is necessary for this
>>>> particular
>>>> >>>>> PDF?
>>>> >>>>>> Like is the matrix scale just absurdly bad in this file?  For an
>>>> image
>>>> >>>>> that
>>>> >>>>>> is originally 2148x440, it seems like requiring a 700MB raster
>>>> >>>>>> should not
>>>> >>>>>> be necessary - but I do NOT have nearly enough experience with
>>>> this
>>>> >>>>>> area
>>>> >>>>> of
>>>> >>>>>> PDF...
>>>> >>>>> The ridiculous thing isn't the matrix, it's the XStep and YStep.
>>>> This
>>>> >>>>> goes well outside of the page. The 0.067 matrix scaling makes it
>>>> >>>>> less bad.
>>>> >>>>>
>>>> >>>>> A weakness in PDFBox is that we don't readjust the tile image, or
>>>> >>>>> use an
>>>> >>>>> alternative painting method that doesn't use TexturePaint when the
>>>> >>>>> image
>>>> >>>>> would be painted only once.
>>>> >>>>>
>>>> >>>>> Tilman
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>> Thanks,
>>>> >>>>>>
>>>> >>>>>> - Kevin
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Kevin Day
>>>> >>>>>>
>>>> >>>>>> *trumpet**p| *480.961.6003 x1002
>>>> >>>>>> *e| *ke...@trumpetinc.com
>>>> >>>>>> www.trumpetinc.com <http://trumpetinc.com/>
>>>> >>>>>> LinkedIn <https://www.linkedin.com/company/trumpet-inc.>|
>>>> Trumpet
>>>> >> Blog
>>>> >>>>>> <http://trumpetinc.com/blog/>| Twitter
>>>> >>>>>> <https://twitter.com/trumpetinc>
>>>> >>>>>>
>>>> >>>>>> Proud to be Great Place To Work
>>>> >>>>>> <https://www.greatplacetowork.com/certified-company/7012667>
>>>> >> Certified
>>>> >>>>>> 2020-2021
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Fri, Nov 5, 2021 at 11:27 AM Tilman Hausherr
>>>> >>>>>> <thaush...@t-online.de>
>>>> >>>>>> wrote:
>>>> >>>>>>
>>>> >>>>>>> If you're experienced with memory leaks then it would be nice
>>>> if you
>>>> >>>>>>> could search.
>>>> >>>>>>>
>>>> >>>>>>> Things I tried that didn't help:
>>>> >>>>>>> - calling graphics.setPaint(null) after operations (to "lose"
>>>> >>>>>>> TilingPaint objects)
>>>> >>>>>>> - disabling the (weak) cache of TilingPaint objects
>>>> >>>>>>> - adding finalize to see if the TilingPaint class gets finalized
>>>> >>>>>>> (yes)
>>>> >>>>>>> - adding finalize to see if the HighResolutionImageIcon class
>>>> gets
>>>> >>>>>>> finalized (yes)
>>>> >>>>>>>
>>>> >>>>>>> Tilman
>>>> >>>>>>>
>>>> >>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>> >>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>> >>>>>
>>>> >>>>>
>>>> >>>
>>>> >>>
>>>> ---------------------------------------------------------------------
>>>> >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>> >>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>> >>>
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>> >> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>> >>
>>>> >>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>>
>>>>

Reply via email to