Hello all,

there we go. The report following after is based on comparing the "before" and 
"after" profiles using Eclipse Memory Analyzer. To sum it up:

- There are many 'CaffeineCache' showing up only in the 'after' profile
  and being reported as a leak suspect. Interestingly, directly instructing
  the memory analyzer to compare before and after, it did not derive any leak
  suspects from the after<>before comparison.

- When looking at the snapshot comparison class histogram, I assume the
  'shallow heap #2-#1" difference is the most relevant aspect about finding
  out the critical path at lower level.
  Interesting: shallow heap seems to increase most for HashMapNode, String,
  long[] and byte[] (see below in detail).

Please let me know if there is anything further I can do.
Florian


Overview
========

"Overview - Details" shows 291,9 MB before vs. 358,5 MB. I don't know what 
exactly this is referring to as `top` within the solr container and `kubectl` 
stated memory allocation around 1,4 GiB.

Dominator Tree
==============

 - Top 3 (before)
   JmxMBeanServer (shallow heap = 40, retained heap = 14.082.024, percentage = 
4,60%)
   RingBuffer (shallow heap = 144, retained heap = 7.340.424, percentage 2,40%)
   SolrMetricManager (shallow heap = 48, retained heap = 4.296.328, percentage 
1,40%)

 - Top 3 (after)
   JmxMBeanServer (shallow heap = 40, retained heap = 16.370.888, percentage = 
4,34%)
   RingBuffer (shallow heap = 144, retained heap = 7.340.424, percentage 1,95%)
   SolrMetricManager (shallow heap = 48, retained heap = 4.981.712, percentage 
1,33%)

This seems very constant. Anyway, there are many 'CaffeineCache' instances which
do only show up in the 'after' dump. Examples: retained heap 4.961.472, 
4.119.528,
3.516.368. They are also listed as a 'leak suspect' (see below).
I assume these occupation is referring to bytes.

Leak Suspects - "before"
========================

Problem Suspect 1
-----------------

50.371 instances of javax.management.ObjectName, loaded by <system class 
loader> occupy 74.946.952 (24,49 %) bytes.

Most of these instances are referenced from one instance of 
java.util.HashMap$Node[], loaded by <system class loader>, which occupies 
14.042.960 (4,59 %) bytes. The instance is referenced by 
org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxListener @ 
0x739f72c18, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000.

Keywords

javax.management.ObjectName
java.util.HashMap$Node[]
org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxListener
org.eclipse.jetty.webapp.WebAppClassLoader


Problem Suspect 2
-----------------

156 instances of org.apache.solr.update.UpdateLog, loaded by 
org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000 occupy 48.435.616 
(15,82 %) bytes.

Most of these instances are referenced from one instance of 
java.util.HashMap$Node[], loaded by <system class loader>, which occupies 1.040 
(0,00 %) bytes. The instance is referenced by org.apache.solr.core.SolrCores @ 
0x739427be0, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000.

Thread org.apache.solr.core.CloserThread @ 0x7389df580 CloserThread has a local 
variable or reference to org.apache.solr.core.SolrCores @ 0x739427be0 which is 
on the shortest path to java.util.HashMap$Node[256] @ 0x743d2be90. The thread 
org.apache.solr.core.CloserThread @ 0x7389df580 CloserThread keeps local 
variables with total size 224 (0,00 %) bytes.

The stacktrace of this Thread is available. See stacktrace. See stacktrace with 
involved local variables.

Keywords

org.apache.solr.update.UpdateLog
org.eclipse.jetty.webapp.WebAppClassLoader
java.util.HashMap$Node[]
org.apache.solr.core.SolrCores


Leak Suspects - "after"
=======================

Note that Suspect 3 only shows up here.

Problem Suspect 1
-----------------

58.650 instances of javax.management.ObjectName, loaded by <system class 
loader> occupy 87.289.920 (23,22 %) bytes.

Most of these instances are referenced from one instance of 
java.util.HashMap$Node[], loaded by <system class loader>, which occupies 
16.268.824 (4,33 %) bytes. The instance is referenced by 
org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxListener @ 
0x739f72c18, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000.

Keywords

javax.management.ObjectName
java.util.HashMap$Node[]
org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxListener
org.eclipse.jetty.webapp.WebAppClassLoader


Problem Suspect 2
-----------------

182 instances of org.apache.solr.update.UpdateLog, loaded by 
org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000 occupy 51.072.600 
(13,58 %) bytes.

Most of these instances are referenced from one instance of 
java.util.HashMap$Node[], loaded by <system class loader>, which occupies 2.064 
(0,00 %) bytes. The instance is referenced by org.apache.solr.core.SolrCores @ 
0x739427be0, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000.

Thread org.apache.solr.core.CloserThread @ 0x7389df580 CloserThread has a local 
variable or reference to org.apache.solr.core.SolrCores @ 0x739427be0 which is 
on the shortest path to java.util.HashMap$Node[512] @ 0x74dc49eb8. The thread 
org.apache.solr.core.CloserThread @ 0x7389df580 CloserThread keeps local 
variables with total size 224 (0,00 %) bytes.

The stacktrace of this Thread is available. See stacktrace. See stacktrace with 
involved local variables.

Keywords

org.apache.solr.update.UpdateLog
org.eclipse.jetty.webapp.WebAppClassLoader
java.util.HashMap$Node[]
org.apache.solr.core.SolrCores


Problem Suspect 3
-----------------

910 instances of org.apache.solr.search.CaffeineCache, loaded by 
org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000 occupy 47.949.360 
(12,75 %) bytes.

Most of these instances are referenced from one instance of 
java.util.HashMap$Node[], loaded by <system class loader>, which occupies 2.064 
(0,00 %) bytes. The instance is referenced by org.apache.solr.core.SolrCores @ 
0x739427be0, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000.

Thread org.apache.solr.core.CloserThread @ 0x7389df580 CloserThread has a local 
variable or reference to org.apache.solr.core.SolrCores @ 0x739427be0 which is 
on the shortest path to java.util.HashMap$Node[512] @ 0x74dc49eb8. The thread 
org.apache.solr.core.CloserThread @ 0x7389df580 CloserThread keeps local 
variables with total size 224 (0,00 %) bytes.

The stacktrace of this Thread is available. See stacktrace. See stacktrace with 
involved local variables.

Keywords

org.apache.solr.search.CaffeineCache
org.eclipse.jetty.webapp.WebAppClassLoader
java.util.HashMap$Node[]
org.apache.solr.core.SolrCores


Leak Suspects by Snapshot Comparison
====================================

no suspects found. Anyway, class histogram shows following information.
the following aspects are 1:1 copy-pasted from the class histogram
columns:
- class name
- objects #1 (i.e. the before dump)
- objects #2-#1 (i.e. difference after-before)
- shallow heap #1
- shallow heap #2-#1 (i.e. shallow heap increase)
- retained heap #1
- retained heap #2-#1 (i.e. retained heap increase)

org.apache.solr.search.CaffeineCache
      780   +130  106.080     +17.680     >= 17.647.480     ≈ +30.303.600

com.github.benmanes.caffeine.cache.SSLSMS
      780   +130  237.120     +39.520     >= 17.545.784     ≈ +29.743.920

com.github.benmanes.caffeine.cache.PSMW
      12.584      +17.638     503.360     +705.520    >= 5.620.496      ≈ 
+23.365.656

java.util.concurrent.CompletableFuture
      12.596      +17.625     302.304     +423.000    >= 5.119.520      ≈ 
+22.658.448

long[]
      19.771      +4.403      7.751.048   +17.844.576 >= 7.751.048      ≈ 
+17.844.576

org.apache.solr.search.BitDocSet
      354   +1.574      8.496 +37.776     >= 3.110.920      ≈ +17.260.584

org.apache.lucene.util.FixedBitSet
      369   +1.559      8.856 +37.416     >= 3.103.672      ≈ +17.221.640

java.util.HashMap
      119.490     +18.457     5.735.520   +885.936    >= 101.690.496    ≈ 
+16.595.992

java.util.HashMap$Node[]
      114.553     +17.410     11.168.712  +1.628.712  >= 97.420.632     ≈ 
+15.830.392

java.util.HashMap$Node
      793.117     +130.345    25.379.744  +4.171.040  >= 86.734.088     ≈ 
+14.241.040

java.lang.String
      1.121.440   +190.635    26.914.560  +4.575.240  >= 71.930.520     ≈ 
+12.518.456

javax.management.ObjectName
      50.551      +8.279      1.617.632   +264.928    >= 75.107.872     ≈ 
+12.342.968

byte[]
      1.214.028   +220.670    100.549.544 +11.897.224 >= 100.549.544    ≈ 
+11.897.224

java.lang.Object[]
      61.766      +24.761     5.426.192   +1.476.208  >= 57.114.176     ≈ 
+7.655.120

java.util.ArrayList
      56.277      +22.808     1.350.648   +547.392    >= 12.604.656     ≈ 
+6.154.880

com.codahale.metrics.Timer
      5.462 +904  131.088     +21.696     >= 7.941.264      ≈ +5.093.920

com.codahale.metrics.Histogram
      5.463 +904  131.112     +21.696     >= 5.843.248      ≈ +4.746.640

com.codahale.metrics.ExponentiallyDecayingReservoir
      5.463 +904  305.928     +50.624     >= 5.536.048      ≈ +4.695.800

java.util.concurrent.ConcurrentSkipListMap
      5.466 +904  262.368     +43.392     >= 4.313.592      ≈ +4.493.104

java.util.concurrent.ConcurrentSkipListMap$Index
      23.087      +24.681     554.088     +592.344    >= 4.010.920      ≈ 
+4.440.376

java.util.concurrent.ConcurrentHashMap
      25.133      +3.751      1.608.512   +240.064    >= 28.338.872     ≈ 
+3.965.624

java.util.concurrent.ConcurrentHashMap$Node[]
      15.109      +2.425      3.370.416   +631.184    >= 27.139.248     ≈ 
+3.947.888

java.util.concurrent.ConcurrentSkipListMap$Node
      44.005      +48.274     1.056.120   +1.158.576  >= 3.456.832      ≈ 
+3.848.032

org.apache.solr.search.QueryResultKey
      3.764 +5.522      210.784     +309.232    >= 5.585.824      ≈ +3.814.448

java.util.concurrent.ConcurrentHashMap$Node
      283.804     +55.058     9.081.728   +1.761.856  >= 24.163.208     ≈ 
+3.378.344

total: 25 of 15.027 entries; 15.002 more
      7.055.191   +1.375.017  306.092.872 +69.871.128

On 2025/05/16 12:38:55 Jason Gerlowski wrote:
> Hi Florian,
>
> Glad you were able to get the heap dumps as Jan suggested!
>
> I wouldn't sell yourself short - looking at heap dumps can be
> intimidating if you haven't done it before, but there's some really
> great analysis tools out there that summarize things in a pretty clear
> way. I'd open the heap dump in one of those (I personally recommend
> the "Eclipse Memory Analyzer Tool", sometimes called "MAT") and look
> at the biggest sections of the "Allocations" pie chart. The
> "Dominator Report" is another good place to look, if you see that in
> your tool of choice. You may recognize more of the Solr class and
> object names than you're giving yourself credit for, and if you don't
> you can always share them here and we can try and interpret.
>
> In general folks (myself included) are a little reluctant to click
> file-download links shared on the mailing list, mostly for security
> reasons. Open source projects are common targets of phishing attacks,
> and while every sign points to your sincerity and trustworthiness,
> it's hard to argue against prudence.
>
> Best,
>
> Jason
>
> On Mon, May 12, 2025 at 9:34 AM Schieder Florian
> <fl...@puzzleyou.de.invalid> wrote:
> >
> > Good morning,
> >
> > please excuse the late response. The jattach hint was very useful, thank 
> > you very much.
> > I have created some memory dumps before and after the REINDEXCOLLECTION 
> > jobs. Honestly, I think it won't help much if I take a deep dive into 
> > understanding the problem myself. I could share them via download with you 
> > if you want to.
> >
> > Florian
> >
> > On 2025/04/30 12:28:19 Jan Høydahl wrote:
> > > We provide jattach for these purposes in the official docker image: 
> > > https://solr.apache.org/guide/solr/latest/deployment-guide/solr-in-docker.html#debugging-with-jattach
> > >
> > > Jan
> > >
> > > > 30. apr. 2025 kl. 13:11 skrev Jason Gerlowski <ge...@gmail.com>:
> > > >
> > > > Hi Florian,
> > > >
> > > > I haven't heard any reports of a memory leak triggered by the
> > > > REINDEXCOLLECTION codepath, but such a leak is always possible. I'd
> > > > love to see what you find if you're able to take a heap dump!
> > > >
> > > > The typical way (afaik) to create heap dumps "on demand" is with jcmd
> > > > or jmap. If that's not available on the Solr pods, it might be
> > > > possible to "kubectl cp" one of those executables into your pods, and
> > > > then run it to produce a dump. If that doesn't work, then the other
> > > > option I'd recommend is making sure that Solr is running with the
> > > > "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<file-or-dir-path>"
> > > > JVM flags so that a heap dump is produced when OOM is eventually hit.
> > > > (You'd want to make sure that the heap-dump path is something
> > > > persistent, so that the heap dump won't be wiped out by a
> > > > pod-restart.)
> > > >
> > > > Hopefully that helps a bit. Let us know if you're able to find anything!
> > > >
> > > > Best,
> > > >
> > > > Jason
> > > >
> > > > On Wed, Apr 30, 2025 at 6:55 AM Schieder Florian
> > > > <fl...@puzzleyou.de.invalid> wrote:
> > > >>
> > > >> Hello all,
> > > >>
> > > >> we are running Solr 9.8.1 on Kubernetes using the official Solr 
> > > >> Operator 0.9.1.
> > > >>
> > > >> After many asynchronous REINDEXCOLLECTION requests we are encountering 
> > > >> a significant RAM allocation increase which is itself neither 
> > > >> surprising nor problematic, but after those requests have successfully 
> > > >> completed, the consumption stays constant and does not decrease. 
> > > >> Manually restarting the SolrCloud deployment always resolves that 
> > > >> issue, so for me it seems like either the allocated blocks are 
> > > >> afterwards not being freed up correctly or, they *are* deallocated 
> > > >> correctly but after the requests completed, the heap itself is 
> > > >> fragmented too much.
> > > >>
> > > >> If the heap allocation increased and increased after several 
> > > >> REINDEXCOLLECTION requests and we did not restart the pods, this led 
> > > >> to OOM crashes for us in the past. In this case, the REINDEXCOLLECTION 
> > > >> async requests were interrupted which required manual intervention.
> > > >>
> > > >> It's easy for me to reproduce this RAM allocation behavior, but after 
> > > >> diving into how to create JVM memory dumps with 'jcmd' and 'jmap', it 
> > > >> seems to me like it's not trivially possible to create such a heap 
> > > >> dump with the current tools installed into the Solr pods. However, I'm 
> > > >> not that expert regarding Solr and the Java ecosystem, so if there are 
> > > >> any solutions to create memory dumps to help the Solr development team 
> > > >> tracing the RAM consumption issues explained above, any ideas are 
> > > >> welcome.
> > > >>
> > > >> Many thanks!
> > > >> Florian Schieder
> > >
> > >
>

Reply via email to