Hello all, there we go. The report following after is based on comparing the "before" and "after" profiles using Eclipse Memory Analyzer. To sum it up:
- There are many 'CaffeineCache' showing up only in the 'after' profile and being reported as a leak suspect. Interestingly, directly instructing the memory analyzer to compare before and after, it did not derive any leak suspects from the after<>before comparison. - When looking at the snapshot comparison class histogram, I assume the 'shallow heap #2-#1" difference is the most relevant aspect about finding out the critical path at lower level. Interesting: shallow heap seems to increase most for HashMapNode, String, long[] and byte[] (see below in detail). Please let me know if there is anything further I can do. Florian Overview ======== "Overview - Details" shows 291,9 MB before vs. 358,5 MB. I don't know what exactly this is referring to as `top` within the solr container and `kubectl` stated memory allocation around 1,4 GiB. Dominator Tree ============== - Top 3 (before) JmxMBeanServer (shallow heap = 40, retained heap = 14.082.024, percentage = 4,60%) RingBuffer (shallow heap = 144, retained heap = 7.340.424, percentage 2,40%) SolrMetricManager (shallow heap = 48, retained heap = 4.296.328, percentage 1,40%) - Top 3 (after) JmxMBeanServer (shallow heap = 40, retained heap = 16.370.888, percentage = 4,34%) RingBuffer (shallow heap = 144, retained heap = 7.340.424, percentage 1,95%) SolrMetricManager (shallow heap = 48, retained heap = 4.981.712, percentage 1,33%) This seems very constant. Anyway, there are many 'CaffeineCache' instances which do only show up in the 'after' dump. Examples: retained heap 4.961.472, 4.119.528, 3.516.368. They are also listed as a 'leak suspect' (see below). I assume these occupation is referring to bytes. Leak Suspects - "before" ======================== Problem Suspect 1 ----------------- 50.371 instances of javax.management.ObjectName, loaded by <system class loader> occupy 74.946.952 (24,49 %) bytes. Most of these instances are referenced from one instance of java.util.HashMap$Node[], loaded by <system class loader>, which occupies 14.042.960 (4,59 %) bytes. The instance is referenced by org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxListener @ 0x739f72c18, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000. Keywords javax.management.ObjectName java.util.HashMap$Node[] org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxListener org.eclipse.jetty.webapp.WebAppClassLoader Problem Suspect 2 ----------------- 156 instances of org.apache.solr.update.UpdateLog, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000 occupy 48.435.616 (15,82 %) bytes. Most of these instances are referenced from one instance of java.util.HashMap$Node[], loaded by <system class loader>, which occupies 1.040 (0,00 %) bytes. The instance is referenced by org.apache.solr.core.SolrCores @ 0x739427be0, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000. Thread org.apache.solr.core.CloserThread @ 0x7389df580 CloserThread has a local variable or reference to org.apache.solr.core.SolrCores @ 0x739427be0 which is on the shortest path to java.util.HashMap$Node[256] @ 0x743d2be90. The thread org.apache.solr.core.CloserThread @ 0x7389df580 CloserThread keeps local variables with total size 224 (0,00 %) bytes. The stacktrace of this Thread is available. See stacktrace. See stacktrace with involved local variables. Keywords org.apache.solr.update.UpdateLog org.eclipse.jetty.webapp.WebAppClassLoader java.util.HashMap$Node[] org.apache.solr.core.SolrCores Leak Suspects - "after" ======================= Note that Suspect 3 only shows up here. Problem Suspect 1 ----------------- 58.650 instances of javax.management.ObjectName, loaded by <system class loader> occupy 87.289.920 (23,22 %) bytes. Most of these instances are referenced from one instance of java.util.HashMap$Node[], loaded by <system class loader>, which occupies 16.268.824 (4,33 %) bytes. The instance is referenced by org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxListener @ 0x739f72c18, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000. Keywords javax.management.ObjectName java.util.HashMap$Node[] org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxListener org.eclipse.jetty.webapp.WebAppClassLoader Problem Suspect 2 ----------------- 182 instances of org.apache.solr.update.UpdateLog, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000 occupy 51.072.600 (13,58 %) bytes. Most of these instances are referenced from one instance of java.util.HashMap$Node[], loaded by <system class loader>, which occupies 2.064 (0,00 %) bytes. The instance is referenced by org.apache.solr.core.SolrCores @ 0x739427be0, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000. Thread org.apache.solr.core.CloserThread @ 0x7389df580 CloserThread has a local variable or reference to org.apache.solr.core.SolrCores @ 0x739427be0 which is on the shortest path to java.util.HashMap$Node[512] @ 0x74dc49eb8. The thread org.apache.solr.core.CloserThread @ 0x7389df580 CloserThread keeps local variables with total size 224 (0,00 %) bytes. The stacktrace of this Thread is available. See stacktrace. See stacktrace with involved local variables. Keywords org.apache.solr.update.UpdateLog org.eclipse.jetty.webapp.WebAppClassLoader java.util.HashMap$Node[] org.apache.solr.core.SolrCores Problem Suspect 3 ----------------- 910 instances of org.apache.solr.search.CaffeineCache, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000 occupy 47.949.360 (12,75 %) bytes. Most of these instances are referenced from one instance of java.util.HashMap$Node[], loaded by <system class loader>, which occupies 2.064 (0,00 %) bytes. The instance is referenced by org.apache.solr.core.SolrCores @ 0x739427be0, loaded by org.eclipse.jetty.webapp.WebAppClassLoader @ 0x739200000. Thread org.apache.solr.core.CloserThread @ 0x7389df580 CloserThread has a local variable or reference to org.apache.solr.core.SolrCores @ 0x739427be0 which is on the shortest path to java.util.HashMap$Node[512] @ 0x74dc49eb8. The thread org.apache.solr.core.CloserThread @ 0x7389df580 CloserThread keeps local variables with total size 224 (0,00 %) bytes. The stacktrace of this Thread is available. See stacktrace. See stacktrace with involved local variables. Keywords org.apache.solr.search.CaffeineCache org.eclipse.jetty.webapp.WebAppClassLoader java.util.HashMap$Node[] org.apache.solr.core.SolrCores Leak Suspects by Snapshot Comparison ==================================== no suspects found. Anyway, class histogram shows following information. the following aspects are 1:1 copy-pasted from the class histogram columns: - class name - objects #1 (i.e. the before dump) - objects #2-#1 (i.e. difference after-before) - shallow heap #1 - shallow heap #2-#1 (i.e. shallow heap increase) - retained heap #1 - retained heap #2-#1 (i.e. retained heap increase) org.apache.solr.search.CaffeineCache 780 +130 106.080 +17.680 >= 17.647.480 ≈ +30.303.600 com.github.benmanes.caffeine.cache.SSLSMS 780 +130 237.120 +39.520 >= 17.545.784 ≈ +29.743.920 com.github.benmanes.caffeine.cache.PSMW 12.584 +17.638 503.360 +705.520 >= 5.620.496 ≈ +23.365.656 java.util.concurrent.CompletableFuture 12.596 +17.625 302.304 +423.000 >= 5.119.520 ≈ +22.658.448 long[] 19.771 +4.403 7.751.048 +17.844.576 >= 7.751.048 ≈ +17.844.576 org.apache.solr.search.BitDocSet 354 +1.574 8.496 +37.776 >= 3.110.920 ≈ +17.260.584 org.apache.lucene.util.FixedBitSet 369 +1.559 8.856 +37.416 >= 3.103.672 ≈ +17.221.640 java.util.HashMap 119.490 +18.457 5.735.520 +885.936 >= 101.690.496 ≈ +16.595.992 java.util.HashMap$Node[] 114.553 +17.410 11.168.712 +1.628.712 >= 97.420.632 ≈ +15.830.392 java.util.HashMap$Node 793.117 +130.345 25.379.744 +4.171.040 >= 86.734.088 ≈ +14.241.040 java.lang.String 1.121.440 +190.635 26.914.560 +4.575.240 >= 71.930.520 ≈ +12.518.456 javax.management.ObjectName 50.551 +8.279 1.617.632 +264.928 >= 75.107.872 ≈ +12.342.968 byte[] 1.214.028 +220.670 100.549.544 +11.897.224 >= 100.549.544 ≈ +11.897.224 java.lang.Object[] 61.766 +24.761 5.426.192 +1.476.208 >= 57.114.176 ≈ +7.655.120 java.util.ArrayList 56.277 +22.808 1.350.648 +547.392 >= 12.604.656 ≈ +6.154.880 com.codahale.metrics.Timer 5.462 +904 131.088 +21.696 >= 7.941.264 ≈ +5.093.920 com.codahale.metrics.Histogram 5.463 +904 131.112 +21.696 >= 5.843.248 ≈ +4.746.640 com.codahale.metrics.ExponentiallyDecayingReservoir 5.463 +904 305.928 +50.624 >= 5.536.048 ≈ +4.695.800 java.util.concurrent.ConcurrentSkipListMap 5.466 +904 262.368 +43.392 >= 4.313.592 ≈ +4.493.104 java.util.concurrent.ConcurrentSkipListMap$Index 23.087 +24.681 554.088 +592.344 >= 4.010.920 ≈ +4.440.376 java.util.concurrent.ConcurrentHashMap 25.133 +3.751 1.608.512 +240.064 >= 28.338.872 ≈ +3.965.624 java.util.concurrent.ConcurrentHashMap$Node[] 15.109 +2.425 3.370.416 +631.184 >= 27.139.248 ≈ +3.947.888 java.util.concurrent.ConcurrentSkipListMap$Node 44.005 +48.274 1.056.120 +1.158.576 >= 3.456.832 ≈ +3.848.032 org.apache.solr.search.QueryResultKey 3.764 +5.522 210.784 +309.232 >= 5.585.824 ≈ +3.814.448 java.util.concurrent.ConcurrentHashMap$Node 283.804 +55.058 9.081.728 +1.761.856 >= 24.163.208 ≈ +3.378.344 total: 25 of 15.027 entries; 15.002 more 7.055.191 +1.375.017 306.092.872 +69.871.128 On 2025/05/16 12:38:55 Jason Gerlowski wrote: > Hi Florian, > > Glad you were able to get the heap dumps as Jan suggested! > > I wouldn't sell yourself short - looking at heap dumps can be > intimidating if you haven't done it before, but there's some really > great analysis tools out there that summarize things in a pretty clear > way. I'd open the heap dump in one of those (I personally recommend > the "Eclipse Memory Analyzer Tool", sometimes called "MAT") and look > at the biggest sections of the "Allocations" pie chart. The > "Dominator Report" is another good place to look, if you see that in > your tool of choice. You may recognize more of the Solr class and > object names than you're giving yourself credit for, and if you don't > you can always share them here and we can try and interpret. > > In general folks (myself included) are a little reluctant to click > file-download links shared on the mailing list, mostly for security > reasons. Open source projects are common targets of phishing attacks, > and while every sign points to your sincerity and trustworthiness, > it's hard to argue against prudence. > > Best, > > Jason > > On Mon, May 12, 2025 at 9:34 AM Schieder Florian > <fl...@puzzleyou.de.invalid> wrote: > > > > Good morning, > > > > please excuse the late response. The jattach hint was very useful, thank > > you very much. > > I have created some memory dumps before and after the REINDEXCOLLECTION > > jobs. Honestly, I think it won't help much if I take a deep dive into > > understanding the problem myself. I could share them via download with you > > if you want to. > > > > Florian > > > > On 2025/04/30 12:28:19 Jan Høydahl wrote: > > > We provide jattach for these purposes in the official docker image: > > > https://solr.apache.org/guide/solr/latest/deployment-guide/solr-in-docker.html#debugging-with-jattach > > > > > > Jan > > > > > > > 30. apr. 2025 kl. 13:11 skrev Jason Gerlowski <ge...@gmail.com>: > > > > > > > > Hi Florian, > > > > > > > > I haven't heard any reports of a memory leak triggered by the > > > > REINDEXCOLLECTION codepath, but such a leak is always possible. I'd > > > > love to see what you find if you're able to take a heap dump! > > > > > > > > The typical way (afaik) to create heap dumps "on demand" is with jcmd > > > > or jmap. If that's not available on the Solr pods, it might be > > > > possible to "kubectl cp" one of those executables into your pods, and > > > > then run it to produce a dump. If that doesn't work, then the other > > > > option I'd recommend is making sure that Solr is running with the > > > > "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<file-or-dir-path>" > > > > JVM flags so that a heap dump is produced when OOM is eventually hit. > > > > (You'd want to make sure that the heap-dump path is something > > > > persistent, so that the heap dump won't be wiped out by a > > > > pod-restart.) > > > > > > > > Hopefully that helps a bit. Let us know if you're able to find anything! > > > > > > > > Best, > > > > > > > > Jason > > > > > > > > On Wed, Apr 30, 2025 at 6:55 AM Schieder Florian > > > > <fl...@puzzleyou.de.invalid> wrote: > > > >> > > > >> Hello all, > > > >> > > > >> we are running Solr 9.8.1 on Kubernetes using the official Solr > > > >> Operator 0.9.1. > > > >> > > > >> After many asynchronous REINDEXCOLLECTION requests we are encountering > > > >> a significant RAM allocation increase which is itself neither > > > >> surprising nor problematic, but after those requests have successfully > > > >> completed, the consumption stays constant and does not decrease. > > > >> Manually restarting the SolrCloud deployment always resolves that > > > >> issue, so for me it seems like either the allocated blocks are > > > >> afterwards not being freed up correctly or, they *are* deallocated > > > >> correctly but after the requests completed, the heap itself is > > > >> fragmented too much. > > > >> > > > >> If the heap allocation increased and increased after several > > > >> REINDEXCOLLECTION requests and we did not restart the pods, this led > > > >> to OOM crashes for us in the past. In this case, the REINDEXCOLLECTION > > > >> async requests were interrupted which required manual intervention. > > > >> > > > >> It's easy for me to reproduce this RAM allocation behavior, but after > > > >> diving into how to create JVM memory dumps with 'jcmd' and 'jmap', it > > > >> seems to me like it's not trivially possible to create such a heap > > > >> dump with the current tools installed into the Solr pods. However, I'm > > > >> not that expert regarding Solr and the Java ecosystem, so if there are > > > >> any solutions to create memory dumps to help the Solr development team > > > >> tracing the RAM consumption issues explained above, any ideas are > > > >> welcome. > > > >> > > > >> Many thanks! > > > >> Florian Schieder > > > > > > >