For systems I have access to, documentcache is disabled and has been for some time. I don't have any performance comparisons to share but from what I've been able to gather, the document cache was disabled due to performance testing a while ago.
Kevin Risden On Tue, Feb 6, 2024 at 4:13 PM Chris Hostetter <hossman_luc...@fucit.org> wrote: > > TL;DR: Some limited testing suggests that documentCache adds overhead w/o > benefit. Are there any folks that can can report their usecases perform > significantly better with documentCache enabled? > > > Background... > > I was asked to investigate a Solr 9.1 kubernetes deployment where a subset > of the Solr pods (hosting only PULL replicas) were configured to scale > up/down as needed using a kubernetes HPA (based on CPU load). The idea > being that as traffic was largely cyclical over the course of the day, thy > could save some compute cost by letting the system scale down pods when > load was lower, and scale back up (still using the same Persistent > Volumes) when load resumed the next day. > > For the most part this works fine, but in some cases -- not all -- they > were seeing that as traffic ramped up, and the HPA (re)started pod that > had existing replicas on an existing Persistent Volume, a pod might > experience several minutes of p99 query response times that were abismal: > 5+ seconds, compared to the typical 10-100ms. Even the p50 response > times would be in the 5+ seconds range for a several minutes! > > When I started reproducing this behavior under controled load, what I > found is that when this was happening, thread dumps showed a lot of > threads blocked on trying to "read through" the documentCache via > CaffeineCache.computeIfAbsent. > > > 100s of threads: > > "qtp392918519-19" #19 prio=5 os_prio=0 cpu=2853.50ms elapsed=129.15s > tid=0x00007f04316f6cf0 nid=0x5e waiting on condition [0x00007f03f8e2f000] > java.lang.Thread.State: TIMED_WAITING (parking) > at jdk.internal.misc.Unsafe.park(java.base@17.0.7/Native Method) > - parking to wait for <0x00000004b1c6a6a8> (a > java.util.concurrent.locks.ReentrantLock$NonfairSync) > at > java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.7/Unknown > Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.7/Unknown > Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(java.base@17.0.7/Unknown > Source) > at > java.util.concurrent.locks.ReentrantLock$Sync.tryLockNanos(java.base@17.0.7/Unknown > Source) > at > java.util.concurrent.locks.ReentrantLock.tryLock(java.base@17.0.7/Unknown > Source) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache.lock(BoundedLocalCache.java:1510) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1492) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2212) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:2182) > at > com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.putIfAbsent(LocalAsyncCache.java:316) > at > com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.putIfAbsent(LocalAsyncCache.java:291) > at > org.apache.solr.search.CaffeineCache.computeAsync(CaffeineCache.java:209) > at > org.apache.solr.search.CaffeineCache.computeIfAbsent(CaffeineCache.java:250) > at > org.apache.solr.search.SolrDocumentFetcher.doc(SolrDocumentFetcher.java:259) > > ...seemingly blocked on one thread doing cleanup / eviction: > > "qtp392918519-960" #960 prio=5 os_prio=0 cpu=141.89ms elapsed=32.24s > tid=0x00007f03f013ec30 nid=0x41f runnable [0x00007f01a946e000] > java.lang.Thread.State: RUNNABLE > at > com.github.benmanes.caffeine.cache.AccessOrderDeque.setPrevious(AccessOrderDeque.java:66) > at > com.github.benmanes.caffeine.cache.AccessOrderDeque.setPrevious(AccessOrderDeque.java:30) > at > com.github.benmanes.caffeine.cache.AbstractLinkedDeque.unlink(AbstractLinkedDeque.java:139) > at > com.github.benmanes.caffeine.cache.AccessOrderDeque.remove(AccessOrderDeque.java:53) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache.evictFromWindow(BoundedLocalCache.java:694) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntries(BoundedLocalCache.java:671) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1634) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1602) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3620) > at > org.apache.solr.search.CaffeineCache$$Lambda$656/0x00000008015b2870.execute(Unknown > Source) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleDrainBuffers(BoundedLocalCache.java:1575) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleAfterWrite(BoundedLocalCache.java:1545) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1477) > at > com.github.benmanes.caffeine.cache.BoundedLocalCache.replace(BoundedLocalCache.java:2485) > at > com.github.benmanes.caffeine.cache.LocalAsyncCache.lambda$handleCompletion$7(LocalAsyncCache.java:216) > at > com.github.benmanes.caffeine.cache.LocalAsyncCache$$Lambda$681/0x00000008015df980.accept(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(java.base@17.0.7/Unknown > Source) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(java.base@17.0.7/Unknown > Source) > at > java.util.concurrent.CompletableFuture.postComplete(java.base@17.0.7/Unknown > Source) > at > java.util.concurrent.CompletableFuture.complete(java.base@17.0.7/Unknown > Source) > at > org.apache.solr.search.CaffeineCache.computeAsync(CaffeineCache.java:235) > at > org.apache.solr.search.CaffeineCache.computeIfAbsent(CaffeineCache.java:250) > at > org.apache.solr.search.SolrDocumentFetcher.doc(SolrDocumentFetcher.java:259) > > (NOTE: not all thread dumps i took included a RUNNABLE doing cleanup / > eviction -- it's possible there _was_ a thread doing this cleanup that had > already returned, and the waiting threads hadn't (all) had time to wake up > yet.) > > > I don't really understand the Caffeine cache internals (or how they are > used in solr.CaffeineCache) enough to fully understand *WHY* this is > happening -- let alone why it seemed to only happen right after pod > startup -- but based on some googling my two best theories are/were: > > 1) A bug and/or pathological behavior in the Caffeine cache (or in > solr.CaffeineCache) when the cache starts out "empty" > > - I found some links (see [LINKS] below) that seemed to suggest this > might be possible, but... > * We seem to be following most of the best practices > (initialSize==size, using async, etc.) > * Since there is no autowarming of documentCache (and this collection > had no newSearcher event listene configured) that seems like it > should impact newSearcher situations just as much as firstSearcher > * But this deployment only ever saw problems on pod startup. > - I tried testing some changes to mitigate this: > * increases in size && initialSize > * explicit firstSearcher warming to pre-fill the cache with *ANY* docs > (in case the initialSize was being ignored) > ...but it didn't change the outcome > > 2) Reading stored fields from a freshly mounted Persistent Volume (ie: > nothing in the OS filesystem cache) might be slow enough to cause problems > / long locking in LocalAsyncCache > > - This occured to me because of some Caffeine cache guidance I saw > that advised Callables passed to Caffeine cache population should > be "fast" > * I was left wondering if the IO needed to read stored fields > from disk was "fast enough" > * And it would explain why the problem only seemed to happen > on pod startup > - But skiming the code in solr.CaffeineCache, I don't see how this would > affect us: > * the only "blocking" operation should be: > - asyncCache.asMap().putIfAbsent(key, future); > * actaully computing the result of that future shouldn't block anyone > except threads waiting on that exact key->future > > ---- > > In any case: Completely disabling the documentCache made this problem go > away. Which was not a suprise -- if you stop using codepaths with locks > you'll stop seeing a backlog of threads waiting on that lock. > > What *WAS* a surprise is that when I disabled the documentCache, there was > a tiny but consistent improvement in the mean, p50, p75, p95 & p99 Solr > response times (as measured by the application executing the HTTP request > to solr and reading the response back over the network). > > But this improvement wasn't just in the "obvious" case of the HPA scaling > test (by eliminating the several minutes of huge outliers). > > Even in "steady state" tests, where a fixed number of Solr pods were used, > and recieved a consistent, and modest, rate of requests that they could > handle w/o maxing out CPU, the measured (mean, p50, p75, p95 & p99) Solr > response times were still consistently a few milliseconds faster when the > documentCache was disabled then they were when the documentCache was > enabled. > > > In short, for this Solr deployment, the documentCache only seemed to have > downsides: > > - Slightly slower overall response times > - Risk of major slowdowns when starting up new pods > > > Hence the question I posed on the Subject of this email: Does > documentCache still make sense in modern Solr? > > Has anyone (else) done any (recent) performance testing of their Solr use > cases with and w/o documentCache enabled, and found that the documentCache > adds value? If not, should we consider changing the default configset to > mark it enabled="false", and let people enable it if and when they think > it might be useful? > > > > > [LINKS] related to write contention issues in Caffeine: > > - https://github.com/ben-manes/caffeine/wiki/Faq#write-contention > - https://github.com/ben-manes/caffeine/issues/672 > - cites Solr as a good example of avoiding these problems > - https://github.com/ben-manes/caffeine/issues/768 > - https://github.com/ben-manes/caffeine/issues/527 > - https://github.com/ben-manes/caffeine/issues/203 > > > > > > -Hoss > http://www.lucidworks.com/ >