For systems I have access to, documentcache is disabled and has been for
some time. I don't have any performance comparisons to share but from what
I've been able to gather, the document cache was disabled due to
performance testing a while ago.

Kevin Risden


On Tue, Feb 6, 2024 at 4:13 PM Chris Hostetter <hossman_luc...@fucit.org>
wrote:

>
> TL;DR: Some limited testing suggests that documentCache adds overhead w/o
> benefit.  Are there any folks that can can report their usecases perform
> significantly better with documentCache enabled?
>
>
> Background...
>
> I was asked to investigate a Solr 9.1 kubernetes deployment where a subset
> of the Solr pods (hosting only PULL replicas) were configured to scale
> up/down as needed using a kubernetes HPA (based on CPU load).  The idea
> being that as traffic was largely cyclical over the course of the day, thy
> could save some compute cost by letting the system scale down pods when
> load was lower, and scale back up (still using the same Persistent
> Volumes) when load resumed the next day.
>
> For the most part this works fine, but in some cases -- not all -- they
> were seeing that as traffic ramped up, and the HPA (re)started pod that
> had existing replicas on an existing Persistent Volume, a pod might
> experience several minutes of p99 query response times that were abismal:
> 5+ seconds, compared to the typical 10-100ms.   Even the p50 response
> times would be in the 5+ seconds range for a several minutes!
>
> When I started reproducing this behavior under controled load, what I
> found is that when this was happening, thread dumps showed a lot of
> threads blocked on trying to "read through" the documentCache via
> CaffeineCache.computeIfAbsent.
>
>
> 100s of threads:
>
> "qtp392918519-19" #19 prio=5 os_prio=0 cpu=2853.50ms elapsed=129.15s
> tid=0x00007f04316f6cf0 nid=0x5e waiting on condition  [0x00007f03f8e2f000]
>     java.lang.Thread.State: TIMED_WAITING (parking)
>          at jdk.internal.misc.Unsafe.park(java.base@17.0.7/Native Method)
>          - parking to wait for  <0x00000004b1c6a6a8> (a
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>          at
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.7/Unknown
> Source)
>          at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.7/Unknown
> Source)
>          at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(java.base@17.0.7/Unknown
> Source)
>          at
> java.util.concurrent.locks.ReentrantLock$Sync.tryLockNanos(java.base@17.0.7/Unknown
> Source)
>          at
> java.util.concurrent.locks.ReentrantLock.tryLock(java.base@17.0.7/Unknown
> Source)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache.lock(BoundedLocalCache.java:1510)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1492)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2212)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:2182)
>          at
> com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.putIfAbsent(LocalAsyncCache.java:316)
>          at
> com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.putIfAbsent(LocalAsyncCache.java:291)
>          at
> org.apache.solr.search.CaffeineCache.computeAsync(CaffeineCache.java:209)
>          at
> org.apache.solr.search.CaffeineCache.computeIfAbsent(CaffeineCache.java:250)
>          at
> org.apache.solr.search.SolrDocumentFetcher.doc(SolrDocumentFetcher.java:259)
>
> ...seemingly blocked on one thread doing cleanup / eviction:
>
> "qtp392918519-960" #960 prio=5 os_prio=0 cpu=141.89ms elapsed=32.24s
> tid=0x00007f03f013ec30 nid=0x41f runnable  [0x00007f01a946e000]
>     java.lang.Thread.State: RUNNABLE
>          at
> com.github.benmanes.caffeine.cache.AccessOrderDeque.setPrevious(AccessOrderDeque.java:66)
>          at
> com.github.benmanes.caffeine.cache.AccessOrderDeque.setPrevious(AccessOrderDeque.java:30)
>          at
> com.github.benmanes.caffeine.cache.AbstractLinkedDeque.unlink(AbstractLinkedDeque.java:139)
>          at
> com.github.benmanes.caffeine.cache.AccessOrderDeque.remove(AccessOrderDeque.java:53)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache.evictFromWindow(BoundedLocalCache.java:694)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntries(BoundedLocalCache.java:671)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1634)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1602)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3620)
>          at
> org.apache.solr.search.CaffeineCache$$Lambda$656/0x00000008015b2870.execute(Unknown
> Source)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleDrainBuffers(BoundedLocalCache.java:1575)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleAfterWrite(BoundedLocalCache.java:1545)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1477)
>          at
> com.github.benmanes.caffeine.cache.BoundedLocalCache.replace(BoundedLocalCache.java:2485)
>          at
> com.github.benmanes.caffeine.cache.LocalAsyncCache.lambda$handleCompletion$7(LocalAsyncCache.java:216)
>          at
> com.github.benmanes.caffeine.cache.LocalAsyncCache$$Lambda$681/0x00000008015df980.accept(Unknown
> Source)
>          at
> java.util.concurrent.CompletableFuture.uniWhenComplete(java.base@17.0.7/Unknown
> Source)
>          at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(java.base@17.0.7/Unknown
> Source)
>          at
> java.util.concurrent.CompletableFuture.postComplete(java.base@17.0.7/Unknown
> Source)
>          at
> java.util.concurrent.CompletableFuture.complete(java.base@17.0.7/Unknown
> Source)
>          at
> org.apache.solr.search.CaffeineCache.computeAsync(CaffeineCache.java:235)
>          at
> org.apache.solr.search.CaffeineCache.computeIfAbsent(CaffeineCache.java:250)
>          at
> org.apache.solr.search.SolrDocumentFetcher.doc(SolrDocumentFetcher.java:259)
>
> (NOTE: not all thread dumps i took included a RUNNABLE doing cleanup /
> eviction -- it's possible there _was_ a thread doing this cleanup that had
> already returned, and the waiting threads hadn't (all) had time to wake up
> yet.)
>
>
> I don't really understand the Caffeine cache internals (or how they are
> used in solr.CaffeineCache) enough to fully understand *WHY* this is
> happening -- let alone why it seemed to only happen right after pod
> startup -- but based on some googling my two best theories are/were:
>
> 1) A bug and/or pathological behavior in the Caffeine cache (or in
> solr.CaffeineCache) when the cache starts out "empty"
>
>   - I found some links (see [LINKS] below) that seemed to suggest this
>     might be possible, but...
>     * We seem to be following most of the best practices
>        (initialSize==size, using async, etc.)
>     * Since there is no autowarming of documentCache (and this collection
>       had no newSearcher event listene configured) that seems like it
>       should impact newSearcher situations just as much as firstSearcher
>     * But this deployment only ever saw problems on pod startup.
>   - I tried testing some changes to mitigate this:
>     * increases in size && initialSize
>     * explicit firstSearcher warming to pre-fill the cache with *ANY* docs
>       (in case the initialSize was being ignored)
>     ...but it didn't change the outcome
>
> 2) Reading stored fields from a freshly mounted Persistent Volume (ie:
> nothing in the OS filesystem cache) might be slow enough to cause problems
> / long locking in LocalAsyncCache
>
>   - This occured to me because of some Caffeine cache guidance I saw
>     that advised Callables passed to Caffeine cache population should
>     be "fast"
>     * I was left wondering if the IO needed to read stored fields
>       from disk was "fast enough"
>     * And it would explain why the problem only seemed to happen
>       on pod startup
>   - But skiming the code in solr.CaffeineCache, I don't see how this would
>     affect us:
>     * the only "blocking" operation should be:
>       - asyncCache.asMap().putIfAbsent(key, future);
>     * actaully computing the result of that future shouldn't block anyone
>       except threads waiting on that exact key->future
>
> ----
>
> In any case: Completely disabling the documentCache made this problem go
> away.  Which was not a suprise -- if you stop using codepaths with locks
> you'll stop seeing a backlog of threads waiting on that lock.
>
> What *WAS* a surprise is that when I disabled the documentCache, there was
> a tiny but consistent improvement in the mean, p50, p75, p95 & p99 Solr
> response times (as measured by the application executing the HTTP request
> to solr and reading the response back over the network).
>
> But this improvement wasn't just in the "obvious" case of the HPA scaling
> test (by eliminating the several minutes of huge outliers).
>
> Even in "steady state" tests, where a fixed number of Solr pods were used,
> and recieved a consistent, and modest, rate of requests that they could
> handle w/o maxing out CPU, the measured (mean, p50, p75, p95 & p99) Solr
> response times were still consistently a few milliseconds faster when the
> documentCache was disabled then they were when the documentCache was
> enabled.
>
>
> In short, for this Solr deployment, the documentCache only seemed to have
> downsides:
>
>   - Slightly slower overall response times
>   - Risk of major slowdowns when starting up new pods
>
>
> Hence the question I posed on the Subject of this email: Does
> documentCache still make sense in modern Solr?
>
> Has anyone (else) done any (recent) performance testing of their Solr use
> cases with and w/o documentCache enabled, and found that the documentCache
> adds value?  If not, should we consider changing the default configset to
> mark it enabled="false", and let people enable it if and when they think
> it might be useful?
>
>
>
>
> [LINKS] related to write contention issues in Caffeine:
>
>   - https://github.com/ben-manes/caffeine/wiki/Faq#write-contention
>   - https://github.com/ben-manes/caffeine/issues/672
>     - cites Solr as a good example of avoiding these problems
>   - https://github.com/ben-manes/caffeine/issues/768
>   - https://github.com/ben-manes/caffeine/issues/527
>   - https://github.com/ben-manes/caffeine/issues/203
>
>
>
>
>
> -Hoss
> http://www.lucidworks.com/
>

Reply via email to