Re: Does documentCache still make sense in modern Solr?

Walter Underwood Sat, 10 Feb 2024 19:01:54 -0800

I’m sure that the document cache was helpful years ago at Netflix, because 
people would use different queries (“batman”, “dark knight”, “the dark knight”) 
to get to the same movies. So our document cache hit rate was around 75%. Of 
course, that was with Solr 1.3 and spinning disks, but nothing was virtual, all 
bare metal.


If you are in a mass market hit-oriented business, document cache might pay 
off. Where I work now, every client has a different need (legal support), so 
our cache hit rates are very small.

It all comes back to the users.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 10, 2024, at 5:29 PM, Jan Høydahl <jan....@cominvent.com> wrote:
> 
> I've always told clients that DocumentCache only make sense if they can 
> achieve a significant cache hit ratio and have slow disks. For most installs 
> I've been involved with, that has not been the case, and we have disabled the 
> cache. Interesting find that it may actually hurt performance, so I look 
> forward to understanding if there are usecases out there that actually 
> justify it. Guess much has happened since it was invented - SSD, MMapDir and 
> more RAM etc.
> 
> Jan
> 
>> 6. feb. 2024 kl. 22:13 skrev Chris Hostetter <hossman_luc...@fucit.org>:
>> 
>> 
>> TL;DR: Some limited testing suggests that documentCache adds overhead w/o 
>> benefit.  Are there any folks that can can report their usecases perform 
>> significantly better with documentCache enabled?
>> 
>> 
>> Background...
>> 
>> I was asked to investigate a Solr 9.1 kubernetes deployment where a subset 
>> of the Solr pods (hosting only PULL replicas) were configured to scale 
>> up/down as needed using a kubernetes HPA (based on CPU load).  The idea 
>> being that as traffic was largely cyclical over the course of the day, thy 
>> could save some compute cost by letting the system scale down pods when load 
>> was lower, and scale back up (still using the same Persistent Volumes) when 
>> load resumed the next day.
>> 
>> For the most part this works fine, but in some cases -- not all -- they were 
>> seeing that as traffic ramped up, and the HPA (re)started pod that had 
>> existing replicas on an existing Persistent Volume, a pod might experience 
>> several minutes of p99 query response times that were abismal: 5+ seconds, 
>> compared to the typical 10-100ms.   Even the p50 response times would be in 
>> the 5+ seconds range for a several minutes!
>> 
>> When I started reproducing this behavior under controled load, what I found 
>> is that when this was happening, thread dumps showed a lot of threads 
>> blocked on trying to "read through" the documentCache via 
>> CaffeineCache.computeIfAbsent.
>> 
>> 
>> 100s of threads:
>> 
>> "qtp392918519-19" #19 prio=5 os_prio=0 cpu=2853.50ms elapsed=129.15s 
>> tid=0x00007f04316f6cf0 nid=0x5e waiting on condition  [0x00007f03f8e2f000]
>>  java.lang.Thread.State: TIMED_WAITING (parking)
>>       at jdk.internal.misc.Unsafe.park(java.base@17.0.7/Native Method)
>>       - parking to wait for  <0x00000004b1c6a6a8> (a 
>> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>>       at 
>> java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.7/Unknown 
>> Source)
>>       at 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.7/Unknown
>>  Source)
>>       at 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(java.base@17.0.7/Unknown
>>  Source)
>>       at 
>> java.util.concurrent.locks.ReentrantLock$Sync.tryLockNanos(java.base@17.0.7/Unknown
>>  Source)
>>       at 
>> java.util.concurrent.locks.ReentrantLock.tryLock(java.base@17.0.7/Unknown 
>> Source)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache.lock(BoundedLocalCache.java:1510)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1492)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2212)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:2182)
>>       at 
>> com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.putIfAbsent(LocalAsyncCache.java:316)
>>       at 
>> com.github.benmanes.caffeine.cache.LocalAsyncCache$AsyncAsMapView.putIfAbsent(LocalAsyncCache.java:291)
>>       at 
>> org.apache.solr.search.CaffeineCache.computeAsync(CaffeineCache.java:209)
>>       at 
>> org.apache.solr.search.CaffeineCache.computeIfAbsent(CaffeineCache.java:250)
>>       at 
>> org.apache.solr.search.SolrDocumentFetcher.doc(SolrDocumentFetcher.java:259)
>> 
>> ...seemingly blocked on one thread doing cleanup / eviction:
>> 
>> "qtp392918519-960" #960 prio=5 os_prio=0 cpu=141.89ms elapsed=32.24s 
>> tid=0x00007f03f013ec30 nid=0x41f runnable  [0x00007f01a946e000]
>>  java.lang.Thread.State: RUNNABLE
>>       at 
>> com.github.benmanes.caffeine.cache.AccessOrderDeque.setPrevious(AccessOrderDeque.java:66)
>>       at 
>> com.github.benmanes.caffeine.cache.AccessOrderDeque.setPrevious(AccessOrderDeque.java:30)
>>       at 
>> com.github.benmanes.caffeine.cache.AbstractLinkedDeque.unlink(AbstractLinkedDeque.java:139)
>>       at 
>> com.github.benmanes.caffeine.cache.AccessOrderDeque.remove(AccessOrderDeque.java:53)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache.evictFromWindow(BoundedLocalCache.java:694)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntries(BoundedLocalCache.java:671)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1634)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1602)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3620)
>>       at 
>> org.apache.solr.search.CaffeineCache$$Lambda$656/0x00000008015b2870.execute(Unknown
>>  Source)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleDrainBuffers(BoundedLocalCache.java:1575)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleAfterWrite(BoundedLocalCache.java:1545)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1477)
>>       at 
>> com.github.benmanes.caffeine.cache.BoundedLocalCache.replace(BoundedLocalCache.java:2485)
>>       at 
>> com.github.benmanes.caffeine.cache.LocalAsyncCache.lambda$handleCompletion$7(LocalAsyncCache.java:216)
>>       at 
>> com.github.benmanes.caffeine.cache.LocalAsyncCache$$Lambda$681/0x00000008015df980.accept(Unknown
>>  Source)
>>       at 
>> java.util.concurrent.CompletableFuture.uniWhenComplete(java.base@17.0.7/Unknown
>>  Source)
>>       at 
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(java.base@17.0.7/Unknown
>>  Source)
>>       at 
>> java.util.concurrent.CompletableFuture.postComplete(java.base@17.0.7/Unknown 
>> Source)
>>       at 
>> java.util.concurrent.CompletableFuture.complete(java.base@17.0.7/Unknown 
>> Source)
>>       at 
>> org.apache.solr.search.CaffeineCache.computeAsync(CaffeineCache.java:235)
>>       at 
>> org.apache.solr.search.CaffeineCache.computeIfAbsent(CaffeineCache.java:250)
>>       at 
>> org.apache.solr.search.SolrDocumentFetcher.doc(SolrDocumentFetcher.java:259)
>> 
>> (NOTE: not all thread dumps i took included a RUNNABLE doing cleanup / 
>> eviction -- it's possible there _was_ a thread doing this cleanup that had 
>> already returned, and the waiting threads hadn't (all) had time to wake up 
>> yet.)
>> 
>> 
>> I don't really understand the Caffeine cache internals (or how they are used 
>> in solr.CaffeineCache) enough to fully understand *WHY* this is happening -- 
>> let alone why it seemed to only happen right after pod startup -- but based 
>> on some googling my two best theories are/were:
>> 
>> 1) A bug and/or pathological behavior in the Caffeine cache (or in 
>> solr.CaffeineCache) when the cache starts out "empty"
>> 
>> - I found some links (see [LINKS] below) that seemed to suggest this
>>  might be possible, but...
>>  * We seem to be following most of the best practices
>>     (initialSize==size, using async, etc.)
>>  * Since there is no autowarming of documentCache (and this collection
>>    had no newSearcher event listene configured) that seems like it
>>    should impact newSearcher situations just as much as firstSearcher
>>  * But this deployment only ever saw problems on pod startup.
>> - I tried testing some changes to mitigate this:
>>  * increases in size && initialSize
>>  * explicit firstSearcher warming to pre-fill the cache with *ANY* docs
>>    (in case the initialSize was being ignored)
>>  ...but it didn't change the outcome
>> 
>> 2) Reading stored fields from a freshly mounted Persistent Volume (ie: 
>> nothing in the OS filesystem cache) might be slow enough to cause problems / 
>> long locking in LocalAsyncCache
>> 
>> - This occured to me because of some Caffeine cache guidance I saw
>>  that advised Callables passed to Caffeine cache population should
>>  be "fast"
>>  * I was left wondering if the IO needed to read stored fields
>>    from disk was "fast enough"
>>  * And it would explain why the problem only seemed to happen
>>    on pod startup
>> - But skiming the code in solr.CaffeineCache, I don't see how this would
>>  affect us:
>>  * the only "blocking" operation should be:
>>    - asyncCache.asMap().putIfAbsent(key, future);
>>  * actaully computing the result of that future shouldn't block anyone
>>    except threads waiting on that exact key->future
>> 
>> ----
>> 
>> In any case: Completely disabling the documentCache made this problem go 
>> away.  Which was not a suprise -- if you stop using codepaths with locks 
>> you'll stop seeing a backlog of threads waiting on that lock.
>> 
>> What *WAS* a surprise is that when I disabled the documentCache, there was a 
>> tiny but consistent improvement in the mean, p50, p75, p95 & p99 Solr 
>> response times (as measured by the application executing the HTTP request to 
>> solr and reading the response back over the network).
>> 
>> But this improvement wasn't just in the "obvious" case of the HPA scaling 
>> test (by eliminating the several minutes of huge outliers).
>> 
>> Even in "steady state" tests, where a fixed number of Solr pods were used, 
>> and recieved a consistent, and modest, rate of requests that they could 
>> handle w/o maxing out CPU, the measured (mean, p50, p75, p95 & p99) Solr 
>> response times were still consistently a few milliseconds faster when the 
>> documentCache was disabled then they were when the documentCache was enabled.
>> 
>> 
>> In short, for this Solr deployment, the documentCache only seemed to have 
>> downsides:
>> 
>> - Slightly slower overall response times
>> - Risk of major slowdowns when starting up new pods
>> 
>> 
>> Hence the question I posed on the Subject of this email: Does documentCache 
>> still make sense in modern Solr?
>> 
>> Has anyone (else) done any (recent) performance testing of their Solr use 
>> cases with and w/o documentCache enabled, and found that the documentCache 
>> adds value?  If not, should we consider changing the default configset to 
>> mark it enabled="false", and let people enable it if and when they think it 
>> might be useful?
>> 
>> 
>> 
>> 
>> [LINKS] related to write contention issues in Caffeine:
>> 
>> - https://github.com/ben-manes/caffeine/wiki/Faq#write-contention
>> - https://github.com/ben-manes/caffeine/issues/672
>>  - cites Solr as a good example of avoiding these problems
>> - https://github.com/ben-manes/caffeine/issues/768
>> - https://github.com/ben-manes/caffeine/issues/527
>> - https://github.com/ben-manes/caffeine/issues/203
>> 
>> 
>> 
>> 
>> 
>> -Hoss
>> http://www.lucidworks.com/
>

Re: Does documentCache still make sense in modern Solr?

Reply via email to