[jira] [Comment Edited] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872983#comment-15872983 ] Ben Manes edited comment on SOLR-10141 at 2/18/17 5:08 AM: --- Thanks!!! I think I found the bug. It now passes your test case. The problem was due to put() stampeding over the value during the eviction. The [eviction routine|https://github.com/ben-manes/caffeine/blob/65e3efd4b50613c27567ff594877d0f63acfbce2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L725] performed the following: # Read the key, value, etc # Conditionally removed in a computeIfPresent() block #* resurrected if a race occurred (e.g. was thought expired, but newly accessed) # Mark the entry as "dead" (using a synchronized (entry) block) # Notify the listener This failed because [putFast|https://github.com/ben-manes/caffeine/blob/65e3efd4b50613c27567ff594877d0f63acfbce2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1521] can perform its update outside of a hash table lock (e.g. a computation). It synchronizes on the entry to update, checking first if it was still alive. This resulted in a race where the entry was removed from the hash table, the value updated, and entry marked as dead. When the listener was notified, it received the wrong value. The solution I have now is to expand the synchronized block on eviction. This passes your test and should be cheap. I'd like to review it a little more and incorporate your test into my suite. This is an excellent find. I've stared at the code many times and the race seems obvious in hindsight. was (Author: ben.manes): Thanks!!! I think I found the bug. It now passes your test case. The problem was due to put() stampeding over the value during the eviction. The [eviction routine|https://github.com/ben-manes/caffeine/blob/65e3efd4b50613c27567ff594877d0f63acfbce2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L725] performed the following: # Read the key, value, etc # Conditionally removed in a computeIfPresent() block - resurrected if a race occurred (e.g. was thought expired, but newly accessed) # Mark the entry as "dead" (using a synchronized (entry) block) # Notify the listener This failed because [putFast|https://github.com/ben-manes/caffeine/blob/65e3efd4b50613c27567ff594877d0f63acfbce2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1521] can perform its update outside of a hash table lock (e.g. a computation). It synchronizes on the entry to update, checking first if it was still alive. This resulted in a race where the entry was removed from the hash table, the value updated, and entry marked as dead. When the listener was notified, it received the wrong value. The solution I have now is to expand the synchronized block on eviction. This passes your test and should be cheap. I'd like to review it a little more and incorporate your test into my suite. This is an excellent find. I've stared at the code many times and the race seems obvious in hindsight. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868132#comment-15868132 ] Yonik Seeley edited comment on SOLR-10141 at 2/15/17 4:45 PM: -- Adding a guard in the test code is easy enough (just check if "live" has already been set to false), but that then uncovers an additional problem: a memory leak since size() != (adds-removes) at the end (i.e. the removal listener is not called for all items). It looks like the removal listener is called the correct number of times, but not always with the correct value. My guess is that it's somehow related to concurrent use of equal keys with different values. was (Author: ysee...@gmail.com): Adding a guard in the test code is easy enough (just check if "live" has already been set to false), but that then causes an additional problem: a memory leak since size() != (adds-removes) at the end (i.e. the removal listener is not called for all items). It looks like the removal listener is called the correct number of times, but not always with the correct value. My guess is that it's somehow related to concurrent use of equal keys with different values. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868099#comment-15868099 ] Yonik Seeley edited comment on SOLR-10141 at 2/15/17 4:07 PM: -- OK, so I finally tracked down the corruption failures with Caffeine to the removal listener being called more than once with the same value. The first time, the underlying block is released and then presumably reused for a different key. The next time (which should never happen), the underlying block is unlocked again and can hence be reused by an additional key and we get into a situation where multiple "live" keys point to the same underlying memory block (and corruption results). I'm going to come up with a simple unit test that directly tests the underlying Caffeine cache the same way we use it. was (Author: ysee...@gmail.com): OK, so I finally tracked down the corruption failures with Caffeine to the removal listener being called more than once with the same value. The first time, the underlying block is released and then presumably reused for a different key. The next time (which should never happen), the underlying block is unlocked again and can hence be reused by an additional key and we get into a situation where multiple "live" keys point to the same underlying memory block (and corruption results). > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org