[jira] [Comment Edited] (SOLR-10141) Caffeine cache causes BlockCache corruption

2017-02-17 Thread Ben Manes (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872983#comment-15872983
 ] 

Ben Manes edited comment on SOLR-10141 at 2/18/17 5:08 AM:
---

Thanks!!! I think I found the bug. It now passes your test case.

The problem was due to put() stampeding over the value during the eviction. The 
[eviction 
routine|https://github.com/ben-manes/caffeine/blob/65e3efd4b50613c27567ff594877d0f63acfbce2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L725]
 performed the following:
# Read the key, value, etc
# Conditionally removed in a computeIfPresent() block
#* resurrected if a race occurred (e.g. was thought expired, but newly accessed)
# Mark the entry as "dead" (using a synchronized (entry) block)
# Notify the listener

This failed because 
[putFast|https://github.com/ben-manes/caffeine/blob/65e3efd4b50613c27567ff594877d0f63acfbce2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1521]
 can perform its update outside of a hash table lock (e.g. a computation). It 
synchronizes on the entry to update, checking first if it was still alive. This 
resulted in a race where the entry was removed from the hash table, the value 
updated, and entry marked as dead. When the listener was notified, it received 
the wrong value.

The solution I have now is to expand the synchronized block on eviction. This 
passes your test and should be cheap. I'd like to review it a little more and 
incorporate your test into my suite.

This is an excellent find. I've stared at the code many times and the race 
seems obvious in hindsight.


was (Author: ben.manes):
Thanks!!! I think I found the bug. It now passes your test case.

The problem was due to put() stampeding over the value during the eviction. The 
[eviction 
routine|https://github.com/ben-manes/caffeine/blob/65e3efd4b50613c27567ff594877d0f63acfbce2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L725]
 performed the following:
# Read the key, value, etc
# Conditionally removed in a computeIfPresent() block
   - resurrected if a race occurred (e.g. was thought expired, but newly 
accessed)
# Mark the entry as "dead" (using a synchronized (entry) block)
# Notify the listener

This failed because 
[putFast|https://github.com/ben-manes/caffeine/blob/65e3efd4b50613c27567ff594877d0f63acfbce2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1521]
 can perform its update outside of a hash table lock (e.g. a computation). It 
synchronizes on the entry to update, checking first if it was still alive. This 
resulted in a race where the entry was removed from the hash table, the value 
updated, and entry marked as dead. When the listener was notified, it received 
the wrong value.

The solution I have now is to expand the synchronized block on eviction. This 
passes your test and should be cheap. I'd like to review it a little more and 
incorporate your test into my suite.

This is an excellent find. I've stared at the code many times and the race 
seems obvious in hindsight.

> Caffeine cache causes BlockCache corruption 
> 
>
> Key: SOLR-10141
> URL: https://issues.apache.org/jira/browse/SOLR-10141
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Yonik Seeley
> Attachments: SOLR-10141.patch, Solr10141Test.java
>
>
> After fixing the race conditions in the BlockCache itself (SOLR-10121), the 
> concurrency test passes with the previous implementation using 
> ConcurrentLinkedHashMap and fail with Caffeine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-10141) Caffeine cache causes BlockCache corruption

2017-02-15 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868132#comment-15868132
 ] 

Yonik Seeley edited comment on SOLR-10141 at 2/15/17 4:45 PM:
--

Adding a guard in the test code is easy enough (just check if "live" has 
already been set to false), but that then uncovers an additional problem: a 
memory leak since size() != (adds-removes) at the end (i.e. the removal 
listener is not called for all items).

It looks like the removal listener is called the correct number of times, but 
not always with the correct value.  My guess is that it's somehow related to 
concurrent use of equal keys with different values.


was (Author: ysee...@gmail.com):
Adding a guard in the test code is easy enough (just check if "live" has 
already been set to false), but that then causes an additional problem: a 
memory leak since size() != (adds-removes) at the end (i.e. the removal 
listener is not called for all items).

It looks like the removal listener is called the correct number of times, but 
not always with the correct value.  My guess is that it's somehow related to 
concurrent use of equal keys with different values.

> Caffeine cache causes BlockCache corruption 
> 
>
> Key: SOLR-10141
> URL: https://issues.apache.org/jira/browse/SOLR-10141
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Yonik Seeley
> Attachments: SOLR-10141.patch
>
>
> After fixing the race conditions in the BlockCache itself (SOLR-10121), the 
> concurrency test passes with the previous implementation using 
> ConcurrentLinkedHashMap and fail with Caffeine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-10141) Caffeine cache causes BlockCache corruption

2017-02-15 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868099#comment-15868099
 ] 

Yonik Seeley edited comment on SOLR-10141 at 2/15/17 4:07 PM:
--

OK, so I finally tracked down the corruption failures with Caffeine to the 
removal listener being called more than once with the same value.
The first time, the underlying block is released and then presumably reused for 
a different key.  The next time (which should never happen), the underlying 
block is unlocked again and can hence be reused by an additional key and we get 
into a situation where multiple "live" keys point to the same underlying memory 
block (and corruption results).

I'm going to come up with a simple unit test that directly tests the underlying 
Caffeine cache the same way we use it.


was (Author: ysee...@gmail.com):
OK, so I finally tracked down the corruption failures with Caffeine to the 
removal listener being called more than once with the same value.
The first time, the underlying block is released and then presumably reused for 
a different key.  The next time (which should never happen), the underlying 
block is unlocked again and can hence be reused by an additional key and we get 
into a situation where multiple "live" keys point to the same underlying memory 
block (and corruption results).

> Caffeine cache causes BlockCache corruption 
> 
>
> Key: SOLR-10141
> URL: https://issues.apache.org/jira/browse/SOLR-10141
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Yonik Seeley
>
> After fixing the race conditions in the BlockCache itself (SOLR-10121), the 
> concurrency test passes with the previous implementation using 
> ConcurrentLinkedHashMap and fail with Caffeine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org