[jira] [Comment Edited] (IGNITE-12451) Introduce deadlock detection for cache entry reentrant locks

2020-09-28 Thread Mirza Aliev (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203137#comment-17203137
 ] 

Mirza Aliev edited comment on IGNITE-12451 at 9/28/20, 10:21 AM:
-

[~alex_pl] Thank you for the detailed explanation! I tend to agree that this 
solution is better. 

Also, I've discussed the solution with [~agoncharuk] and we agreed that this 
solution definitely should be benchmarked as far as atomic cache operations are 
lightweight, and even additional adding to HashMap might bring some degradation 
on huge batch sizes.

 

Also, note that _removeAll(Set m)_ operation is also affected 


was (Author: maliev):
[~alex_pl] Thank you for the detailed explanation! I tend to agree that this 
solution is better. 

Also, I've discussed the solution with [~agoncharuk] and we agreed that this 
solution definitely should be benchmarked as far as atomic cache operation is 
lightweight, and even additional adding to HashMap might bring some degradation 
on huge batch sizes.

 

Also, note that _removeAll(Set m)_ operation is also affected 

> Introduce deadlock detection for cache entry reentrant locks
> 
>
> Key: IGNITE-12451
> URL: https://issues.apache.org/jira/browse/IGNITE-12451
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7.6
>Reporter: Ivan Rakov
>Assignee: Mirza Aliev
>Priority: Major
> Fix For: 2.10
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Aside from IGNITE-12365, we still have possible threat of cache-entry-level 
> deadlock in case of careless usage of JCache mass operations (putAll, 
> removeAll):
> 1. If two different user threads will perform putAll on the same two keys in 
> reverse order (primary node for which is the same), there's a chance that 
> sys-stripe threads will be deadlocked.
> 2. Even without direct contract violation from user side, HashMap can be 
> passed as argument for putAll. Even if user threads have called mass 
> operations with two keys in the same order, HashMap iteration order is not 
> strictly defined, which may cause the same deadlock. 
> Local deadlock detection should mitigate this issue. We can create a wrapper 
> for ReentrantLock with logic that performs cycle detection in wait-for graph 
> in case we are waiting for lock acquisition for too long. Exception will be 
> thrown from one of the threads in such case, failing user operation, but 
> letting the system make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-12451) Introduce deadlock detection for cache entry reentrant locks

2020-09-22 Thread Mirza Aliev (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200206#comment-17200206
 ] 

Mirza Aliev edited comment on IGNITE-12451 at 9/22/20, 4:42 PM:


[~alex_pl]

> Since only younger threads retry all locks

Could you please explain in more detail this phrase? As far as I can 
understand, only _older_ threads retry all locks, because the timeout has 
passed for old threads, while young threads still try to lock entry because 
their timeout has not happened. This means that there is a chance that the 
older thread will be deadlocked again with a new putAll thread that comes at 
the moment when the old thread started to take locks for its entries, and so 
on, this means that the older thread won't do any progress. 


was (Author: maliev):
> Since only younger threads retry all locks

Could you please explain in more detail this phrase? As far as I can 
understand, only _older_ threads retry all locks, because the timeout has 
passed for old threads, while young threads still try to lock entry because 
their timeout has not happened. This means that there is a chance that the 
older thread will be deadlocked again with a new putAll thread that comes at 
the moment when the old thread started to take locks for its entries, and so 
on, this means that the older thread won't do any progress. 

> Introduce deadlock detection for cache entry reentrant locks
> 
>
> Key: IGNITE-12451
> URL: https://issues.apache.org/jira/browse/IGNITE-12451
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7.6
>Reporter: Ivan Rakov
>Assignee: Mirza Aliev
>Priority: Major
> Fix For: 2.10
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Aside from IGNITE-12365, we still have possible threat of cache-entry-level 
> deadlock in case of careless usage of JCache mass operations (putAll, 
> removeAll):
> 1. If two different user threads will perform putAll on the same two keys in 
> reverse order (primary node for which is the same), there's a chance that 
> sys-stripe threads will be deadlocked.
> 2. Even without direct contract violation from user side, HashMap can be 
> passed as argument for putAll. Even if user threads have called mass 
> operations with two keys in the same order, HashMap iteration order is not 
> strictly defined, which may cause the same deadlock. 
> Local deadlock detection should mitigate this issue. We can create a wrapper 
> for ReentrantLock with logic that performs cycle detection in wait-for graph 
> in case we are waiting for lock acquisition for too long. Exception will be 
> thrown from one of the threads in such case, failing user operation, but 
> letting the system make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-12451) Introduce deadlock detection for cache entry reentrant locks

2020-09-22 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200168#comment-17200168
 ] 

Aleksey Plekhanov edited comment on IGNITE-12451 at 9/22/20, 3:58 PM:
--

[~maliev], {{hasLockCollisions}} gives sometimes false-positive results, but I 
think it's not very often. If we already waited some time for lock most likely 
this thread already in deadlock and most likely thread that holds the lock we 
are trying to acquire blocking us. If not, we just retry and nothing bad 
happens, user operation is not affected by this retry.

Fair cycles searching is more resource consuming (affects performance and 
footprint). This solution doesn't add any footprint and has no performance 
impact, but I see no advantages of fair cycles searching here. 

This algorithm does garantee that system won't hang. Since only younger threads 
retry all locks, older threads don't retry and will always advance in progress 
(there will be no infinite retries).   

Throwing exception to user leads to bad user experience. If we can perform this 
operation without exception, why should we throw it? 



was (Author: alex_pl):
{{hasLockCollisions}} gives sometimes false-positive results, but I think it's 
not very often. If we already waited some time for lock most likely this thread 
already in deadlock and most likely thread that holds the lock we are trying to 
acquire blocking us. If not, we just retry and nothing bad happens, user 
operation is not affected by this retry.

Fair cycles searching is more resource consuming (affects performance and 
footprint). This solution doesn't add any footprint and has no performance 
impact, but I see no advantages of fair cycles searching here. 

This algorithm does garantee that system won't hang. Since only younger threads 
retry all locks, older threads don't retry and will always advance in progress 
(there will be no infinite retries).   

Throwing exception to user leads to bad user experience. If we can perform this 
operation without exception, why should we throw it? 


> Introduce deadlock detection for cache entry reentrant locks
> 
>
> Key: IGNITE-12451
> URL: https://issues.apache.org/jira/browse/IGNITE-12451
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7.6
>Reporter: Ivan Rakov
>Assignee: Mirza Aliev
>Priority: Major
> Fix For: 2.10
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Aside from IGNITE-12365, we still have possible threat of cache-entry-level 
> deadlock in case of careless usage of JCache mass operations (putAll, 
> removeAll):
> 1. If two different user threads will perform putAll on the same two keys in 
> reverse order (primary node for which is the same), there's a chance that 
> sys-stripe threads will be deadlocked.
> 2. Even without direct contract violation from user side, HashMap can be 
> passed as argument for putAll. Even if user threads have called mass 
> operations with two keys in the same order, HashMap iteration order is not 
> strictly defined, which may cause the same deadlock. 
> Local deadlock detection should mitigate this issue. We can create a wrapper 
> for ReentrantLock with logic that performs cycle detection in wait-for graph 
> in case we are waiting for lock acquisition for too long. Exception will be 
> thrown from one of the threads in such case, failing user operation, but 
> letting the system make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-12451) Introduce deadlock detection for cache entry reentrant locks

2020-09-22 Thread Mirza Aliev (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200154#comment-17200154
 ] 

Mirza Aliev edited comment on IGNITE-12451 at 9/22/20, 3:30 PM:


I see that your solution is based on a _tryLock_ idea, we also came to that 
because real-time locks graph analyzing brought a significant performance drop. 

I see that you introduced _hasLockCollisions_ method, but IMHO it gives a lot 
of false-positive results, this method does not search for cycles in locks 
acquisition order. Also, it does not guarantee that system won't hang because 
there is a chance that the retry-lock-acquisition loop also will lead to 
deadlocks. 

So, I believe that the right way to handle deadlock and let the system continue 
to work is the following: we need to throw an exception when _tryLock_ timeouts 
and a user will see _CachePartialUpdateCheckedException_. This solution is 
almost developed in my PR and will be ready in a few days.

 

[~alex_pl] what do you think? 


was (Author: maliev):
I see that your solution is based on a _tryLock_ idea, we also came to that 
because real-time locks graph analyzing brought a significant performance drop. 

I see that you introduced _hasLockCollisions_ method, but IMHO it gives a lot 
of false-positive results, this method does not search for cycles in locks 
acquisition order. Also, it does not guarantee that system won't hang because 
there is a chance that the retry-lock-acquisition loop also will lead to 
deadlocks. 

So, I believe that the right way to handle deadlock and let the system continue 
to work is the following: we need to throw an exception when _tryLock_ timeouts 
and a user will see _CachePartialUpdateCheckedException_. This solution is 
almost developed in my PR and will be ready in a few days.

 

[~alex_pl] what do you think? 

 

 

 

> Introduce deadlock detection for cache entry reentrant locks
> 
>
> Key: IGNITE-12451
> URL: https://issues.apache.org/jira/browse/IGNITE-12451
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7.6
>Reporter: Ivan Rakov
>Assignee: Mirza Aliev
>Priority: Major
> Fix For: 2.10
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Aside from IGNITE-12365, we still have possible threat of cache-entry-level 
> deadlock in case of careless usage of JCache mass operations (putAll, 
> removeAll):
> 1. If two different user threads will perform putAll on the same two keys in 
> reverse order (primary node for which is the same), there's a chance that 
> sys-stripe threads will be deadlocked.
> 2. Even without direct contract violation from user side, HashMap can be 
> passed as argument for putAll. Even if user threads have called mass 
> operations with two keys in the same order, HashMap iteration order is not 
> strictly defined, which may cause the same deadlock. 
> Local deadlock detection should mitigate this issue. We can create a wrapper 
> for ReentrantLock with logic that performs cycle detection in wait-for graph 
> in case we are waiting for lock acquisition for too long. Exception will be 
> thrown from one of the threads in such case, failing user operation, but 
> letting the system make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-12451) Introduce deadlock detection for cache entry reentrant locks

2020-09-22 Thread Mirza Aliev (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200154#comment-17200154
 ] 

Mirza Aliev edited comment on IGNITE-12451 at 9/22/20, 3:29 PM:


I see that your solution is based on a _tryLock_ idea, we also came to that 
because real-time locks graph analyzing brought a significant performance drop. 

I see that you introduced _hasLockCollisions_ method, but IMHO it gives a lot 
of false-positive results, this method does not search for cycles in locks 
acquisition order. Also, it does not guarantee that system won't hang because 
there is a chance that the retry-lock-acquisition loop also will lead to 
deadlocks. 

So, I believe that the right way to handle deadlock and let the system continue 
to work is the following: we need to throw an exception when _tryLock_ timeouts 
and a user will see _CachePartialUpdateCheckedException_. This solution is 
almost developed in my PR and will be ready in a few days.

 

[~alex_pl] what do you think? 

 

 

 


was (Author: maliev):
I see that your solution is based on a _tryLock_ idea, we also came to that 
because real-time locks graph analyzing brought a significant performance drop. 

I see that you introduced _hasLockCollisions_ method, but IMHO it gives a lot 
of false-positive results, this method does not search for cycles in locks 
acquisition order. Also, it does not guarantee that system won't hang because 
there is a chance that the retry-lock-acquisition loop also will lead to 
deadlocks. 

So, I believe that the right way to handle deadlock and let the system continue 
to work is the following: we need to throw an exception when _tryLock_ timeouts 
and a user will see _CachePartialUpdateCheckedException_. __ This solution is 
almost developed in my PR and will be ready in a few days.

 

[~alex_pl] what do you think? 

 

 

 

> Introduce deadlock detection for cache entry reentrant locks
> 
>
> Key: IGNITE-12451
> URL: https://issues.apache.org/jira/browse/IGNITE-12451
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7.6
>Reporter: Ivan Rakov
>Assignee: Mirza Aliev
>Priority: Major
> Fix For: 2.10
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Aside from IGNITE-12365, we still have possible threat of cache-entry-level 
> deadlock in case of careless usage of JCache mass operations (putAll, 
> removeAll):
> 1. If two different user threads will perform putAll on the same two keys in 
> reverse order (primary node for which is the same), there's a chance that 
> sys-stripe threads will be deadlocked.
> 2. Even without direct contract violation from user side, HashMap can be 
> passed as argument for putAll. Even if user threads have called mass 
> operations with two keys in the same order, HashMap iteration order is not 
> strictly defined, which may cause the same deadlock. 
> Local deadlock detection should mitigate this issue. We can create a wrapper 
> for ReentrantLock with logic that performs cycle detection in wait-for graph 
> in case we are waiting for lock acquisition for too long. Exception will be 
> thrown from one of the threads in such case, failing user operation, but 
> letting the system make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-12451) Introduce deadlock detection for cache entry reentrant locks

2019-12-16 Thread Mirza Aliev (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997239#comment-16997239
 ] 

Mirza Aliev edited comment on IGNITE-12451 at 12/16/19 12:33 PM:
-

[~Pavlukhin] I'm assuming that this note from javadoc of 
{{java.lang.management.ThreadMXBean#findDeadlockedThreads}} might be the reason 
to not use such approach.
{noformat}
This method is designed for troubleshooting use, but not for synchronization 
control. It might be an expensive operation.{noformat}
What do you think [~Pavlukhin] [~ivan.glukos]?


was (Author: maliev):
[~Pavlukhin] I'm assuming that these note from javadoc of 
{{java.lang.management.ThreadMXBean#findDeadlockedThreads}} might be the reason 
to not use such approach.
{noformat}
This method is designed for troubleshooting use, but not for synchronization 
control. It might be an expensive operation.{noformat}
What do you think [~Pavlukhin] [~ivan.glukos]?

> Introduce deadlock detection for cache entry reentrant locks
> 
>
> Key: IGNITE-12451
> URL: https://issues.apache.org/jira/browse/IGNITE-12451
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 2.7.6
>Reporter: Ivan Rakov
>Priority: Major
> Fix For: 2.9
>
>
> Aside from IGNITE-12365, we still have possible threat of cache-entry-level 
> deadlock in case of careless usage of JCache mass operations (putAll, 
> removeAll):
> 1. If two different user threads will perform putAll on the same two keys in 
> reverse order (primary node for which is the same), there's a chance that 
> sys-stripe threads will be deadlocked.
> 2. Even without direct contract violation from user side, HashMap can be 
> passed as argument for putAll. Even if user threads have called mass 
> operations with two keys in the same order, HashMap iteration order is not 
> strictly defined, which may cause the same deadlock. 
> Local deadlock detection should mitigate this issue. We can create a wrapper 
> for ReentrantLock with logic that performs cycle detection in wait-for graph 
> in case we are waiting for lock acquisition for too long. Exception will be 
> thrown from one of the threads in such case, failing user operation, but 
> letting the system make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)