[ 
https://issues.apache.org/jira/browse/IGNITE-12255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944589#comment-16944589
 ] 

Ignite TC Bot commented on IGNITE-12255:
----------------------------------------

{panel:title=Branch: [pull/6933/head] Base: [master] : No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
[TeamCity *--> Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=4658217&buildTypeId=IgniteTests24Java8_RunAll]

> Cache affinity fetching and calculation on client nodes may be broken in some 
> cases
> -----------------------------------------------------------------------------------
>
>                 Key: IGNITE-12255
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12255
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.5, 2.7
>            Reporter: Pavel Kovalenko
>            Assignee: Pavel Kovalenko
>            Priority: Critical
>             Fix For: 2.8
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We have a cluster with server and client nodes.
> We dynamically start several caches on a cluster.
> Periodically we create and destroy some temporary cache in a cluster to move 
> up cluster topology version.
> At the same time, a random client node chooses a random existing cache and 
> performs operations on that cache.
> It leads to an exception on client node that affinity is not initialized for 
> a cache during cache operation like:
> Affinity for topology version is not initialized [topVer = 8:10, head = 8:2]
> This exception means that the last affinity for a cache is calculated on 
> version [8,2]. This is a cache start version. It happens because during 
> creating/destroying some temporary cache we don’t re-calculate affinity for 
> all existing but not already accessed caches on client nodes. Re-calculate in 
> this case is cheap - we just copy affinity assignment and increment topology 
> version.
> As a solution, we need to fetch affinity on client node join for all caches. 
> Also, we need to re-calculate affinity for all affinity holders (not only for 
> started caches or only configured caches) for all topology events that 
> happened in a cluster on a client node.
> This solution showed the existing race between client node join and 
> concurrent cache destroy.
> The race is the following:
> Client node (with some configured caches) joins to a cluster sending 
> SingleMessage to coordinator during client PME. This SingleMessage contains 
> affinity fetch requests for all cluster caches. When SingleMessage is 
> in-flight server nodes finish client PME and also process and finish cache 
> destroy PME. When a cache is destroyed affinity for that cache is cleared. 
> When SingleMessage delivered to coordinator it doesn’t have affinity for a 
> requested cache because the cache is already destroyed. It leads to assertion 
> error on the coordinator and unpredictable behavior on the client node.
> The race may be fixed with the following change:
> If the coordinator doesn’t have an affinity for requested cache from the 
> client node, it doesn’t break PME with assertion error, just doesn’t send 
> affinity for that cache to a client node. When the client node receives 
> FullMessage and sees that affinity for some requested cache doesn’t exist, it 
> just closes cache proxy for user interactions which throws CacheStopped 
> exception for every attempt to use that cache. This is safe behavior because 
> cache destroy event should be happened on the client node soon and destroy 
> that cache completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to