[ 
https://issues.apache.org/jira/browse/IGNITE-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandr Fedotov updated IGNITE-6256:
-------------------------------------
    Description: 
The assert is as follows:

exception="java.lang.AssertionError: null
 at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.removeNode(GridDhtPartitionTopologyImpl.java:1422)
 at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:490)
 at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:769)
 at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:504)
 at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:1689)
 at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
 at java.lang.Thread.run(Thread.java:745)

Below is the sequence of steps that leads to the assertion error:

1) A node becomes SEGMENTED when it's determined by SegmentCheckWorker, after 
an EVT_NODE_FAILED has been received.
2) It gets visibleRemoteNodes from it's TcpDiscoveryNodesRing
3) Clears the TcpDiscoveryNodesRing leaving only self on the list. The node 
ring is used to determine if a node is alive
during DiscoCache creation
4) After that, the node initiates removal of all the nodes read in step 2
5) For each node, it sends an EVT_NODE_FAILED to the corresponding 
DiscoverySpiListener
providing a topology containing all the nodes except already processed
6) This event gets into GridDiscoveryManager 
7) The node gets removed from alive nodes for every DiscoCache in discoCacheHist
8) Topology change is detected
9) Creation of a new DiscoCache is attempted. At this moment every remote node 
is not available due to the
TcpDiscoveryNodesRing has been cleared, thus resulting in a DiscoCache with 
empty alives
10) The event with the created DiscoCache and the new topology version is 
passed to DiscoveryWorker
11) The event is eventually handled by DiscoveryWorker and is recorded by 
DiscoveryWorker#recordEvent
12) The recording is handled by GridEventStorageManager which notifies every 
listener for this event type (EVT_NODE_FAILED)
13) One of the listeners is GridCachePartitionExchangeManager#discoLsnr
It creates a new GridDhtPartitionsExchangeFuture with the empty DiscoCache 
received with the event and enqueues it
14) The future gets eventually handled by GridDhtPartitionsExchangeFuture and 
initialized
15) updateTopologies is called, which for each GridCacheContext gets its 
topology (GridDhtPartitionTopology)
and calls GridDhtPartitionTopology#updateTopologyVersion
16) DiscoCache for GridDhtPartitionTopology is assigned from the one of the 
GridDhtPartitionsExchangeFuture.
The assigned DiscoCache has empty alives at the moment
15) A distributed exchange is handled 
(GridDhtPartitionsExchangeFuture#distributedExchange)
16) For each cache context GridCacheContext, for its topology 
(GridDhtPartitionTopologyImpl) GridDhtPartitionTopologyImpl#beforeExchange is 
called
17) The fact that the node has left is determined and 
GridDhtPartitionTopologyImpl#removeNode is called to handle it
18) An attempt is made to get the alive coordinator node by calling 
DiscoCache#oldestAliveServerNode
19) null is returned which results in an AssertionError

The fix should probably prevent initiating exchange futures if a node has 
segmented.

  was:
Below is the sequence of steps that leads to the assertion:

1) A node becomes SEGMENTED when it's determined by SegmentCheckWorker, after 
an EVT_NODE_FAILED has been received.
2) It gets visibleRemoteNodes from it's TcpDiscoveryNodesRing
3) Clears the TcpDiscoveryNodesRing leaving only self on the list. The node 
ring is used to determine if a node is alive
during DiscoCache creation
4) After that, the node initiates removal of all the nodes read in step 2
5) For each node, it sends an EVT_NODE_FAILED to the corresponding 
DiscoverySpiListener
providing a topology containing all the nodes except already processed
6) This event gets into GridDiscoveryManager 
7) The node gets removed from alive nodes for every DiscoCache in discoCacheHist
8) Topology change is detected
9) Creation of a new DiscoCache is attempted. At this moment every remote node 
is not available due to the
TcpDiscoveryNodesRing has been cleared, thus resulting in a DiscoCache with 
empty alives
10) The event with the created DiscoCache and the new topology version is 
passed to DiscoveryWorker
11) The event is eventually handled by DiscoveryWorker and is recorded by 
DiscoveryWorker#recordEvent
12) The recording is handled by GridEventStorageManager which notifies every 
listener for this event type (EVT_NODE_FAILED)
13) One of the listeners is GridCachePartitionExchangeManager#discoLsnr
It creates a new GridDhtPartitionsExchangeFuture with the empty DiscoCache 
received with the event and enqueues it
14) The future gets eventually handled by GridDhtPartitionsExchangeFuture and 
initialized
15) updateTopologies is called, which for each GridCacheContext gets its 
topology (GridDhtPartitionTopology)
and calls GridDhtPartitionTopology#updateTopologyVersion
16) DiscoCache for GridDhtPartitionTopology is assigned from the one of the 
GridDhtPartitionsExchangeFuture.
The assigned DiscoCache has empty alives at the moment
15) A distributed exchange is handled 
(GridDhtPartitionsExchangeFuture#distributedExchange)
16) For each cache context GridCacheContext, for its topology 
(GridDhtPartitionTopologyImpl) GridDhtPartitionTopologyImpl#beforeExchange is 
called
17) The fact that the node has left is determined and 
GridDhtPartitionTopologyImpl#removeNode is called to handle it
18) An attempt is made to get the alive coordinator node by calling 
DiscoCache#oldestAliveServerNode
19) null is returned which results in an AssertionError

The fix should probably prevent initiating exchange futures if a node has 
segmented.


> When a node becomes segmented an AssertionError is thrown during 
> GridDhtPartitionTopologyImpl.removeNode
> --------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-6256
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6256
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.8
>            Reporter: Alexandr Fedotov
>             Fix For: 2.3
>
>
> The assert is as follows:
> exception="java.lang.AssertionError: null
>  at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.removeNode(GridDhtPartitionTopologyImpl.java:1422)
>  at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:490)
>  at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:769)
>  at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:504)
>  at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:1689)
>  at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>  at java.lang.Thread.run(Thread.java:745)
> Below is the sequence of steps that leads to the assertion error:
> 1) A node becomes SEGMENTED when it's determined by SegmentCheckWorker, after 
> an EVT_NODE_FAILED has been received.
> 2) It gets visibleRemoteNodes from it's TcpDiscoveryNodesRing
> 3) Clears the TcpDiscoveryNodesRing leaving only self on the list. The node 
> ring is used to determine if a node is alive
> during DiscoCache creation
> 4) After that, the node initiates removal of all the nodes read in step 2
> 5) For each node, it sends an EVT_NODE_FAILED to the corresponding 
> DiscoverySpiListener
> providing a topology containing all the nodes except already processed
> 6) This event gets into GridDiscoveryManager 
> 7) The node gets removed from alive nodes for every DiscoCache in 
> discoCacheHist
> 8) Topology change is detected
> 9) Creation of a new DiscoCache is attempted. At this moment every remote 
> node is not available due to the
> TcpDiscoveryNodesRing has been cleared, thus resulting in a DiscoCache with 
> empty alives
> 10) The event with the created DiscoCache and the new topology version is 
> passed to DiscoveryWorker
> 11) The event is eventually handled by DiscoveryWorker and is recorded by 
> DiscoveryWorker#recordEvent
> 12) The recording is handled by GridEventStorageManager which notifies every 
> listener for this event type (EVT_NODE_FAILED)
> 13) One of the listeners is GridCachePartitionExchangeManager#discoLsnr
> It creates a new GridDhtPartitionsExchangeFuture with the empty DiscoCache 
> received with the event and enqueues it
> 14) The future gets eventually handled by GridDhtPartitionsExchangeFuture and 
> initialized
> 15) updateTopologies is called, which for each GridCacheContext gets its 
> topology (GridDhtPartitionTopology)
> and calls GridDhtPartitionTopology#updateTopologyVersion
> 16) DiscoCache for GridDhtPartitionTopology is assigned from the one of the 
> GridDhtPartitionsExchangeFuture.
> The assigned DiscoCache has empty alives at the moment
> 15) A distributed exchange is handled 
> (GridDhtPartitionsExchangeFuture#distributedExchange)
> 16) For each cache context GridCacheContext, for its topology 
> (GridDhtPartitionTopologyImpl) GridDhtPartitionTopologyImpl#beforeExchange is 
> called
> 17) The fact that the node has left is determined and 
> GridDhtPartitionTopologyImpl#removeNode is called to handle it
> 18) An attempt is made to get the alive coordinator node by calling 
> DiscoCache#oldestAliveServerNode
> 19) null is returned which results in an AssertionError
> The fix should probably prevent initiating exchange futures if a node has 
> segmented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to