[jira] [Comment Edited] (IGNITE-13465) Ignite cluster falls apart if two nodes segmented sequentially

2020-09-22 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199980#comment-17199980
 ] 

Vladimir Steshin edited comment on IGNITE-13465 at 9/22/20, 7:43 PM:
-

The problem is that connectionRecoveryTimeout can be wholly spent on one next 
node. If two fails in a row at the same time, previous nodes may become 
segmented one by one. 

I suggest to slice connectionRecoveryTimeout in order to traverse several next 
nodes in attempt to reconnect to the ring. 
To avoid too small timeouts per one node we should introduce a constant like 
100ms as minimal timeout on attempt to connect to one next node in the ring.



was (Author: vladsz83):
The problem is that connectionRecoveryTimeout can be wholly spent on one next 
node. If two fails at the same time, previous nodes may become segmented one by 
one. 

I suggest to slice connectionRecoveryTimeout in order to traverse several next 
nodes in attempt to reconnect to the ring. 
To avoid too small timeouts per one node I suggest to introduce a constant like 
100ms as minimal timeout on attempt to connect to one next node in the ring.


> Ignite cluster falls apart if two nodes segmented sequentially
> --
>
> Key: IGNITE-13465
> URL: https://issues.apache.org/jira/browse/IGNITE-13465
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Aleksey Plekhanov
>Assignee: Vladimir Steshin
>Priority: Blocker
> Fix For: 2.9
>
> Attachments: GridSequentionNodesFailureTest.java
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> After ticket IGNITE-13134 sequential nodes segmentation leads to segmentation 
> of other nodes in the cluster.
> Reproducer attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (IGNITE-13465) Ignite cluster falls apart if two nodes segmented sequentially

2020-09-22 Thread Vladimir Steshin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199980#comment-17199980
 ] 

Vladimir Steshin edited comment on IGNITE-13465 at 9/22/20, 11:24 AM:
--

The problem is that connectionRecoveryTimeout can be wholly spent on one next 
node. If two fails at the same time, previous nodes may become segmented one by 
one. 

I suggest to slice connectionRecoveryTimeout in order to traverse several next 
nodes in attempt to reconnect to the ring. 
To avoid too small timeouts per one node I suggest to introduce a constant like 
100ms as minimal timeout on attempt to connect to one next node in the ring.



was (Author: vladsz83):
The problem is that connectionRecoveryTimeout can be wholly spent on one next 
node. If two fails at the same time, previous nodes may become segmented one by 
one. 

I suggest to slice connectionRecoveryTimeout in order to traverse several next 
nodes in attempt to reconnect to the ring. We should consider maximum 
reasonable nodes number to reconnect to as `servers/2 + 1`. If we cannot 
connect to half of the ring, this can be considered as major malfunction of the 
network and segmentation.
To avoid too small timeouts per one node I suggest to introduce a constant like 
100ms as minimal timeout on attempt to connect to one next node in the ring.


> Ignite cluster falls apart if two nodes segmented sequentially
> --
>
> Key: IGNITE-13465
> URL: https://issues.apache.org/jira/browse/IGNITE-13465
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Aleksey Plekhanov
>Assignee: Vladimir Steshin
>Priority: Blocker
> Fix For: 2.9
>
> Attachments: GridSequentionNodesFailureTest.java
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> After ticket IGNITE-13134 sequential nodes segmentation leads to segmentation 
> of other nodes in the cluster.
> Reproducer attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)