[jira] [Comment Edited] (IGNITE-13465) Ignite cluster falls apart if two nodes segmented sequentially
[ https://issues.apache.org/jira/browse/IGNITE-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199980#comment-17199980 ] Vladimir Steshin edited comment on IGNITE-13465 at 9/22/20, 7:43 PM: - The problem is that connectionRecoveryTimeout can be wholly spent on one next node. If two fails in a row at the same time, previous nodes may become segmented one by one. I suggest to slice connectionRecoveryTimeout in order to traverse several next nodes in attempt to reconnect to the ring. To avoid too small timeouts per one node we should introduce a constant like 100ms as minimal timeout on attempt to connect to one next node in the ring. was (Author: vladsz83): The problem is that connectionRecoveryTimeout can be wholly spent on one next node. If two fails at the same time, previous nodes may become segmented one by one. I suggest to slice connectionRecoveryTimeout in order to traverse several next nodes in attempt to reconnect to the ring. To avoid too small timeouts per one node I suggest to introduce a constant like 100ms as minimal timeout on attempt to connect to one next node in the ring. > Ignite cluster falls apart if two nodes segmented sequentially > -- > > Key: IGNITE-13465 > URL: https://issues.apache.org/jira/browse/IGNITE-13465 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.9 >Reporter: Aleksey Plekhanov >Assignee: Vladimir Steshin >Priority: Blocker > Fix For: 2.9 > > Attachments: GridSequentionNodesFailureTest.java > > Time Spent: 50m > Remaining Estimate: 0h > > After ticket IGNITE-13134 sequential nodes segmentation leads to segmentation > of other nodes in the cluster. > Reproducer attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (IGNITE-13465) Ignite cluster falls apart if two nodes segmented sequentially
[ https://issues.apache.org/jira/browse/IGNITE-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199980#comment-17199980 ] Vladimir Steshin edited comment on IGNITE-13465 at 9/22/20, 11:24 AM: -- The problem is that connectionRecoveryTimeout can be wholly spent on one next node. If two fails at the same time, previous nodes may become segmented one by one. I suggest to slice connectionRecoveryTimeout in order to traverse several next nodes in attempt to reconnect to the ring. To avoid too small timeouts per one node I suggest to introduce a constant like 100ms as minimal timeout on attempt to connect to one next node in the ring. was (Author: vladsz83): The problem is that connectionRecoveryTimeout can be wholly spent on one next node. If two fails at the same time, previous nodes may become segmented one by one. I suggest to slice connectionRecoveryTimeout in order to traverse several next nodes in attempt to reconnect to the ring. We should consider maximum reasonable nodes number to reconnect to as `servers/2 + 1`. If we cannot connect to half of the ring, this can be considered as major malfunction of the network and segmentation. To avoid too small timeouts per one node I suggest to introduce a constant like 100ms as minimal timeout on attempt to connect to one next node in the ring. > Ignite cluster falls apart if two nodes segmented sequentially > -- > > Key: IGNITE-13465 > URL: https://issues.apache.org/jira/browse/IGNITE-13465 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.9 >Reporter: Aleksey Plekhanov >Assignee: Vladimir Steshin >Priority: Blocker > Fix For: 2.9 > > Attachments: GridSequentionNodesFailureTest.java > > Time Spent: 50m > Remaining Estimate: 0h > > After ticket IGNITE-13134 sequential nodes segmentation leads to segmentation > of other nodes in the cluster. > Reproducer attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)