[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vyacheslav Koptilin updated IGNITE-12133:
-----------------------------------------
    Description: 
Currently, partition exchange leverages a ring. This means that communications 
is O\(n) in number of nodes. It also means that if non-coordinator nodes hang 
it can take much longer to successfully resolve the topology.

Instead, why not use something like a skip-list where the coordinator is first. 
The coordinator can notify the first node at each level of the skip-list. Each 
node then notifies all of its "near-neighbours" in the skip-list, where node B 
is a near-neighbour of node-A, if max-level(nodeB) <= max-level(nodeA), and 
nodeB is the first node at its level when traversing from nodeA in the 
direction of nodeB, skipping over nodes C which have max-level(C) > 
max-level(A). 

1

1 .  .  .3

1        3 . .  . 5

1 . 2 . 3 . 4 . 5 . 6

In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
6, and 5 -> 6.

One can achieve better redundancy by having each node traverse in both 
directions, and having the coordinator also notify the last node in the list at 
each level. This way in the above example if 2 and 3 were both down, 4 would 
still get notified from 5 and 6 (in the backwards direction).

 

The idea is that each individual node has O(log n) nodes to notify - so the 
overall time is reduced. Additionally, we can deal well with at least 1 node 
failure - if one includes the option of processing backwards, 2 consecutive 
node failures can be handled as well. By taking this kind of an approach, then 
the coordinator can basically treat any nodes it didn't receive a message from 
as not-connected, and update the topology as well (disconnecting any nodes that 
it didn't get a notification from). While there are some edge cases here (e.g. 
2 disconnected nodes, then 1 connected node, then 2 disconnected nodes - the 
connected node would be wrongly ejected from the topology), these would 
generally be too rare to need explicit handling for.

  was:
Currently, partition exchange leverages a ring. This means that communications 
is O(n) in number of nodes. It also means that if non-coordinator nodes hang it 
can take much longer to successfully resolve the topology.

Instead, why not use something like a skip-list where the coordinator is first. 
The coordinator can notify the first node at each level of the skip-list. Each 
node then notifies all of its "near-neighbours" in the skip-list, where node B 
is a near-neighbour of node-A, if max-level(nodeB) <= max-level(nodeA), and 
nodeB is the first node at its level when traversing from nodeA in the 
direction of nodeB, skipping over nodes C which have max-level(C) > 
max-level(A). 

1

1 .  .  .3

1        3 . .  . 5

1 . 2 . 3 . 4 . 5 . 6

In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
6, and 5 -> 6.

One can achieve better redundancy by having each node traverse in both 
directions, and having the coordinator also notify the last node in the list at 
each level. This way in the above example if 2 and 3 were both down, 4 would 
still get notified from 5 and 6 (in the backwards direction).

 

The idea is that each individual node has O(log n) nodes to notify - so the 
overall time is reduced. Additionally, we can deal well with at least 1 node 
failure - if one includes the option of processing backwards, 2 consecutive 
node failures can be handled as well. By taking this kind of an approach, then 
the coordinator can basically treat any nodes it didn't receive a message from 
as not-connected, and update the topology as well (disconnecting any nodes that 
it didn't get a notification from). While there are some edge cases here (e.g. 
2 disconnected nodes, then 1 connected node, then 2 disconnected nodes - the 
connected node would be wrongly ejected from the topology), these would 
generally be too rare to need explicit handling for.


> O(log n) partition exchange
> ---------------------------
>
>                 Key: IGNITE-12133
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12133
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Moti Nisenson-Ken
>            Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to