[ https://issues.apache.org/jira/browse/IGNITE-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103757#comment-17103757 ]
Alexey Scherbakov edited comment on IGNITE-12617 at 5/11/20, 8:25 AM: ---------------------------------------------------------------------- [~avinogradov] Your solution has 2 drawbacks: 1. Double latch waiting if replicated caches are in topology. 2. It degrades to be a no-op if backups are spread by grid nodes (this is a default behavior with rendezvous affinity). I would like to propose an algorithm, which should provide the same latency decrease as for a best case by your approach and has no obvious drawbacks. Assume a baseline node has failed. 1. As soon as all nodes have received node failed event a coordinator sends a discovery custom CollectMaxCountersMessage for collecting max update counter for all affected partitions (all primaries for failed node). No new transaction can reserve a counter at this point. 2. Each node having a newly assigned primary partition waits for countersReadyFuture before finishing an exchange future. 3. All nodes having no such partitions finish exchange future immediately. 4. A coordinator on recieving CollectMaxCountersMessage prepares a CountersReadyMessage and sends directly to all affected nodes from step 2. 5. Each affected node received the CountersReadyMessage, applies max counter to a partition and finishes exchange future. The algorithm also benefits from "cells". It can be further improved by integrating max counters collection to node fail processing at discovery layer. was (Author: ascherbakov): [~avinogradov] Your solution has 2 drawbacks: 1. Double latch waiting if replicated caches are in topology. 2. It degrades to be a no-op if backups are spread by grid nodes (this is a default behavior with rendezvous affinity). I would like to propose algorythm, which should provide the same latency decrease as for a best case by your approach and has no obvious drawbacks. Assume a baseline node has failed. 1. As soon as all nodes have received node failed event a coordinator sends a discovery custom CollectMaxCountersMessage for collecting max update counter for all affected partitions (all primaries for failed node). No new transaction can reserve a counter at this point. 2. Each node having a newly assigned primary partition waits for countersReadyFuture before finishing an exchange future. 3. All nodes having no such partitions finish exchange future immediately. 4. A coordinator on recieving CollectMaxCountersMessage prepares a CountersReadyMessage and sends directly to all affected nodes from step 2. 5. Each affected node received the CountersReadyMessage, applies max counter to a partition and finishes exchange future. This algorythm also benefits from "cells". It can be further improved by integrating max counters collection to node fail processing at discovery layer. > PME-free switch should wait for recovery only at affected nodes. > ---------------------------------------------------------------- > > Key: IGNITE-12617 > URL: https://issues.apache.org/jira/browse/IGNITE-12617 > Project: Ignite > Issue Type: Task > Reporter: Anton Vinogradov > Assignee: Anton Vinogradov > Priority: Major > Labels: iep-45 > Fix For: 2.9 > > Time Spent: 10m > Remaining Estimate: 0h > > Since IGNITE-9913, new-topology operations allowed immediately after > cluster-wide recovery finished. > But is there any reason to wait for a cluster-wide recovery if only one node > failed? > In this case, we should recover only the failed node's backups. > Unfortunately, {{RendezvousAffinityFunction}} tends to spread the node's > backup partitions to the whole cluster. In this case, we, obviously, have to > wait for cluster-wide recovery on switch. > But what if only some nodes will be the backups for every primary? > In case nodes combined into virtual cells where, for each partition, backups > located at the same cell with primaries, it's possible to finish the switch > outside the affected cell before tx recovery finish. > This optimization will allow us to start and even finish new operations > outside the failed cell without a cluster-wide switch finish (broken cell > recovery) waiting. > In other words, switch (when left/fail + baseline + rebalanced) will have > little effect on the operation's (not related to failed cell) latency. > In other words > - We should wait for tx recovery before finishing the switch only on a broken > cell. > - We should wait for replicated caches tx recovery everywhere since every > node is a backup of a failed one. > - Upcoming operations related to the broken cell (including all replicated > caches operations) will require a cluster-wide switch finish to be processed. -- This message was sent by Atlassian Jira (v8.3.4#803005)