This is an automated email from the ASF dual-hosted git repository. agoncharuk pushed a commit to branch ignite-14647 in repository https://gitbox.apache.org/repos/asf/ignite-3.git
commit c9fa5ec4dfd62ac9f0c7caa6350c4bcd5051f2e8 Author: Alexey Goncharuk <alexey.goncha...@gmail.com> AuthorDate: Sat Apr 24 13:11:38 2021 +0300 IGNITE-14647 Describe Raft-based rebalance process --- modules/affinity/README.md | 86 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) diff --git a/modules/affinity/README.md b/modules/affinity/README.md new file mode 100644 index 0000000..2c2db7c --- /dev/null +++ b/modules/affinity/README.md @@ -0,0 +1,86 @@ +# Partitioning approach + +## Hash-based partitioning + +## Range-based partitioning + +# Data migration (rebalance) +There is a significant difference between the rebalance approach in Ignite 2.x and rebalance approach in Ignite 3.x. + +Ignite 2.x implemented rebalance process with updates being applied to the storage concurrently with data migration +process. This results in a complex interaction between the rebalance process and data update protocol (the necessity +to compare key-value versions during data migration, different entry processor application paths for cases when +rebalance is active and not active, uncertain partition state during recovery, etc). + +Ignite 3.x relies on common replication infrastructure for data replication between nodes, thus the rebalance should +be handled by means of the replication protocols. + +## Raft +Raft consensus protocol does not have a concept of rebalance. Instead, it relies on two underlying mechanisms in order +to have an ability to catch offline nodes up-to-speed and bootstrap new Raft group members: Raft log and Snapshots. +These mechanisms handle both delta (when a local node has relevant enough local state so it can be brought up to speed +by sending only recent Raft log commands) and full (when a Raft group does not have sufficient Raft log to catch up the +node, so the full state machine snapshot should be sent to the local node) rebalance scenarios. The choice between +snapshot and log-based catch-up is based on Raft log availability, however, this logic can be adjusted to a more +sophisticated heuristic. The underlying state machine should only provide the snapshot functionality. This functionality +differs for in-memory and persistent tables. + +### In-memory tables +In-memory tables do not save partitions in isolated memory regions. Instead, the partition data is written to a shared +memory pool in order to provide efficient memory utilization (otherwise, an assigned memory chunk would remain assigned +to a partition and would not be eligible for other partitions for reuse). This makes it impossible to create partition +memory snapshots on phycial level, so we need to maintain a snapshot on tuple basis. + +At any moment in time at most one in-memory partition snapshot can be maintained. + +#### Alternative 1 +To create an in-memory snapshot, we use an MVCC-like approach with copy-on-write technique. The partition tree is +extended to support keeping two versions of a tuple for the same key: one is the most relevant version, and another one +is snapshot version. The snapshot tuple contains snapshot ID additionally to the regular tuple data. Snapshot tuples are +only available to the snapshot iterator and must be filtered out from regular data access paths. + +When a snapshot for an in-memory partition is requested, the partition state machine checks that there is no another +active snapshot and assigns a new snapshot ID which will be used for copy-on-write tuples. When the snapshot iterator +is traversing a tree, it attempts to read both up-to-date and snapshot version of the key. If the snapshot version of +the key with the current snapshot ID exists, it must be used in the iterator. If the snapshot version of the key with +the current snapshot ID does not exist, the up-to-date version of the tuple must be used in the iterator. + +Each partition state machine update checks if there is a snapshot that is being maintained. If there is no active +snapshot, the update operation should clean an old snapshot tuple version, if any, and do the regular tuple update. If +there is an active snapshot, the update operation must first clean an old snapshot tuple version, if any. Then, if a +snapshot tuple version with the current snapshot ID does not exist, the update operation copies the current tuple value +to the snapshot version, and then completes the update (it does not copy the current value if a relevant snapshot +version already exists). + +When snapshot is no longer needed, an asynchronous process can clean up the snapshot versions from the partition. + +This approach does not induce any memory overhead when no snapshot is maintained, but may require up to 2x of partition +size under heavy load (because the whole partition may be copied to the snapshot versions in the worst case scenario). + +#### Alternative 2 +To create an in-memory snapshot, the snapshot data is written to a separate in-memory buffer. The buffer is populated +from the state machine update thread either by the update operations or by a snapshot advance mini-task which is +submitted to the state machine update thread as needed. + +To maintain a snapshot, the state machine needs to keep an snapshot iterator boundary key. If a key being updated is +smaller or equal than the boundary key, there is no need in any additional action because the snapshot iterator has +already processed this key. If a key being updated is larger than the boundary key, the old version of the key is +eagerly put to the snapshot buffer and the key is marked with snapshot ID (so that the key is skipped during further +iteration). Snapshot advance mini-task iterates over a next batch of the keys starting from the boundary key and puts +to the snapshot buffer only keys that are not yet marked by the snapshot ID. + +This approach has similar memory requirements to the first alternative, but does not require to modify the storage tree +so that it can store multiple versions of the same key. This approach, however, allows for transparent snapshot buffer +offloading to disk which can reduce memory requirements. It is also simpler in implementation because the code is +essentially single-threaded and only requires synchronization for the in-memory buffer. The downside is that snapshot +advance tasks will increase tail latency of state machine update operations. + +### Persistent tables +For persistent tables, we exploit the existing ability of Ignite native persistence to take partition snapshots and use +the partition snapshot file as a Raft snapshot. + +## Rebalance scheduling +Snapshot transfer between nodes should be run through a separate scheduling layer so that nodes can dynamically adjust +to memory, CPU, IO, and network consumption. A certain quota of each resource should be allocated to the rebalance +process, and the scheduling layer should suspend and probably cancel the rebalance session when the quota is exceeded +and resume or reschedule the rebalance session when resources get freed. \ No newline at end of file