This is an automated email from the ASF dual-hosted git repository.
agoncharuk pushed a commit to branch ignite-14647
in repository https://gitbox.apache.org/repos/asf/ignite-3.git
commit c9fa5ec4dfd62ac9f0c7caa6350c4bcd5051f2e8
Author: Alexey Goncharuk
AuthorDate: Sat Apr 24 13:11:38 2021 +0300
IGNITE-14647 Describe Raft-based rebalance process
---
modules/affinity/README.md | 86 ++
1 file changed, 86 insertions(+)
diff --git a/modules/affinity/README.md b/modules/affinity/README.md
new file mode 100644
index 000..2c2db7c
--- /dev/null
+++ b/modules/affinity/README.md
@@ -0,0 +1,86 @@
+# Partitioning approach
+
+## Hash-based partitioning
+
+## Range-based partitioning
+
+# Data migration (rebalance)
+There is a significant difference between the rebalance approach in Ignite 2.x
and rebalance approach in Ignite 3.x.
+
+Ignite 2.x implemented rebalance process with updates being applied to the
storage concurrently with data migration
+process. This results in a complex interaction between the rebalance process
and data update protocol (the necessity
+to compare key-value versions during data migration, different entry processor
application paths for cases when
+rebalance is active and not active, uncertain partition state during recovery,
etc).
+
+Ignite 3.x relies on common replication infrastructure for data replication
between nodes, thus the rebalance should
+be handled by means of the replication protocols.
+
+## Raft
+Raft consensus protocol does not have a concept of rebalance. Instead, it
relies on two underlying mechanisms in order
+to have an ability to catch offline nodes up-to-speed and bootstrap new Raft
group members: Raft log and Snapshots.
+These mechanisms handle both delta (when a local node has relevant enough
local state so it can be brought up to speed
+by sending only recent Raft log commands) and full (when a Raft group does not
have sufficient Raft log to catch up the
+node, so the full state machine snapshot should be sent to the local node)
rebalance scenarios. The choice between
+snapshot and log-based catch-up is based on Raft log availability, however,
this logic can be adjusted to a more
+sophisticated heuristic. The underlying state machine should only provide the
snapshot functionality. This functionality
+differs for in-memory and persistent tables.
+
+### In-memory tables
+In-memory tables do not save partitions in isolated memory regions. Instead,
the partition data is written to a shared
+memory pool in order to provide efficient memory utilization (otherwise, an
assigned memory chunk would remain assigned
+to a partition and would not be eligible for other partitions for reuse). This
makes it impossible to create partition
+memory snapshots on phycial level, so we need to maintain a snapshot on tuple
basis.
+
+At any moment in time at most one in-memory partition snapshot can be
maintained.
+
+ Alternative 1
+To create an in-memory snapshot, we use an MVCC-like approach with
copy-on-write technique. The partition tree is
+extended to support keeping two versions of a tuple for the same key: one is
the most relevant version, and another one
+is snapshot version. The snapshot tuple contains snapshot ID additionally to
the regular tuple data. Snapshot tuples are
+only available to the snapshot iterator and must be filtered out from regular
data access paths.
+
+When a snapshot for an in-memory partition is requested, the partition state
machine checks that there is no another
+active snapshot and assigns a new snapshot ID which will be used for
copy-on-write tuples. When the snapshot iterator
+is traversing a tree, it attempts to read both up-to-date and snapshot version
of the key. If the snapshot version of
+the key with the current snapshot ID exists, it must be used in the iterator.
If the snapshot version of the key with
+the current snapshot ID does not exist, the up-to-date version of the tuple
must be used in the iterator.
+
+Each partition state machine update checks if there is a snapshot that is
being maintained. If there is no active
+snapshot, the update operation should clean an old snapshot tuple version, if
any, and do the regular tuple update. If
+there is an active snapshot, the update operation must first clean an old
snapshot tuple version, if any. Then, if a
+snapshot tuple version with the current snapshot ID does not exist, the update
operation copies the current tuple value
+to the snapshot version, and then completes the update (it does not copy the
current value if a relevant snapshot
+version already exists).
+
+When snapshot is no longer needed, an asynchronous process can clean up the
snapshot versions from the partition.
+
+This approach does not induce any memory overhead when no snapshot is
maintained, but may require up to 2x of partition
+size under heavy load (because the whole partition may be copied to the
snapshot versions