[ignite-3] branch ignite-14647 created (now c9fa5ec)

2021-04-24 Thread agoncharuk
This is an automated email from the ASF dual-hosted git repository.

agoncharuk pushed a change to branch ignite-14647
in repository https://gitbox.apache.org/repos/asf/ignite-3.git.


  at c9fa5ec  IGNITE-14647 Describe Raft-based rebalance process

This branch includes the following new commits:

 new c9fa5ec  IGNITE-14647 Describe Raft-based rebalance process

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



[ignite-3] 01/01: IGNITE-14647 Describe Raft-based rebalance process

2021-04-24 Thread agoncharuk
This is an automated email from the ASF dual-hosted git repository.

agoncharuk pushed a commit to branch ignite-14647
in repository https://gitbox.apache.org/repos/asf/ignite-3.git

commit c9fa5ec4dfd62ac9f0c7caa6350c4bcd5051f2e8
Author: Alexey Goncharuk 
AuthorDate: Sat Apr 24 13:11:38 2021 +0300

IGNITE-14647 Describe Raft-based rebalance process
---
 modules/affinity/README.md | 86 ++
 1 file changed, 86 insertions(+)

diff --git a/modules/affinity/README.md b/modules/affinity/README.md
new file mode 100644
index 000..2c2db7c
--- /dev/null
+++ b/modules/affinity/README.md
@@ -0,0 +1,86 @@
+# Partitioning approach
+
+## Hash-based partitioning
+
+## Range-based partitioning
+
+# Data migration (rebalance)
+There is a significant difference between the rebalance approach in Ignite 2.x 
and rebalance approach in Ignite 3.x.
+
+Ignite 2.x implemented rebalance process with updates being applied to the 
storage concurrently with data migration
+process. This results in a complex interaction between the rebalance process 
and data update protocol (the necessity
+to compare key-value versions during data migration, different entry processor 
application paths for cases when 
+rebalance is active and not active, uncertain partition state during recovery, 
etc).
+
+Ignite 3.x relies on common replication infrastructure for data replication 
between nodes, thus the rebalance should
+be handled by means of the replication protocols.
+
+## Raft
+Raft consensus protocol does not have a concept of rebalance. Instead, it 
relies on two underlying mechanisms in order
+to have an ability to catch offline nodes up-to-speed and bootstrap new Raft 
group members: Raft log and Snapshots.
+These mechanisms handle both delta (when a local node has relevant enough 
local state so it can be brought up to speed
+by sending only recent Raft log commands) and full (when a Raft group does not 
have sufficient Raft log to catch up the
+node, so the full state machine snapshot should be sent to the local node) 
rebalance scenarios. The choice between
+snapshot and log-based catch-up is based on Raft log availability, however, 
this logic can be adjusted to a more 
+sophisticated heuristic. The underlying state machine should only provide the 
snapshot functionality. This functionality
+differs for in-memory and persistent tables.
+
+### In-memory tables
+In-memory tables do not save partitions in isolated memory regions. Instead, 
the partition data is written to a shared
+memory pool in order to provide efficient memory utilization (otherwise, an 
assigned memory chunk would remain assigned
+to a partition and would not be eligible for other partitions for reuse). This 
makes it impossible to create partition 
+memory snapshots on phycial level, so we need to maintain a snapshot on tuple 
basis.
+
+At any moment in time at most one in-memory partition snapshot can be 
maintained.
+
+ Alternative 1
+To create an in-memory snapshot, we use an MVCC-like approach with 
copy-on-write technique. The partition tree is 
+extended to support keeping two versions of a tuple for the same key: one is 
the most relevant version, and another one 
+is snapshot version. The snapshot tuple contains snapshot ID additionally to 
the regular tuple data. Snapshot tuples are 
+only available to the snapshot iterator and must be filtered out from regular 
data access paths.  
+
+When a snapshot for an in-memory partition is requested, the partition state 
machine checks that there is no another 
+active snapshot and assigns a new snapshot ID which will be used for 
copy-on-write tuples. When the snapshot iterator
+is traversing a tree, it attempts to read both up-to-date and snapshot version 
of the key. If the snapshot version of 
+the key with the current snapshot ID exists, it must be used in the iterator. 
If the snapshot version of the key with 
+the current snapshot ID does not exist, the up-to-date version of the tuple 
must be used in the iterator.
+
+Each partition state machine update checks if there is a snapshot that is 
being maintained. If there is no active 
+snapshot, the update operation should clean an old snapshot tuple version, if 
any, and do the regular tuple update. If 
+there is an active snapshot, the update operation must first clean an old 
snapshot tuple version, if any. Then, if a 
+snapshot tuple version with the current snapshot ID does not exist, the update 
operation copies the current tuple value
+to the snapshot version, and then completes the update (it does not copy the 
current value if a relevant snapshot 
+version already exists).
+
+When snapshot is no longer needed, an asynchronous process can clean up the 
snapshot versions from the partition.
+
+This approach does not induce any memory overhead when no snapshot is 
maintained, but may require up to 2x of partition
+size under heavy load (because the whole partition may be copied to the 
snapshot versions