This is an automated email from the ASF dual-hosted git repository. agoncharuk pushed a commit to branch ignite-14393 in repository https://gitbox.apache.org/repos/asf/ignite-3.git
commit 5ba6651a0da7b4637fc901efe384796bf4b5a75e Author: Alexey Goncharuk <alexey.goncha...@gmail.com> AuthorDate: Wed Mar 24 20:36:38 2021 +0300 IGNITE-14393 Components interactions workflow draft --- modules/runner/README.md | 145 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 145 insertions(+) diff --git a/modules/runner/README.md b/modules/runner/README.md new file mode 100644 index 0000000..a182dd9 --- /dev/null +++ b/modules/runner/README.md @@ -0,0 +1,145 @@ +# Ignite cluster & node lifecycle +This document describes user-level and component-level cluster lifecycles and their mutual interaction. + +## Node lifecycle +A node maintains its' local state in the local persistent key-value storage named vault. The data stored in the vault is +semantically divided in the following categories: + * User-level local configuration properties (such as memory limits, network timeouts, etc). User-level configuration + properties can be written both at runtime (not all properties will be applied at runtime, however, - some of them will + require a full node restart) and when a node is shut down (in order to be able to change properties that prevent node + startup for some reason) + * System-level private properties (such as computed local statistics, node-local commin paths, etc). System-level + private properties are computed locally based on the information available at node locally (not based on metastorage + watched values) + * System-level distributed metastorage projected properties (such as paths to partition files, etc). System-level + projected properties are associated with one or more metastorage properties and are computed based on the local node + state and the metastorage properties values. System-level projected properties values are semantically bound to a + particular revision of the dependee properties and must be recalculated when dependees are changed (see + [reliable watch processing](#reliable-watch-processing)). + +The vault is created during the first node startup and optionally populated with the paremeters from the configuration +file passed in to the ``ignite node start`` [command](TODO link to CLI readme). Only user-level properties can be +written via the provided file. + +System-level properties are written to the storage during the first vault initialization by the node start process. +Projected properties are not initialized during the initial node startup because at this point the local node is not +aware of the distributed metastorage. The node remains in a 'zombie' state until after it learns that there is an +initialized metastorage (either via the ``ignite cluster init`` [command](TODO link to CLI readme) during the initial +cluster initialization) or from the group membershup service via gossip (implying that group membership protocol is +working at this point). + +### Node components startup +For testability purposes, we require that component dependencies are defined upfront and provided at the construction +time. This additionaly requires that component dependencies form no cycles. Therefore, components form an acyclic +directed graph that is constructed in topological sort order wrt root. + +Components created and initialized also in an order consistent with a topological sort of the components graph. This +enforces serveral rules related to the components interaction: + * Since metastorage watches can only be added during the component startup, the watch notification order is consistent + with the component initialization order. I.e. if a component `B` depdends on a component `A`, then `A` receives watch + notification prior to `B`. + * Dependent component can directly call an API method on a dependee component (because it can obtain the dependee + reference during construction). Direct inverse calls are prohibited (this is enforced by only acquiring component + references during the components construction). Nevertheless, inverse call can be implemented by means of listeners or + callbacks: the dependent component installs a listener to a dependeee, which can be later invoked. + +<!-- +Change /svg/... to /uml/... here to view the image UML. +--> +![Example components dependency graph](http://www.plantuml.com/plantuml/svg/TO_Fpi8W4CJlF0N7xpkKn3zUJ3HDZCTwqNZVjXkjKZ1qqTUND6rbCLw0tVbDXiax0aU-rSAMDwn87f1Urjt5SCkrjAv69pTo9aRc35wJwCz8dqzwWGGTMGSN5D4xOXSJUwoks4819W1Ei2dYbnD_WbBZY7y6Hgy4YysoxLYBENeX0h_4eMYw_lrdPaltF2kHLQq6N-W1ZsO7Ml_zinhA1oHfh9kEqA1JrkoVQ2XOSZIrR_KR) + +The diagram above shows an example component dependency diagram and provides an order in which compomnents may be +initialized. + +## Cluster lifecycle +For a cluster to become operational, the metastorage instance must be initialized first. The initialization command +chooses a set of nodes (normally, 3 - 5 nodes) to host the distributed metastorage Raft group. When a node receives the +initialization command, it either creates a bootstrapped Raft instance with the given members (if this is a metastorage +group node), or writes the metastorage group member IDs to the vault as a private system-level property. + +After the metastorage is initialized, components start to receive and process watch events, updating the local state +according to the changes received from the watch. + +## Reliable watch processing +All cluster state is written and maintained in the metastorage. Nodes may update some state in the metastorage, which +may require a recomputation of some other metastorage properties (for example, when cluster baseline changes, Ignite +needs to recalculate table affinity assignments). In other words, some properties in the metastore are dependent on each +other and we may need to reliably update one property in response to an update to another. + +To facilitate this pattern, Ignite uses the metastorage ability to replay metastorage changes from a certain revision +called [watch](TODO link to metastorage watch). To process watch updates reliably, we associate a special persistent +value called ``applied revision`` (stored in the vault) with each watch. We rely on the following assumptions about the +reliable watch processing: + * Watch execution is idempotent (if the same watch event is processed twice, the second watch invocation will have no + effect on the system). This is usually enforced by conditional multi-update operations for the metastorage and + deterministic projected properties calculations. The conditional multi-update should check that the revision of the key + being updated matches the revision observed with the watch's event upper bound. + * All properties read inside the watch must be read with the upper bound equal to the watch event revision. + * If watch processing initiates a metastorage update, the ``applied revision`` is propagated only after the metastorage + confirmed that the proposed change is committed (note that in this case it does not matter whether this particular + multi-update succeeds or not: we know that the first multi-update will succeed, and all further updates are idempotent, + so we need to make sure that at least one multi-update is committed). + * If watch processing initiates projected keys writes to the vault, the keys must be written atomically with the + updated ``applied revision`` value. + * If a watch initiates metastorage properties update, it should only be deployed on the metastorage group members to + avoid massive identical updates being issued to the metastorage (TODO: should really be only the leader of the + metastorage Raft group). + +In a case of a crash, each watch is restared from the revision stored in the corresponding ``applied revision`` variable +of the watch, and not processed events are replayed. + +### Example: `CREATE TABLE` flow +We require that each Ignite table is assigned a globally unique ID (the ID must not repeat even after the table is +dropped, so we use a growing `long` counter to assign table IDs). + +When a table is created, Ignite first checks that a table with the given name does not exist, chooses the next available +table ID ``idNext`` and attempts to create the following pair of key-value pairs in the metastorage via the conditional +multi-update: + +``` +internal.tables.names.<name>=<idNext> +internal.tables.<idNext>=name +``` + +If the multi-update succeeds, Ignite considers the table created. If the multi-update fails, then either the table with +the same name was concurrently created (the operation fails in this case) or the ``idNext`` was assigned to another +table with a different name (Ignite retries the operation in this case). + +In order to process affinity calculations and assignments, the affinity manager creates a reliable watch for the +following keys on metastorage group members: + +``` +internal.tables.<ID> +internal.baseline +``` + +Whenever a watch is fired, the affinity manager checks which key was updated. If the watch is triggered for +``internal.tables.<ID>`` key, it calculates a new affinity for the table with the given ``ID``. If the watch is +triggered for ``internal.baseline`` key, the manager recalculates affinity for all tables exsiting at the watch revision +(this can be done using the metastorage ``range(keys, upperBound)`` method providing the watch event revision as the +upper bound). The calculated affinity is written to the ``internal.tables.<ID>.affinity`` key. + +> Note that ideally the watch should only be processed on metastorage group leader, thus eliminating unnecessary network +> trips. Theoretically, we could have embedded this logic to the state machine, but this would enormously complicate +> the cluster updates and put metastorage consistency at risk. + +To handle partition assignments, partition manager creates a reliable watch for the affinity assignment key on all +nodes: + +``` +internal.tables.<ID>.affinity +``` + +Whenever a watch is fired, the node checks whether there exist new partitions assigned to the local node, and if there +are, the node bootstraps corresponding Raft partition servers (i.e. allocates paths to Raft logs and storage files). +The allocation information is written to projected vault keys: + +``` +local.tables.<ID>.<PARTITION_ID>.logpath=/path/to/raft/log +local.tables.<ID>.<PARTITION_ID>.storagepath=/path/to/storage/file +``` + +Once the projected keys are synced to the vault, the partition manager can create partition Raft servers (initialize +the log and storage, write hard state, register message handlers, etc). Upon startup, the node checks the existing +projected keys (finishing the raft log and storage initialization in case it crashed in the midst) and starts the Raft +servers. \ No newline at end of file