This is an automated email from the ASF dual-hosted git repository. agoncharuk pushed a commit to branch ignite-14647 in repository https://gitbox.apache.org/repos/asf/ignite-3.git
The following commit(s) were added to refs/heads/ignite-14647 by this push: new 37a1fa7 IGNITE-14647 Describe no-logging native persistence 37a1fa7 is described below commit 37a1fa71e1b8136137415e3e182d235760f57a70 Author: Alexey Goncharuk <alexey.goncha...@gmail.com> AuthorDate: Mon May 3 23:46:46 2021 +0300 IGNITE-14647 Describe no-logging native persistence --- modules/vault/README.md | 96 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) diff --git a/modules/vault/README.md b/modules/vault/README.md new file mode 100644 index 0000000..93d961f --- /dev/null +++ b/modules/vault/README.md @@ -0,0 +1,96 @@ +# Apache Ignite Native Persistence (nextgen) +This document describes the architecture of Apache Ignite 3.x native persistence and highlights key differencies +compared to the Apache Ignite 2.x native perisitence architecture. + +## Replication log as recovery journal +Apache Ignite 3.x uses log-based replication protocol (currently, Raft) to ensure data consistency on different nodes +in the cluster. This very same log is used as a logical WAL for local state machine recovery. This change allows for +several optimizations in the page memory persistence mechanics. + +This document describes the architecture of the replication log as well as the optimized version of the native +persistence. + +## Replication log +The Raft log represents a key-value storage for which key is always a contiguously growing unbounded integer number. +In practice, it is sufficient to limit the key by a 64-bit unsigned integer. The key range consists of two segments: +committed entries (the larger part of the log) cannot be changed and are immutable, and not-yet-committed entries that +can be truncated and be overwritten. The non-committed entries are rarely overwritten in practice. Additionally, the +Raft log can be compacted: entries with key smaller than lower bound are thrown away as they were made durable in the +state machine and are no longer needed. + +The replication log storage is based on WiscKey approach of log-structured storages. Multiple Raft group logs can be +stored in a single storage, however, it is not neccessary to have a single storage for all Raft groups. The storage and +WAL of Raft log are combined and use immutable append-only segments containing log indices and state machine commands. +Segment header may contain a dictionary to reduce the size of Raft group IDs written in the segment. Each log entry +contains the log index, Raft group ID and command payload (the payload itself also contains the log entry term which +is essential to Raft protocol). Schematically, the segment structure can be represented as follows: + +<pre> ++----------------+--------------------------------+--------------------------------+-----+ +| | Entry 1 | Entry 2 | ... | ++----------------+--------------------------------+--------------------------------+-----+ +| Segment header | Log index | Group ID | Payload | Log index | Group ID | Payload | ... | ++----------------+--------------------------------+--------------------------------+-----+ +</pre> + +New entries are always written to the end of the segment, even if it is an overwrite entry with the log index which +was already written to the same Raft group. The Raft log maintains an index of log entries for each Raft log group +which is represented as a simple array of entry offsets from the beginning of the file. The array is complemented with +the base index of the first entry in the segment. The in-memory index is flushed on disk (checkpointed) only when all +entries in the segment are committed. Schematically, the index structure may be represented as follows: + +<pre> ++----------------+----------------------------------------------+----------------------------------------------+-----+ +| | Group 1 Index | Group 2 Index | ... | ++----------------+----------------------------------------------+----------------------------------------------+-----+ +| Index header | Len | Index Base | Offset 1 | Offset 2 | ... | Len | Index Base | Offset 1 | Offset 2 | ... | ... | ++----------------+----------------------------------------------+----------------------------------------------+-----+ +</pre> + +Upon recovery, the Raft log component reads all log entries from non-checkpointed segments and repopulates the in-memory +index which will be later checkpointed. + +## No-logging native persistence +Given that the Raft log is available as a separate component, we can now see how the page memory-based native +perisitence uses the external log to avoid any WAL logging for recovery. + +Similar to Ignite 2.x native persistence, all data structures are implemented over logical pages. This include data +pages, B+Tree, reuse and free lists, etc. Each Raft log command represents a single atomic change to the state machine. +When a command is applied, it changes some pages in the state machine data structures, marking them dirtly. Unlike +Ignite 2.x, no physical or logical records are written to WAL when these changes are applied. As a result, after +applying a set of changes, certain pages will be marked dirty and need to be checkpointed. To allow for effecient +checkpointing, Ignite 3.x splits storage files into main, containing only contiguous range of pages, and auxiliary +files, containing dirty pages from consequtive checkpoints. Pages in auxiliary files are not necessarily contiguous +and may have gaps. The files are ordered by the order in which they were written, forming a structure similar to an +LSM-tree: when a page is read from disk, it is first looked up in the most recent auxiliary file, then in the second +auxiliary file, and so on, until the main storage file is reached. To allow for effecient search in auxiliary files, +a small reverse index is stored with each file that maps page IDs to offsets in the file. The number of auxiliary +files should be kept small enough so that the full inverse index is kept in-memory; however, if the merge process +does not keep up with the load, the indexes can be always evicted from memory and read on-demand. + +Schematically, the structure of the main and auxiliary checkpoint files may be represented as follows: + +Main storage file: +<pre> ++--------+--------+--------+-----+ +| Page 1 | Page 2 | Page 3 | ... | ++--------+--------+--------+-----+ +</pre> + +Auxiliary storage file: +<pre> ++-----------------------------------------------------------+-------------------------+ +| Offsets Index | Pages Data | ++-----------+--------------------+--------------------+-----+---------+---------+-----+ +| Index len | Page P1 ID, Offset | Page P2 ID, Offset | ... | Page P1 | Page P2 | ... | ++-----------+--------------------+--------------------+-----+---------+---------+-----+ +</pre> + +### Recovery +If a process crashes when an auxiliary file is not completely written, the file is discarded upon recovery and Raft +log commands are replayed again, forming another set of dirty pages that should be checkpointed. + +As the number of auxiliary files grow, they can be merged back to the main storage file in the same order they were +originally written. When an auxiliary file is fully merged to the main storage file, it can be safely removed provided +that there are no threads intending to read from that file. If a process crashes in the middle of the merge, the file +can be safely merged again (the original pages will not be used anyway). \ No newline at end of file