[ 
https://issues.apache.org/jira/browse/IGNITE-17077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17077:
--------------------------------------

    Assignee: Ivan Bessonov

> Implement checkpointIndex for PDS
> ---------------------------------
>
>                 Key: IGNITE-17077
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17077
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
> prerequisites.
> h2. General idea
> The idea doesn't seem complicated. There will be a "setUpdateIndex" and 
> "getUpdateIndex" methods (names might be different).
>  * First one is invoked at the end of every write command, with RAFT commit 
> index being passed as a parameter. This is done right before releasing 
> checkpoint read lock (or whatever the name we will come up with). More on 
> that later.
>  * Second one is invoked at the beginning of every write command to validate 
> that update don't come out of order or with gaps. This is the way to 
> guarantee that IndexMismatchException can be thrown at the right time.
> So, the write command flow will look like this. All names here are completely 
> random.
>  
> {code:java}
> try (ConsistencyLock lock = partition.acquireConsistencyLock()) {
>     long updateIndex = partition.getUpdateIndex();
>     long raftIndex = writeCommand.raftIndex();
>     if (raftIndex != updateIndex + 1) {
>         throw new IndexMismatchException(updateIndex);
>     }
>     partition.write(writeCommand.row());
>     for (Index index : table.indexes(partition) {
>         index.index(writeCommand.row());
>     }
>     partition.setUpdateIndex(raftIndex);
> }{code}
>  
> Some nuances:
>  * Mismatch exception must be thrown before any data modifications. Storage 
> content must be intact, otherwise we'll just break it.
>  * Case above is the simplest one - there's a single "atomic" storage update. 
> Generally speaking, we can't or sometimes don't want to work this way. 
> Examples of operations, where atomicity this strict is not required:
>  ** Batch insert/update from the transaction.
>  ** Transaction commit might have a huge number of row ids, we can exhaust 
> the memory while committing.
>  * If we split write operation into several operations, we should externally 
> guarantee their idempotence. "setUpdateIndex" should be at the end of the 
> last "atomic" operation, so that the last command could be safely reapplied.
> h2. Implementation
> "set" method could write a value directly into partitions meta page. This 
> *will* work. But it's not quite optimal.
> Optimal solution is tightly coupled with the way checkpoint should work. This 
> may not be the right place to describe the issue, but I do it nonetheless. 
> It'll probably get split into another issue one day.
> There's a simple way to touch every meta page only once per checkpoint. We 
> just do it while holding checkpoint write lock. This way data is consistent. 
> But this solution is equally {*}bad{*}, it forces us to perform pages 
> manipulation under write lock. Flushing freelists is enough already. (NOTE: 
> we should test the performance without onheap-cache, it'll speed-up 
> checkpoint start process, thus reducing latency spikes)
> Better way to do this is not having meta pages in page memory whatsoever. 
> Maybe during the start, but that's it. It's a common practice to have a 
> pageSize being equal to 16Kb. Effective payload of partition meta page in 
> Ignite 2.x is just above 100 bytes. I expect it to be way lower in Ignite 
> 3.0. Having a loaded page for every partition is just a waste of resources, 
> all required data can be stored on-heap.
> Then, let's rely on two simple facts:
>  * If meta page date is cached on-heap, no one would need to read it from 
> disk. I should also mention that it will mostly be immutable.
>  * We can write partition meta page into every delta file even if meta has 
> not changed. In actuality, this will be very rare situation.
> Considering both of these facts, checkpointer may unconditionally write meta 
> page from heap to disk at the beginning of writing the delta file. This page 
> will become a write-only page, which is basically what we need. 
> h2. Callbacks and RAFT snapshots
> I argue against scheduled RAFT snapshots. They will produce a lot of junk 
> checkpoints. This is because checkpoint is a {*}global operation{*}. Imagine 
> RAFT triggering snapshots for 100 partitions in a row. This will result in a 
> 100 minuscule checkpoints, no one needs it. So, I'd say, we need to operation:
>  * partition.getCheckpointerUpdateIndex();
>  * partition.registerCheckpointedUpdateIndexListener(closure);
> Bot of these methods could be used by RAFT to determine whether it needs to 
> truncate its log and to define a specific commit index for truncation.
> In case of PDS checkpointer, implementation for both of these methods is 
> trivial.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to