[ https://issues.apache.org/jira/browse/IGNITE-17077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ivan Bessonov reassigned IGNITE-17077: -------------------------------------- Assignee: Ivan Bessonov > Implement checkpointIndex for PDS > --------------------------------- > > Key: IGNITE-17077 > URL: https://issues.apache.org/jira/browse/IGNITE-17077 > Project: Ignite > Issue Type: Improvement > Reporter: Ivan Bessonov > Assignee: Ivan Bessonov > Priority: Major > Labels: ignite-3 > > Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for > prerequisites. > h2. General idea > The idea doesn't seem complicated. There will be a "setUpdateIndex" and > "getUpdateIndex" methods (names might be different). > * First one is invoked at the end of every write command, with RAFT commit > index being passed as a parameter. This is done right before releasing > checkpoint read lock (or whatever the name we will come up with). More on > that later. > * Second one is invoked at the beginning of every write command to validate > that update don't come out of order or with gaps. This is the way to > guarantee that IndexMismatchException can be thrown at the right time. > So, the write command flow will look like this. All names here are completely > random. > > {code:java} > try (ConsistencyLock lock = partition.acquireConsistencyLock()) { > long updateIndex = partition.getUpdateIndex(); > long raftIndex = writeCommand.raftIndex(); > if (raftIndex != updateIndex + 1) { > throw new IndexMismatchException(updateIndex); > } > partition.write(writeCommand.row()); > for (Index index : table.indexes(partition) { > index.index(writeCommand.row()); > } > partition.setUpdateIndex(raftIndex); > }{code} > > Some nuances: > * Mismatch exception must be thrown before any data modifications. Storage > content must be intact, otherwise we'll just break it. > * Case above is the simplest one - there's a single "atomic" storage update. > Generally speaking, we can't or sometimes don't want to work this way. > Examples of operations, where atomicity this strict is not required: > ** Batch insert/update from the transaction. > ** Transaction commit might have a huge number of row ids, we can exhaust > the memory while committing. > * If we split write operation into several operations, we should externally > guarantee their idempotence. "setUpdateIndex" should be at the end of the > last "atomic" operation, so that the last command could be safely reapplied. > h2. Implementation > "set" method could write a value directly into partitions meta page. This > *will* work. But it's not quite optimal. > Optimal solution is tightly coupled with the way checkpoint should work. This > may not be the right place to describe the issue, but I do it nonetheless. > It'll probably get split into another issue one day. > There's a simple way to touch every meta page only once per checkpoint. We > just do it while holding checkpoint write lock. This way data is consistent. > But this solution is equally {*}bad{*}, it forces us to perform pages > manipulation under write lock. Flushing freelists is enough already. (NOTE: > we should test the performance without onheap-cache, it'll speed-up > checkpoint start process, thus reducing latency spikes) > Better way to do this is not having meta pages in page memory whatsoever. > Maybe during the start, but that's it. It's a common practice to have a > pageSize being equal to 16Kb. Effective payload of partition meta page in > Ignite 2.x is just above 100 bytes. I expect it to be way lower in Ignite > 3.0. Having a loaded page for every partition is just a waste of resources, > all required data can be stored on-heap. > Then, let's rely on two simple facts: > * If meta page date is cached on-heap, no one would need to read it from > disk. I should also mention that it will mostly be immutable. > * We can write partition meta page into every delta file even if meta has > not changed. In actuality, this will be very rare situation. > Considering both of these facts, checkpointer may unconditionally write meta > page from heap to disk at the beginning of writing the delta file. This page > will become a write-only page, which is basically what we need. > h2. Callbacks and RAFT snapshots > I argue against scheduled RAFT snapshots. They will produce a lot of junk > checkpoints. This is because checkpoint is a {*}global operation{*}. Imagine > RAFT triggering snapshots for 100 partitions in a row. This will result in a > 100 minuscule checkpoints, no one needs it. So, I'd say, we need to operation: > * partition.getCheckpointerUpdateIndex(); > * partition.registerCheckpointedUpdateIndexListener(closure); > Bot of these methods could be used by RAFT to determine whether it needs to > truncate its log and to define a specific commit index for truncation. > In case of PDS checkpointer, implementation for both of these methods is > trivial. -- This message was sent by Atlassian Jira (v8.20.10#820010)