sreejasahithi commented on code in PR #281:
URL: https://github.com/apache/ozone-site/pull/281#discussion_r2720956404
##
docs/07-system-internals/07-features/08-om-bootstrapping-with-snapshots.md:
##
@@ -0,0 +1,173 @@
+---
+sidebar_label: OM Bootstrapping with Snapshots
+---
+
+# OM Bootstrapping with Snapshots
+
+## Problem Statement
+
+The current bootstrapping mechanism for OM has inconsistencies when dealing
with Snapshotted OM RocksDBs. Bootstrapping
+occurs without locking mechanisms, and active transactions may still modify
snapshots RocksDB during the process.
+This can lead to a corrupted RocksDB instance on the follower OM
post-bootstrapping. To resolve this, the bootstrapping
+process must operate on a consistent system state.
+
+Jira Ticket: [HDDS-12090](https://issues.apache.org/jira/browse/HDDS-12090)
+
+## Background on Snapshots
+
+### Snapshot Operations
+
+When a snapshot is taken on an Ozone bucket, the following steps occur:
+
+1. A RocksDB checkpoint of the active `om.db` is created.
+2. Deleted entries are removed from the `deletedKeyTable` and
`deletedDirTable` in the Active Object Store (AOS) RocksDB.
+This is to just prevent the blocks from getting purged without checking for
the key's presence in the correct snapshot in the snapshot chain.
+3. A new entry is added to the `snapshotInfoTable` in the AOS RocksDB.
+
+### Current Bootstrap Model
+
+The current model involves the follower OM initiating an HTTP request to the
leader OM, which provides a consistent view of its state.
+Before bucket snapshots were introduced, this process relied solely on an AOS
RocksDB checkpoint. However, with snapshots, multiple RocksDB
+instances (AOS RocksDB + snapshot RocksDBs) must be handled, complicating the
process.
+
+ Workflow
+
+- **Follower Initiation:**
+ - Sends an exclude list of files already copied in previous batches.
+- **Leader Actions:**
+ - Creates an AOS RocksDB checkpoint.
+ - Performs a directory walk through:
+- AOS RocksDB checkpoint directory.
+- Snapshot RocksDB directories.
+- Backup SST file directory (compaction backup directory).
+ - Identifies unique files to be copied in the next batch.
+ - Transfers files in batches, recreating hardlinks on the follower side as
needed.
+
+ Issues with the Current Model
+
+1. Active transactions during bootstrapping may modify snapshot RocksDBs,
leading to inconsistencies.
+2. Partial data copies can occur when double-buffer flushes or other
snapshot-related operations are in progress.
+3. Large snapshot data sizes (often in GBs) require multi-batch transfers,
increasing the risk of data corruption.
+
+## Proposed Fixes
+
+### Locking the Snapshot Cache
+
+Snapshot Cache is the class which is responsible for maintaining all RocksDB
handles corresponding to a snapshot.
+The RocksDB handles are closed by the snapshot cache are closed from time to
time if there are no references of the
+RocksDB being used by any of the threads in the system. Hence any operation on
a snapshot would go through the snapshot
+cache increasing the reference count of that snapshot. Implementing a lock for
this snapshot cache would prevent any newer
+threads from requesting a snapshot RocksDB handle from the snapshot cache.
Thus any operation under this lock will have a
+consistent view of the entire snapshot. The only downside to this is that it
would block the double buffer thread,
+hence any operation performed under this thread has to be lightweight so that
it doesn't end up running for a long
+period of time. (P.S. With Sumit's implementation of optimized Gatekeeping
model and getting rid of double buffer from
+OM would result in only blocking the snapshot operations which should be fine
since these operations are only fired by background threads.)
+
+With the above implementation of a lock there is a way to get a consistent
snapshot of the entire OM. Now lets dive into various approaches to overall
bootstrap flow.
+
+### Approach 1 (Batching files over multiple tarballs)
+
+This approach builds the current model by introducing size thresholds to
manage locks and data transfers more efficiently.
+
+ Workflow
+
+1. **Follower Initiation:**
+ - Sends an exclude list of previously copied files (identified by
`inodeId`).
+2. **Leader Directory Walk:**
+ - Walks through AOS RocksDB, snapshot RocksDBs, and backup SST directories
to identify files to transfer.
+ - Compares against the exclude list to avoid duplicate transfers.
+3. If the total size of files to be copied is more than
`ozone.om.ratis.snapshot.lock.max.total.size.threshold` then the
+files would be directly sent over the stream as a tarball where the name of
the files is the inodeId of the file.
+4. If the total size of files to be copied is less than equal to
`ozone.om.ratis.snapshot.lock.max.total.size.threshold`
+then the snapshot cache lock is taken after waiting for the snapshot cache to
completely get empty (No