Question about Checkpoint Storage (RocksDB)

Sameer W Mon, 25 Jul 2016 11:51:17 -0700

Hi,

My understanding about the RocksDB state backend is as follows:


When using a RocksDB state backend, it the checkpoints are backed up
locally (to the TaskManager) using the backup feature of RocksDB by taking
snapshots from RocksDB which are consistent read-only views on the RockDB
database. Each checkpoint is backed up on the task manager node and this
checkpoint is asynchronously backed up to the remote HDFS location.  When
each checkpoint is committed, the records are deleted from RocksDB,
allowing RocksDb data folders to remain small. This in turn allows each
snapshot to be relatively small. If the Task node goes away due to failure,
I assume the RocksDB database is restored from the checkpoints from the
remote HDFS. Since each checkpoint state is relatively small, the
restoration time from HDFS for the RocksDB database on the new task node is
relatively small.

The question is, if using really long windows (in hours) if the state of
the window gets very large over time, would size of the RocksDB get larger?
Would replication to HDFS start causing performance bottlenecks? Also would
this need a constant (at checkpoint interval?), read from RocksDB, add more
window elements and write to RocksDB.

Outside of the read costs, is there a risk to having very long windows when
you know you could collect a lot of elements in them. Instead is it safer
to perform aggregations on top of aggregations or use your own custom
remote store like HBase to persist larger state per record and use windows
only to store the keys in HBase. I mention HBase because of its support for
column qualifiers allow elements to be added to the same key in multiple
ordered column qualifiers. Reading can also be throttled in batches of
column qualifiers allowing for the better memory consumption. Is this
approach used in practice?

Thanks,
Sameer

Question about Checkpoint Storage (RocksDB)

Reply via email to