Concerning the size of RocksDB snapshots - I am wondering if RocksDB simply does not compact for a long time, thus having a lot of stale data in the snapshot.
That would be especially the case, if you have a lot of changing values for the same set of keys. On Tue, Apr 12, 2016 at 6:41 PM, Aljoscha Krettek <aljos...@apache.org> wrote: > Hi, > I'm going to try and respond to each point: > > 1. This seems strange, could you give some background on parallelism, > number of operators with state and so on? Also, I'm assuming you are using > the partitioned state abstraction, i.e. getState(), correct? > > 2. your observations are pretty much correct. The reason why RocksDB is > slower is that the FsStateBackend basically stores the state in a Java > HashMap and writes the contents to HDFS when checkpointing. RocksDB stores > data in on-disk files and goes to them for every state access (of course > there are caches, but generally it is like this). I'm actually impressed > that it is still this fast in comparison. > > 3. see 1. (I think for now) > > 4. The checkpointing time is the time from the JobManager deciding to > start a checkpoint until all tasks have confirmed that checkpoint. I have > seen this before and I think it results from back pressure. The problem is > that the checkpoint messages that we sent through the topology are sitting > at the sources because they are also back pressured by the slow processing > of normal records. You should be able to see the actual checkpointing times > (both synchronous and asynchronous) in the log files of the task managers, > they should be very much lower. > > I can go into details, I'm just writing this quickly before calling it a > day. :-) > > Cheers, > Aljoscha > > On Tue, 12 Apr 2016 at 18:21 Konstantin Knauf < > konstantin.kn...@tngtech.com> wrote: > >> Hi everyone, >> >> my experience with RocksDBStatebackend have left me a little bit >> confused. Maybe you guys can confirm that my epxierence is the expected >> behaviour ;): >> >> I have run a "performancetest" twice, once with FsStateBackend and once >> RocksDBStatebackend in comparison. In this particular test the state >> saved is generally not large (in a production scenario it will be larger.) >> >> These are my observations: >> >> 1. Minimal Checkpoint Size (no records) with RocksDB was 33MB compared >> to <<1MB with the FSStatebackend. >> >> 2. Throughput dropped from 28k/s -> 18k/s on a small cluster. >> >> 3. Checkpoint sizes as reported in the Dashboard was ca. 1MB for >> FsStatebackend but >100MB for RocksDbStatebackend. I hope the difference >> gets smaller for very large state. Can you confirm? >> >> 4. Checkpointing Times as reported in the Dashboard were 26secs for >> RocksDB during the test and <1 second for FsStatebackend. Does the >> reported time correspond to the sync. + asynchronous part of the >> checkpointing in case of RocksDB? Is there any way to tell how long the >> synchronous part takes? >> >> Form these first observations RocksDB does seem to bring a large >> overhead for state < 1GB, I guess? Is this expected? >> >> Cheers, >> >> Konstantin >> >