Re: Implications on incremental checkpoint duration based on RocksDBStateBackend, Number of Slots per TM

2020-06-17 Thread Yun Tang
Hi Sameer

If you only have one disk for one TM, 10 TMs could deploy at most 10 disks 
while 100TM could deploy at most 100 disks.
The sync checkpoint phase of RocksDB need to write disk and if you could 
distribute the write pressure over more disks, you could get better performance 
which is what you observed.

The synchronous checkpoint phase actually means the task can only execute 
checkpoint and cannot process elements at that time.
On the other hand, the asynchronous phase means the task upload checkpoint data 
asyncly and could still process elements at that time.
Moreover, flushing RocksDB in sync phase is executed in task main thread and 
one TM could have many task main threads.

Since the synchronous checkpoint phase is only triggered after barrier 
alignment finished, we cannot ensure all RocksDB instances would execute 
flushing at the same time.


Best
Yun Tang

From: Sameer W 
Sent: Thursday, June 18, 2020 3:34
To: user 
Subject: Implications on incremental checkpoint duration based on 
RocksDBStateBackend, Number of Slots per TM

Hi,

The number of RocksDB databases the Flink creates is equal to the number of 
operator states multiplied by the number of slots.

Assuming a parallelism of 100 for a job which is executed on 100 TM's with 1 
slot per TM as opposed to 10 TM's with 10 slots per TM I have noticed that the 
former configuration is more efficient for incremental checkpointing. In both 
cases the number of RocksDB databases is the same, except in the latter case 10 
times as many are created in one TM vs the former case.

Reading the 
link
 below, it says - "link uses this to figure out the state changes. To do this, 
Flink triggers a flush in RocksDB, forcing all memtables into sstables on disk, 
and hard-linked in a local temporary directory. This process is synchronous to 
the processing pipeline, and Flink performs all further steps asynchronously 
and does not block processing."

What does "Synchronous to the processing pipeline" mean? Does it mean that 
flushing to DB happens synchronously (serially) for all RocksDB databases in 
one TM? Is the flushing single threaded per TM

Thanks,
Sameer



Implications on incremental checkpoint duration based on RocksDBStateBackend, Number of Slots per TM

2020-06-17 Thread Sameer W
Hi,

The number of RocksDB databases the Flink creates is equal to the number of
operator states multiplied by the number of slots.

Assuming a parallelism of 100 for a job which is executed on 100 TM's with
1 slot per TM as opposed to 10 TM's with 10 slots per TM I have noticed
that the former configuration is more efficient for incremental
checkpointing. In both cases the number of RocksDB databases is the same,
except in the latter case 10 times as many are created in one TM vs the
former case.

Reading the link

below, it says - "link uses this to figure out the state changes. To do
this, Flink triggers a flush in RocksDB, forcing all memtables into
sstables on disk, and hard-linked in a local temporary directory. *This
process is synchronous to the processing pipeline*, and Flink performs all
further steps asynchronously and does not block processing."

What does "Synchronous to the processing pipeline" mean? Does it mean that
flushing to DB happens synchronously (serially) for all RocksDB databases
in one TM? Is the flushing single threaded per TM

Thanks,
Sameer