Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-05-04 Thread Tony Wei
>>>> >>>> Part 1. Take snapshot of the RocksDB. (This can map to the "Checkpoint >>>> Duration (sync) " on the checkpoint detail page) >>>> >>>> Part2. Loop the records of the snapshot, along with some `if` check to >>>&

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-09 Thread Tony Wei
ta is sent to s3 in the order of the key group. (This can map to >>> the "Checkpoint Duration(Async)"). >>> >>> So part2 could be cpu costly and network costly, if the CPU load is too >>> high, then sending data will slow down, because there are in a si

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-09 Thread Stefan Richter
slow down, because there are in a single loop. If cpu > is the reason, this phenomenon will disappear if you use increment > checkpoint, because it almost only send data to s3. In the all, for now > trying out the incremental checkpoint is the best thing to do I think. > > Best Regards,

Re: Fwd: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Tony Wei
gt; On 03/6/2018 14:45,Tony Wei<tony19920...@gmail.com> > <tony19920...@gmail.com> wrote: > > Sent to the wrong mailing list. Forward it to the correct one. > > -- Forwarded message ------ > From: Tony Wei <tony19920...@gmail.com> > Date: 2018-03-06 14:43

Fwd: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Tony Wei
Sent to the wrong mailing list. Forward it to the correct one. -- Forwarded message -- From: Tony Wei <tony19920...@gmail.com> Date: 2018-03-06 14:43 GMT+08:00 Subject: Re: checkpoint stuck with rocksdb statebackend and s3 filesystem To: 周思华 <summerle...@163.com>, St

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread 周思华
Hi Tony, About to your question: average end to end latency of checkpoint is less than 1.5 mins, doesn't means that checkpoint won't timeout. indeed, it determined byt the max end to end latency (the slowest one), a checkpoint truly completed only after all task's checkpoint have completed.

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Tony Wei
Hi Sihua, Thanks for your suggestion. "incremental checkpoint" is what I will try out next and I know it will give a better performance. However, it might not solve this issue completely, because as I said, the average end to end latency of checkpointing is less than 1.5 mins currently, and it is

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Tony Wei
Hi Stefan, I see. That explains why the loading of machines grew up. However, I think it is not the root cause that led to these consecutive checkpoint timeout. As I said in my first mail, the checkpointing progress usually took 1.5 mins to upload states, and this operator and kafka consumer are

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Stefan Richter
Hi, thanks for all the info. I had a look into the problem and opened https://issues.apache.org/jira/browse/FLINK-8871 to fix this. From your stack trace, you can see many checkpointing threads are running on your TM for checkpoints that have

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Tony Wei
Hi Stefan, Here is my checkpointing configuration. Checkpointing Mode Exactly Once Interval 20m 0s Timeout 10m 0s Minimum Pause Between Checkpoints 0ms Maximum Concurrent Checkpoints 1 Persist Checkpoints Externally Enabled (delete on cancellation) Best Regards, Tony Wei 2018-03-05 21:30

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Stefan Richter
Hi, quick question: what is your exact checkpointing configuration? In particular, what is your value for the maximum parallel checkpoints and the minimum time interval to wait between two checkpoints? Best, Stefan > Am 05.03.2018 um 06:34 schrieb Tony Wei : > > Hi