Sent to the wrong mailing list. Forward it to the correct one. ---------- Forwarded message ---------- From: Tony Wei <tony19920...@gmail.com> Date: 2018-03-06 14:43 GMT+08:00 Subject: Re: checkpoint stuck with rocksdb statebackend and s3 filesystem To: 周思华 <summerle...@163.com>, Stefan Richter <s.rich...@data-artisans.com> Cc: "user-subscr...@flink.apache.org" <user-subscr...@flink.apache.org>
Hi Sihua, Thanks a lot. I will try to find out the problem from machines' environment. If you and Stefan have any new suggestions or thoughts, please advise me. Thank you ! Best Regards, Tony Wei 2018-03-06 14:34 GMT+08:00 周思华 <summerle...@163.com>: > Hi Tony, > > I think the two things you mentioned can both lead to a bad network. But > from my side, I think it more likely that it is the unstable network env > that cause the poor network performance itself, because as far as I know > I can't found out the reason that the flink would slow down the network so > much (even It does, the effect should not be that so much). > > Maybe stefan could tell more about that. ;) > > Best Regards, > Sihua Zhou > > 发自网易邮箱大师 > > On 03/6/2018 14:04,Tony Wei<tony19920...@gmail.com> > <tony19920...@gmail.com> wrote: > > Hi Sihua, > > >> Hi Tony, >> >> About to your question: average end to end latency of checkpoint is less >> than 1.5 mins, doesn't means that checkpoint won't timeout. indeed, it >> determined byt the max end to end latency (the slowest one), a checkpoint >> truly completed only after all task's checkpoint have completed. >> > > Sorry for my poor expression. What I mean is the average duration of > "completed" checkpoints, so I guess there are some problems that make some > subtasks of checkpoint take so long, even more than 10 mins. > > >> >> About to the problem: after a second look at the info you privode, we can >> found from the checkpoint detail picture that there is one task which cost >> 4m20s to transfer it snapshot (about 482M) to s3 and there are 4 others >> tasks didn't complete the checkpoint yet. And from the bad_tm_pic.png vs >> good_tm_pic.png, we can found that on "bad tm" the network performance is >> far less than the "good tm" (-15 MB vs -50MB). So I guss the network is a >> problem, sometimes it failed to send 500M data to s3 in 10 minutes. (maybe >> you need to check whether the network env is stable) >> > > That is what I concerned. Because I can't determine if checkpoint is stuck > makes network performance worse or network performance got worse makes > checkpoint stuck. > Although I provided one "bad machine" and one "good machine", that doesn't > mean bad machine is always bad and good machine is always good. See the > attachments. > All of my TMs met this problem at least once from last weekend until now. > Some machines recovered by themselves and some recovered after I restarted > them. > > Best Regards, > Tony Wei > > 2018-03-06 13:41 GMT+08:00 周思华 <summerle...@163.com>: > >> >> Hi Tony, >> >> About to your question: average end to end latency of checkpoint is less >> than 1.5 mins, doesn't means that checkpoint won't timeout. indeed, it >> determined byt the max end to end latency (the slowest one), a checkpoint >> truly completed only after all task's checkpoint have completed. >> >> About to the problem: after a second look at the info you privode, we can >> found from the checkpoint detail picture that there is one task which cost >> 4m20s to transfer it snapshot (about 482M) to s3 and there are 4 others >> tasks didn't complete the checkpoint yet. And from the bad_tm_pic.png vs >> good_tm_pic.png, we can found that on "bad tm" the network performance is >> far less than the "good tm" (-15 MB vs -50MB). So I guss the network is a >> problem, sometimes it failed to send 500M data to s3 in 10 minutes. (maybe >> you need to check whether the network env is stable) >> >> About the solution: I think incremental checkpoint can help you a lot, it >> will only send the new data each checkpoint, but you are right if the >> increment state size is huger than 500M, it might cause the timeout problem >> again (because of the bad network performance). >> >> Best Regards, >> Sihua Zhou >> >> 发自网易邮箱大师 >> >> On 03/6/2018 13:02,Tony Wei<tony19920...@gmail.com> >> <tony19920...@gmail.com> wrote: >> >> Hi Sihua, >> >> Thanks for your suggestion. "incremental checkpoint" is what I will try >> out next and I know it will give a better performance. However, it might >> not solve this issue completely, because as I said, the average end to end >> latency of checkpointing is less than 1.5 mins currently, and it is far >> from my timeout configuration. I believe "incremental checkpoint" will >> reduce the latency and make this issue might occur seldom, but I can't >> promise it won't happen again if I have bigger states growth in the future. >> Am I right? >> >> Best Regards, >> Tony Wei >> >> 2018-03-06 10:55 GMT+08:00 周思华 <summerle...@163.com>: >> >>> Hi Tony, >>> >>> Sorry for jump into, one thing I want to remind is that from the log you >>> provided it looks like you are using "full checkpoint", this means that the >>> state data that need to be checkpointed and transvered to s3 will grow over >>> time, and even for the first checkpoint it performance is slower that >>> incremental checkpoint (because it need to iterate all the record from the >>> rocksdb using the RocksDBMergeIterator). Maybe you can try out "incremental >>> checkpoint", it could help you got a better performance. >>> >>> Best Regards, >>> Sihua Zhou >>> >>> 发自网易邮箱大师 >>> >>> On 03/6/2018 10:34,Tony Wei<tony19920...@gmail.com> >>> <tony19920...@gmail.com> wrote: >>> >>> Hi Stefan, >>> >>> I see. That explains why the loading of machines grew up. However, I >>> think it is not the root cause that led to these consecutive checkpoint >>> timeout. As I said in my first mail, the checkpointing progress usually >>> took 1.5 mins to upload states, and this operator and kafka consumer are >>> only two operators that have states in my pipeline. In the best case, I >>> should never encounter the timeout problem that only caused by lots of >>> pending checkpointing threads that have already timed out. Am I right? >>> >>> Since these logging and stack trace was taken after nearly 3 hours from >>> the first checkpoint timeout, I'm afraid that we couldn't actually find out >>> the root cause for the first checkpoint timeout. Because we are >>> preparing to make this pipeline go on production, I was wondering if you >>> could help me find out where the root cause happened: bad machines or s3 or >>> flink-s3-presto packages or flink checkpointing thread. It will be great if >>> we can find it out from those informations the I provided, or a >>> hypothesis based on your experience is welcome as well. The most important >>> thing is that I have to decide whether I need to change my persistence >>> filesystem or use another s3 filesystem package, because it is the last >>> thing I want to see that the checkpoint timeout happened very often. >>> >>> Thank you very much for all your advices. >>> >>> Best Regards, >>> Tony Wei >>> >>> 2018-03-06 1:07 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com>: >>> >>>> Hi, >>>> >>>> thanks for all the info. I had a look into the problem and opened >>>> https://issues.apache.org/jira/browse/FLINK-8871 to fix this. From >>>> your stack trace, you can see many checkpointing threads are running on >>>> your TM for checkpoints that have already timed out, and I think this >>>> cascades and slows down everything. Seems like the implementation of some >>>> features like checkpoint timeouts and not failing tasks from checkpointing >>>> problems overlooked that we also require to properly communicate that >>>> checkpoint cancellation to all task, which was not needed before. >>>> >>>> Best, >>>> Stefan >>>> >>>> >>>> Am 05.03.2018 um 14:42 schrieb Tony Wei <tony19920...@gmail.com>: >>>> >>>> Hi Stefan, >>>> >>>> Here is my checkpointing configuration. >>>> >>>> Checkpointing Mode Exactly Once >>>> Interval 20m 0s >>>> Timeout 10m 0s >>>> Minimum Pause Between Checkpoints 0ms >>>> Maximum Concurrent Checkpoints 1 >>>> Persist Checkpoints Externally Enabled (delete on cancellation) >>>> Best Regards, >>>> Tony Wei >>>> >>>> 2018-03-05 21:30 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com> >>>> : >>>> >>>>> Hi, >>>>> >>>>> quick question: what is your exact checkpointing configuration? In >>>>> particular, what is your value for the maximum parallel checkpoints and >>>>> the >>>>> minimum time interval to wait between two checkpoints? >>>>> >>>>> Best, >>>>> Stefan >>>>> >>>>> > Am 05.03.2018 um 06:34 schrieb Tony Wei <tony19920...@gmail.com>: >>>>> > >>>>> > Hi all, >>>>> > >>>>> > Last weekend, my flink job's checkpoint start failing because of >>>>> timeout. I have no idea what happened, but I collect some informations >>>>> about my cluster and job. Hope someone can give me advices or hints about >>>>> the problem that I encountered. >>>>> > >>>>> > My cluster version is flink-release-1.4.0. Cluster has 10 TMs, each >>>>> has 4 cores. These machines are ec2 spot instances. The job's parallelism >>>>> is set as 32, using rocksdb as state backend and s3 presto as checkpoint >>>>> file system. >>>>> > The state's size is nearly 15gb and still grows day-by-day. >>>>> Normally, It takes 1.5 mins to finish the whole checkpoint process. The >>>>> timeout configuration is set as 10 mins. >>>>> > >>>>> > <chk_snapshot.png> >>>>> > >>>>> > As the picture shows, not each subtask of checkpoint broke caused by >>>>> timeout, but each machine has ever broken for all its subtasks during last >>>>> weekend. Some machines recovered by themselves and some machines recovered >>>>> after I restarted them. >>>>> > >>>>> > I record logs, stack trace and snapshot for machine's status (CPU, >>>>> IO, Network, etc.) for both good and bad machine. If there is a need for >>>>> more informations, please let me know. Thanks in advance. >>>>> > >>>>> > Best Regards, >>>>> > Tony Wei. >>>>> > <bad_tm_log.log><bad_tm_pic.png><bad_tm_stack.log><good_tm_l >>>>> og.log><good_tm_pic.png><good_tm_stack.log> >>>>> >>>>> >>>> >>>> >>> >> >