Hi Zhijiang,
Thanks for your response.
I add the checkpointAlignmentTime, the data shows that the
checkpointDuration is about 150s, and the checkpointAlignmentTims is about 4s.
There is a big gap between them.
Best
Henry
> 在 2018年10月10日,下午1:26,Zhijiang(wangzhijiang999) <[email protected]>
> 写道:
>
> The checkpoint duration includes the processes of barrier alignment and state
> snapshot. Every task has to receive all the barriers from all the channels,
> then trriger to snapshot state.
> I guess the barrier alignment may take long time for your case, and it is
> specially critical during backpressure. You can check the metric of
> "checkpointAlignmentTime" for confirmation.
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> 发件人:徐涛 <[email protected]>
> 发送时间:2018年10月10日(星期三) 13:13
> 收件人:user <[email protected]>
> 主 题:Small checkpoint data takes too much time
>
> Hi
> I recently encounter a problem in production. I found checkpoint takes too
> much time, although it doesn`t affect the job execution.
> I am using FsStateBackend, writing the data to a HDFS checkpointDataUri, and
> asynchronousSnapshots, I print the metric data “lastCheckpointDuration” and
> “lastCheckpointSize”. It shows the “lastCheckpointSize” is about 80KB, but
> the “lastCheckpointDuration” is about 160s! Because checkpoint data is small
> , I think it should not take that long time. I do not know why and which
> condition may influent the checkpoint time. Does anyone has encounter such
> problem?
> Thanks a lot.
>
> Best
> Henry
>