Hi Stefan,

I see. That explains why the loading of machines grew up. However, I think
it is not the root cause that led to these consecutive checkpoint timeout.
As I said in my first mail, the checkpointing progress usually took 1.5
mins to upload states, and this operator and kafka consumer are only two
operators that have states in my pipeline. In the best case, I should never
encounter the timeout problem that only caused by lots of pending
checkpointing threads that have already timed out. Am I right?

Since these logging and stack trace was taken after nearly 3 hours from the
first checkpoint timeout, I'm afraid that we couldn't actually find out the
root cause for the first checkpoint timeout. Because we are preparing to
make this pipeline go on production, I was wondering if you could help me
find out where the root cause happened: bad machines or s3 or
flink-s3-presto packages or flink checkpointing thread. It will be great if
we can find it out from those informations the I provided, or a
hypothesis based on your experience is welcome as well. The most important
thing is that I have to decide whether I need to change my persistence
filesystem or use another s3 filesystem package, because it is the last
thing I want to see that the checkpoint timeout happened very often.

Thank you very much for all your advices.

Best Regards,
Tony Wei

2018-03-06 1:07 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com>:

> Hi,
>
> thanks for all the info. I had a look into the problem and opened
> https://issues.apache.org/jira/browse/FLINK-8871 to fix this. From your
> stack trace, you can see many checkpointing threads are running on your TM
> for checkpoints that have already timed out, and I think this cascades and
> slows down everything. Seems like the implementation of some features like
> checkpoint timeouts and not failing tasks from checkpointing problems
> overlooked that we also require to properly communicate that checkpoint
> cancellation to all task, which was not needed before.
>
> Best,
> Stefan
>
>
> Am 05.03.2018 um 14:42 schrieb Tony Wei <tony19920...@gmail.com>:
>
> Hi Stefan,
>
> Here is my checkpointing configuration.
>
> Checkpointing Mode Exactly Once
> Interval 20m 0s
> Timeout 10m 0s
> Minimum Pause Between Checkpoints 0ms
> Maximum Concurrent Checkpoints 1
> Persist Checkpoints Externally Enabled (delete on cancellation)
> Best Regards,
> Tony Wei
>
> 2018-03-05 21:30 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com>:
>
>> Hi,
>>
>> quick question: what is your exact checkpointing configuration? In
>> particular, what is your value for the maximum parallel checkpoints and the
>> minimum time interval to wait between two checkpoints?
>>
>> Best,
>> Stefan
>>
>> > Am 05.03.2018 um 06:34 schrieb Tony Wei <tony19920...@gmail.com>:
>> >
>> > Hi all,
>> >
>> > Last weekend, my flink job's checkpoint start failing because of
>> timeout. I have no idea what happened, but I collect some informations
>> about my cluster and job. Hope someone can give me advices or hints about
>> the problem that I encountered.
>> >
>> > My cluster version is flink-release-1.4.0. Cluster has 10 TMs, each has
>> 4 cores. These machines are ec2 spot instances. The job's parallelism is
>> set as 32, using rocksdb as state backend and s3 presto as checkpoint file
>> system.
>> > The state's size is nearly 15gb and still grows day-by-day. Normally,
>> It takes 1.5 mins to finish the whole checkpoint process. The timeout
>> configuration is set as 10 mins.
>> >
>> > <chk_snapshot.png>
>> >
>> > As the picture shows, not each subtask of checkpoint broke caused by
>> timeout, but each machine has ever broken for all its subtasks during last
>> weekend. Some machines recovered by themselves and some machines recovered
>> after I restarted them.
>> >
>> > I record logs, stack trace and snapshot for machine's status (CPU, IO,
>> Network, etc.) for both good and bad machine. If there is a need for more
>> informations, please let me know. Thanks in advance.
>> >
>> > Best Regards,
>> > Tony Wei.
>> > <bad_tm_log.log><bad_tm_pic.png><bad_tm_stack.log><good_tm_
>> log.log><good_tm_pic.png><good_tm_stack.log>
>>
>>
>
>

Reply via email to