Hi!

checkpoint 超时有很多可能性。最常见的原因是超时的节点太忙阻塞了 checkpoint(包括计算资源不足,或者数据有倾斜等),这可以通过看
Flink web UI 上的 busy 以及反压信息判断;另外一个常见原因是 gc 太频繁,可以通过设置 jvm 参数打印出 gc log 观察。

[email protected] <[email protected]> 于2021年11月18日周四 下午2:54写道:

> flink的job运行一段时间后, checkpoint就一直失败,信息如下:
> ID
> Status
> Acknowledged
> Trigger Time
> Latest Acknowledgement
> End to End Duration
> State Size
> Buffered During Alignment
> 295
> FAILED
> 30/5011:55:3811:55:391h 0m 0s205 KB0 B
> Checkpoint Detail:
> Path: - Discarded: - Failure Message: Checkpoint expired before completing.
> Operators:
> Name
> Acknowledged
> Latest Acknowledgment
> End to End Duration
> State Size
> Buffered During Alignment
> Source: dw-member
> 6/10 (60%)11:55:391s7.08 KB0 B
> Source: wi-order
> 6/10 (60%)11:55:391s7.11 KB0 B
> Source: dw-pay
> 6/10 (60%)11:55:391s7.11 KB0 B
> RecordTransformOperator
> 6/10 (60%)11:55:391s98.8 KB0 B
> RecordComputeOperator -> Sink: dw-record-data-sink
> 6/10 (60%)11:55:391s85.1 KB0 B
> SubTasks:
> End to End Duration
> State Size
> Checkpoint Duration (Sync)
> Checkpoint Duration (Async)
> Alignment Buffered
> Alignment Duration
> Minimum1s14.2 KB7ms841ms0 B13ms
> Average1s14.2 KB94ms1s0 B13ms
> Maximum1s14.2 KB181ms1s0 B15ms
> ID
> Acknowledgement Time
> E2E Duration
> State Size
> Checkpoint Duration (Sync)
> Checkpoint Duration (Async)
> Align Buffered
> Align Duration
> 1n/a
> 211:55:391s14.2 KB8ms1s0 B15ms
> 3n/a
> 411:55:391s14.2 KB181ms1s0 B13ms
> 5n/a
> 611:55:391s14.2 KB8ms1s0 B14ms
> 711:55:391s14.2 KB181ms961ms0 B13ms
> 8n/a
> 911:55:391s14.2 KB181ms841ms0 B13ms
> 1011:55:391s14.2 KB7ms1s0 B14ms
>
>
> 请问,这类问题如何排查,有没有好的建议或者最佳实践?谢谢!
>

回复