Hi! checkpoint 超时有很多可能性。最常见的原因是超时的节点太忙阻塞了 checkpoint(包括计算资源不足,或者数据有倾斜等),这可以通过看 Flink web UI 上的 busy 以及反压信息判断;另外一个常见原因是 gc 太频繁,可以通过设置 jvm 参数打印出 gc log 观察。
[email protected] <[email protected]> 于2021年11月18日周四 下午2:54写道: > flink的job运行一段时间后, checkpoint就一直失败,信息如下: > ID > Status > Acknowledged > Trigger Time > Latest Acknowledgement > End to End Duration > State Size > Buffered During Alignment > 295 > FAILED > 30/5011:55:3811:55:391h 0m 0s205 KB0 B > Checkpoint Detail: > Path: - Discarded: - Failure Message: Checkpoint expired before completing. > Operators: > Name > Acknowledged > Latest Acknowledgment > End to End Duration > State Size > Buffered During Alignment > Source: dw-member > 6/10 (60%)11:55:391s7.08 KB0 B > Source: wi-order > 6/10 (60%)11:55:391s7.11 KB0 B > Source: dw-pay > 6/10 (60%)11:55:391s7.11 KB0 B > RecordTransformOperator > 6/10 (60%)11:55:391s98.8 KB0 B > RecordComputeOperator -> Sink: dw-record-data-sink > 6/10 (60%)11:55:391s85.1 KB0 B > SubTasks: > End to End Duration > State Size > Checkpoint Duration (Sync) > Checkpoint Duration (Async) > Alignment Buffered > Alignment Duration > Minimum1s14.2 KB7ms841ms0 B13ms > Average1s14.2 KB94ms1s0 B13ms > Maximum1s14.2 KB181ms1s0 B15ms > ID > Acknowledgement Time > E2E Duration > State Size > Checkpoint Duration (Sync) > Checkpoint Duration (Async) > Align Buffered > Align Duration > 1n/a > 211:55:391s14.2 KB8ms1s0 B15ms > 3n/a > 411:55:391s14.2 KB181ms1s0 B13ms > 5n/a > 611:55:391s14.2 KB8ms1s0 B14ms > 711:55:391s14.2 KB181ms961ms0 B13ms > 8n/a > 911:55:391s14.2 KB181ms841ms0 B13ms > 1011:55:391s14.2 KB7ms1s0 B14ms > > > 请问,这类问题如何排查,有没有好的建议或者最佳实践?谢谢! >
