flink的job运行一段时间后, checkpoint就一直失败,信息如下: ID Status Acknowledged Trigger Time Latest Acknowledgement End to End Duration State Size Buffered During Alignment 295 FAILED 30/5011:55:3811:55:391h 0m 0s205 KB0 B Checkpoint Detail: Path: - Discarded: - Failure Message: Checkpoint expired before completing. Operators: Name Acknowledged Latest Acknowledgment End to End Duration State Size Buffered During Alignment Source: dw-member 6/10 (60%)11:55:391s7.08 KB0 B Source: wi-order 6/10 (60%)11:55:391s7.11 KB0 B Source: dw-pay 6/10 (60%)11:55:391s7.11 KB0 B RecordTransformOperator 6/10 (60%)11:55:391s98.8 KB0 B RecordComputeOperator -> Sink: dw-record-data-sink 6/10 (60%)11:55:391s85.1 KB0 B SubTasks: End to End Duration State Size Checkpoint Duration (Sync) Checkpoint Duration (Async) Alignment Buffered Alignment Duration Minimum1s14.2 KB7ms841ms0 B13ms Average1s14.2 KB94ms1s0 B13ms Maximum1s14.2 KB181ms1s0 B15ms ID Acknowledgement Time E2E Duration State Size Checkpoint Duration (Sync) Checkpoint Duration (Async) Align Buffered Align Duration 1n/a 211:55:391s14.2 KB8ms1s0 B15ms 3n/a 411:55:391s14.2 KB181ms1s0 B13ms 5n/a 611:55:391s14.2 KB8ms1s0 B14ms 711:55:391s14.2 KB181ms961ms0 B13ms 8n/a 911:55:391s14.2 KB181ms841ms0 B13ms 1011:55:391s14.2 KB7ms1s0 B14ms
请问,这类问题如何排查,有没有好的建议或者最佳实践?谢谢!
