flink的job运行一段时间后, checkpoint就一直失败,信息如下:
ID
Status
Acknowledged
Trigger Time
Latest Acknowledgement
End to End Duration
State Size
Buffered During Alignment
295
FAILED
30/5011:55:3811:55:391h 0m 0s205 KB0 B
Checkpoint Detail:
Path: - Discarded: - Failure Message: Checkpoint expired before completing.
Operators:
Name
Acknowledged
Latest Acknowledgment
End to End Duration
State Size
Buffered During Alignment
Source: dw-member
6/10 (60%)11:55:391s7.08 KB0 B
Source: wi-order
6/10 (60%)11:55:391s7.11 KB0 B
Source: dw-pay
6/10 (60%)11:55:391s7.11 KB0 B
RecordTransformOperator
6/10 (60%)11:55:391s98.8 KB0 B
RecordComputeOperator -> Sink: dw-record-data-sink
6/10 (60%)11:55:391s85.1 KB0 B
SubTasks:
End to End Duration
State Size
Checkpoint Duration (Sync)
Checkpoint Duration (Async)
Alignment Buffered
Alignment Duration
Minimum1s14.2 KB7ms841ms0 B13ms
Average1s14.2 KB94ms1s0 B13ms
Maximum1s14.2 KB181ms1s0 B15ms
ID
Acknowledgement Time
E2E Duration
State Size
Checkpoint Duration (Sync)
Checkpoint Duration (Async)
Align Buffered
Align Duration
1n/a
211:55:391s14.2 KB8ms1s0 B15ms
3n/a
411:55:391s14.2 KB181ms1s0 B13ms
5n/a
611:55:391s14.2 KB8ms1s0 B14ms
711:55:391s14.2 KB181ms961ms0 B13ms
8n/a
911:55:391s14.2 KB181ms841ms0 B13ms
1011:55:391s14.2 KB7ms1s0 B14ms


请问,这类问题如何排查,有没有好的建议或者最佳实践?谢谢!

回复