Flink 1.12.0 隔几个小时Checkpoint就会失败

Frost Wong Wed, 17 Mar 2021 19:38:35 -0700

Hi 大家好

我用的Flink on yarn模式运行的一个任务，每隔几个小时就会出现一次错误


2021-03-18 08:52:37,019 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes in 
4699 ms).
2021-03-18 08:52:37,637 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job 
4fa72fc414f53e5ee062f9fbd5a2f4d5.
2021-03-18 08:52:42,956 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes in 
4939 ms).
2021-03-18 08:52:43,528 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job 
4fa72fc414f53e5ee062f9fbd5a2f4d5.
2021-03-18 09:12:43,528 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 
661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before completing.
2021-03-18 09:12:43,615 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Trying to recover from a global failure.
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable 
failure threshold.
at 
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90)
 ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at 
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65)
 ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760)
 ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733)
 ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93)
 ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870)
 ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_231]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 ~[?:1.8.0_231]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 ~[?:1.8.0_231]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_231]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_231]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231]
2021-03-18 09:12:43,618 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched from 
state RUNNING to RESTARTING.
2021-03-18 09:12:43,619 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map 
(43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to CANCELING.
2021-03-18 09:12:43,622 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map 
(44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to CANCELING.
2021-03-18 09:12:43,622 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map 
(45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to CANCELING.
然后就自己恢复了。用的是Unaligned 
Checkpoint，rocksdb存储后端，在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看，总是剩最后一个无法完成，调整过parallelism也无法解决问题。

谢谢大家！

Flink 1.12.0 隔几个小时Checkpoint就会失败

回复