Hi! There are a ton of possible reasons for a checkpoint failure. The most possible reasons might be * The JVM is busy with garbage collecting when performing the checkpoints. This can be checked by looking into the GC logs of a task manager. * The state suddenly becomes quite large due to some specific data pattern. This can be checked by looking at the state size for the completed portion of that checkpoint.
You might also want to profile the CPU usage when the checkpoint is happening. Daniel Vol <vold...@gmail.com> 于2021年9月1日周三 下午7:08写道: > Hello, > > I see the following error in my jobmanager log (Flink on EMR): > Checking cluster logs I see : > 2021-08-21 17:17:30,489 [Checkpoint Timer] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1 (type=CHECKPOINT) @ 1629566250303 for job > c513e9ebbea4ab72d80b1338896ca5c2. > 2021-08-21 17:17:33,572 [jobmanager-future-thread-5] INFO > com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream > - close closed:false s3://***/_metadata > 2021-08-21 17:17:33,800 [jobmanager-future-thread-5] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1 for job c513e9ebbea4ab72d80b1338896ca5c2 (737859873 bytes in > 3496 ms). > 2021-08-21 17:27:30,474 [Checkpoint Timer] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 2 (type=CHECKPOINT) @ 1629566850302 for job > c513e9ebbea4ab72d80b1338896ca5c2. > 2021-08-21 17:27:46,012 [jobmanager-future-thread-3] INFO > com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream > - close closed:false s3://***/_metadata > 2021-08-21 17:27:46,158 [jobmanager-future-thread-3] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 2 for job c513e9ebbea4ab72d80b1338896ca5c2 (1210889410 bytes in > 15856 ms). > 2021-08-21 17:37:30,468 [Checkpoint Timer] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 3 (type=CHECKPOINT) @ 1629567450302 for job > c513e9ebbea4ab72d80b1338896ca5c2. > 2021-08-21 17:47:30,469 [Checkpoint Timer] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 3 > of job c513e9ebbea4ab72d80b1338896ca5c2 expired before completing. > 2021-08-21 17:47:30,476 [flink-akka.actor.default-dispatcher-34] > INFO org.apache.flink.runtime.jobmaster.JobMaster - Trying to recover from > a global failure. > org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable > failure threshold. > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:66) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1673) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1650) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:91) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1783) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-08-21 17:47:30,478 [flink-akka.actor.default-dispatcher-34] > INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > session-aggregation (c513e9ebbea4ab72d80b1338896ca5c2) switched from state > RUNNING to RESTARTING. > > Configuration is: > > -yD "execution.checkpointing.timeout=10 min"\ > -yD "restart-strategy=failure-rate"\ > -yD "restart-strategy.failure-rate.max-failures-per-interval=70"\ > -yD "restart-strategy.failure-rate.delay=1 min"\ > -yD "restart-strategy.failure-rate.failure-rate-interval=60 min"\ > > Not sure this - https://issues.apache.org/jira/browse/FLINK-21215 is related > - but it looks like it is solved. > > I know I can increase checkpoint timeout - but checkpoint size is relatively > small and most of the time it takes several seconds to complete so 10 minutes > should be more than enough. So the main question is why "Exceeded checkpoint > tolerable failure threshold" triggered? > > Thanks! > >