Re: Flink restarts on Checkpoint failure

Caizhi Weng Wed, 01 Sep 2021 19:24:06 -0700

Hi!

There are a ton of possible reasons for a checkpoint failure. The most
possible reasons might be
* The JVM is busy with garbage collecting when performing the checkpoints.
This can be checked by looking into the GC logs of a task manager.
* The state suddenly becomes quite large due to some specific data pattern.
This can be checked by looking at the state size for the completed portion
of that checkpoint.


You might also want to profile the CPU usage when the checkpoint is
happening.

Daniel Vol <[email protected]> 于2021年9月1日周三 下午7:08写道：

> Hello,
>
> I see the following error in my jobmanager log (Flink on EMR):
> Checking cluster logs I see :
> 2021-08-21 17:17:30,489 [Checkpoint Timer] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering
> checkpoint 1 (type=CHECKPOINT) @ 1629566250303 for job
> c513e9ebbea4ab72d80b1338896ca5c2.
> 2021-08-21 17:17:33,572 [jobmanager-future-thread-5] INFO  
> com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream
> - close closed:false s3://***/_metadata
> 2021-08-21 17:17:33,800 [jobmanager-future-thread-5] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed
> checkpoint 1 for job c513e9ebbea4ab72d80b1338896ca5c2 (737859873 bytes in
> 3496 ms).
> 2021-08-21 17:27:30,474 [Checkpoint Timer] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering
> checkpoint 2 (type=CHECKPOINT) @ 1629566850302 for job
> c513e9ebbea4ab72d80b1338896ca5c2.
> 2021-08-21 17:27:46,012 [jobmanager-future-thread-3] INFO  
> com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream
> - close closed:false s3://***/_metadata
> 2021-08-21 17:27:46,158 [jobmanager-future-thread-3] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed
> checkpoint 2 for job c513e9ebbea4ab72d80b1338896ca5c2 (1210889410 bytes in
> 15856 ms).
> 2021-08-21 17:37:30,468 [Checkpoint Timer] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering
> checkpoint 3 (type=CHECKPOINT) @ 1629567450302 for job
> c513e9ebbea4ab72d80b1338896ca5c2.
> 2021-08-21 17:47:30,469 [Checkpoint Timer] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Checkpoint 3
> of job c513e9ebbea4ab72d80b1338896ca5c2 expired before completing.
> 2021-08-21 17:47:30,476 [flink-akka.actor.default-dispatcher-34]
> INFO org.apache.flink.runtime.jobmaster.JobMaster - Trying to recover from
> a global failure.
> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
> failure threshold.
> at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:66)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1673)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1650)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:91)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1783)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2021-08-21 17:47:30,478 [flink-akka.actor.default-dispatcher-34]
> INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job
> session-aggregation (c513e9ebbea4ab72d80b1338896ca5c2) switched from state
> RUNNING to RESTARTING.
>
> Configuration is:
>
> -yD "execution.checkpointing.timeout=10 min"\
> -yD "restart-strategy=failure-rate"\
> -yD "restart-strategy.failure-rate.max-failures-per-interval=70"\
> -yD "restart-strategy.failure-rate.delay=1 min"\
> -yD "restart-strategy.failure-rate.failure-rate-interval=60 min"\
>
> Not sure this - https://issues.apache.org/jira/browse/FLINK-21215 is related 
> - but it looks like it is solved.
>
> I know I can increase checkpoint timeout - but checkpoint size is relatively 
> small and most of the time it takes several seconds to complete so 10 minutes 
> should be more than enough. So the main question is why "Exceeded checkpoint 
> tolerable failure threshold" triggered?
>
> Thanks!
>
>

Re: Flink restarts on Checkpoint failure

Reply via email to