Using session cluster with three taskmanagers, cluster.evenly-spread-out-slots 
is set to true.  13 jobs running.  Average parallelism of each job is 4.
Flink version 1.11.2, Java 11.
Running on AWS EC2 instances with EFS for high-availability.storageDir.


We are seeing very high checkpoint times and experiencing timeouts.  The 
checkpoint timeout is the default 10 minutes.   This does not seem to be 
related to EFS limits/throttling .  We started experiencing these timeouts 
after upgrading from Flink 1.9.2/Java 8.  Are there any known issues which 
cause very high checkpoint times?

Also I noticed we did not set state.checkpoints.dir, I assume it is using 
high-availability.storageDir.  Is that correct?

For now we plan on setting

execution.checkpointing.timeout<https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#execution-checkpointing-timeout>:
 60 min

execution.checkpointing.tolerable-failed-checkpoints<https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#execution-checkpointing-tolerable-failed-checkpoints>:12

execution.checkpointing.unaligned<https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#execution-checkpointing-unaligned>
  true
and also explicitly set
state.checkpoints.dir

Reply via email to