Hi Edward,

    For the second issue, have you also set the statebackend type? I'm asking 
so because except for the default heap statebackend, other statebackends should 
throws exception if the state.checkpoint.dir is not set. Since heap 
statebackend stores all the snapshots in the JM's memory, it could not be 
recovered after JM failover, which makes it not suitable for production usage. 
Therefore, if used in production env then it might better to switch to 
statebackend like rocksdb.

   For the checkpoint timeout, AFAIK there should be no large changes after 
1.9.2. There may be different issues for checkpoint timeout, and one possible 
one might be there are back-pressure due to some operator could not process its 
records  in time, which would block the checkpoints. I think you might check 
the back-pressure [1] first, and if there is indeed back pressure, then you 
might try unaligned checkpoints or solve the back pressure by increasing the 
parallelism of slow operators. 

Best,
 Yun



[1] 
https://ci.apache.org/projects/flink/flink-docs-master/ops/monitoring/back_pressure.html
 



 ------------------Original Mail ------------------
Sender:Colletta, Edward <edward.colle...@fmr.com>
Send Date:Mon Dec 21 17:50:15 2020
Recipients:user@flink.apache.org <user@flink.apache.org>
Subject:checkpointing seems to be throttled.

Using session cluster with three taskmanagers, cluster.evenly-spread-out-slots 
is set to true.  13 jobs running.  Average parallelism of each job is 4.        
                                                                                
                                                              
Flink version 1.11.2, Java 11.
Running on AWS EC2 instances with EFS for high-availability.storageDir.
We are seeing very high checkpoint times and experiencing timeouts.  The 
checkpoint timeout is the default 10 minutes.   This does not seem to be 
related to EFS limits/throttling .  We started experiencing these timeouts 
after upgrading from Flink 1.9.2/Java 8.  Are there any known issues which 
cause very high checkpoint times?
Also I noticed we did not set state.checkpoints.dir, I assume it is using 
high-availability.storageDir.  Is that correct?
For now we plan on setting 
execution.checkpointing.timeout: 60 min
execution.checkpointing.tolerable-failed-checkpoints:12execution.checkpointing.unaligned
  trueand also explicitly setstate.checkpoints.dir

Reply via email to