Hi,

I am experiencing an issue when a job manager is trying to recover using a
HA setup. When the job manager starts again and tries to resume from the
last checkpoints, it gets killed by Kubernetes (I guess), since I can see
the following in the logs while the jobs are deployed:

INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

I am requesting enough memory for it, 3000Gi, and it is configured to use
2048Gb of memory. I have tried to increase the max perm size, but did not
see an improvement.

Any suggestions to help diagnose this?

I have the following:

Flink 1.6.0 (same with 1.5.1)
Azure AKS with Kubernetes 1.11
State management using RocksDB with checkpoints stored in Azure Data Lake

Thanks!

Bruno

Reply via email to