Hi Bruno,

in order to debug this problem we would need a bit more information. In
particular, the logs of the cluster entrypoint and your K8s deployment
specification would be helpful. If you have some memory limits specified
these would also be interesting to know.

Cheers,
Till

On Sun, Aug 19, 2018 at 2:43 PM vino yang <yanghua1...@gmail.com> wrote:

> Hi Bruno,
>
> Ping Till for you, he may give you some useful information.
>
> Thanks, vino.
>
> Bruno Aranda <bara...@apache.org> 于2018年8月19日周日 上午6:57写道:
>
>> Hi,
>>
>> I am experiencing an issue when a job manager is trying to recover using
>> a HA setup. When the job manager starts again and tries to resume from the
>> last checkpoints, it gets killed by Kubernetes (I guess), since I can see
>> the following in the logs while the jobs are deployed:
>>
>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
>>
>> I am requesting enough memory for it, 3000Gi, and it is configured to use
>> 2048Gb of memory. I have tried to increase the max perm size, but did not
>> see an improvement.
>>
>> Any suggestions to help diagnose this?
>>
>> I have the following:
>>
>> Flink 1.6.0 (same with 1.5.1)
>> Azure AKS with Kubernetes 1.11
>> State management using RocksDB with checkpoints stored in Azure Data Lake
>>
>> Thanks!
>>
>> Bruno
>>
>>

Reply via email to