Hi guys, It looks suspicious that the TM pod termination is potentially delayed by the reconnect to a killed JM. I created an issue to investigate this: https://issues.apache.org/jira/browse/FLINK-15946 Let's continue the discussion there.
Best, Andrey On Wed, Feb 5, 2020 at 11:49 AM Yang Wang <danrtsey...@gmail.com> wrote: > Maybe you need to check the kubelet logs to see why it get stuck in the > "Terminating" state > for long time. Even it needs to clean up the ephemeral storage, it should > not take so long > time. > > > Best, > Yang > > Li Peng <li.p...@doordash.com> 于2020年2月5日周三 上午10:42写道: > >> My yml files follow most of the instructions here: >> >> >> http://shzhangji.com/blog/2019/08/24/deploy-flink-job-cluster-on-kubernetes/ >> >> What command did you use to delete the deployments? I use : helm >> --tiller-namespace prod delete --purge my-deployment >> >> I noticed that for environments without much data (like staging), this >> works flawlessly, but in production with high volume of data, it gets stuck >> in a loop. I suspect that the extra time needed to cleanup the task >> managers with high traffic, delays the shutdown until after the job manager >> terminates, and then the task manager gets stuck in a loop when it detects >> the job manager is dead. >> >> Thanks, >> Li >> >>>