Hi, In fact, not only JobManager(ResoruceManager) will kill TimeOut's TaskManager, but if TaskManager finds that it cannot connect to JobManager(ResourceManager), it will also exit by itself. You can look at the time period during which the HB timeout occurred and what happened in the log. Under normal circumstances, I also look at what the GC situation was like at that time. Best, Guowei
On Thu, May 13, 2021 at 11:06 AM narasimha <swamy.haj...@gmail.com> wrote: > Hi, > > Trying to understand how JobManager. kills TaskManager that didn't respond > for heartbeat after a certain time. > > For example: > > If a network connection b/w JobManager and TaskManager is lost for some > reasons, the JobManager will bring up another Taskmanager post > hearbeat timeout. > In such a case, how does JobManager make sure all connections like to > Kafka from lost Taskmanager are cut down and the new one will take from a > certain consistent point. > > Also want to learn ways to debug what caused the timeout, our job fairly > handles 5k records/s, not a heavy traffic job. > -- > A.Narasimha Swamy >