Re: How does JobManager terminate dangling task manager

2021-05-13 Thread Xintong Song
Hi narasimha,

For each TaskManager, there are two kinds of connections to the JobManager
process.
- One single connection to the ResourceManager, which allows RM to monitor
the slots' availability and assign them to Flink jobs.
- Connections to each JobMaster that the slots of this TM are assigned to.

Upon the JobMaster-TM disconnection, all tasks running on the TM that are
from the corresponding job are failed immediately. Take the Kafka source as
an example, that's where the task stops consuming data from Kafka.
Upon the RM-TM disconnection, TM kills itself if it cannot reconnect to the
RM within a certain time.
Since JobMaster and RM are in the same process, when one of the two
connections breaks, the other usually also breaks. In cases not, RM-TM
disconnection does not fail the running tasks, until the reconnection
timeout.

As for failover consistency, that is guaranteed by the checkpointing
mechanism. The new task does not resume from the exact position where the
old task is stopped. Instead, it resumes from the last successful
checkpoint.

Thank you~

Xintong Song



On Thu, May 13, 2021 at 5:38 PM Guowei Ma  wrote:

> Hi,
> In fact, not only JobManager(ResoruceManager) will kill TimeOut's
> TaskManager, but if TaskManager finds that it cannot connect to
> JobManager(ResourceManager), it will also exit by itself.
> You can look at the time period during which the HB timeout occurred and
> what happened in the log. Under normal circumstances, I also look at what
> the GC situation was like at that time.
> Best,
> Guowei
>
>
> On Thu, May 13, 2021 at 11:06 AM narasimha  wrote:
>
>> Hi,
>>
>> Trying to understand how JobManager. kills TaskManager that didn't
>> respond for heartbeat after a certain time.
>>
>> For example:
>>
>> If a network connection b/w JobManager and TaskManager is lost for some
>> reasons, the JobManager will bring up another Taskmanager post
>> hearbeat timeout.
>> In such a case, how does JobManager make sure all connections like to
>> Kafka from lost Taskmanager are cut down and the new one will take from a
>> certain consistent point.
>>
>> Also want to learn ways to debug what caused the timeout, our job fairly
>> handles 5k records/s, not a heavy traffic job.
>> --
>> A.Narasimha Swamy
>>
>


Re: How does JobManager terminate dangling task manager

2021-05-13 Thread Guowei Ma
Hi,
In fact, not only JobManager(ResoruceManager) will kill TimeOut's
TaskManager, but if TaskManager finds that it cannot connect to
JobManager(ResourceManager), it will also exit by itself.
You can look at the time period during which the HB timeout occurred and
what happened in the log. Under normal circumstances, I also look at what
the GC situation was like at that time.
Best,
Guowei


On Thu, May 13, 2021 at 11:06 AM narasimha  wrote:

> Hi,
>
> Trying to understand how JobManager. kills TaskManager that didn't respond
> for heartbeat after a certain time.
>
> For example:
>
> If a network connection b/w JobManager and TaskManager is lost for some
> reasons, the JobManager will bring up another Taskmanager post
> hearbeat timeout.
> In such a case, how does JobManager make sure all connections like to
> Kafka from lost Taskmanager are cut down and the new one will take from a
> certain consistent point.
>
> Also want to learn ways to debug what caused the timeout, our job fairly
> handles 5k records/s, not a heavy traffic job.
> --
> A.Narasimha Swamy
>


How does JobManager terminate dangling task manager

2021-05-12 Thread narasimha
Hi,

Trying to understand how JobManager. kills TaskManager that didn't respond
for heartbeat after a certain time.

For example:

If a network connection b/w JobManager and TaskManager is lost for some
reasons, the JobManager will bring up another Taskmanager post
hearbeat timeout.
In such a case, how does JobManager make sure all connections like to Kafka
from lost Taskmanager are cut down and the new one will take from a certain
consistent point.

Also want to learn ways to debug what caused the timeout, our job fairly
handles 5k records/s, not a heavy traffic job.
-- 
A.Narasimha Swamy