Re: Task manager not able to rejoin job manager after network hicup

Aljoscha Krettek Fri, 23 Feb 2018 05:42:30 -0800

@Till Is this the expected behaviour or do you suspect something could be going 
wrong?


> On 23. Feb 2018, at 08:59, jelmer <jkupe...@gmail.com> wrote:
> 
> We've observed on our flink 1.4.0 setup that if for some reason the 
> networking between the task manager and the job manager gets disrupted then 
> the task manager is never able to reconnect.
> 
> You'll end up with messages like this getting printed to the log repeatedly
> 
> Trying to register at JobManager 
> akka.tcp://flink@jobmanager:6123/user/jobmanager (attempt 17, timeout: 30000 
> milliseconds)
> Quarantined address [akka.tcp://flink@jobmanager:6123] is still unreachable 
> or has not been restarted. Keeping it quarantined.
> 
> Or alternatively
> 
> 
> Tried to associate with unreachable remote address 
> [akka.tcp://flink@jobmanager:6123]. Address is now gated for 5000 ms, all 
> messages to this address will be delivered to dead letters. Reason: [The 
> remote system has quarantined this system. No further associations to the 
> remote system are possible until this system is restarted.
> 
> But it never recovers until you either restart the job manager or the task 
> manager
> 
> I was able to successfully reproduce this behaviour in two docker containers 
> here :
> 
> https://github.com/jelmerk/flink-worker-not-rejoining 
> <https://github.com/jelmerk/flink-worker-not-rejoining> 
> 
> Has anyone else seen this problem ?
> 
> 
> 
> 
> 
> 
>

Re: Task manager not able to rejoin job manager after network hicup

Reply via email to