@Till Is this the expected behaviour or do you suspect something could be going wrong?
> On 23. Feb 2018, at 08:59, jelmer <jkupe...@gmail.com> wrote: > > We've observed on our flink 1.4.0 setup that if for some reason the > networking between the task manager and the job manager gets disrupted then > the task manager is never able to reconnect. > > You'll end up with messages like this getting printed to the log repeatedly > > Trying to register at JobManager > akka.tcp://flink@jobmanager:6123/user/jobmanager (attempt 17, timeout: 30000 > milliseconds) > Quarantined address [akka.tcp://flink@jobmanager:6123] is still unreachable > or has not been restarted. Keeping it quarantined. > > Or alternatively > > > Tried to associate with unreachable remote address > [akka.tcp://flink@jobmanager:6123]. Address is now gated for 5000 ms, all > messages to this address will be delivered to dead letters. Reason: [The > remote system has quarantined this system. No further associations to the > remote system are possible until this system is restarted. > > But it never recovers until you either restart the job manager or the task > manager > > I was able to successfully reproduce this behaviour in two docker containers > here : > > https://github.com/jelmerk/flink-worker-not-rejoining > <https://github.com/jelmerk/flink-worker-not-rejoining> > > Has anyone else seen this problem ? > > > > > > >