I don't think its entirely the same thing. It seems to be that by design once
a worker misses a heartbeat for whatever reason , be it a network hicup or
a long stop the world garbage collect etc etc, it gets quarantined and it
will not recover from that until it is restarted.

Which is what the post by till in the thread you linked seems to indicate.

I assumed that a system like flink would be able to recover from this and
that if it does not that its a bug

Your problem seems to be that for some reason flink misses the heartbeats
under heavy load

I just simulated missing a heartbeat by blocking traffic to the job manager




On 24 February 2018 at 15:57, ashish pok <ashish...@yahoo.com> wrote:

> We see the same in 1.4. I dont think we could see this in 1.3. I had
> started a thread a while back on this. Till asked for more details. I
> havent had a chance to get back to him on this. If you can repro this
> easily perhaps you can get to it faster. I will find the thread and resend.
>
> Thanks,
>
> -- Ashish
>
> On Fri, Feb 23, 2018 at 9:56 AM, jelmer
> <jkupe...@gmail.com> wrote:
> We found out there's a taskmanager.exit-on-fatal-akka-error property that
> will restart flink in this situation but it is not enabled by default and
> that feels like a rather blunt tool. I expect systems like this to be more
> resilient to this
>
> On 23 February 2018 at 14:42, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> @Till Is this the expected behaviour or do you suspect something could be
> going wrong?
>
>
> On 23. Feb 2018, at 08:59, jelmer <jkupe...@gmail.com> wrote:
>
> We've observed on our flink 1.4.0 setup that if for some reason the
> networking between the task manager and the job manager gets disrupted then
> the task manager is never able to reconnect.
>
> You'll end up with messages like this getting printed to the log repeatedly
>
> Trying to register at JobManager akka.tcp://flink@jobmanager: 
> 6123/user/jobmanager (attempt 17, timeout: 30000 milliseconds)
> Quarantined address [akka.tcp://flink@jobmanager: 6123] is still unreachable 
> or has not been restarted. Keeping it quarantined.
>
>
> Or alternatively
>
>
> Tried to associate with unreachable remote address 
> [akka.tcp://flink@jobmanager: 6123]. Address is now gated for 5000 ms, all 
> messages to this address will be delivered to dead letters. Reason: [The 
> remote system has quarantined this system. No further associations to the 
> remote system are possible until this system is restarted.
>
>
> But it never recovers until you either restart the job manager or the task
> manager
>
> I was able to successfully reproduce this behaviour in two docker
> containers here :
>
> https://github.com/jelmerk/ flink-worker-not-rejoining
> <https://github.com/jelmerk/flink-worker-not-rejoining>
>
> Has anyone else seen this problem ?
>
>
>
>
>
>
>
>
>
>

Reply via email to