Stephan Ewen created FLINK-3347:
-----------------------------------

             Summary: TaskManager ActorSystems need to restart themselves in 
case they notice quarantine
                 Key: FLINK-3347
                 URL: https://issues.apache.org/jira/browse/FLINK-3347
             Project: Flink
          Issue Type: Improvement
          Components: TaskManager
    Affects Versions: 0.10.1
            Reporter: Stephan Ewen
             Fix For: 1.0.0


There are cases where Akka quarantines remote actor systems. In that case, no 
further communication is possible with that actor system unless one of the two 
actor systems is restarted.

The result is that a TaskManager is up and available, but cannot register at 
the JobManager (Akka refuses connection because of the quarantined state), 
making the TaskManager a useless process.

I suggest to let the TaskManager restart itself once it notices that either it 
quarantined the JobManager, or the JobManager quarantined it.

It is possible to recognize that by listening to certain events in the actor 
system event stream: 
http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to