[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager
[ https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329046#comment-14329046 ] ASF GitHub Bot commented on FLINK-1484: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/368#issuecomment-75254417 @tillrohrmann and @StephanEwen worked on some other reliablity issues. Will the changes in this PR be subsumed by the upcoming changes? If not, we should merge this. :-) JobManager restart does not notify the TaskManager -- Key: FLINK-1484 URL: https://issues.apache.org/jira/browse/FLINK-1484 Project: Flink Issue Type: Bug Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 0.9 In case of a JobManager restart, which can happen due to an uncaught exception, the JobManager is restarted. However, connected TaskManager are not informed about the disconnection and continue sending messages to a JobManager with a reseted state. TaskManager should be informed about a possible restart and cleanup their own state in such a case. Afterwards, they can try to reconnect to a restarted JobManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager
[ https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329056#comment-14329056 ] ASF GitHub Bot commented on FLINK-1484: --- Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/368 JobManager restart does not notify the TaskManager -- Key: FLINK-1484 URL: https://issues.apache.org/jira/browse/FLINK-1484 Project: Flink Issue Type: Bug Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 0.9 In case of a JobManager restart, which can happen due to an uncaught exception, the JobManager is restarted. However, connected TaskManager are not informed about the disconnection and continue sending messages to a JobManager with a reseted state. TaskManager should be informed about a possible restart and cleanup their own state in such a case. Afterwards, they can try to reconnect to a restarted JobManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager
[ https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329055#comment-14329055 ] ASF GitHub Bot commented on FLINK-1484: --- Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/368#issuecomment-75256160 This PR has been merged as part of PR #423 JobManager restart does not notify the TaskManager -- Key: FLINK-1484 URL: https://issues.apache.org/jira/browse/FLINK-1484 Project: Flink Issue Type: Bug Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 0.9 In case of a JobManager restart, which can happen due to an uncaught exception, the JobManager is restarted. However, connected TaskManager are not informed about the disconnection and continue sending messages to a JobManager with a reseted state. TaskManager should be informed about a possible restart and cleanup their own state in such a case. Afterwards, they can try to reconnect to a restarted JobManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager
[ https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327635#comment-14327635 ] ASF GitHub Bot commented on FLINK-1484: --- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/423 [FLINK-1484] [FLINK-1499] Adds explicit disconnect messages in case of an actor shutdown Introduces explicit disconnect messages which are sent from the JobManager/TaskManager to the TaskManager/JobManager in case of a graceful actor termination. These disconnect messages allow a faster recovery from failure in order to reach quickly a clean state. Contains minor Scala cleanups. This PR is based on #419 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink taskManagerDisconnect Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/423.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #423 commit 8cc604d61d75370972146333c5a016b5fcdddc77 Author: Till Rohrmann trohrm...@apache.org Date: 2015-02-19T10:04:56Z [FLINK-1584] [runtime][tests] Fixes TaskManagerFailsITCase by replacing the TestingCluster with a ForkableFlinkMiniCluster commit 21660683633df999b86a7240929e07b8935e17df Author: Till Rohrmann trohrm...@apache.org Date: 2015-02-17T14:54:42Z [Flink-1484] [runtime] Adds explicit disconnect message for TaskManagers commit b2ff739feb6915bb131d1aeac7ca772eb4f85cba Author: Till Rohrmann trohrm...@apache.org Date: 2015-02-17T15:36:34Z [FLINK-1499] [runtime] TaskManager sends explicit disconnect message to JobManager in case of shutdown JobManager restart does not notify the TaskManager -- Key: FLINK-1484 URL: https://issues.apache.org/jira/browse/FLINK-1484 Project: Flink Issue Type: Bug Reporter: Till Rohrmann In case of a JobManager restart, which can happen due to an uncaught exception, the JobManager is restarted. However, connected TaskManager are not informed about the disconnection and continue sending messages to a JobManager with a reseted state. TaskManager should be informed about a possible restart and cleanup their own state in such a case. Afterwards, they can try to reconnect to a restarted JobManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager
[ https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327759#comment-14327759 ] ASF GitHub Bot commented on FLINK-1484: --- Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/423#issuecomment-75096035 Rebased the PR on the pending PR #422 JobManager restart does not notify the TaskManager -- Key: FLINK-1484 URL: https://issues.apache.org/jira/browse/FLINK-1484 Project: Flink Issue Type: Bug Reporter: Till Rohrmann In case of a JobManager restart, which can happen due to an uncaught exception, the JobManager is restarted. However, connected TaskManager are not informed about the disconnection and continue sending messages to a JobManager with a reseted state. TaskManager should be informed about a possible restart and cleanup their own state in such a case. Afterwards, they can try to reconnect to a restarted JobManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager
[ https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311958#comment-14311958 ] ASF GitHub Bot commented on FLINK-1484: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/368#issuecomment-73478074 Looks good. Since this is a behavior change, can you file a ticket for this, Till? JobManager restart does not notify the TaskManager -- Key: FLINK-1484 URL: https://issues.apache.org/jira/browse/FLINK-1484 Project: Flink Issue Type: Bug Reporter: Till Rohrmann In case of a JobManager restart, which can happen due to an uncaught exception, the JobManager is restarted. However, connected TaskManager are not informed about the disconnection and continue sending messages to a JobManager with a reseted state. TaskManager should be informed about a possible restart and cleanup their own state in such a case. Afterwards, they can try to reconnect to a restarted JobManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager
[ https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308834#comment-14308834 ] ASF GitHub Bot commented on FLINK-1484: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/368#discussion_r24227786 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -125,6 +126,10 @@ Actor with ActorLogMessages with ActorLogging { override def postStop(): Unit = { log.info(sStopping job manager ${self.path}.) +// disconnect the registered task managers +instanceManager.getAllRegisteredInstances.asScala.foreach{ + _.getTaskManager ! Disconnected(JobManager is stopping)} + for((e,_) - currentJobs.values){ e.fail(new Exception(The JobManager is shutting down.)) --- End diff -- Thanks Henry for spotting it. I corrected it. JobManager restart does not notify the TaskManager -- Key: FLINK-1484 URL: https://issues.apache.org/jira/browse/FLINK-1484 Project: Flink Issue Type: Bug Reporter: Till Rohrmann In case of a JobManager restart, which can happen due to an uncaught exception, the JobManager is restarted. However, connected TaskManager are not informed about the disconnection and continue sending messages to a JobManager with a reseted state. TaskManager should be informed about a possible restart and cleanup their own state in such a case. Afterwards, they can try to reconnect to a restarted JobManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager
[ https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309033#comment-14309033 ] ASF GitHub Bot commented on FLINK-1484: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/368#discussion_r24236048 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/messages/TaskManagerMessages.scala --- @@ -129,6 +129,13 @@ object TaskManagerMessages { * @param cause reason for the external failure */ case class FailTask(executionID: ExecutionAttemptID, cause: Throwable) + + /** + * Makes the TaskManager to disconnect from the registered JobManager --- End diff -- You're right. Thanks, I changed it. JobManager restart does not notify the TaskManager -- Key: FLINK-1484 URL: https://issues.apache.org/jira/browse/FLINK-1484 Project: Flink Issue Type: Bug Reporter: Till Rohrmann In case of a JobManager restart, which can happen due to an uncaught exception, the JobManager is restarted. However, connected TaskManager are not informed about the disconnection and continue sending messages to a JobManager with a reseted state. TaskManager should be informed about a possible restart and cleanup their own state in such a case. Afterwards, they can try to reconnect to a restarted JobManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager
[ https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308540#comment-14308540 ] ASF GitHub Bot commented on FLINK-1484: --- Github user hsaputra commented on a diff in the pull request: https://github.com/apache/flink/pull/368#discussion_r24220454 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -125,6 +126,10 @@ Actor with ActorLogMessages with ActorLogging { override def postStop(): Unit = { log.info(sStopping job manager ${self.path}.) +// disconnect the registered task managers +instanceManager.getAllRegisteredInstances.asScala.foreach{ + _.getTaskManager ! Disconnected(JobManager is stopping)} + for((e,_) - currentJobs.values){ e.fail(new Exception(The JobManager is shutting down.)) --- End diff -- Since we are cleaning up messages, maybe remove The so it is consistent with other messages. JobManager restart does not notify the TaskManager -- Key: FLINK-1484 URL: https://issues.apache.org/jira/browse/FLINK-1484 Project: Flink Issue Type: Bug Reporter: Till Rohrmann In case of a JobManager restart, which can happen due to an uncaught exception, the JobManager is restarted. However, connected TaskManager are not informed about the disconnection and continue sending messages to a JobManager with a reseted state. TaskManager should be informed about a possible restart and cleanup their own state in such a case. Afterwards, they can try to reconnect to a restarted JobManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager
[ https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307890#comment-14307890 ] ASF GitHub Bot commented on FLINK-1484: --- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/368 [FLINK-1484] Adds explicit disconnect message for TaskManagers In case of a JobManager restart, which can be caused by an uncaught exception, all connected TaskManager are notified by a ```Disconnected``` message. This message triggers the cleanup of the TaskManagers and makes them try to reconnect to the JobManager. This PR also includes a test case to verify the afore-mentioned behaviour. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink akkaSupervision Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/368.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #368 commit 081721a01b5070dccdc807bf3ecd14d5b9de1b24 Author: Till Rohrmann trohrm...@apache.org Date: 2015-02-05T19:58:59Z Adds explicit disconnect message to tell TaskManagers that the JobManager has failed JobManager restart does not notify the TaskManager -- Key: FLINK-1484 URL: https://issues.apache.org/jira/browse/FLINK-1484 Project: Flink Issue Type: Bug Reporter: Till Rohrmann In case of a JobManager restart, which can happen due to an uncaught exception, the JobManager is restarted. However, connected TaskManager are not informed about the disconnection and continue sending messages to a JobManager with a reseted state. TaskManager should be informed about a possible restart and cleanup their own state in such a case. Afterwards, they can try to reconnect to a restarted JobManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)