[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager

2015-02-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329046#comment-14329046
 ] 

ASF GitHub Bot commented on FLINK-1484:
---

Github user uce commented on the pull request:

https://github.com/apache/flink/pull/368#issuecomment-75254417
  
@tillrohrmann and @StephanEwen worked on some other reliablity issues. Will 
the changes in this PR be subsumed by the upcoming changes? If not, we should 
merge this. :-)


 JobManager restart does not notify the TaskManager
 --

 Key: FLINK-1484
 URL: https://issues.apache.org/jira/browse/FLINK-1484
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 0.9


 In case of a JobManager restart, which can happen due to an uncaught 
 exception, the JobManager is restarted. However, connected TaskManager are 
 not informed about the disconnection and continue sending messages to a 
 JobManager with a reseted state. 
 TaskManager should be informed about a possible restart and cleanup their own 
 state in such a case. Afterwards, they can try to reconnect to a restarted 
 JobManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager

2015-02-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329056#comment-14329056
 ] 

ASF GitHub Bot commented on FLINK-1484:
---

Github user tillrohrmann closed the pull request at:

https://github.com/apache/flink/pull/368


 JobManager restart does not notify the TaskManager
 --

 Key: FLINK-1484
 URL: https://issues.apache.org/jira/browse/FLINK-1484
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 0.9


 In case of a JobManager restart, which can happen due to an uncaught 
 exception, the JobManager is restarted. However, connected TaskManager are 
 not informed about the disconnection and continue sending messages to a 
 JobManager with a reseted state. 
 TaskManager should be informed about a possible restart and cleanup their own 
 state in such a case. Afterwards, they can try to reconnect to a restarted 
 JobManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager

2015-02-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329055#comment-14329055
 ] 

ASF GitHub Bot commented on FLINK-1484:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/368#issuecomment-75256160
  
This PR has been merged as part of PR #423 


 JobManager restart does not notify the TaskManager
 --

 Key: FLINK-1484
 URL: https://issues.apache.org/jira/browse/FLINK-1484
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 0.9


 In case of a JobManager restart, which can happen due to an uncaught 
 exception, the JobManager is restarted. However, connected TaskManager are 
 not informed about the disconnection and continue sending messages to a 
 JobManager with a reseted state. 
 TaskManager should be informed about a possible restart and cleanup their own 
 state in such a case. Afterwards, they can try to reconnect to a restarted 
 JobManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager

2015-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327635#comment-14327635
 ] 

ASF GitHub Bot commented on FLINK-1484:
---

GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/423

[FLINK-1484] [FLINK-1499] Adds explicit disconnect messages in case of an 
actor shutdown

Introduces explicit disconnect messages which are sent from the 
JobManager/TaskManager to the TaskManager/JobManager in case of a graceful 
actor termination. These disconnect messages allow a faster recovery from 
failure in order to reach quickly a clean state.

Contains minor Scala cleanups.

This PR is based on #419 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink taskManagerDisconnect

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/423.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #423


commit 8cc604d61d75370972146333c5a016b5fcdddc77
Author: Till Rohrmann trohrm...@apache.org
Date:   2015-02-19T10:04:56Z

[FLINK-1584] [runtime][tests] Fixes TaskManagerFailsITCase by replacing the 
TestingCluster with a ForkableFlinkMiniCluster

commit 21660683633df999b86a7240929e07b8935e17df
Author: Till Rohrmann trohrm...@apache.org
Date:   2015-02-17T14:54:42Z

[Flink-1484] [runtime] Adds explicit disconnect message for TaskManagers

commit b2ff739feb6915bb131d1aeac7ca772eb4f85cba
Author: Till Rohrmann trohrm...@apache.org
Date:   2015-02-17T15:36:34Z

[FLINK-1499] [runtime] TaskManager sends explicit disconnect message to 
JobManager in case of shutdown




 JobManager restart does not notify the TaskManager
 --

 Key: FLINK-1484
 URL: https://issues.apache.org/jira/browse/FLINK-1484
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann

 In case of a JobManager restart, which can happen due to an uncaught 
 exception, the JobManager is restarted. However, connected TaskManager are 
 not informed about the disconnection and continue sending messages to a 
 JobManager with a reseted state. 
 TaskManager should be informed about a possible restart and cleanup their own 
 state in such a case. Afterwards, they can try to reconnect to a restarted 
 JobManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager

2015-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327759#comment-14327759
 ] 

ASF GitHub Bot commented on FLINK-1484:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/423#issuecomment-75096035
  
Rebased the PR on the pending PR #422 


 JobManager restart does not notify the TaskManager
 --

 Key: FLINK-1484
 URL: https://issues.apache.org/jira/browse/FLINK-1484
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann

 In case of a JobManager restart, which can happen due to an uncaught 
 exception, the JobManager is restarted. However, connected TaskManager are 
 not informed about the disconnection and continue sending messages to a 
 JobManager with a reseted state. 
 TaskManager should be informed about a possible restart and cleanup their own 
 state in such a case. Afterwards, they can try to reconnect to a restarted 
 JobManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager

2015-02-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311958#comment-14311958
 ] 

ASF GitHub Bot commented on FLINK-1484:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/368#issuecomment-73478074
  
Looks good. Since this is a behavior change, can you file a ticket for 
this, Till?


 JobManager restart does not notify the TaskManager
 --

 Key: FLINK-1484
 URL: https://issues.apache.org/jira/browse/FLINK-1484
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann

 In case of a JobManager restart, which can happen due to an uncaught 
 exception, the JobManager is restarted. However, connected TaskManager are 
 not informed about the disconnection and continue sending messages to a 
 JobManager with a reseted state. 
 TaskManager should be informed about a possible restart and cleanup their own 
 state in such a case. Afterwards, they can try to reconnect to a restarted 
 JobManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager

2015-02-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308834#comment-14308834
 ] 

ASF GitHub Bot commented on FLINK-1484:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/368#discussion_r24227786
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
@@ -125,6 +126,10 @@ Actor with ActorLogMessages with ActorLogging {
   override def postStop(): Unit = {
 log.info(sStopping job manager ${self.path}.)
 
+// disconnect the registered task managers
+instanceManager.getAllRegisteredInstances.asScala.foreach{
+  _.getTaskManager ! Disconnected(JobManager is stopping)}
+
 for((e,_) - currentJobs.values){
   e.fail(new Exception(The JobManager is shutting down.))
--- End diff --

Thanks Henry for spotting it. I corrected it.


 JobManager restart does not notify the TaskManager
 --

 Key: FLINK-1484
 URL: https://issues.apache.org/jira/browse/FLINK-1484
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann

 In case of a JobManager restart, which can happen due to an uncaught 
 exception, the JobManager is restarted. However, connected TaskManager are 
 not informed about the disconnection and continue sending messages to a 
 JobManager with a reseted state. 
 TaskManager should be informed about a possible restart and cleanup their own 
 state in such a case. Afterwards, they can try to reconnect to a restarted 
 JobManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager

2015-02-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309033#comment-14309033
 ] 

ASF GitHub Bot commented on FLINK-1484:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/368#discussion_r24236048
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/messages/TaskManagerMessages.scala
 ---
@@ -129,6 +129,13 @@ object TaskManagerMessages {
* @param cause reason for the external failure
*/
   case class FailTask(executionID: ExecutionAttemptID, cause: Throwable)
+
+  /**
+   * Makes the TaskManager to disconnect from the registered JobManager
--- End diff --

You're right. Thanks, I changed it.


 JobManager restart does not notify the TaskManager
 --

 Key: FLINK-1484
 URL: https://issues.apache.org/jira/browse/FLINK-1484
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann

 In case of a JobManager restart, which can happen due to an uncaught 
 exception, the JobManager is restarted. However, connected TaskManager are 
 not informed about the disconnection and continue sending messages to a 
 JobManager with a reseted state. 
 TaskManager should be informed about a possible restart and cleanup their own 
 state in such a case. Afterwards, they can try to reconnect to a restarted 
 JobManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager

2015-02-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308540#comment-14308540
 ] 

ASF GitHub Bot commented on FLINK-1484:
---

Github user hsaputra commented on a diff in the pull request:

https://github.com/apache/flink/pull/368#discussion_r24220454
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
@@ -125,6 +126,10 @@ Actor with ActorLogMessages with ActorLogging {
   override def postStop(): Unit = {
 log.info(sStopping job manager ${self.path}.)
 
+// disconnect the registered task managers
+instanceManager.getAllRegisteredInstances.asScala.foreach{
+  _.getTaskManager ! Disconnected(JobManager is stopping)}
+
 for((e,_) - currentJobs.values){
   e.fail(new Exception(The JobManager is shutting down.))
--- End diff --

Since we are cleaning up messages, maybe remove The so it is consistent 
with other messages.


 JobManager restart does not notify the TaskManager
 --

 Key: FLINK-1484
 URL: https://issues.apache.org/jira/browse/FLINK-1484
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann

 In case of a JobManager restart, which can happen due to an uncaught 
 exception, the JobManager is restarted. However, connected TaskManager are 
 not informed about the disconnection and continue sending messages to a 
 JobManager with a reseted state. 
 TaskManager should be informed about a possible restart and cleanup their own 
 state in such a case. Afterwards, they can try to reconnect to a restarted 
 JobManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1484) JobManager restart does not notify the TaskManager

2015-02-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307890#comment-14307890
 ] 

ASF GitHub Bot commented on FLINK-1484:
---

GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/368

[FLINK-1484] Adds explicit disconnect message for TaskManagers

In case of a JobManager restart, which can be caused by an uncaught 
exception, all connected TaskManager are notified by a ```Disconnected``` 
message. This message triggers the cleanup of the TaskManagers and makes them 
try to reconnect to the JobManager.

This PR also includes a test case to verify the afore-mentioned behaviour.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink akkaSupervision

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/368.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #368


commit 081721a01b5070dccdc807bf3ecd14d5b9de1b24
Author: Till Rohrmann trohrm...@apache.org
Date:   2015-02-05T19:58:59Z

Adds explicit disconnect message to tell TaskManagers that the JobManager 
has failed




 JobManager restart does not notify the TaskManager
 --

 Key: FLINK-1484
 URL: https://issues.apache.org/jira/browse/FLINK-1484
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann

 In case of a JobManager restart, which can happen due to an uncaught 
 exception, the JobManager is restarted. However, connected TaskManager are 
 not informed about the disconnection and continue sending messages to a 
 JobManager with a reseted state. 
 TaskManager should be informed about a possible restart and cleanup their own 
 state in such a case. Afterwards, they can try to reconnect to a restarted 
 JobManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)