[jira] [Commented] (FLINK-6213) When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later

2017-07-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107243#comment-16107243
 ] 

ASF GitHub Bot commented on FLINK-6213:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/3640


> When number of failed containers exceeds maximum failed containers and 
> application is stopped, the AM container will be released 10 minutes later 
> --
>
> Key: FLINK-6213
> URL: https://issues.apache.org/jira/browse/FLINK-6213
> Project: Flink
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Yelei Feng
>
> When number of failed containers exceeds maximum failed containers and 
> application is stopped, the AM container will be released 10 minutes later. I 
> checked yarn log and found out after invoking 
> {{unregisterApplicationMaster}}, the AM container is not released. After 10 
> minutes, the release is triggered by RM ping check timeout.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6213) When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later

2017-07-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107040#comment-16107040
 ] 

ASF GitHub Bot commented on FLINK-6213:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/3640#discussion_r130309444
  
--- Diff: 
flink-yarn/src/main/java/org/apache/flink/yarn/YarnFlinkResourceManager.java ---
@@ -300,6 +301,8 @@ protected void shutdownApplication(ApplicationStatus 
finalStatus, String optiona
} catch (Throwable t) {
LOG.error("Could not cleanly shut down the Node Manager 
Client", t);
}
+
+   self().tell(decorateMessage(PoisonPill.getInstance()), self());
--- End diff --

I would directly call `getContext().system().stop(self())`, because that 
way we will stop immediately processing any further messages.


> When number of failed containers exceeds maximum failed containers and 
> application is stopped, the AM container will be released 10 minutes later 
> --
>
> Key: FLINK-6213
> URL: https://issues.apache.org/jira/browse/FLINK-6213
> Project: Flink
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Yelei Feng
>
> When number of failed containers exceeds maximum failed containers and 
> application is stopped, the AM container will be released 10 minutes later. I 
> checked yarn log and found out after invoking 
> {{unregisterApplicationMaster}}, the AM container is not released. After 10 
> minutes, the release is triggered by RM ping check timeout.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6213) When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later

2017-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948213#comment-15948213
 ] 

ASF GitHub Bot commented on FLINK-6213:
---

Github user barcahead commented on the issue:

https://github.com/apache/flink/pull/3640
  
@StephanEwen Thanks for the review, I tested in my environment and 
`ProcessReaper` receives msg  `Terminated` successfully. Because msg 
`PosionPill` is not a instance of `RequiredLeaderSessionID`, so it doesn't 
change after being decorated.


> When number of failed containers exceeds maximum failed containers and 
> application is stopped, the AM container will be released 10 minutes later 
> --
>
> Key: FLINK-6213
> URL: https://issues.apache.org/jira/browse/FLINK-6213
> Project: Flink
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Yelei Feng
>
> When number of failed containers exceeds maximum failed containers and 
> application is stopped, the AM container will be released 10 minutes later. I 
> checked yarn log and found out after invoking 
> {{unregisterApplicationMaster}}, the AM container is not released. After 10 
> minutes, the release is triggered by RM ping check timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6213) When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later

2017-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946901#comment-15946901
 ] 

ASF GitHub Bot commented on FLINK-6213:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3640
  
Good idea to add the poison pill. But does it actually work when the poison 
pill is decorated?


> When number of failed containers exceeds maximum failed containers and 
> application is stopped, the AM container will be released 10 minutes later 
> --
>
> Key: FLINK-6213
> URL: https://issues.apache.org/jira/browse/FLINK-6213
> Project: Flink
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Yelei Feng
>
> When number of failed containers exceeds maximum failed containers and 
> application is stopped, the AM container will be released 10 minutes later. I 
> checked yarn log and found out after invoking 
> {{unregisterApplicationMaster}}, the AM container is not released. After 10 
> minutes, the release is triggered by RM ping check timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6213) When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later

2017-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946842#comment-15946842
 ] 

ASF GitHub Bot commented on FLINK-6213:
---

GitHub user barcahead opened a pull request:

https://github.com/apache/flink/pull/3640

[FLINK-6213] [yarn] terminate resource manager itself when shutting down 
application

When number of failed containers exceeds maximum failed containers, 
`YarnFlinkResourceManager` will receive msg `StopCluster` and then invoke 
`shutdownApplication`. In this method, it calls 
`amrmclient.unregisterApplicationMaster` to finish the application. But the AM 
container is not released until 10 minutes later triggered by RM ping check 
timeout. 
I fix this issue by terminating resource manager itself after unregistering 
application master, then the process will exit and the container will be 
released.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/barcahead/flink FLINK-6213

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/3640.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3640


commit 1f4c91af090189d8a797a500701689b6639c4a85
Author: fengyelei 
Date:   2017-03-29T03:40:24Z

[FLINK-6213] [yarn] terminate resource manager itself when shutting down 
application




> When number of failed containers exceeds maximum failed containers and 
> application is stopped, the AM container will be released 10 minutes later 
> --
>
> Key: FLINK-6213
> URL: https://issues.apache.org/jira/browse/FLINK-6213
> Project: Flink
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Yelei Feng
>
> When number of failed containers exceeds maximum failed containers and 
> application is stopped, the AM container will be released 10 minutes later. I 
> checked yarn log and found out after invoking 
> {{unregisterApplicationMaster}}, the AM container is not released. After 10 
> minutes, the release is triggered by RM ping check timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)