[jira] [Commented] (FLINK-6213) When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later
[ https://issues.apache.org/jira/browse/FLINK-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107243#comment-16107243 ] ASF GitHub Bot commented on FLINK-6213: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3640 > When number of failed containers exceeds maximum failed containers and > application is stopped, the AM container will be released 10 minutes later > -- > > Key: FLINK-6213 > URL: https://issues.apache.org/jira/browse/FLINK-6213 > Project: Flink > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yelei Feng > > When number of failed containers exceeds maximum failed containers and > application is stopped, the AM container will be released 10 minutes later. I > checked yarn log and found out after invoking > {{unregisterApplicationMaster}}, the AM container is not released. After 10 > minutes, the release is triggered by RM ping check timeout. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-6213) When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later
[ https://issues.apache.org/jira/browse/FLINK-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107040#comment-16107040 ] ASF GitHub Bot commented on FLINK-6213: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/3640#discussion_r130309444 --- Diff: flink-yarn/src/main/java/org/apache/flink/yarn/YarnFlinkResourceManager.java --- @@ -300,6 +301,8 @@ protected void shutdownApplication(ApplicationStatus finalStatus, String optiona } catch (Throwable t) { LOG.error("Could not cleanly shut down the Node Manager Client", t); } + + self().tell(decorateMessage(PoisonPill.getInstance()), self()); --- End diff -- I would directly call `getContext().system().stop(self())`, because that way we will stop immediately processing any further messages. > When number of failed containers exceeds maximum failed containers and > application is stopped, the AM container will be released 10 minutes later > -- > > Key: FLINK-6213 > URL: https://issues.apache.org/jira/browse/FLINK-6213 > Project: Flink > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yelei Feng > > When number of failed containers exceeds maximum failed containers and > application is stopped, the AM container will be released 10 minutes later. I > checked yarn log and found out after invoking > {{unregisterApplicationMaster}}, the AM container is not released. After 10 > minutes, the release is triggered by RM ping check timeout. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-6213) When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later
[ https://issues.apache.org/jira/browse/FLINK-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948213#comment-15948213 ] ASF GitHub Bot commented on FLINK-6213: --- Github user barcahead commented on the issue: https://github.com/apache/flink/pull/3640 @StephanEwen Thanks for the review, I tested in my environment and `ProcessReaper` receives msg `Terminated` successfully. Because msg `PosionPill` is not a instance of `RequiredLeaderSessionID`, so it doesn't change after being decorated. > When number of failed containers exceeds maximum failed containers and > application is stopped, the AM container will be released 10 minutes later > -- > > Key: FLINK-6213 > URL: https://issues.apache.org/jira/browse/FLINK-6213 > Project: Flink > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yelei Feng > > When number of failed containers exceeds maximum failed containers and > application is stopped, the AM container will be released 10 minutes later. I > checked yarn log and found out after invoking > {{unregisterApplicationMaster}}, the AM container is not released. After 10 > minutes, the release is triggered by RM ping check timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6213) When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later
[ https://issues.apache.org/jira/browse/FLINK-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946901#comment-15946901 ] ASF GitHub Bot commented on FLINK-6213: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3640 Good idea to add the poison pill. But does it actually work when the poison pill is decorated? > When number of failed containers exceeds maximum failed containers and > application is stopped, the AM container will be released 10 minutes later > -- > > Key: FLINK-6213 > URL: https://issues.apache.org/jira/browse/FLINK-6213 > Project: Flink > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yelei Feng > > When number of failed containers exceeds maximum failed containers and > application is stopped, the AM container will be released 10 minutes later. I > checked yarn log and found out after invoking > {{unregisterApplicationMaster}}, the AM container is not released. After 10 > minutes, the release is triggered by RM ping check timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6213) When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later
[ https://issues.apache.org/jira/browse/FLINK-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946842#comment-15946842 ] ASF GitHub Bot commented on FLINK-6213: --- GitHub user barcahead opened a pull request: https://github.com/apache/flink/pull/3640 [FLINK-6213] [yarn] terminate resource manager itself when shutting down application When number of failed containers exceeds maximum failed containers, `YarnFlinkResourceManager` will receive msg `StopCluster` and then invoke `shutdownApplication`. In this method, it calls `amrmclient.unregisterApplicationMaster` to finish the application. But the AM container is not released until 10 minutes later triggered by RM ping check timeout. I fix this issue by terminating resource manager itself after unregistering application master, then the process will exit and the container will be released. You can merge this pull request into a Git repository by running: $ git pull https://github.com/barcahead/flink FLINK-6213 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3640.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3640 commit 1f4c91af090189d8a797a500701689b6639c4a85 Author: fengyeleiDate: 2017-03-29T03:40:24Z [FLINK-6213] [yarn] terminate resource manager itself when shutting down application > When number of failed containers exceeds maximum failed containers and > application is stopped, the AM container will be released 10 minutes later > -- > > Key: FLINK-6213 > URL: https://issues.apache.org/jira/browse/FLINK-6213 > Project: Flink > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yelei Feng > > When number of failed containers exceeds maximum failed containers and > application is stopped, the AM container will be released 10 minutes later. I > checked yarn log and found out after invoking > {{unregisterApplicationMaster}}, the AM container is not released. After 10 > minutes, the release is triggered by RM ping check timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346)