[
https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983892#comment-13983892
]
Wangda Tan commented on YARN-1885:
----------------------------------
Thanks Jian's review!
I'm agree with your #1, #2, #4, they're very clear. I'll address them later
For #3,
bq. There are two routes to notify the finished apps one from the app and one
from the register call. It’s good to have a single source of notification to
cover all possible race conditions. we may send an event to the app and let app
make decision to notify the node to cleanup the applications or not.
+1 for this (Solution see below)
bq. Today RMAppAttempt is capturing all the ranNodes, probably we need to move
that to the RMApp also.
+1 for this too, because RMAppAttempt captures ranNodes in its side, but the
RMApp actually used them. And it will do copy when transition from one attempt
to another,
{code}
public void transferStateFromPreviousAttempt(RMAppAttempt attempt) {
this.justFinishedContainers = attempt.getJustFinishedContainers();
this.ranNodes = attempt.getRanNodes();
}
{code}
To address this, do you agree to create an "ADD_NODE" event to solve the above
two? We can send a ADD_NODE event to RMApp, and ADD_NODE event will not change
states in RMApp. RMApp will,
1) Save node to ranNodes if state didn't reach finalStateStored,
2) Otherwise, it will send a application_cleanup event to RMNode
Any thoughts? Thanks!
> RM may not send the finished signal to some nodes where the application ran
> after RM restarts
> ---------------------------------------------------------------------------------------------
>
> Key: YARN-1885
> URL: https://issues.apache.org/jira/browse/YARN-1885
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.4.0
> Reporter: Arpit Gupta
> Assignee: Wangda Tan
> Attachments: YARN-1885.patch, YARN-1885.patch
>
>
> During our HA testing we have seen cases where yarn application logs are not
> available through the cli but i can look at AM logs through the UI. RM was
> also being restarted in the background as the application was running.
--
This message was sent by Atlassian JIRA
(v6.2#6252)