[
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045959#comment-14045959
]
Jason Lowe commented on YARN-1341:
----------------------------------
Agree it's not ideal to discuss handling state store errors for all NM
components in this JIRA. In general I'd prefer to discuss and address each
case with the corresponding JIRA, e.g.: application state store errors
discussed and addressed in YARN-1354, container state store errors in
YARN-1337, etc. If we feel there's significant utility to committing a JIRA
before all the issues are addressed then we can file one or more followup JIRAs
to track those outstanding issues. That's the normal process we follow with
other features/fixes as well.
So if we follow that process then we're back to the discussion about RM master
keys not being able to be stored in the state store. The choices we've
discussed are:
1) Log an error, update the master key in memory, and continue
2) Log an error, _not_ update the master key in memory, and continue
3) Log an error and tear down the NM
I'd prefer 1) since that is the option that preserves the most work in all
scenarios I can think of, and I don't know of a scenario where 2) would handle
it better. However I could be convinced given the right scenario. I'd really
rather avoid 3) since that seems like a severe way to "handle" the error and
guarantees work is lost.
Oh there is one more handling scenario we briefly discussed where we flag the
NM as "undesirable". When that occurs we don't shoot the containers that are
running, but we avoid adding new containers since the node is having issues
(i.e.: a drain-decommission). I feel that would be a separate JIRA since it
needs YARN-914, and we'd still need to decide how to handle the error until the
decommission is complete (i.e.: choice 1 or 2 above).
> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>
> Key: YARN-1341
> URL: https://issues.apache.org/jira/browse/YARN-1341
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch,
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)