[
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13785306#comment-13785306
]
Hitesh Shah commented on YARN-867:
----------------------------------
[~xgong] [~bikassaha] [~vinodkv] It seems like this fix is getting quite
complex and the introduction of container failure on service event handling has
a possibility of introducing a lot of different race conditions.
I propose the following:
- Add the code for catch Throwable whenever an aux service is invoked for
handling the container related events ( app init, container start, container
stop, app cleanup ). And, do not fail the container if an exception is thrown.
- A simpler check could be done to match the service metadata from the
ContainerLaunchContext and ensure that the service is configured on the NM in
question.
Using the above, at the very least, we can catch issues related to
mis-configured NMs where the shuffle service is not configured. This is way
simpler as it could be done a simple synchronous check when handling the
startContainers rpc call. This could be targeted to 2.1.2/2.2.0
As for the failing containers, I propose that we target fixing the feedback of
failed containers back to the AM on service handling errors in 2.3.0. For the
2.3.0 targeted jira, I would prefer to increase the scope of this to design for
differentiating critical vs non-critical services so as to have the framework
in place to understand which service's errors result in failed containers.
Comments?
> Isolation of failures in aux services
> --------------------------------------
>
> Key: YARN-867
> URL: https://issues.apache.org/jira/browse/YARN-867
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Hitesh Shah
> Assignee: Xuan Gong
> Priority: Critical
> Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch,
> YARN-867.4.patch, YARN-867.5.patch, YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a
> service. For example, sending data to the ShuffleService such that it results
> any non-IOException will cause the NM's async dispatcher to exit as the
> service's INIT APP event is not handled properly.
--
This message was sent by Atlassian JIRA
(v6.1#6144)