[
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745794#comment-13745794
]
Xuan Gong commented on YARN-867:
--------------------------------
bq.Let's just handle the NM crash scenario here. And for informing the AM,
instead of adding more protocol changes, we can fail the container setting a
proper diagnostic and may be a custom exit-code.
I agree. I think that we can use a easier way to solve this issue. This is the
proposal :
If auxServices throw out the exceptions, we still need to catch them, after
that, we can fail the related container by send the containerExitEvent with
ContainerEventType.CONTAINER_EXITED_WITH_FAILURE. Also we need to provide the
proper diagnostic and custom exit-code. Eventually, this container will
transfer to Completed state. Then we can inform the RM thru the node heartbeat.
In that case, the related RMContainer will get this diagnostic info and custom
exit-code, also will go to completed state. So, when AM do the heartbeat, it
will the list of completed containerStatus. After that, AM just need simply
check the exit code to find out whether there is any auxService fail.
Attached is the sample code for this propsal
> Isolation of failures in aux services
> --------------------------------------
>
> Key: YARN-867
> URL: https://issues.apache.org/jira/browse/YARN-867
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Hitesh Shah
> Assignee: Xuan Gong
> Priority: Critical
> Attachments: YARN-867.1.sampleCode.patch, YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a
> service. For example, sending data to the ShuffleService such that it results
> any non-IOException will cause the NM's async dispatcher to exit as the
> service's INIT APP event is not handled properly.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira