[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13785306#comment-13785306
 ] 

Hitesh Shah commented on YARN-867:
----------------------------------

[~xgong] [~bikassaha] [~vinodkv] It seems like this fix is getting quite 
complex and the introduction of container failure on service event handling has 
a possibility of introducing a lot of different race conditions.

I propose the following:

   - Add the code for catch Throwable whenever an aux service is invoked for 
handling the container related events ( app init, container start, container 
stop, app cleanup ). And, do not fail the container if an exception is thrown. 
   - A simpler check could be done to match the service metadata from the 
ContainerLaunchContext and ensure that the service is configured on the NM in 
question. 

Using the above, at the very least, we can catch issues related to 
mis-configured NMs where the shuffle service is not configured. This is way 
simpler as it could be done a simple synchronous check when handling the 
startContainers rpc call. This could be targeted to 2.1.2/2.2.0

As for the failing containers, I propose that we target fixing the feedback of 
failed containers back to the AM on service handling errors in 2.3.0. For the 
2.3.0 targeted jira, I would prefer to increase the scope of this to design for 
differentiating critical vs non-critical services so as to have the framework 
in place to understand which service's errors result in failed containers. 

Comments? 




> Isolation of failures in aux services 
> --------------------------------------
>
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>         Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
> YARN-867.4.patch, YARN-867.5.patch, YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a 
> service. For example, sending data to the ShuffleService such that it results 
> any non-IOException will cause the NM's async dispatcher to exit as the 
> service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to