[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765884#comment-13765884
 ] 

Zhijie Shen commented on YARN-867:
----------------------------------

Sorry to post the broken comment before.

Think about the problem again. Essentially, problem is the implementation of 
AuxiliaryService may throw RuntimeException (or other Throwable), and fail the 
thread of NM dispatcher. Wrapping the calling statements with try/catch can 
basically prevent NM failure.
The next task is to handle the throwable from AuxiliaryService. In previous 
thread, what we plan to do is to fail the container directly, and let the AM 
know that the container is failed due to AUXSERVICE_FAILED. For MR, it may be 
okay, because without ShuffleHandler, MR jobs cannot run properly. However, 
should NM always make the decision to fail the container? I'm concerned that:
1. NM doesn't know what the AuxiliaryService serves the application and how 
important it is.
2. NM doesn't know how critical the exception is, or whether it is transit or 
reproducible.
Therefore, if the application can tolerant the AuxiliaryService failure? For 
example, if the AuxiliaryService just does some node-local monitoring work, the 
application can complete with the AuxiliaryService not working. Therefore, I'm 
wondering whether we should leave the decision to the AM. The application knows 
how to handle the exception best. NM just need to exposure the failure of the 
AuxiliaryService to the application in some method. Thoughts?
                
> Isolation of failures in aux services 
> --------------------------------------
>
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>         Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
> YARN-867.4.patch, YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a 
> service. For example, sending data to the ShuffleService such that it results 
> any non-IOException will cause the NM's async dispatcher to exit as the 
> service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to