[ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765878#comment-13765878 ]
Zhijie Shen commented on YARN-867: ---------------------------------- Think about the problem again. Essentially, problem is the implementation of AuxiliaryService may throw RuntimeException (or other Throwable), and fail the thread of NM dispatcher. Wrapping the calling statements with try/catch can basically prevent NM failure. The next task is to handle the throwable from AuxiliaryService. In previous thread, what we plan to do is to fail the container directly, and let the AM know that the container is failed due to AUXSERVICE_FAILED. For MR, it may be okay, because without ShuffleHandler, MR jobs cannot run properly. However, should NM always make the decision to fail the container? I'm concerned that: 1. NM doesn't know what the AuxiliaryService serves the application and how important it is. 2. NM doesn't know how critical the exception is, or whether it is transit or reproducible. Therefore, if the application can toleran > Isolation of failures in aux services > -------------------------------------- > > Key: YARN-867 > URL: https://issues.apache.org/jira/browse/YARN-867 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: Hitesh Shah > Assignee: Xuan Gong > Priority: Critical > Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, > YARN-867.4.patch, YARN-867.sampleCode.2.patch > > > Today, a malicious application can bring down the NM by sending bad data to a > service. For example, sending data to the ShuffleService such that it results > any non-IOException will cause the NM's async dispatcher to exit as the > service's INIT APP event is not handled properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira