[jira] [Commented] (YARN-867) Isolation of failures in aux services

Zhijie Shen (JIRA) Fri, 13 Sep 2013 02:22:54 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765878#comment-13765878
 ]


Zhijie Shen commented on YARN-867:
----------------------------------

Think about the problem again. Essentially, problem is the implementation of 
AuxiliaryService may throw RuntimeException (or other Throwable), and fail the 
thread of NM dispatcher. Wrapping the calling statements with try/catch can 
basically prevent NM failure.

The next task is to handle the throwable from AuxiliaryService. In previous 
thread, what we plan to do is to fail the container directly, and let the AM 
know that the container is failed due to AUXSERVICE_FAILED. For MR, it may be 
okay, because without ShuffleHandler, MR jobs cannot run properly. However, 
should NM always make the decision to fail the container? I'm concerned that:
1. NM doesn't know what the AuxiliaryService serves the application and how 
important it is.
2. NM doesn't know how critical the exception is, or whether it is transit or 
reproducible.
Therefore, if the application can toleran
                
> Isolation of failures in aux services 
> --------------------------------------
>
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>         Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
> YARN-867.4.patch, YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a 
> service. For example, sending data to the ShuffleService such that it results 
> any non-IOException will cause the NM's async dispatcher to exit as the 
> service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-867) Isolation of failures in aux services

Reply via email to