[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765696#comment-13765696
 ] 

Xuan Gong commented on YARN-867:
--------------------------------

NEW patch added more transitions in ContainerSEate.EXITED_WITH_FAILURE and 
ContainerState.DONE. This patch still handles the 
AuxServicesEventType.APPLICATION_INIT and handles exceptions at the container 
level. 

I thought about moving AuxServicesEventType.APPLICATION_INIT into application. 
But I do not think that we will get any benefits. The reasons are :
1. There are two newly events : AuxServicesEvent.CONTAINER_INIT and 
AuxServicesEvent.CONTAINER_STOP. We need to handle them at container level.
2. Even if we move AuxServicesEventType.APPLICATION_INIT into application, we 
will have two options :
   a. We will not start any containers until all the AuxServices finish their 
APPLICATION_INIT. If we choose this, that definitely simplify the problem. When 
there is any exceptions from APPLICATION_INIT on AuxServices, just simply kill 
the applications. But does it make sense that we need to block all the 
containers ?
   b. We can let AuxServices do APPLICATION_INIT and container starts at the 
same time, if this is the case, we will go to the same process as now. Because, 
when the container receives the CONTAINER_EXITED_WITH_FAILURE event, we can not 
guarantee which state the container is, maybe at killing state, LOCALIZED 
state, etc. Any state is possible.

                
> Isolation of failures in aux services 
> --------------------------------------
>
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>         Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
> YARN-867.4.patch, YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a 
> service. For example, sending data to the ShuffleService such that it results 
> any non-IOException will cause the NM's async dispatcher to exit as the 
> service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to