[
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765696#comment-13765696
]
Xuan Gong commented on YARN-867:
--------------------------------
NEW patch added more transitions in ContainerSEate.EXITED_WITH_FAILURE and
ContainerState.DONE. This patch still handles the
AuxServicesEventType.APPLICATION_INIT and handles exceptions at the container
level.
I thought about moving AuxServicesEventType.APPLICATION_INIT into application.
But I do not think that we will get any benefits. The reasons are :
1. There are two newly events : AuxServicesEvent.CONTAINER_INIT and
AuxServicesEvent.CONTAINER_STOP. We need to handle them at container level.
2. Even if we move AuxServicesEventType.APPLICATION_INIT into application, we
will have two options :
a. We will not start any containers until all the AuxServices finish their
APPLICATION_INIT. If we choose this, that definitely simplify the problem. When
there is any exceptions from APPLICATION_INIT on AuxServices, just simply kill
the applications. But does it make sense that we need to block all the
containers ?
b. We can let AuxServices do APPLICATION_INIT and container starts at the
same time, if this is the case, we will go to the same process as now. Because,
when the container receives the CONTAINER_EXITED_WITH_FAILURE event, we can not
guarantee which state the container is, maybe at killing state, LOCALIZED
state, etc. Any state is possible.
> Isolation of failures in aux services
> --------------------------------------
>
> Key: YARN-867
> URL: https://issues.apache.org/jira/browse/YARN-867
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Hitesh Shah
> Assignee: Xuan Gong
> Priority: Critical
> Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch,
> YARN-867.4.patch, YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a
> service. For example, sending data to the ShuffleService such that it results
> any non-IOException will cause the NM's async dispatcher to exit as the
> service's INIT APP event is not handled properly.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira