[
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13784745#comment-13784745
]
Bikas Saha commented on YARN-867:
---------------------------------
Why is this check needed?
{code}
+ private void handleAuxServiceFail(AuxServicesEvent event, Throwable th) {
+ if (event.getType() instanceof AuxServicesEventType) {
+ Container container = event.getContainer();
{code}
If container has already failed then why do we need to change state again? the
container has already failed.
{code}
+ .addTransition(ContainerState.LOCALIZATION_FAILED,
ContainerState.EXITED_WITH_FAILURE,
+ ContainerEventType.CONTAINER_EXITED_WITH_FAILURE,
+ new ExitedWithFailureTransition(false))
{code}
{code}
+ .addTransition(ContainerState.CONTAINER_CLEANEDUP_AFTER_KILL,
+ ContainerState.EXITED_WITH_FAILURE,
+ ContainerEventType.CONTAINER_EXITED_WITH_FAILURE,
+ new ExitedWithFailureTransition(false))
{code}
Why is CONTAINER_EXITED_WITH_FAILURE not being handled while container state is
localized/running?
Why are extra events being ignored in addition to
ContainerEventType.CONTAINER_EXITED_WITH_FAILURE?
{code}
+ ContainerState.EXITED_WITH_FAILURE,
+ EnumSet.of(
+ ContainerEventType.KILL_CONTAINER,
+ ContainerEventType.CONTAINER_EXITED_WITH_FAILURE,
+ ContainerEventType.RESOURCE_LOCALIZED,
+ ContainerEventType.RESOURCE_FAILED,
+ ContainerEventType.CONTAINER_LAUNCHED,
+ ContainerEventType.CONTAINER_EXITED_WITH_SUCCESS,
+ ContainerEventType.CONTAINER_KILLED_ON_REQUEST))
{code}
{code}
+ .addTransition(ContainerState.DONE, ContainerState.DONE,
+ EnumSet.of(
+ ContainerEventType.RESOURCE_LOCALIZED,
+ ContainerEventType.CONTAINER_LAUNCHED,
+ ContainerEventType.CONTAINER_EXITED_WITH_FAILURE,
+ ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP,
+ ContainerEventType.CONTAINER_EXITED_WITH_SUCCESS,
+ ContainerEventType.CONTAINER_KILLED_ON_REQUEST))
{code}
Can you please check if ExitedWithFailureTransition(true) needs to be called in
places where the patch is adding ExitedWithFailureTransition(false). Is cleanup
required?
Do the new tests fail without the changes?
> Isolation of failures in aux services
> --------------------------------------
>
> Key: YARN-867
> URL: https://issues.apache.org/jira/browse/YARN-867
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Hitesh Shah
> Assignee: Xuan Gong
> Priority: Critical
> Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch,
> YARN-867.4.patch, YARN-867.5.patch, YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a
> service. For example, sending data to the ShuffleService such that it results
> any non-IOException will cause the NM's async dispatcher to exit as the
> service's INIT APP event is not handled properly.
--
This message was sent by Atlassian JIRA
(v6.1#6144)