[
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740009#comment-13740009
]
Xuan Gong commented on YARN-867:
--------------------------------
My proposal:
When there is any auxService failure, instead of simply throwing out the
exceptions to the dispatcher, we will catch them and inform the AM.
Here is how it works:
We will use containerManagementProtocol. Basically, AM will need to send the
AuxiliaryServiceCheckRequest with ApplicationId as parameter frequently (We can
set the period as 3s or 5s), and we use ContainerManagementProtocol to send
this request to all the ContainerManager that this AM knows. Then those
ContainerManagers will send the response back with the information whether
there is any AuxiliaryService with this appId is failed, and related
diagnositics.
At ContainerManagerImpl side, for all the registered AuxServices, if any of
them fails, instead of simp lying throwing out of the exceptions to the
dispatcher, we will catch the exceptions, and save them with appId and
exception message into a AuxServiceFailureMap. In that case, when one
containerManager receives AuxiliaryServiceCheckRequest, it can check
AuxServiceFailureMap with the appId, and send back the response with whether
this is any AuxServices with this appid fails.
Attached a sample code for this proposal.
> Isolation of failures in aux services
> --------------------------------------
>
> Key: YARN-867
> URL: https://issues.apache.org/jira/browse/YARN-867
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Hitesh Shah
> Assignee: Xuan Gong
> Priority: Critical
>
> Today, a malicious application can bring down the NM by sending bad data to a
> service. For example, sending data to the ShuffleService such that it results
> any non-IOException will cause the NM's async dispatcher to exit as the
> service's INIT APP event is not handled properly.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira