[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740009#comment-13740009
 ] 

Xuan Gong commented on YARN-867:
--------------------------------

My proposal:
When there is any auxService failure, instead of simply throwing out the 
exceptions to the dispatcher, we will catch them and inform the AM. 

Here is how it works:

We will use containerManagementProtocol. Basically, AM will need to send the 
AuxiliaryServiceCheckRequest with ApplicationId as parameter frequently (We can 
set the period as 3s or 5s), and we use ContainerManagementProtocol to send 
this request to all the ContainerManager that this AM knows. Then those 
ContainerManagers will send the response back with the information whether 
there is any AuxiliaryService with this appId is failed, and related 
diagnositics. 

At ContainerManagerImpl side, for all the registered  AuxServices, if any of 
them fails, instead of simp lying throwing out of the exceptions to the 
dispatcher, we will catch the exceptions, and save them with appId and 
exception message into a AuxServiceFailureMap. In that case, when one 
containerManager receives  AuxiliaryServiceCheckRequest, it can check 
AuxServiceFailureMap with the appId, and send back the response with whether 
this is any  AuxServices with this appid fails.

Attached a sample code for this proposal.
                
> Isolation of failures in aux services 
> --------------------------------------
>
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>
> Today, a malicious application can bring down the NM by sending bad data to a 
> service. For example, sending data to the ShuffleService such that it results 
> any non-IOException will cause the NM's async dispatcher to exit as the 
> service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to