[ 
https://issues.apache.org/jira/browse/YARN-4466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067852#comment-15067852
 ] 

Naganarasimha G R commented on YARN-4466:
-----------------------------------------

Hi [~djp]
Was looking into the exception trace which caused the RM to go down in 
YARN-4452  issue
{code}
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.appAttemptRegistered(SystemMetricsPublisher.java:165)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1533)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1505)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:845)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:110)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:815)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:796)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
{code}
Here {{rmDispatcher-> ApplicationAttemptEventDispatcher-> ... -> transition in 
RMAppAttemptImpl -> SMP.appAttemptRegistered}} causes a NPE. In 
{{AsyncDispatcher.dispatch}} method on exception we check for 
{{exitOnDispatchException}} and and if true we call System.exit. 
So was not able to figure out an approach here to have a check such that RM 
doesn't fail. It was basically a bug.
So IMO its *not* always possible to cover all scenarios but what we could do is 
*expose interface in AsyncDispatcher to allow setting exitOnDispatchException 
variable and the dispatchers used in ATS can set it false* so that RM doesn't 
go down for exceptions related to dispatching ATS events. But apart from it 
even if introduce some mechanism we might miss to get some hidden bugs.
Apart from TimelinePublisher only other place  in RM i could see 
AsyncDispatcher used where we can neglect the failures/exceptions is 
{{RMApplicationHistoryWriter}} which is already deprecated ! 
Thoughts ?

> ResourceManager should tolerate unexpected exceptions to happen in 
> non-critical subsystem/services like SystemMetricsPublisher
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4466
>                 URL: https://issues.apache.org/jira/browse/YARN-4466
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Naganarasimha G R
>
> From my comment in 
> YARN-4452(https://issues.apache.org/jira/browse/YARN-4452?focusedCommentId=15059805&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15059805),
>  we should make RM more robust with ignore (but log) unexpected exception in 
> its non-critical subsystems/services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to