[
https://issues.apache.org/jira/browse/YARN-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Naganarasimha G R updated YARN-3260:
------------------------------------
Comment: was deleted
(was: Hi [~jlowe],
Had a look at the code and some approaches which i can think of are :
* ApplicationMasterService.registerAppAttempt(ApplicationAttemptId) to be
called in RMAppAttemptImpl.AMLaunchedTransition instead of
RMAppAttemptImpl.AttemptStartedTransition and ensuring that ClientToAMToken and
registerering with ApplicationMasterService in the same block. By doing this we
can throw InvalidApplicationMasterRequestException if AM tries to register to
AMS before RMAppAttemptImpl processes RMAppAttempt LAUNCHED event.
* Was thinking of having MultiThreadedDispatcher for processing APP and
AppAttempt events similar to the one in
SystemMetricsPublisher.MultiThreadedDispatcher with additional modification
that instead of having {{ "(event.hashCode() & Integer.MAX_VALUE) %
dispatchers.size();"}} we can think of doing it based on applicationId. This
can speed up the processing of App events ...
Was not able to see any other cleaner direct fix for this issue, so was
wondering whether we need to start looking at the reason for "clusters was
running behind on processing AsyncDispatcher events". Were these events were
getting delayed to any particular reason? )
> NPE if AM attempts to register before RM processes launch event
> ---------------------------------------------------------------
>
> Key: YARN-3260
> URL: https://issues.apache.org/jira/browse/YARN-3260
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.6.0
> Reporter: Jason Lowe
> Assignee: Bibin A Chundatt
> Priority: Critical
> Attachments: YARN-3260.001.patch
>
>
> The RM on one of our clusters was running behind on processing
> AsyncDispatcher events, and this caused AMs to fail to register due to an
> NPE. The AM was launched and attempting to register before the
> RMAppAttemptImpl had processed the LAUNCHED event, and the client to AM token
> had not been generated yet. The NPE occurred because the
> ApplicationMasterService tried to encode the missing token.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]