[
https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033961#comment-15033961
]
Jason Lowe commented on YARN-4392:
----------------------------------
I agree that if we're going to resend the ATS events then the start time should
be consistent. This is already done with the audit logs. There's still
[~Naganarasimha]'s question of whether we should simply avoid sending the
events at all upon recovery. If we take that approach I'm wondering if there
may be cases where we are updating the app state before we know for certain
that the ATS has received the event. Therefore re-sending the events is
probably a safer approach, but it does send a flood of events from the RM to
the ATS upon recovery.
Anyway if we proceed with a resend event approach, I'm wondering if there's a
simpler way to handle it. Rather than updating the RMAppImpl constructor,
can't we simply wait until we recover to send the event? I find it odd that we
are telling the ATS that the app has started in the RMAppImpl constructor
rather than in the transition triggered by the START event. Moving the ATS app
start notification out of the constructor and instead to that start transition
allows us to construct an app and send it a recover event without triggering an
ATS event. Then we can let the app recover and either send the event with the
recovered startTime or avoid sending it during recovery. It would be our
choice. Then we don't need to update the constructor, leak even more app state
recovery logic into RMAppManager, etc.
> ApplicationCreatedEvent event time resets after RM restart/failover
> -------------------------------------------------------------------
>
> Key: YARN-4392
> URL: https://issues.apache.org/jira/browse/YARN-4392
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.8.0
> Reporter: Xuan Gong
> Assignee: Xuan Gong
> Priority: Critical
> Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch
>
>
> {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) -
> Finished time 1437453994768 is ahead of started time 1440308399674
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437454008244 is ahead of started time 1440308399676
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444305171 is ahead of started time 1440308399653
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444293115 is ahead of started time 1440308399647
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444379645 is ahead of started time 1440308399656
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444361234 is ahead of started time 1440308399655
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444342029 is ahead of started time 1440308399654
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444323447 is ahead of started time 1440308399654
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444430006 is ahead of started time 1440308399660
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444415698 is ahead of started time 1440308399659
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444419060 is ahead of started time 1440308399658
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444393931 is ahead of started time 1440308399657
> {code} .
> From ATS logs, we would see a large amount of 'stale alerts' messages
> periodically
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)