[
https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Naganarasimha G R updated YARN-4392:
------------------------------------
Attachment: YARN-4392.3.patch
bq. ATS events upon recovery in some scenarios if we don't re-send since ATS
event posting is async and state store updating are async. There's a race where
we could update the state store and crash before the ATS event is sent.
IMO its like situation where in we need to decide which is greater of 2 evils
and try to take care of it. Had a offline discussion with [~sunilg] and
[~vvasudev], and few thoughts were :
* anyway processing of the ATS events even in recovery is in separate thread
hence its not blocking
* when we move out of ATS1.0, storage will also not be a problem.
* Data is also not getting changed on resending during recovery
considering all these i am fine with approach in the patch. Also have uploaded
a new patch with test case correction and addition of test case to validate
during creation and recovery events are sent for container created.
> ApplicationCreatedEvent event time resets after RM restart/failover
> -------------------------------------------------------------------
>
> Key: YARN-4392
> URL: https://issues.apache.org/jira/browse/YARN-4392
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.8.0
> Reporter: Xuan Gong
> Assignee: Naganarasimha G R
> Priority: Critical
> Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch,
> YARN-4392.2.patch, YARN-4392.3.patch
>
>
> {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) -
> Finished time 1437453994768 is ahead of started time 1440308399674
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437454008244 is ahead of started time 1440308399676
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444305171 is ahead of started time 1440308399653
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444293115 is ahead of started time 1440308399647
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444379645 is ahead of started time 1440308399656
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444361234 is ahead of started time 1440308399655
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444342029 is ahead of started time 1440308399654
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444323447 is ahead of started time 1440308399654
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444430006 is ahead of started time 1440308399660
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444415698 is ahead of started time 1440308399659
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444419060 is ahead of started time 1440308399658
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished
> time 1437444393931 is ahead of started time 1440308399657
> {code} .
> From ATS logs, we would see a large amount of 'stale alerts' messages
> periodically
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)