[
https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432723#comment-16432723
]
Vrushali C commented on YARN-8130:
----------------------------------
Yes, I agree, we need a configurable delay like the collectorLingerPeriod in
the PerNodeTimelineCollectorsAuxService#removeApplicationCollector.
Need to check if there are other places where we are removing the app id from
some map.
Relevant jiras for collectorLingerPeriod YARN-3995 and YARN-7835
> Race condition when container events are published for KILLED applications
> --------------------------------------------------------------------------
>
> Key: YARN-8130
> URL: https://issues.apache.org/jira/browse/YARN-8130
> Project: Hadoop YARN
> Issue Type: Bug
> Components: ATSv2
> Reporter: Charan Hebri
> Priority: Major
>
> There seems to be a race condition happening when an application is KILLED
> and the corresponding container event information is being published. For
> completed containers, a YARN_CONTAINER_FINISHED event is generated but for
> some containers in a KILLED application this information is missing. Below is
> a node manager log snippet,
> {code:java}
> 2018-04-09 08:44:54,474 INFO shuffle.ExternalShuffleBlockResolver
> (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application
> application_1523259757659_0003 removed, cleanupLocalDirs = false
> 2018-04-09 08:44:54,478 INFO application.ApplicationImpl
> (ApplicationImpl.java:handle(632)) - Application
> application_1523259757659_0003 transitioned from
> APPLICATION_RESOURCES_CLEANINGUP to FINISHED
> 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher
> (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been
> removed before the entity could be published for
> TimelineEntity[type='YARN_CONTAINER',
> id='container_1523259757659_0003_01_000002']
> 2018-04-09 08:44:54,478 INFO logaggregation.AppLogAggregatorImpl
> (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just
> finished : application_1523259757659_0003
> 2018-04-09 08:44:54,488 INFO logaggregation.AppLogAggregatorImpl
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs
> for container container_1523259757659_0003_01_000001. Current good log dirs
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:54,492 INFO logaggregation.AppLogAggregatorImpl
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs
> for container container_1523259757659_0003_01_000002. Current good log dirs
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:55,470 INFO collector.TimelineCollectorManager
> (TimelineCollectorManager.java:remove(192)) - The collector service for
> application_1523259757659_0003 was removed
> 2018-04-09 08:44:55,472 INFO containermanager.ContainerManagerImpl
> (ContainerManagerImpl.java:handle(1572)) - couldn't find application
> application_1523259757659_0003 while processing FINISH_APPS event. The
> ResourceManager allocated resources for this application to the NodeManager
> but no active containers were found to process{code}
> The container id specified in the log,
> *container_1523259757659_0003_01_000002* is the one that has the finished
> event missing.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]