[
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070709#comment-15070709
]
Naganarasimha G R commented on YARN-3995:
-----------------------------------------
bq. If I recall, this window of opportunity is going to be quite small because
any non-AM container will be completed before the app can be finished (and the
AM container is completed).
This is true in most of the cases, unless and untill AM doesn't wait for the
containers launched/requested by it to go down before it goes down.
I ran TestDistributedShell and cross verified the logs for any errors due to
collector being not there and din't find any for the containers launched by it.
But TestDistributedShell launches only 2 containers if we run with more
container then can find the impact.
bq. I suspect a simple linger might be sufficient, but do we see a case where
we might miss writes otherwise?
Yes simple linger should be sufficient, shall i make this configurable period ?
so that there is backup option in case of any issues and if required in future
we can handle it in a better way ? Also is launching one thread per collector
for closing it is fine ? IMO configurable linger period is sufficient
> Some of the NM events are not getting published due race condition when AM
> container finishes in NM
> ----------------------------------------------------------------------------------------------------
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager, timelineserver
> Affects Versions: YARN-2928
> Reporter: Naganarasimha G R
> Assignee: Naganarasimha G R
> Labels: yarn-2928-1st-milestone
>
> As discussed in YARN-3045: While testing in TestDistributedShell found out
> that few of the container metrics events were failing as there will be race
> condition. When the AM container finishes and removes the collector for the
> app, still there is possibility that all the events published for the app by
> the current NM and other NM are still in pipeline,
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)