Naganarasimha G R commented on YARN-3995:

bq. If I recall, this window of opportunity is going to be quite small because 
any non-AM container will be completed before the app can be finished (and the 
AM container is completed).
This is true in most of the cases, unless and untill AM doesn't wait for the 
containers launched/requested by it to go down before it goes down. 
I ran TestDistributedShell and cross verified the logs for any errors due to 
collector being not there and din't find any for the containers launched by it. 
But TestDistributedShell launches only 2 containers if we run with more 
container then can find the impact.

bq. I suspect a simple linger might be sufficient, but do we see a case where 
we might miss writes otherwise?
Yes simple linger should be sufficient, shall i make this configurable period ? 
so that there is backup option in case of any issues and if required in future 
we can handle it in a better way ? Also is launching one thread per collector 
for closing it is fine ? IMO configurable linger period is sufficient 

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> ----------------------------------------------------------------------------------------------------
>                 Key: YARN-3995
>                 URL: https://issues.apache.org/jira/browse/YARN-3995
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Naganarasimha G R
>            Assignee: Naganarasimha G R
>              Labels: yarn-2928-1st-milestone
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 

This message was sent by Atlassian JIRA

Reply via email to