Vrushali C commented on YARN-3995:

Hi [~Naganarasimha]

Thanks for the thoughts on the jira. I was wondering if the following is a 
feasible solution:

- can the NM container maintain a list/map info of  “zombie app ids” for 
AMs/collectors that it is removing?  That way when metrics arrive at the NM 
from other NMs for those zombie app ids, it can see if this was for an app that 
previously had a collector and hence most likely still a valid metric/entity 
and then somehow write that to the backend, perhaps via a “common parent 
collector” process or something.

- we can have the NM periodically prune  this zombie list, perhaps say a few 
days after app completion, remove the info for that app from the zombie app 

I am not too knowledgeable about the NM and so not sure if this is 

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> ----------------------------------------------------------------------------------------------------
>                 Key: YARN-3995
>                 URL: https://issues.apache.org/jira/browse/YARN-3995
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Naganarasimha G R
>            Assignee: Naganarasimha G R
>              Labels: yarn-2928-1st-milestone
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 

This message was sent by Atlassian JIRA

Reply via email to