[ 
https://issues.apache.org/jira/browse/YARN-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357463#comment-16357463
 ] 

Haibo Chen commented on YARN-7835:
----------------------------------

Thanks for the patch [~rohithsharma]. A few comments:

1) The new code in initializeContainer() and stopContainer() involves 
synchronized blocks, so I guess the code is meant to be thread safe. In 
initializeContainer(), if two threads see the container set for a given 
application is missing at the same time, they would create a singleton set 
respectively, but one would override the other. I think we could just 
synchronize on appIdToContainerId in both initializeContainer() and 
stopContainer() as we do with collectors in TimelineCollectorManager.

2) Not sure why we test 
`!auxService.hasApplication(appAttemptId.getApplicationId())` in a loop after 
the 1st attempt stopped. IIUC, the application should never be removed given 
the 2rd attempt is still running, so we'd just sleep all the time until the for 
loop counter goes over. Am I missing something?

3) For the second for-loop to wait for the application to be cleaned up, I 
think we could reuse GenericTestUtils.waitFor().

> [Atsv2] Race condition in NM while publishing events if second attempt 
> launched on same node
> --------------------------------------------------------------------------------------------
>
>                 Key: YARN-7835
>                 URL: https://issues.apache.org/jira/browse/YARN-7835
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>            Priority: Critical
>         Attachments: YARN-7835.001.patch
>
>
> It is observed race condition that if master container is killed for some 
> reason and launched on same node then NMTimelinePublisher doesn't add 
> timelineClient. But once completed container for 1st attempt has come then 
> NMTimelinePublisher removes the timelineClient. 
>  It causes all subsequent event publishing from different client fails to 
> publish with exception Application is not found. !



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to