[
https://issues.apache.org/jira/browse/YARN-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357463#comment-16357463
]
Haibo Chen commented on YARN-7835:
----------------------------------
Thanks for the patch [~rohithsharma]. A few comments:
1) The new code in initializeContainer() and stopContainer() involves
synchronized blocks, so I guess the code is meant to be thread safe. In
initializeContainer(), if two threads see the container set for a given
application is missing at the same time, they would create a singleton set
respectively, but one would override the other. I think we could just
synchronize on appIdToContainerId in both initializeContainer() and
stopContainer() as we do with collectors in TimelineCollectorManager.
2) Not sure why we test
`!auxService.hasApplication(appAttemptId.getApplicationId())` in a loop after
the 1st attempt stopped. IIUC, the application should never be removed given
the 2rd attempt is still running, so we'd just sleep all the time until the for
loop counter goes over. Am I missing something?
3) For the second for-loop to wait for the application to be cleaned up, I
think we could reuse GenericTestUtils.waitFor().
> [Atsv2] Race condition in NM while publishing events if second attempt
> launched on same node
> --------------------------------------------------------------------------------------------
>
> Key: YARN-7835
> URL: https://issues.apache.org/jira/browse/YARN-7835
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Rohith Sharma K S
> Assignee: Rohith Sharma K S
> Priority: Critical
> Attachments: YARN-7835.001.patch
>
>
> It is observed race condition that if master container is killed for some
> reason and launched on same node then NMTimelinePublisher doesn't add
> timelineClient. But once completed container for 1st attempt has come then
> NMTimelinePublisher removes the timelineClient.
> It causes all subsequent event publishing from different client fails to
> publish with exception Application is not found. !
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]