[
https://issues.apache.org/jira/browse/YARN-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372404#comment-16372404
]
Haibo Chen commented on YARN-7835:
----------------------------------
Thanks [~rohithsharma] for the updated patch! There are a few misspellings,
let's fix that, collectos -> collectors, applicatin -> application.--
One issue I see with the new test (and all the other test methods regarding
stopContainer() ) is that it is flaky given it depends on how quickly the
executor inside the PerNodeTimelineCollectorsAuxService runs the deletion task
when an application is supposed to be removed. We have two threads, the thread
that runs the test code and calls auxService.stopContainer() and the executor
thread that removes the application asynchronously. Consider the following code,
{code:java}
auxService.stopContainer(context);
// auxService should have the app's collector and need to remove only after
// a configured period
assertTrue("Applicatin not found in collectors.",
auxService.hasApplication(appAttemptId.getApplicationId()));
// 2nd attempt container removed, still collectos should hold applicatin id.
assertTrue("collector has removed application though 2nd attempt"
+ " is running this node",
waitFor(auxService, appAttemptId.getApplicationId(), 4, 500));
{code}
If the executor thread is slow, both assertTrue() could succeed even though the
application would still be removed.
A more reliable way is to extract the asynchronous application removing part
into a function that we can override in a test class, that is, in
PerNodeTimelineCollectorsAuxService.java, we'd have a method
{code:java}
protected Future removeApplicationCollector(ApplicationId appId) {
return scheduler.schedule(new Runnable() {
public void run() {
synchronized (appIdToContainerId) {
Set<ContainerId> masterContainers = appIdToContainerId.get(appId);
if (masterContainers == null) {
LOG.info("Stop container for " + containerId
+ " is called before initializing container.");
return;
}
masterContainers.remove(containerId);
if (masterContainers.size() == 0) {
// remove only if it is last master container
removeApplication(appId);
appIdToContainerId.remove(appId);
}
}
}
}, collectorLingerPeriod, TimeUnit.MILLISECONDS);
}
{code}
In TestPerNodeTimelineCollectorsAuxService.java, we can then create a test
version of PerNodeTimelineCollectorsAuxService that does the application
removing synchronously by overriding the function as
{code:java}
protected Future removeApplicationCollector(ApplicationId appId) {
Future future = super.removeApplicationCollector(appId);
future.get();
return future;
}
{code}
Even though this is more code, it make our test code
auxServer.hasApplication(appId) absolutely a clear indication of whether the
app collector is removed or not. We can remove all the waitFor() calls.
> [Atsv2] Race condition in NM while publishing events if second attempt
> launched on same node
> --------------------------------------------------------------------------------------------
>
> Key: YARN-7835
> URL: https://issues.apache.org/jira/browse/YARN-7835
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Rohith Sharma K S
> Assignee: Rohith Sharma K S
> Priority: Critical
> Attachments: YARN-7835.001.patch, YARN-7835.002.patch
>
>
> It is observed race condition that if master container is killed for some
> reason and launched on same node then NMTimelinePublisher doesn't add
> timelineClient. But once completed container for 1st attempt has come then
> NMTimelinePublisher removes the timelineClient.
> It causes all subsequent event publishing from different client fails to
> publish with exception Application is not found. !
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]