[jira] [Commented] (YARN-7835) [Atsv2] Race condition in NM while publishing events if second attempt launched on same node

Haibo Chen (JIRA) Wed, 21 Feb 2018 20:49:56 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372404#comment-16372404
 ]


Haibo Chen commented on YARN-7835:
----------------------------------

Thanks [~rohithsharma] for  the updated patch! There are a few misspellings, 
let's fix that, collectos -> collectors, applicatin -> application.--

One issue I see with the new test (and all the other test methods regarding 
stopContainer() ) is that it is flaky given it depends on how quickly the 
executor inside the PerNodeTimelineCollectorsAuxService runs the deletion task 
when an application is supposed to be removed. We have two threads, the thread 
that runs the test code and calls auxService.stopContainer() and the executor 
thread that removes the application asynchronously. Consider the following code,
{code:java}
auxService.stopContainer(context);

// auxService should have the app's collector and need to remove only after
// a configured period
assertTrue("Applicatin not found in collectors.",
        auxService.hasApplication(appAttemptId.getApplicationId()));

// 2nd attempt container removed, still collectos should hold applicatin id.
assertTrue("collector has removed application though 2nd attempt"
            + " is running this node",
        waitFor(auxService, appAttemptId.getApplicationId(), 4, 500));

{code}
If the executor thread is slow, both assertTrue() could succeed even though the 
application would still be removed.

A more reliable way is to extract the asynchronous application removing part 
into a function that we can override in a test class, that is, in 
PerNodeTimelineCollectorsAuxService.java, we'd have a method
{code:java}
protected Future removeApplicationCollector(ApplicationId appId) {
  return scheduler.schedule(new Runnable() {
        public void run() {
          synchronized (appIdToContainerId) {
            Set<ContainerId> masterContainers = appIdToContainerId.get(appId);
            if (masterContainers == null) {
              LOG.info("Stop container for " + containerId
                  + " is called before initializing container.");
              return;
            }
            masterContainers.remove(containerId);
            if (masterContainers.size() == 0) {
              // remove only if it is last master container
              removeApplication(appId);
              appIdToContainerId.remove(appId);
            }
          }
        }
      }, collectorLingerPeriod, TimeUnit.MILLISECONDS);
}
{code}
In TestPerNodeTimelineCollectorsAuxService.java, we can then create a test 
version of PerNodeTimelineCollectorsAuxService that does the application 
removing synchronously by overriding the function as
{code:java}
protected Future removeApplicationCollector(ApplicationId appId) {
   Future future = super.removeApplicationCollector(appId);
   future.get();
   return future;
}
{code}
Even though this is more code, it make our test code 
auxServer.hasApplication(appId) absolutely a clear indication of whether the 
app collector is removed or not. We can remove all the waitFor() calls.

 

> [Atsv2] Race condition in NM while publishing events if second attempt 
> launched on same node
> --------------------------------------------------------------------------------------------
>
>                 Key: YARN-7835
>                 URL: https://issues.apache.org/jira/browse/YARN-7835
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>            Priority: Critical
>         Attachments: YARN-7835.001.patch, YARN-7835.002.patch
>
>
> It is observed race condition that if master container is killed for some 
> reason and launched on same node then NMTimelinePublisher doesn't add 
> timelineClient. But once completed container for 1st attempt has come then 
> NMTimelinePublisher removes the timelineClient. 
>  It causes all subsequent event publishing from different client fails to 
> publish with exception Application is not found. !



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-7835) [Atsv2] Race condition in NM while publishing events if second attempt launched on same node

Reply via email to