Jason Lowe commented on YARN-2902:

Thanks for updating the patch, Varun!

I'm not sure the logic in handleDownloadingRsrcsOnCleanup is desired.  IIUC 
when a container is killed then all resources that were still pending will 
receive a resource failed event.  Those resources will in turn send all 
containers that reference it a failed resource message, and that will 
effectively kill the container.  This could be bad if the killed container was 
localizing a resource for more than one container.  Today when the container is 
killed I believe another outstanding container will simply acquire the lock on 
the scheduled resource and proceed to re-attempt a localization on that 
resource.  If we do this logic then we'll kill all those other referencing 
containers which isn't correct.  We do need to cleanup downloading resource 
objects that have a zero refcount, but not ones that are still referenced by 
other containers.  Those other containers should be able to continue localizing 
the resource if necessary.

With the following code, don't we risk failing to put the resource on the 
scheduled event queue if something is thrown yet we will still hold the 
resource lock?  In the old code we would always put the resource on the 
scheduled queue before fielding any errors, and the scheduled queue processing 
would release the resource lock.
          if (nRsrc.tryAcquire()) {
            if (nRsrc.getState() == ResourceState.DOWNLOADING) {
              [...some code removed for brevity...]
              try {
                Path localizationPath = getPathForLocalization(next);
                synchronized (scheduled) {
                      new ScheduledResource(evt, localizationPath));
                return NodeManagerBuilderUtils.newResourceLocalizationSpec(next,
              } catch (IOException e) {
                LOG.error("local path for PRIVATE localization could not be " +
                    "found. Disks might have failed.", e);
              } catch (IllegalArgumentException e) {
                LOG.error("Inorrect path for PRIVATE localization."
                    + next.getResource().getFile(), e);
              } catch (URISyntaxException e) {
                    "Got exception in parsing URL of LocalResource:"
                        + next.getResource(), e);
            } else {
              // Need to release acquired lock

Actually do we even need all the code refactoring associated with the new 
ScheduledResource class?  I think we can simply use the local path already 
accessible via event.getResource().getLocalPath() in the cleanup code.  If 
that's not null then that means we sent the localizer that resource in the 
heartbeat response, and those would be the ones to target for cleanup.

> Killing a container that is localizing can orphan resources in the 
> ------------------------------------------------------------------------------------
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, 
> YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.

This message was sent by Atlassian JIRA

Reply via email to