Varun Saxena commented on YARN-2902:

[~jlowe], kindly review. 
Sorry could not upload the patch earlier due to bandwidth issues. But I think 
its still on track for 2.7.2

Coming to the patch, the patch handles the deletion in NM itself. At the time 
of processing container cleanup event(after killing of container), we will 
transition downloading resource to FAILED.
And after localizer exits, deletion will be done in finally block of 
LocalizerRunner, as per suggestion given above.

There is one presumably rare scenario where this deletion wont work. That is if 
NM recovery is not enabled and the deletion task is scheduled. But the deletion 
task is put in the deletion service's executor queue because all the 4 threads 
in deletion service's executor(NM delete threads) are occupied. If NM goes down 
before this task is taken up, the downloading resources wont be deleted.

If you want this handled, we can attempt deletion in container localizer too. I 
already have code for it(in earlier patches). But do we need to handle this 
rare case ? Let me know.
BTW, patch does not apply cleanly on branch-2.7 so will update that patch once 
trunk patch is ok to go in.

> Killing a container that is localizing can orphan resources in the 
> ------------------------------------------------------------------------------------
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, 
> YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.

This message was sent by Atlassian JIRA

Reply via email to