Jason Lowe commented on YARN-2902:

bq. In container localizer, when processing HB DIE response, we send another 
localizer status to NM. Is it really required ? What do you think ?

I don't think this is required.  If the NM is telling the localizer to DIE then 
I don't think the NM cares after that point what the localizer is doing.  The 
NM is totally done with it at that point due to failure or lack of knowledge of 
that localizer.

bq. Or are you suggesting System exit ?
I was basically suggesting System.exit if we aren't convinced that the 
localizer can actually tear down in a timely manner.  For example, if the 
graceful shutdown could involve waiting for active transfers to complete 
because we can't reliably interrupt them, then yes I think System.exit is 
appropriate.  A good compromise would be to put a timeout on shutdown -- if we 
can't get down within so many seconds then have something (e.g.: a watchdog 
thread if necessary) call System.exit to get out.  Otherwise the localizer 
could still be running and messing with the filesystem after the NM tries to 
cleanup afterwards.

Worst-case scenario is this could still happen even with these fixes, but it 
should resolve the leaking issue for the vast majority of cases.  We can make 
it more bulletproof in a followup JIRA for 2.8 or later that actually has the 
NM tracking localizer pids and proactively killing them if they don't respond 
in a timely manner to commands.

bq. However we can also let localizer not do any cleanup at all and let NM 
delete paths.
I would still like the localizer to try to perform some cleanup if possible, as 
the NM doesn't track localizers in the state store.  Therefore if the NM 
restarts we may not cleanup everything properly if the localizer doesn't do it 
on its own.

> Killing a container that is localizing can orphan resources in the 
> ------------------------------------------------------------------------------------
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.

This message was sent by Atlassian JIRA

Reply via email to