Varun Saxena commented on YARN-2902:

bq. As far as properly handling DIE so we actually stop downloading and 
problems canceling active transfers, can't we just have the localizer forcibly 
tear down the JVM? If we're being told to DIE then I assume we really don't 
care about pending transfers completing and just want to get out. If the NM is 
going to clean up after the localizer anyway, seems like we can drastically 
simplify DIE handling and just exit the JVM. That seems like a change that's 
targeted enough to be appropriate for 2.7 instead of adding localizer kill 
support, etc.
In container localizer, when processing HB DIE response, we send another 
localizer status to NM. Is it really required ? What do you think ?
I think as soon as we get DIE, we can follow current code of cancelling pending 
tasks, although not wait for them to complete(as is being done in newly added 
code in patch) and  delete paths reported in last status. And then just return 
from the loop for a graceful shutdown(after stopping executors).
Or are you suggesting System exit ?

>From the NM side, we can have a deletion task after some configured delay(same 
>as right now). We will never cancel this deletion task though unlike code in 
>patch now.

This way localizer should quit quickly and NM can cleanup.
I will change the behavior of executor on deletion as well i.e. I will ignore 
missing paths by default. Wont add flag.

> Killing a container that is localizing can orphan resources in the 
> ------------------------------------------------------------------------------------
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.

This message was sent by Atlassian JIRA

Reply via email to