Jason Lowe commented on YARN-2902:

bq. For NM side, currently it does not kill localizers. We can track PID and 
kill it as discussed earlier if HB doesnt come for a configured period.
Yeah, I think long-term to make this a lot more stable and reliable we're going 
to need the ability for the NM to kill localizers explicitly rather than via 
request.  As you mentioned, the concern is that the localizer will not actually 
interrupt and stop the localization.  Having the NM forcibly kill the localizer 
means we can put less trust in the localizer to always get that right.  However 
that's probably a lot of work and churn to the code which makes it less 
palatable for a 2.7 inclusion.  Ideally we should target a minimal change for 
2.7 that gets us past the main problems we're having today, and we can add more 
bulletproofing in followup JIRAs for subsequent releases.

As far as properly handling DIE so we actually stop downloading and problems 
canceling active transfers, can't we just have the localizer forcibly tear down 
the JVM?  If we're being told to DIE then I assume we really don't care about 
pending transfers completing and just want to get out.  If the NM is going to 
clean up after the localizer anyway, seems like we can drastically simplify DIE 
handling and just exit the JVM.  That seems like a change that's targeted 
enough to be appropriate for 2.7 instead of adding localizer kill support, etc.

> Killing a container that is localizing can orphan resources in the 
> ------------------------------------------------------------------------------------
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.

This message was sent by Atlassian JIRA

Reply via email to