[
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336724#comment-14336724
]
Jason Lowe commented on YARN-2902:
----------------------------------
We still need to do this for APPLICATION resources. It is true that those
resources will be cleaned up when the application finishes, but that could be
hours or days later. And as for PUBLIC resources, Sangjin confirmed earlier
he's seen the orphaning occur with those resources as well, so it must be
occurring somehow even for those. [~sjlee0] do you have any ideas on how
PUBLIC resources ended up hung in a DOWNLOADING state? I'm wondering if this
is specific to the shared cache setup or if there's a code path we're missing.
I don't think we should special case the resource types to fix this. Again I
think the cleanest approach is to make sure we send an event to the
LocalizedResource when a container localizer (or maybe just the container
itself) is killed, and let that state machine handle it appropriately (e.g.:
try to remove the _tmp file if the resource was in the downloading state,
ignore it if it's already localized, etc.).
> Killing a container that is localizing can orphan resources in the
> DOWNLOADING state
> ------------------------------------------------------------------------------------
>
> Key: YARN-2902
> URL: https://issues.apache.org/jira/browse/YARN-2902
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.5.0
> Reporter: Jason Lowe
> Assignee: Varun Saxena
> Fix For: 2.7.0
>
> Attachments: YARN-2902.002.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then
> resources are left in the DOWNLOADING state. If no other container comes
> along and requests these resources they linger around with no reference
> counts but aren't cleaned up during normal cache cleanup scans since it will
> never delete resources in the DOWNLOADING state even if their reference count
> is zero.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)