[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597265#comment-14597265
 ] 

Varun Saxena commented on YARN-2902:
------------------------------------

Looking at the public localization code, I do not think public resources can be 
orphaned because we do not stop localization for them midway on container 
cleanup.
Its difficult to ascertain though from logs as to why localization was failing 
in the scenario mentioned above for public resources. Whatever little I could 
look into the code, I could not find anything concrete which can explain the 
failures. 

Anyways, the scope of this JIRA, i.e. orphaning of resources would not happen 
for PUBLIC resources IMHO. And I guess there is no point further delaying this 
JIRA hoping to find out what went wrong with public resources in scenario above.

bq.  What's not clear to me is whether the trigger was the public localization 
timing out or the stopContainer request 
Reference can become 0 if container is killed while downloading.

Coming to the patch, there are two approaches to handle this.
# Cleanup for downloading resources can be done by Localization Service while 
doing container cleanup.
# On Heartbeat from container localizer, if localizer runner is already 
stopped, we can indicate the localizer runner to do the cleanup for downloading 
resources.

The patch attached adopts approach 1.
Herein, we wait for container localizer to die before running deletion tasks. 
Also, downloading resources can either be in local directory or in local 
directory suffixed by {{_tmp}}. So we try for both.
Moreover, localization failed event is sent to all the containers which are 
referring to the resource which is in downloading state. 


> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to