[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state

Sangjin Lee (JIRA) Tue, 31 Mar 2015 10:50:07 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388937#comment-14388937
 ]


Sangjin Lee commented on YARN-2902:
-----------------------------------

Sorry it took me a while to get to this. Here is an excerpt of the log when 
this happened:

{noformat}
2015-03-05 00:20:17,529 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Application application_1418357586203_2035414 transitioned from INITING to 
RUNNING
2015-03-05 00:20:17,532 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1418357586203_2035414_01_000486 transitioned from NEW to 
LOCALIZING
2015-03-05 00:20:17,532 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
event CONTAINER_INIT for appId application_1418357586203_2035414
2015-03-05 00:20:17,532 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
event APPLICATION_INIT for appId application_1418357586203_2035414
2015-03-05 00:20:17,532 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
APPLICATION_INIT for service mapreduce_shuffle
2015-03-05 00:20:17,532 INFO org.apache.hadoop.mapred.ShuffleHandler: Added 
token for job_1418357586203_2035414

2015-03-05 00:20:17,532 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
 Resource 
hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar
 transitioned from INIT to DOWNLOADING
2015-03-05 00:20:17,537 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Downloading public rsrc:{ 
hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar,
 1425430654133, FILE, null }

2015-03-05 00:28:44,388 INFO 
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=alice  
IP=10.53.186.122        OPERATION=Stop Container Request        
TARGET=ContainerManageImpl      RESULT=SUCCESS  
APPID=application_1418357586203_2035414 
CONTAINERID=container_1418357586203_2035414_01_000486

2015-03-05 00:30:51,731 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl:
 Attempt to remove resource: { { 
hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar,
 1425430654133, FILE, null },pending,[],1895876473278238,DOWNLOADING} with 
non-zero refcount
{noformat}

Around this time there are many public resources in the downloading state that 
generated this error. I do think these public resources were truly in the 
middle of being downloaded (yes this can happen). What's not clear to me is 
whether the trigger was the public localization timing out or the stopContainer 
request (see in the log). Sorry there is not much more information I can glean 
from the log. Let me know if you have more questions.

FYI, we're running basically 2.4.0+.

> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state

Reply via email to