Sangjin Lee commented on YARN-2902:

Sorry it took me a while to get to this. Here is an excerpt of the log when 
this happened:

2015-03-05 00:20:17,529 INFO 
 Application application_1418357586203_2035414 transitioned from INITING to 
2015-03-05 00:20:17,532 INFO 
Container container_1418357586203_2035414_01_000486 transitioned from NEW to 
2015-03-05 00:20:17,532 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
event CONTAINER_INIT for appId application_1418357586203_2035414
2015-03-05 00:20:17,532 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
event APPLICATION_INIT for appId application_1418357586203_2035414
2015-03-05 00:20:17,532 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
APPLICATION_INIT for service mapreduce_shuffle
2015-03-05 00:20:17,532 INFO org.apache.hadoop.mapred.ShuffleHandler: Added 
token for job_1418357586203_2035414

2015-03-05 00:20:17,532 INFO 
 transitioned from INIT to DOWNLOADING
2015-03-05 00:20:17,537 INFO 
 Downloading public rsrc:{ 
 1425430654133, FILE, null }

2015-03-05 00:28:44,388 INFO 
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=alice  
IP=        OPERATION=Stop Container Request        
TARGET=ContainerManageImpl      RESULT=SUCCESS  

2015-03-05 00:30:51,731 ERROR 
 Attempt to remove resource: { { 
 1425430654133, FILE, null },pending,[],1895876473278238,DOWNLOADING} with 
non-zero refcount

Around this time there are many public resources in the downloading state that 
generated this error. I do think these public resources were truly in the 
middle of being downloaded (yes this can happen). What's not clear to me is 
whether the trigger was the public localization timing out or the stopContainer 
request (see in the log). Sorry there is not much more information I can glean 
from the log. Let me know if you have more questions.

FYI, we're running basically 2.4.0+.

> Killing a container that is localizing can orphan resources in the 
> ------------------------------------------------------------------------------------
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.patch
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.

This message was sent by Atlassian JIRA

Reply via email to