[
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388937#comment-14388937
]
Sangjin Lee commented on YARN-2902:
-----------------------------------
Sorry it took me a while to get to this. Here is an excerpt of the log when
this happened:
{noformat}
2015-03-05 00:20:17,529 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Application application_1418357586203_2035414 transitioned from INITING to
RUNNING
2015-03-05 00:20:17,532 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1418357586203_2035414_01_000486 transitioned from NEW to
LOCALIZING
2015-03-05 00:20:17,532 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got
event CONTAINER_INIT for appId application_1418357586203_2035414
2015-03-05 00:20:17,532 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got
event APPLICATION_INIT for appId application_1418357586203_2035414
2015-03-05 00:20:17,532 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got
APPLICATION_INIT for service mapreduce_shuffle
2015-03-05 00:20:17,532 INFO org.apache.hadoop.mapred.ShuffleHandler: Added
token for job_1418357586203_2035414
2015-03-05 00:20:17,532 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
Resource
hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar
transitioned from INIT to DOWNLOADING
2015-03-05 00:20:17,537 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Downloading public rsrc:{
hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar,
1425430654133, FILE, null }
2015-03-05 00:28:44,388 INFO
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=alice
IP=10.53.186.122 OPERATION=Stop Container Request
TARGET=ContainerManageImpl RESULT=SUCCESS
APPID=application_1418357586203_2035414
CONTAINERID=container_1418357586203_2035414_01_000486
2015-03-05 00:30:51,731 ERROR
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl:
Attempt to remove resource: { {
hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar,
1425430654133, FILE, null },pending,[],1895876473278238,DOWNLOADING} with
non-zero refcount
{noformat}
Around this time there are many public resources in the downloading state that
generated this error. I do think these public resources were truly in the
middle of being downloaded (yes this can happen). What's not clear to me is
whether the trigger was the public localization timing out or the stopContainer
request (see in the log). Sorry there is not much more information I can glean
from the log. Let me know if you have more questions.
FYI, we're running basically 2.4.0+.
> Killing a container that is localizing can orphan resources in the
> DOWNLOADING state
> ------------------------------------------------------------------------------------
>
> Key: YARN-2902
> URL: https://issues.apache.org/jira/browse/YARN-2902
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.5.0
> Reporter: Jason Lowe
> Assignee: Varun Saxena
> Attachments: YARN-2902.002.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then
> resources are left in the DOWNLOADING state. If no other container comes
> along and requests these resources they linger around with no reference
> counts but aren't cleaned up during normal cache cleanup scans since it will
> never delete resources in the DOWNLOADING state even if their reference count
> is zero.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)