[
https://issues.apache.org/jira/browse/YARN-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597493#comment-16597493
]
Jason Lowe commented on YARN-8703:
----------------------------------
Ah, sorry. I mistakenly used the wrong log message that I would expect to be
printed in this scenario. Instead of "localized without a location" it should
be "but localized resource is missing".
By "the resource bookkeeping is removed" I meant the code in
LocalResourceTrackerImpl#removeResource where the the entry is removed from the
{{localrsrc}} map. Just above the suspicious code you listed above is this
line:
{code}
LocalizedResource rsrc = localrsrc.get(req);
{code}
which means {{rsrc}} will be null for any localizer event received after
removeResource is called. So when the LOCALIZED event arrives via the
localizer heartbeat just after a container is killed and corresponding
resources are removed, it will then do this check and log a warning without any
further processing of the event:
{code}
if (rsrc == null) {
LOG.warn("Received " + event.getType() + " event for request " + req
+ " but localized resource is missing");
return;
}
{code}
I haven't had a chance to dig into it to be sure, but I think the code should
schedule a deletion of the localized location when no corresponding
LocalizedResource can be found for a LOCALIZED event. Something like this:
{code}
if (rsrc == null) {
LOG.warn("Received " + event.getType() + " event for request " + req
+ " but localized resource is missing");
if (event.getType() == ResourceEventType.LOCALIZED) {
ResourceLocalizedEvent localizedEvent = (ResourceLocalizedEvent) event;
FileDeletionTask deletionTask = new FileDeletionTask(delService,
getUser(), getPathToDelete(localizedEvent.getLocation()), null);
delService.delete(deletionTask);
}
}
{code}
Unfortunately the tracker doesn't know the deletion service, so we would need
to pass it to the constructor or find some other way for accessing the deletion
service when processing localizer heartbeats. If LocalResourceTrackerImpl does
start tracking the deletion service as a class field then the DeletionService
parameter of the getPathForLocalization and removeResource methods becomes
redundant.
> Localized resource may leak on disk if container is killed while localizing
> ---------------------------------------------------------------------------
>
> Key: YARN-8703
> URL: https://issues.apache.org/jira/browse/YARN-8703
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Jason Lowe
> Priority: Major
>
> If a container is killed while localizing then it releases all of its
> resources. If the resource count goes to zero and it is in the DOWNLOADING
> state then the resource bookkeeping is removed in the resource tracker.
> Shortly afterwards the localizer could heartbeat in and report the successful
> localization of the resource that was just removed. When the
> LocalResourcesTrackerImpl receives the LOCALIZED event but does not find the
> corresponding LocalResource for the event then it simply logs a "localized
> without a location" warning. At that point I think the localized resource
> has been leaked on the disk since the NM has removed bookkeeping for the
> resource without removing it on disk.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]