[
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598299#comment-14598299
]
Varun Saxena commented on YARN-3793:
------------------------------------
[~kasha], I think I know whats happening.
When disks become bad(say due to disk full), there is a problem when uploading
container logs.
In {{AppLogAggregatorImpl#doContainerLogAggregation}} only good log directories
are considered for log aggregation. This leads to
{{AggregatedLogFormat#getPendingLogFilesToUploadForThisContainer}} returning no
log files to be uploaded.
The caller of {{doContainerLogAggregation}} is
{{AppLogAggregatorImpl#uploadLogsForContainers}} which as can be seen under
will call {{DeletionService#delete}}. If {{uploadedFilePathsInThisCycle}} is
empty *(which will be if disks are full)*, this will lead to both sub directory
and base directories being null. This explains the NPEs' being thrown.
When these deletion tasks are stored in state store, they will be stored with
nulls as well and this can explain why it happens on recovery as well.
{code}
boolean uploadedLogsInThisCycle = false;
for (ContainerId container : pendingContainerInThisCycle) {
ContainerLogAggregator aggregator = null;
if (containerLogAggregators.containsKey(container)) {
aggregator = containerLogAggregators.get(container);
} else {
aggregator = new ContainerLogAggregator(container);
containerLogAggregators.put(container, aggregator);
}
Set<Path> uploadedFilePathsInThisCycle =
aggregator.doContainerLogAggregation(writer, appFinished);
if (uploadedFilePathsInThisCycle.size() > 0) {
uploadedLogsInThisCycle = true;
}
this.delService.delete(this.userUgi.getShortUserName(), null,
uploadedFilePathsInThisCycle
.toArray(new Path[uploadedFilePathsInThisCycle.size()]));
......
}
{code}
Log aggregation should consider full disks as well otherwise there will be
nothing to be aggregated if disks are full. Anyways log aggregation would lead
to deletion of local logs.
I verified the occurrence of this issue via
TestLogAggregationService#testLocalFileDeletionAfterUpload by making good log
directories return nothing.
> Several NPEs when deleting local files on NM recovery
> -----------------------------------------------------
>
> Key: YARN-3793
> URL: https://issues.apache.org/jira/browse/YARN-3793
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.6.0
> Reporter: Karthik Kambatla
> Assignee: Karthik Kambatla
>
> When NM work-preserving restart is enabled, we see several NPEs on recovery.
> These seem to correspond to sub-directories that need to be deleted. I wonder
> if null pointers here mean incorrect tracking of these resources and a
> potential leak. This JIRA is to investigate and fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
> absolute path : null
> 2015-05-18 07:06:10,224 ERROR
> org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during
> execution of task in DeletionService
> java.lang.NullPointerException
> at
> org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
> at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
> at
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)