Varun Saxena commented on YARN-3793:

[~kasha], I think I know whats happening.
When disks become bad(say due to disk full), there is a problem when uploading 
container logs.

In {{AppLogAggregatorImpl#doContainerLogAggregation}} only good log directories 
are considered for log aggregation. This leads to 
{{AggregatedLogFormat#getPendingLogFilesToUploadForThisContainer}} returning no 
log files to be uploaded.

The caller of {{doContainerLogAggregation}} is 
{{AppLogAggregatorImpl#uploadLogsForContainers}} which as can be seen under 
will call {{DeletionService#delete}}. If {{uploadedFilePathsInThisCycle}} is 
empty *(which will be if disks are full)*, this will lead to both sub directory 
and base directories being null. This explains the NPEs' being thrown.
When these deletion tasks are stored in state store, they will be stored with 
nulls as well and this can explain why it happens on recovery as well.
      boolean uploadedLogsInThisCycle = false;
      for (ContainerId container : pendingContainerInThisCycle) {
        ContainerLogAggregator aggregator = null;
        if (containerLogAggregators.containsKey(container)) {
          aggregator = containerLogAggregators.get(container);
        } else {
          aggregator = new ContainerLogAggregator(container);
          containerLogAggregators.put(container, aggregator);
        Set<Path> uploadedFilePathsInThisCycle =
            aggregator.doContainerLogAggregation(writer, appFinished);
        if (uploadedFilePathsInThisCycle.size() > 0) {
          uploadedLogsInThisCycle = true;
        this.delService.delete(this.userUgi.getShortUserName(), null,
            .toArray(new Path[uploadedFilePathsInThisCycle.size()]));

Log aggregation should consider full disks as well otherwise there will be 
nothing to be aggregated if disks are full. Anyways log aggregation would lead 
to deletion of local logs.

I verified the occurrence of this issue via 
TestLogAggregationService#testLocalFileDeletionAfterUpload by making good log 
directories return nothing.

> Several NPEs when deleting local files on NM recovery
> -----------------------------------------------------
>                 Key: YARN-3793
>                 URL: https://issues.apache.org/jira/browse/YARN-3793
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
> When NM work-preserving restart is enabled, we see several NPEs on recovery. 
> These seem to correspond to sub-directories that need to be deleted. I wonder 
> if null pointers here mean incorrect tracking of these resources and a 
> potential leak. This JIRA is to investigate and fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : null
> 2015-05-18 07:06:10,224 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
> execution of task in DeletionService
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
>         at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}

This message was sent by Atlassian JIRA

Reply via email to