Varun Saxena commented on YARN-3793:

While NPEs' are a problem, on close look at the code shows that there is a 
bigger problem here and that is *container logs can be lost* if disk has become 
bad(become 90% full).

When application finishes,  we upload logs after aggregation by calling 
{{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks 
the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} 
which in case of disk full would return nothing. So none of the container logs 
are aggregated and uploaded.
But on application finish, we also call 
{{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
application directory which contains container logs. This is because it calls 
{{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
as well.

So we are left with neither aggregated logs for the app nor the individual 
container logs for the app.

This sounds like a critical if not a blocker. [~kasha], [~jlowe], can you have 
a look ? I will upload a patch shortly.

> Several NPEs when deleting local files on NM recovery
> -----------------------------------------------------
>                 Key: YARN-3793
>                 URL: https://issues.apache.org/jira/browse/YARN-3793
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Karthik Kambatla
>            Assignee: Varun Saxena
> When NM work-preserving restart is enabled, we see several NPEs on recovery. 
> These seem to correspond to sub-directories that need to be deleted. I wonder 
> if null pointers here mean incorrect tracking of these resources and a 
> potential leak. This JIRA is to investigate and fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : null
> 2015-05-18 07:06:10,224 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
> execution of task in DeletionService
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
>         at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}

This message was sent by Atlassian JIRA

Reply via email to