[
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601865#comment-14601865
]
Varun Saxena commented on YARN-3793:
------------------------------------
While NPEs' are a problem, on close look at the code shows that there is a
bigger problem here and that is *container logs can be lost* if disk has become
bad(become 90% full).
When application finishes, we upload logs after aggregation by calling
{{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks
the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}}
which in case of disk full would return nothing. So none of the container logs
are aggregated and uploaded.
But on application finish, we also call
{{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the
application directory which contains container logs. This is because it calls
{{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks
as well.
So we are left with neither aggregated logs for the app nor the individual
container logs for the app.
This sounds like a critical if not a blocker. [~kasha], [~jlowe], can you have
a look ? I will upload a patch shortly.
> Several NPEs when deleting local files on NM recovery
> -----------------------------------------------------
>
> Key: YARN-3793
> URL: https://issues.apache.org/jira/browse/YARN-3793
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.6.0
> Reporter: Karthik Kambatla
> Assignee: Varun Saxena
>
> When NM work-preserving restart is enabled, we see several NPEs on recovery.
> These seem to correspond to sub-directories that need to be deleted. I wonder
> if null pointers here mean incorrect tracking of these resources and a
> potential leak. This JIRA is to investigate and fix anything required.
> Logs show:
> {noformat}
> 2015-05-18 07:06:10,225 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
> absolute path : null
> 2015-05-18 07:06:10,224 ERROR
> org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during
> execution of task in DeletionService
> java.lang.NullPointerException
> at
> org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
> at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
> at
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)