[
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602004#comment-14602004
]
Jason Lowe commented on YARN-3850:
----------------------------------
After thinking about this I was wondering if ShuffleHandler had a similar
issue, since it too is looking for places to read files. It looks like it
might not be affected in the same way, since it doesn't use
LocalDirsHandlerService and just uses the underlying LocalDirAllocator. I
don't think the latter will auto-update the list of bad/good directories, since
it doesn't appear to update unless something tries to write through it or the
conf is updated.
I think it could be problematic in that the ShuffleHandler will likely continue
to search disks that later go bad or fail to search disks that were bad/full on
startup and later became good. If we start persisting bad/full disks across NM
restart then it seems likely a map task could deposit shuffle data on a disk
the ShuffleHandler will fail to search with its stale view of the disks on
startup. What do you think? Should be addressed as a separate JIRA if a
problem, but I'm trying to think of other places in the NM where we would have
a similar bug and only searching good dirs for reading rather than also
checking the full disks.
> Container logs can be lost if disk is full
> ------------------------------------------
>
> Key: YARN-3850
> URL: https://issues.apache.org/jira/browse/YARN-3850
> Project: Hadoop YARN
> Issue Type: Bug
> Components: log-aggregation
> Affects Versions: 2.7.0
> Reporter: Varun Saxena
> Assignee: Varun Saxena
> Priority: Blocker
>
> *Container logs* can be lost if disk has become bad(become 90% full).
> When application finishes, we upload logs after aggregation by calling
> {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns
> checks the eligible directories on call to
> {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would
> return nothing. So none of the container logs are aggregated and uploaded.
> But on application finish, we also call
> {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the
> application directory which contains container logs. This is because it calls
> {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks
> as well.
> So we are left with neither aggregated logs for the app nor the individual
> container logs for the app.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)