[ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602004#comment-14602004
 ] 

Jason Lowe commented on YARN-3850:
----------------------------------

After thinking about this I was wondering if ShuffleHandler had a similar 
issue, since it too is looking for places to read files.  It looks like it 
might not be affected in the same way, since it doesn't use 
LocalDirsHandlerService and just uses the underlying LocalDirAllocator.  I 
don't think the latter will auto-update the list of bad/good directories, since 
it doesn't appear to update unless something tries to write through it or the 
conf is updated.

I think it could be problematic in that the ShuffleHandler will likely continue 
to search disks that later go bad or fail to search disks that were bad/full on 
startup and later became good.  If we start persisting bad/full disks across NM 
restart then it seems likely a map task could deposit shuffle data on a disk 
the ShuffleHandler will fail to search with its stale view of the disks on 
startup.  What do you think?  Should be addressed as a separate JIRA if a 
problem, but I'm trying to think of other places in the NM where we would have 
a similar bug and only searching good dirs for reading rather than also 
checking the full disks.

> Container logs can be lost if disk is full
> ------------------------------------------
>
>                 Key: YARN-3850
>                 URL: https://issues.apache.org/jira/browse/YARN-3850
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: log-aggregation
>    Affects Versions: 2.7.0
>            Reporter: Varun Saxena
>            Assignee: Varun Saxena
>            Priority: Blocker
>
> *Container logs* can be lost if disk has become bad(become 90% full).
> When application finishes, we upload logs after aggregation by calling 
> {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
> checks the eligible directories on call to 
> {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
> return nothing. So none of the container logs are aggregated and uploaded.
> But on application finish, we also call 
> {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
> application directory which contains container logs. This is because it calls 
> {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
> as well.
> So we are left with neither aggregated logs for the app nor the individual 
> container logs for the app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to