[
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184991#comment-15184991
]
Jason Lowe commented on YARN-4771:
----------------------------------
The problem occurs because removeVeryOldStoppedContainersFromCache will remove
containers from the state store that have completed at least
yarn.nodemanager.duration-to-track-stopped-containers milliseconds ago. Once
the container state is removed from the state store there's nothing to recover
for that container when the NM restarts. With no information about that
container to recover, the log aggregation service doesn't know it needs to
aggregate the logs for that container, so the container is skipped during log
aggregation.
> Some containers can be skipped during log aggregation after NM restart
> ----------------------------------------------------------------------
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.7.2
> Reporter: Jason Lowe
>
> A container can be skipped during log aggregation after a work-preserving
> nodemanager restart if the following events occur:
> # Container completes more than
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the
> restart
> # At least one other container completes after the above container and before
> the restart
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)