[
https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254095#comment-15254095
]
Junping Du commented on YARN-4325:
----------------------------------
We hit the same issue in a cluster recently again. After checking log, related
code and state machine graph for ApplicationImpl (attached). There are three
issues cause app state leak in NM state-store
1. APPLICATION_LOG_HANDLING_FAILED is not handled with remove App in
NMStateStore.
2. APPLICATION_LOG_HANDLING_FAILED event is missing in sent when hit
aggregator's doAppLogAggregation() exception case.
2. Only Application in *FINISHED* status receiving APPLICATION_LOG_FINISHED
has transition to remove app in NM state store. Application in other status -
like APPLICATION_RESOURCES_CLEANUP will ignore the event and later forget to
remove this app from NM state store even after app get finished.
Will put up a patch soon to fix this issue.
> purge app state from NM state-store should be independent of log aggregation
> ----------------------------------------------------------------------------
>
> Key: YARN-4325
> URL: https://issues.apache.org/jira/browse/YARN-4325
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Critical
>
> From a long running cluster, we found tens of thousands of stale apps still
> be recovered in NM restart recovery. The reason is some wrong configuration
> setting to log aggregation so the end of log aggregation events are not
> received so stale apps are not purged properly. We should make sure the
> removal of app state to be independent of log aggregation life cycle.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)