[ 
https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14987722#comment-14987722
 ] 

Junping Du commented on YARN-4325:
----------------------------------

Hi [~vinodkv], we found in a long running cluster, NMs recovery will try to 
recover tens of thousands of apps and most of them are old and stale. For now, 
the removal of app state in NM state store is triggered by 
ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED which created by 
aggregation or non-aggregation log handling only. 

So I were suspecting the purge of app state could be affected by log 
aggregation exceptions, like some permission issue below:
{noformat}
2015-10-13 01:58:40,277 WARN  logaggregation.LogAggregationService 
(LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root Log 
Dir [/app-logs] already exist, but with incorrect permissions. Expected: 
[rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple 
users.
1111336 2015-10-13 01:58:40,277 WARN  logaggregation.AppLogAggregatorImpl 
(AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. 
The log rolling mornitoring interval is disabled. The logs will be aggregated 
after this application is finished.
{noformat}

I am still debugging it, please free free to move to release after 2.7.2.

> purge app state from NM state-store should be independent of log aggregation
> ----------------------------------------------------------------------------
>
>                 Key: YARN-4325
>                 URL: https://issues.apache.org/jira/browse/YARN-4325
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>
> From a long running cluster, we found tens of thousands of stale apps still 
> be recovered in NM restart recovery. The reason is some wrong configuration 
> setting to log aggregation so the end of log aggregation events are not 
> received so stale apps are not purged properly. We should make sure the 
> removal of app state to be independent of log aggregation life cycle. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to