[ 
https://issues.apache.org/jira/browse/YARN-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438840#comment-13438840
 ] 

Jason Lowe commented on YARN-24:
--------------------------------

One thing we could consider is marking the node as UNHEALTHY if it encounters 
issues trying to create the initial app log directory or when it encounters 
issues trying to aggregate for a particular app.  That way we won't pile up 
more apps on a node that's already having issues trying to aggregate, and we're 
at least reporting on the cluster status page that the node needs someone to 
take a look at what's going on.

As for notifying an app that the log aggregation isn't quite complete, I'm not 
sure how best to handle that.  Since currently log aggregation is asynchronous 
from app execution, the app will often have exited before the aggregation 
completes even when there isn't an issue accessing the aggregation filesystem.  
If we need to provide a way for apps to know for certain that all of their 
container logs have been aggregated then we'd need to have log aggregation 
support a notification service or minimally a way for AM's to query nodes to 
see if an aggregation of a container has completed.

Does it make sense to split this into two parts?  We can use this JIRA to have 
NMs become UNHEALTHY while they are having issues accessing the aggregation 
filesystem (and add retries in such cases), and file a separate JIRA to add the 
log aggregation status/notification feature.  The former would still be useful 
to have without the latter.
                
> Nodemanager fails to start if log aggregation enabled and namenode unavailable
> ------------------------------------------------------------------------------
>
>                 Key: YARN-24
>                 URL: https://issues.apache.org/jira/browse/YARN-24
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.1.0-alpha, 0.23.3
>            Reporter: Jason Lowe
>
> If log aggregation is enabled and the namenode is currently unavailable, the 
> nodemanager fails to startup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to