[ 
https://issues.apache.org/jira/browse/YARN-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584711#comment-13584711
 ] 

Sandy Ryza commented on YARN-24:
--------------------------------

I encountered this when trying to start a NM and a namenode at the same time.  
The NM shut down because the namenode was in safe mode.  Having the NM die in 
this way introduces a dependency in the order that services are started.

Log aggregation is checked each time an app is run on a node, and the app is 
immediately killed if a log folder cannot be used for it.  Thus, merely 
removing the NM killing itself on startup doesn't introduce any correctness 
issues.  The worst that could happen is that time could be wasted by scheduling 
more containers on a node we already know has connection issues to the namenode.

Attached a patch that removes the NM killing itself on startup.  At initApp 
time, if verifyAndCreateRemoteLogDir has not been successfully completed, it is 
called again, and the app is failed if it fails.  If initApp fails five 
consecutive times, the NM sets its status to unhealthy.

I agree if an NM loses its ability to connect to the namenode after an app has 
started, it would be good for the NMs to report that they weren't able to write 
their logs, but my opinion is that that is a more difficult issue and does not 
need to be tied to this change. 
                
> Nodemanager fails to start if log aggregation enabled and namenode unavailable
> ------------------------------------------------------------------------------
>
>                 Key: YARN-24
>                 URL: https://issues.apache.org/jira/browse/YARN-24
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3, 2.0.0-alpha
>            Reporter: Jason Lowe
>         Attachments: YARN-24.patch
>
>
> If log aggregation is enabled and the namenode is currently unavailable, the 
> nodemanager fails to startup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to