[ 
https://issues.apache.org/jira/browse/YARN-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611077#comment-13611077
 ] 

Daryn Sharp commented on YARN-494:
----------------------------------

Yes, log aggregation is a yarn service provided by the NM.  The AM is not 
involved in the process, it gets aggregated like any other container.

When an app finishes, the RM informs all the NMs that participated in running 
containers for the app to begin log aggregation.  The RM no longer tracks the 
app state after that, it just assumes that the NM got the event, and that 
aggregation will eventually finish.  If the event is lost or aggregation jams, 
the NMs leaks objects.  We see OOMs on busy clusters.

The NM sends a "keepalive list" to the RM of apps still aggregating.  The 
keepalive extends the RM's token renewal which usually stops when the app 
finishes.  It does this because aggregation writes to hdfs, and aggregation 
will fail if the token is cancelled or expires.
                
> RM should be able to hard stop a lingering app on a NM
> ------------------------------------------------------
>
>                 Key: YARN-494
>                 URL: https://issues.apache.org/jira/browse/YARN-494
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, resourcemanager
>    Affects Versions: 0.23.3, 3.0.0, 2.0.0-alpha
>            Reporter: Daryn Sharp
>
> It's possible for a NM to "leak" applications that the RM believes have 
> finished.  This currently tends to happen when a lingering app jams in log 
> aggregation or misses the notification to begin aggregation.
> Until aggregation completes, the NMs send app keepalive requests to the RM so 
> it continues renewing the app's tokens.  This could be extend to allow the RM 
> to send a hard stop to a NM for an app that has been running for a 
> configurable interval of time after the app has finished.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to