Jason Lowe commented on YARN-4216:

If we're decommissioning a node then we're not doing a rolling upgrade of it.  
Decomm of a node should kill all of the containers on the node, upload the 
logs, then shutdown the node.  That's not a rolling upgrade since we lose work. 
 It may be rolling in the sense that we can go through the nodes in a serial 
fashion, but since work is being lost at each step it's significantly different 
than the rolling upgrade with work-preserving restart.

What we're talking about here is reinsertion of a previously decomm'd node that 
ends up running containers for an application that already had logs aggregated 
which is slightly different than the JIRA title which implies work-preserving 
restart.  Having the NM append the new logs would be a reasonable approach to 
try to avoid log loss, although there's the problem of active readers for the 
logs.  If we're appending then we can end up with partially written logs at the 
end when readers come along to parse the logs.  We'd either have to live with 
that possibility or have the NM copy the existing logs to the .tmp file before 
appending the new logs then atomically replacing the previous logs with the new 
version.  Not all filesystems support atomic replace, but HDFS can do it.

> Container logs not shown for newly assigned containers  after NM  recovery
> --------------------------------------------------------------------------
>                 Key: YARN-4216
>                 URL: https://issues.apache.org/jira/browse/YARN-4216
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: log-aggregation, nodemanager
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts

This message was sent by Atlassian JIRA

Reply via email to