[ 
https://issues.apache.org/jira/browse/YARN-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13703172#comment-13703172
 ] 

Devaraj K commented on YARN-592:
--------------------------------

Thanks Omkar for looking into the patch and trying to understanding.

This JIRA is trying to address these two problems while running containers for 
an application NM goes down and comes up and then launch containers for the 
same application. 

1. Graceful shutdown of NM and start again 
2. NM Crash(or abrupt kill) and start again 


bq.•are you assuming that after nm restarts application for which containers 
were running on that node manager will again get new container on the same node 
manager? at present NM doesn't remember the applications which were running on 
it across restart. Also RM doesn't inform NM about all the running applications 
in the cluster.
Yes, This Jira is mainly to address the case where containers running for the 
same application before and after NM restart. It is the important case because 
NM gets the application completed event and deletes the all container 
logs(including the container logs which ran before crash) for that application, 
and those logs(not aggregated) will not be available in the HDFS as explained 
in the previous comment. If NM doesn't get application completed event from RM 
then the logs atleast will be availble in the local logs dir.
 
bq.•Now across NM restart applications might be still running or it might have 
just finished before restart. Do you want to upload the logs for both 
scenarios? at present we upload logs only when application finishes...
This patch is trying to upload logs for the applications which run before and 
after NM restart. If the application gets completed after NM crash and before 
starting NM, atleast logs for the containers ran on that node can get from NM 
local logs dirs. 

If the NM gets stopped properly, presently NM uploads logs for all the running 
containers before going down. This case we may not need to handle anything.

                
> Container logs lost for the application when NM gets restarted
> --------------------------------------------------------------
>
>                 Key: YARN-592
>                 URL: https://issues.apache.org/jira/browse/YARN-592
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.0.1-alpha, 2.0.3-alpha
>            Reporter: Devaraj K
>            Assignee: Devaraj K
>            Priority: Critical
>         Attachments: YARN-592.patch
>
>
> While running a big job if the NM goes down due to some reason and comes 
> back, it will do the log aggregation for the newly launched containers and 
> deletes all the containers for the application. This case we don't get the 
> container logs from HDFS or local for the containers which are launched 
> before restart and completed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to