[
https://issues.apache.org/jira/browse/YARN-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13703172#comment-13703172
]
Devaraj K commented on YARN-592:
--------------------------------
Thanks Omkar for looking into the patch and trying to understanding.
This JIRA is trying to address these two problems while running containers for
an application NM goes down and comes up and then launch containers for the
same application.
1. Graceful shutdown of NM and start again
2. NM Crash(or abrupt kill) and start again
bq.•are you assuming that after nm restarts application for which containers
were running on that node manager will again get new container on the same node
manager? at present NM doesn't remember the applications which were running on
it across restart. Also RM doesn't inform NM about all the running applications
in the cluster.
Yes, This Jira is mainly to address the case where containers running for the
same application before and after NM restart. It is the important case because
NM gets the application completed event and deletes the all container
logs(including the container logs which ran before crash) for that application,
and those logs(not aggregated) will not be available in the HDFS as explained
in the previous comment. If NM doesn't get application completed event from RM
then the logs atleast will be availble in the local logs dir.
bq.•Now across NM restart applications might be still running or it might have
just finished before restart. Do you want to upload the logs for both
scenarios? at present we upload logs only when application finishes...
This patch is trying to upload logs for the applications which run before and
after NM restart. If the application gets completed after NM crash and before
starting NM, atleast logs for the containers ran on that node can get from NM
local logs dirs.
If the NM gets stopped properly, presently NM uploads logs for all the running
containers before going down. This case we may not need to handle anything.
> Container logs lost for the application when NM gets restarted
> --------------------------------------------------------------
>
> Key: YARN-592
> URL: https://issues.apache.org/jira/browse/YARN-592
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.0.1-alpha, 2.0.3-alpha
> Reporter: Devaraj K
> Assignee: Devaraj K
> Priority: Critical
> Attachments: YARN-592.patch
>
>
> While running a big job if the NM goes down due to some reason and comes
> back, it will do the log aggregation for the newly launched containers and
> deletes all the containers for the application. This case we don't get the
> container logs from HDFS or local for the containers which are launched
> before restart and completed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira