[ 
https://issues.apache.org/jira/browse/YARN-5937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-5937:
------------------------------
    Description: 
stop-yarn.sh always gives following output

{code}
./sbin/stop-yarn.sh
Stopping resourcemanager
Stopping nodemanagers
<NM_HOST>: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying 
to kill with kill -9
<NM_HOST>: ERROR: Unable to kill 18097
{code}

this was because resource manager is stopped before node managers, when the 
shutdown hook manager tries to gracefully stop NM services, NM needs to 
unregister with RM, and it gets timeout as NM could not connect to RM (already 
stopped). See log (stop RM then run kill <nm_pid>)

{code}
16/11/28 08:26:43 ERROR nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM
...
16/11/28 08:26:53 WARN util.ShutdownHookManager: ShutdownHook 
'CompositeServiceShutdownHook' timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)
...
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.unRegisterNM(NodeStatusUpdaterImpl.java:291)
...
16/11/28 08:27:13 ERROR util.ShutdownHookManager: ShutdownHookManger shutdown 
forcefully.
{code}

the shutdown hooker has a default of 10s timeout, so if RM is stopped before 
NMs, they always took more than 10s to stop (in java code). However 
stop-yarn.sh only gives 5s timeout, so NM is always killed instead of stopped.

It would make sense to stop NMs before RMs in this script, in a graceful way.

  was:
stop-yarn.sh always gives following output

{code}
./sbin/stop-yarn.sh
Stopping resourcemanager
Stopping nodemanagers
<NM_HOST>: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying 
to kill with kill -9
oracle1.fyre.ibm.com: ERROR: Unable to kill 18097
{code}

this was because resource manager is stopped before node managers, when the 
shutdown hook manager tries to gracefully stop NM services, NM needs to 
unregister with RM, and it gets timeout as NM could not connect to RM (already 
stopped). See log (stop RM then run kill <nm_pid>)

{code}
16/11/28 08:26:43 ERROR nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM
...
16/11/28 08:26:53 WARN util.ShutdownHookManager: ShutdownHook 
'CompositeServiceShutdownHook' timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)
...
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.unRegisterNM(NodeStatusUpdaterImpl.java:291)
...
16/11/28 08:27:13 ERROR util.ShutdownHookManager: ShutdownHookManger shutdown 
forcefully.
{code}

the shutdown hooker has a default of 10s timeout, so if RM is stopped before 
NMs, they always took more than 10s to stop (in java code). However 
stop-yarn.sh only gives 5s timeout, so NM is always killed instead of stopped.

It would make sense to stop NMs before RMs in this script, in a graceful way.


> stop-yarn.sh is not able to gracefully stop node managers
> ---------------------------------------------------------
>
>                 Key: YARN-5937
>                 URL: https://issues.apache.org/jira/browse/YARN-5937
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>         Attachments: YARN-5937.01.patch, nm_shutdown.log
>
>
> stop-yarn.sh always gives following output
> {code}
> ./sbin/stop-yarn.sh
> Stopping resourcemanager
> Stopping nodemanagers
> <NM_HOST>: WARNING: nodemanager did not stop gracefully after 5 seconds: 
> Trying to kill with kill -9
> <NM_HOST>: ERROR: Unable to kill 18097
> {code}
> this was because resource manager is stopped before node managers, when the 
> shutdown hook manager tries to gracefully stop NM services, NM needs to 
> unregister with RM, and it gets timeout as NM could not connect to RM 
> (already stopped). See log (stop RM then run kill <nm_pid>)
> {code}
> 16/11/28 08:26:43 ERROR nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM
> ...
> 16/11/28 08:26:53 WARN util.ShutdownHookManager: ShutdownHook 
> 'CompositeServiceShutdownHook' timeout, java.util.concurrent.TimeoutException
> java.util.concurrent.TimeoutException
>       at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>       at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)
> ...
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.unRegisterNM(NodeStatusUpdaterImpl.java:291)
> ...
> 16/11/28 08:27:13 ERROR util.ShutdownHookManager: ShutdownHookManger shutdown 
> forcefully.
> {code}
> the shutdown hooker has a default of 10s timeout, so if RM is stopped before 
> NMs, they always took more than 10s to stop (in java code). However 
> stop-yarn.sh only gives 5s timeout, so NM is always killed instead of stopped.
> It would make sense to stop NMs before RMs in this script, in a graceful way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to