Tao Yang created YARN-7751:
------------------------------
Summary: Decommissioned NM leaves orphaned containers
Key: YARN-7751
URL: https://issues.apache.org/jira/browse/YARN-7751
Project: Hadoop YARN
Issue Type: Bug
Reporter: Tao Yang
Recently we found some orphaned containers running on a decommissioned NM in
our production cluster. The beginning of this problem is PCIE error of this
node, one of local directories is not writable so that containers whose pid
files located on it can't be cleanup successfully, after a few moments, NM
changed to DECOMMISSIONED state and exited.
Corresponding logs in NM:
{noformat}
2018-01-12 21:31:38,495 WARN [DiskHealthMonitor-Timer]
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory
/dump/2/nm-logs error, Directory is not writable: /dump/2/nm-logs, removing
from list of valid directories
2018-01-12 21:41:23,352 INFO [AsyncDispatcher event handler]
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Cleaning up container container_e37_1508697357114_216838_01_001812
2018-01-12 21:41:25,601 INFO [AsyncDispatcher event handler]
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Could not get pid for container_e37_1508697357114_216838_01_001812. Waited for
2000 ms.
{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]