[
https://issues.apache.org/jira/browse/YARN-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Evan Tepsic updated YARN-8014:
------------------------------
Description:
A graceful shutdown & then startup of a NodeManager process using YARN/HDFS
v2.8.2 seems to successfully place the Node back into RUNNING state. However,
ResouceManager appears to keep the Node also in SHUTDOWN state.
*Steps To Reproduce:*
1. SSH to host running NodeManager.
2. Switch-to UserID that NodeManager is running as (hadoop).
3. Execute cmd: /opt/hadoop/sbin/yarn-daemon.sh stop nodemanager
4. Wait for NodeManager process to terminate gracefully.
5. Confirm Node is in SHUTDOWN state via:
[http://rb01rm01.local:8088/cluster/nodes]
6. Execute cmd: /opt/hadoop/sbin/yarn-daemon.sh stop nodemanager
7. Confirm Node is in RUNNING state via:
[http://rb01rm01.local:8088/cluster/nodes]
*Investigation:*
1. Review contents of ResourceManager + NodeManager log-files:
+ResourceManager log-[file:+|file:///+]
2018-03-08 08:15:44,085 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Node with
node id : rb0101.local:43892 has shutdown, hence unregistering the node.
2018-03-08 08:15:44,092 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating
Node rb0101.local:43892 as it is now SHUTDOWN
2018-03-08 08:15:44,092 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
rb0101.local:43892 Node Transitioned from RUNNING to SHUTDOWN
2018-03-08 08:15:44,093 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Removed node rb0101.local:43892 cluster capacity: <memory:110592, vCores:54>
2018-03-08 08:16:08,915 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:
NodeManager from node rb0101.local(cmPort: 42627 httpPort: 8042) registered
with capability: <memory:12288, vCores:6>, assigned nodeId rb0101.local:42627
2018-03-08 08:16:08,916 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
rb0101.local:42627 Node Transitioned from NEW to RUNNING
2018-03-08 08:16:08,916 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Added node rb0101.local:42627 cluster capacity: <memory:122880, vCores:60>
2018-03-08 08:16:34,826 WARN org.apache.hadoop.ipc.Server: Large response size
2976014 for call Call#428958 Retry#0
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
192.168.1.100:44034
+NodeManager log-[file:+|file:///+]
2018-03-08 08:00:14,500 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Cache Size Before Clean: 10720046250, Total Deleted: 0, Public
Deleted: 0, Private Deleted: 0
2018-03-08 08:10:14,498 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Cache Size Before Clean: 10720046250, Total Deleted: 0, Public
Deleted: 0, Private Deleted: 0
2018-03-08 08:15:44,048 ERROR
org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15:
SIGTERM
2018-03-08 08:15:44,101 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Successfully
Unregistered the Node rb0101.local:43892 with ResourceManager.
2018-03-08 08:15:44,114 INFO org.mortbay.log: Stopped
[email protected]:8042
2018-03-08 08:15:44,226 INFO org.apache.hadoop.ipc.Server: Stopping server on
43892
2018-03-08 08:15:44,232 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server
listener on 43892
2018-03-08 08:15:44,237 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server
Responder
2018-03-08 08:15:44,239 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
org.apache.hadoop.yarn.server.nodemanager.containermanager.logag
gregation.LogAggregationService waiting for pending aggregation during exit
2018-03-08 08:15:44,242 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Cont
ainersMonitorImpl is interrupted. Exiting.
2018-03-08 08:15:44,284 INFO org.apache.hadoop.ipc.Server: Stopping server on
8040
2018-03-08 08:15:44,285 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server
listener on 8040
2018-03-08 08:15:44,285 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server
Responder
2018-03-08 08:15:44,287 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Public cache exiting
2018-03-08 08:15:44,289 WARN
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl:
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is
interrupted. Exiting.
2018-03-08 08:15:44,294 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics
system...
2018-03-08 08:15:44,295 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system
stopped.
2018-03-08 08:15:44,296 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system
shutdown complete.
2018-03-08 08:15:44,297 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at rb0101.local/192.168.1.101
************************************************************/
2018-03-08 08:16:01,905 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeManager: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NodeManager
STARTUP_MSG: user = hadoop
STARTUP_MSG: host = rb0101.local/192.168.1.101
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.8.2
STARTUP_MSG: classpath = blahblahblah (truncated for size-purposes)
STARTUP_MSG: build = Unknown -r Unknown; compiled by 'root' on
2017-09-14T18:22Z
STARTUP_MSG: java = 1.8.0_144
************************************************************/
2018-03-08 08:16:01,918 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeManager: registered UNIX signal
handlers for [TERM, HUP, INT]
2018-03-08 08:16:03,202 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeManager: Node Manager health
check script is not available or doesn't have execute permission, so not
starting the
node health script runner.
2018-03-08 08:16:03,321 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerEventType
for class
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher
2018-03-08 08:16:03,322 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationEventType
for c
lass
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher
2018-03-08 08:16:03,323 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizationEventType
for class
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService
2018-03-08 08:16:03,323 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServicesEventType
for class org.apa
che.hadoop.yarn.server.nodemanager.containermanager.AuxServices
2018-03-08 08:16:03,324 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorEventType
for
class
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
2018-03-08 08:16:03,324 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncherEventType
f
or class
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher
2018-03-08 08:16:03,347 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.ContainerManagerEventType for class
org.apache.hadoop.y
arn.server.nodemanager.containermanager.ContainerManagerImpl
2018-03-08 08:16:03,348 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.NodeManagerEventType for class
org.apache.hadoop.yarn.s
erver.nodemanager.NodeManager
2018-03-08 08:16:03,402 INFO org.apache.hadoop.metrics2.impl.MetricsConfig:
loaded properties from hadoop-metrics2.properties
2018-03-08 08:16:03,484 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot
period at 10 second(s).
2018-03-08 08:16:03,484 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system
started
2018-03-08 08:16:03,561 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: Using
ResourceCalculatorPlugin :
org.apache.hadoop.yarn.util.ResourceCalculatorPlugin@4b8729f
f
2018-03-08 08:16:03,564 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.event.LogHandlerEventType
f
or class
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
2018-03-08 08:16:03,565 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploa
dEventType for class
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploadService
2018-03-08 08:16:03,565 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
AMRMProxyService is disabled
2018-03-08 08:16:03,566 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
per directory file limit = 8192
2018-03-08 08:16:03,621 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
usercache path :
[file:/space/hadoop/tmp/nm-local-dir/usercache_|file:///space/hadoop/tmp/nm-local-dir/usercache_]
DEL_1520518563569
2018-03-08 08:16:03,667 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
path :
[file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user1|file:///space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user1]
2018-03-08 08:16:03,667 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
path :
[file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user2|file:///space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user2]
2018-03-08 08:16:03,668 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
path :
[file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user3|file:///space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user3]
2018-03-08 08:16:03,681 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
path :
[file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user4|file:///space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user4]
2018-03-08 08:16:03,739 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizerEventType
for class
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker
2018-03-08 08:16:03,793 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding
auxiliary service mapreduce_shuffle, "mapreduce_shuffle"
2018-03-08 08:16:03,826 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Using ResourceCalculatorPlugin :
org.apache.hadoop.yarn.util.ResourceCalculatorPlugin@1187c9e8
2018-03-08 08:16:03,826 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Using ResourceCalculatorProcessTree : null
2018-03-08 08:16:03,827 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Physical memory check enabled: true
2018-03-08 08:16:03,827 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Virtual memory check enabled: true
2018-03-08 08:16:03,832 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
ContainersMonitor enabled: true
2018-03-08 08:16:03,841 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Nodemanager
resources: memory set to 12288MB.
2018-03-08 08:16:03,841 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Nodemanager
resources: vcores set to 6.
2018-03-08 08:16:03,846 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized
nodemanager with : physical-memory=12288 virtual-memory=25805 virtual-cores=6
2018-03-08 08:16:03,850 INFO org.apache.hadoop.util.JvmPauseMonitor: Starting
JVM pause monitor
2018-03-08 08:16:03,908 INFO org.apache.hadoop.ipc.CallQueueManager: Using
callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 2000
scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
2018-03-08 08:16:03,932 INFO org.apache.hadoop.ipc.Server: Starting Socket
Reader #1 for port 42627
2018-03-08 08:16:04,153 INFO
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding
protocol org.apache.hadoop.yarn.api.ContainerManagementProtocolPB to the server
2018-03-08 08:16:04,153 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
Blocking new container-requests as container manager rpc server is still
starting.
2018-03-08 08:16:04,154 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2018-03-08 08:16:04,154 INFO org.apache.hadoop.ipc.Server: IPC Server listener
on 42627: starting
2018-03-08 08:16:04,166 INFO
org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
Updating node address : rb0101.local:42627
2018-03-08 08:16:04,183 INFO org.apache.hadoop.ipc.CallQueueManager: Using
callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 500
scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
2018-03-08 08:16:04,184 INFO org.apache.hadoop.ipc.Server: Starting Socket
Reader #1 for port 8040
2018-03-08 08:16:04,191 INFO
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding
protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
to the server
2018-03-08 08:16:04,191 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2018-03-08 08:16:04,191 INFO org.apache.hadoop.ipc.Server: IPC Server listener
on 8040: starting
2018-03-08 08:16:04,192 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Localizer started on port 8040
2018-03-08 08:16:04,312 INFO org.apache.hadoop.mapred.IndexCache: IndexCache
created with max memory = 10485760
2018-03-08 08:16:04,330 INFO org.apache.hadoop.mapred.ShuffleHandler:
mapreduce_shuffle listening on port 13562
2018-03-08 08:16:04,337 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
ContainerManager started at rb0101.local/192.168.1.101:42627
2018-03-08 08:16:04,337 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
ContainerManager bound to 0.0.0.0/0.0.0.0:0
2018-03-08 08:16:04,340 INFO
org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer: Instantiating
NMWebApp at 0.0.0.0:8042
2018-03-08 08:16:04,427 INFO org.mortbay.log: Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2018-03-08 08:16:04,436 INFO
org.apache.hadoop.security.authentication.server.AuthenticationFilter: Unable
to initialize FileSignerSecretProvider, falling back to use random secrets.
2018-03-08 08:16:04,442 INFO org.apache.hadoop.http.HttpRequestLog: Http
request log for http.requests.nodemanager is not defined
2018-03-08 08:16:04,450 INFO org.apache.hadoop.http.HttpServer2: Added global
filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2018-03-08 08:16:04,461 INFO org.apache.hadoop.http.HttpServer2: Added filter
static_user_filter
(class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
context node
2018-03-08 08:16:04,462 INFO org.apache.hadoop.http.HttpServer2: Added filter
static_user_filter
(class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
context logs
2018-03-08 08:16:04,462 INFO org.apache.hadoop.http.HttpServer2: Added filter
static_user_filter
(class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
context static
2018-03-08 08:16:04,462 INFO
org.apache.hadoop.security.HttpCrossOriginFilterInitializer: CORS filter not
enabled. Please set hadoop.http.cross-origin.enabled to 'true' to enable it
2018-03-08 08:16:04,465 INFO org.apache.hadoop.http.HttpServer2: adding path
spec: /node/*
2018-03-08 08:16:04,465 INFO org.apache.hadoop.http.HttpServer2: adding path
spec: /ws/*
2018-03-08 08:16:04,843 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered
webapp guice modules
2018-03-08 08:16:04,846 INFO org.apache.hadoop.http.HttpServer2: Jetty bound
to port 8042
2018-03-08 08:16:04,846 INFO org.mortbay.log: jetty-6.1.26
2018-03-08 08:16:04,877 INFO org.mortbay.log: Extract
jar:[file:/opt/hadoop-2.8.2/share/hadoop/yarn/hadoop-yarn-common-2.8.2.jar!/webapps/node|file:///opt/hadoop-2.8.2/share/hadoop/yarn/hadoop-yarn-common-2.8.2.jar!/webapps/node]
to /tmp/Jetty_0_0_0_0_8042_node____19tj0x/webapp
2018-03-08 08:16:08,355 INFO org.mortbay.log: Started
[email protected]:8042
2018-03-08 08:16:08,356 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app
node started at 8042
2018-03-08 08:16:08,473 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node ID
assigned is : rb0101.local:42627
2018-03-08 08:16:08,498 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting
to ResourceManager at rb01rm01.local/192.168.1.100:8031
2018-03-08 08:16:08,613 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0
NM container statuses: []
2018-03-08 08:16:08,621 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering
with RM using containers :[]
2018-03-08 08:16:08,934 INFO
org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
Rolling master-key for container-tokens, got key with id -2086472604
2018-03-08 08:16:08,938 INFO
org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM:
Rolling master-key for container-tokens, got key with id -426187560
2018-03-08 08:16:08,939 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered
with ResourceManager as rb0101.local:42627 with total resource of
<memory:12288, vCores:6>
2018-03-08 08:16:08,939 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying
ContainerManager to unblock new container-requests
2018-03-08 08:26:04,174 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private
Deleted: 0
2018-03-08 08:36:04,170 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private
Deleted: 0
2018-03-08 08:46:04,170 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private
Deleted: 0
2. Listing all of YARN's Nodes, we can see it was returned to the RUNNING
state. However, when listing all nodes, it shows the node in 2 states; RUNNING
and SHUTDOWN:
[hadoop@rb01rm01 logs]$ /opt/hadoop/bin/yarn node -list -all
18/03/08 09:20:33 INFO client.RMProxy: Connecting to ResourceManager at
rb01rm01.local/192.168.1.100:8032
18/03/08 09:20:34 INFO client.AHSProxy: Connecting to Application History
server at rb01rm01.local/192.168.1.100:10200
Total Nodes:11
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
rb0106.local:44160 RUNNING rb0106.local:8042 0
rb0105.local:32832 RUNNING rb0105.local:8042 0
rb0101.local:42627 RUNNING rb0101.local:8042 0
rb0108.local:38209 RUNNING rb0108.local:8042 0
rb0107.local:34306 RUNNING rb0107.local:8042 0
rb0102.local:43063 RUNNING rb0102.local:8042 0
rb0103.local:42374 RUNNING rb0103.local:8042 0
rb0109.local:37455 RUNNING rb0109.local:8042 0
rb0110.local:36690 RUNNING rb0110.local:8042 0
rb0104.local:33268 RUNNING rb0104.local:8042 0
rb0101.local:43892 SHUTDOWN rb0101.local:8042 0
[hadoop@rb01rm01 logs]$
[hadoop@rb01rm01 logs]$ /opt/hadoop/bin/yarn node -list -states RUNNING
18/03/08 09:20:55 INFO client.RMProxy: Connecting to ResourceManager at
rb01rm01.local/192.168.1.100:8032
18/03/08 09:20:56 INFO client.AHSProxy: Connecting to Application History
server at rb01rm01.local/192.168.1.100:10200
Total Nodes:10
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
rb0106.local:44160 RUNNING rb0106.local:8042 0
rb0105.local:32832 RUNNING rb0105.local:8042 0
rb0101.local:42627 RUNNING rb0101.local:8042 0
rb0108.local:38209 RUNNING rb0108.local:8042 0
rb0107.local:34306 RUNNING rb0107.local:8042 0
rb0102.local:43063 RUNNING rb0102.local:8042 0
rb0103.local:42374 RUNNING rb0103.local:8042 0
rb0109.local:37455 RUNNING rb0109.local:8042 0
rb0110.local:36690 RUNNING rb0110.local:8042 0
rb0104.local:33268 RUNNING rb0104.local:8042 0
[hadoop@rb01rm01 logs]$ /opt/hadoop/bin/yarn node -list -states SHUTDOWN
18/03/08 09:21:01 INFO client.RMProxy: Connecting to ResourceManager at
rb01rm01.local/192.168.1.100:8032
18/03/08 09:21:01 INFO client.AHSProxy: Connecting to Application History
server at rb01rm01.local/192.168.1.100:10200
Total Nodes:0
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
[hadoop@rb01rm01 logs]$
3. ResourceManager however, does not list Node rb0101.local as SHUTDOWN when
specifically requesting list of Nodes in SHUTDOWN state:
[hadoop@rb01rm01 bin]$ /opt/hadoop/bin/yarn node -list -states SHUTDOWN
18/03/08 08:28:23 INFO client.RMProxy: Connecting to ResourceManager at
rb01rm01.local/v.x.y.z:8032
18/03/08 08:28:24 INFO client.AHSProxy: Connecting to Application History
server at rb01rm01.local/v.x.y.z:10200
Total Nodes:0
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
[hadoop@rb01rm01 bin]$
was:
A graceful shutdown & then startup of a NodeManager process using YARN/HDFS
v2.8.2 seems to successfully place the Node back into RUNNING state. However,
ResouceManager appears to keep the Node also in SHUTDOWN state.
*Steps To Reproduce:*
1. SSH to host running NodeManager.
2. Switch-to UserID that NodeManager is running as (hadoop).
3. Execute cmd: /opt/hadoop/sbin/yarn-daemon.sh stop nodemanager
4. Wait for NodeManager process to terminate gracefully.
5. Confirm Node is in SHUTDOWN state via:
http://rb01rm01.local:8088/cluster/nodes
6. Execute cmd: /opt/hadoop/sbin/yarn-daemon.sh stop nodemanager
7. Confirm Node is in RUNNING state via:
http://rb01rm01.local:8088/cluster/nodes
*Investigation:*
1. Review contents of ResourceManager + NodeManager log-files:
+ResourceManager log-file:+
2018-03-08 08:15:44,085 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Node with
node id : rb0101.local:43892 has shutdown, hence unregistering the node.
2018-03-08 08:15:44,092 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating
Node rb0101.local:43892 as it is now SHUTDOWN
2018-03-08 08:15:44,092 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
rb0101.local:43892 Node Transitioned from RUNNING to SHUTDOWN
2018-03-08 08:15:44,093 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Removed node rb0101.local:43892 cluster capacity: <memory:110592, vCores:54>
2018-03-08 08:16:08,915 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:
NodeManager from node rb0101.local(cmPort: 42627 httpPort: 8042) registered
with capability: <memory:12288, vCores:6>, assigned nodeId rb0101.local:42627
2018-03-08 08:16:08,916 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
rb0101.local:42627 Node Transitioned from NEW to RUNNING
2018-03-08 08:16:08,916 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Added node rb0101.local:42627 cluster capacity: <memory:122880, vCores:60>
2018-03-08 08:16:34,826 WARN org.apache.hadoop.ipc.Server: Large response size
2976014 for call Call#428958 Retry#0
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
192.168.1.100:44034
+NodeManager log-file:+
2018-03-08 08:00:14,500 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Cache Size Before Clean: 10720046250, Total Deleted: 0, Public
Deleted: 0, Private Deleted: 0
2018-03-08 08:10:14,498 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Cache Size Before Clean: 10720046250, Total Deleted: 0, Public
Deleted: 0, Private Deleted: 0
2018-03-08 08:15:44,048 ERROR
org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15:
SIGTERM
2018-03-08 08:15:44,101 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Successfully
Unregistered the Node rb0101.local:43892 with ResourceManager.
2018-03-08 08:15:44,114 INFO org.mortbay.log: Stopped
[email protected]:8042
2018-03-08 08:15:44,226 INFO org.apache.hadoop.ipc.Server: Stopping server on
43892
2018-03-08 08:15:44,232 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server
listener on 43892
2018-03-08 08:15:44,237 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server
Responder
2018-03-08 08:15:44,239 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
org.apache.hadoop.yarn.server.nodemanager.containermanager.logag
gregation.LogAggregationService waiting for pending aggregation during exit
2018-03-08 08:15:44,242 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Cont
ainersMonitorImpl is interrupted. Exiting.
2018-03-08 08:15:44,284 INFO org.apache.hadoop.ipc.Server: Stopping server on
8040
2018-03-08 08:15:44,285 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server
listener on 8040
2018-03-08 08:15:44,285 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server
Responder
2018-03-08 08:15:44,287 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Public cache exiting
2018-03-08 08:15:44,289 WARN
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl:
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is
interrupted. Exiting.
2018-03-08 08:15:44,294 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Stopping NodeManager metrics system...
2018-03-08 08:15:44,295 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
NodeManager metrics system stopped.
2018-03-08 08:15:44,296 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
NodeManager metrics system shutdown complete.
2018-03-08 08:15:44,297 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at rb0101.local/192.168.1.101
************************************************************/
2018-03-08 08:16:01,905 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeManager: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NodeManager
STARTUP_MSG: user = hadoop
STARTUP_MSG: host = rb0101.local/192.168.1.101
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.8.2
STARTUP_MSG: classpath = blahblahblah (truncated for size-purposes)
STARTUP_MSG: build = Unknown -r Unknown; compiled by 'root' on 2017-09-14T18:22Z
STARTUP_MSG: java = 1.8.0_144
************************************************************/
2018-03-08 08:16:01,918 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeManager: registered UNIX signal
handlers for [TERM, HUP, INT]
2018-03-08 08:16:03,202 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeManager: Node Manager health
check script is not available or doesn't have execute permission, so not
starting the
node health script runner.
2018-03-08 08:16:03,321 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerEventType
for class
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher
2018-03-08 08:16:03,322 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationEventType
for c
lass
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher
2018-03-08 08:16:03,323 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizationEventType
for class
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService
2018-03-08 08:16:03,323 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServicesEventType
for class org.apa
che.hadoop.yarn.server.nodemanager.containermanager.AuxServices
2018-03-08 08:16:03,324 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorEventType
for
class
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
2018-03-08 08:16:03,324 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncherEventType
f
or class
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher
2018-03-08 08:16:03,347 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.ContainerManagerEventType for class
org.apache.hadoop.y
arn.server.nodemanager.containermanager.ContainerManagerImpl
2018-03-08 08:16:03,348 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.NodeManagerEventType for class
org.apache.hadoop.yarn.s
erver.nodemanager.NodeManager
2018-03-08 08:16:03,402 INFO org.apache.hadoop.metrics2.impl.MetricsConfig:
loaded properties from hadoop-metrics2.properties
2018-03-08 08:16:03,484 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Scheduled Metric snapshot period at 10 second(s).
2018-03-08 08:16:03,484 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
NodeManager metrics system started
2018-03-08 08:16:03,561 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: Using
ResourceCalculatorPlugin :
org.apache.hadoop.yarn.util.ResourceCalculatorPlugin@4b8729f
f
2018-03-08 08:16:03,564 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.event.LogHandlerEventType
f
or class
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
2018-03-08 08:16:03,565 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploa
dEventType for class
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploadService
2018-03-08 08:16:03,565 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
AMRMProxyService is disabled
2018-03-08 08:16:03,566 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
per directory file limit = 8192
2018-03-08 08:16:03,621 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
usercache path : file:/space/hadoop/tmp/nm-local-dir/usercache_
DEL_1520518563569
2018-03-08 08:16:03,667 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
path : file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user1
2018-03-08 08:16:03,667 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
path : file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user2
2018-03-08 08:16:03,668 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
path : file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user3
2018-03-08 08:16:03,681 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
path : file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user4
2018-03-08 08:16:03,739 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
Registering class
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizerEventType
for class
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker
2018-03-08 08:16:03,793 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding
auxiliary service mapreduce_shuffle, "mapreduce_shuffle"
2018-03-08 08:16:03,826 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Using ResourceCalculatorPlugin :
org.apache.hadoop.yarn.util.ResourceCalculatorPlugin@1187c9e8
2018-03-08 08:16:03,826 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Using ResourceCalculatorProcessTree : null
2018-03-08 08:16:03,827 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Physical memory check enabled: true
2018-03-08 08:16:03,827 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Virtual memory check enabled: true
2018-03-08 08:16:03,832 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
ContainersMonitor enabled: true
2018-03-08 08:16:03,841 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Nodemanager
resources: memory set to 12288MB.
2018-03-08 08:16:03,841 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Nodemanager
resources: vcores set to 6.
2018-03-08 08:16:03,846 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized
nodemanager with : physical-memory=12288 virtual-memory=25805 virtual-cores=6
2018-03-08 08:16:03,850 INFO org.apache.hadoop.util.JvmPauseMonitor: Starting
JVM pause monitor
2018-03-08 08:16:03,908 INFO org.apache.hadoop.ipc.CallQueueManager: Using
callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 2000
scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
2018-03-08 08:16:03,932 INFO org.apache.hadoop.ipc.Server: Starting Socket
Reader #1 for port 42627
2018-03-08 08:16:04,153 INFO
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding
protocol org.apache.hadoop.yarn.api.ContainerManagementProtocolPB to the server
2018-03-08 08:16:04,153 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
Blocking new container-requests as container manager rpc server is still
starting.
2018-03-08 08:16:04,154 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2018-03-08 08:16:04,154 INFO org.apache.hadoop.ipc.Server: IPC Server listener
on 42627: starting
2018-03-08 08:16:04,166 INFO
org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
Updating node address : rb0101.local:42627
2018-03-08 08:16:04,183 INFO org.apache.hadoop.ipc.CallQueueManager: Using
callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 500
scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
2018-03-08 08:16:04,184 INFO org.apache.hadoop.ipc.Server: Starting Socket
Reader #1 for port 8040
2018-03-08 08:16:04,191 INFO
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding
protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
to the server
2018-03-08 08:16:04,191 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2018-03-08 08:16:04,191 INFO org.apache.hadoop.ipc.Server: IPC Server listener
on 8040: starting
2018-03-08 08:16:04,192 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Localizer started on port 8040
2018-03-08 08:16:04,312 INFO org.apache.hadoop.mapred.IndexCache: IndexCache
created with max memory = 10485760
2018-03-08 08:16:04,330 INFO org.apache.hadoop.mapred.ShuffleHandler:
mapreduce_shuffle listening on port 13562
2018-03-08 08:16:04,337 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
ContainerManager started at rb0101.local/192.168.1.101:42627
2018-03-08 08:16:04,337 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
ContainerManager bound to 0.0.0.0/0.0.0.0:0
2018-03-08 08:16:04,340 INFO
org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer: Instantiating
NMWebApp at 0.0.0.0:8042
2018-03-08 08:16:04,427 INFO org.mortbay.log: Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2018-03-08 08:16:04,436 INFO
org.apache.hadoop.security.authentication.server.AuthenticationFilter: Unable
to initialize FileSignerSecretProvider, falling back to use random secrets.
2018-03-08 08:16:04,442 INFO org.apache.hadoop.http.HttpRequestLog: Http
request log for http.requests.nodemanager is not defined
2018-03-08 08:16:04,450 INFO org.apache.hadoop.http.HttpServer2: Added global
filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2018-03-08 08:16:04,461 INFO org.apache.hadoop.http.HttpServer2: Added filter
static_user_filter
(class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
context node
2018-03-08 08:16:04,462 INFO org.apache.hadoop.http.HttpServer2: Added filter
static_user_filter
(class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
context logs
2018-03-08 08:16:04,462 INFO org.apache.hadoop.http.HttpServer2: Added filter
static_user_filter
(class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
context static
2018-03-08 08:16:04,462 INFO
org.apache.hadoop.security.HttpCrossOriginFilterInitializer: CORS filter not
enabled. Please set hadoop.http.cross-origin.enabled to 'true' to enable it
2018-03-08 08:16:04,465 INFO org.apache.hadoop.http.HttpServer2: adding path
spec: /node/*
2018-03-08 08:16:04,465 INFO org.apache.hadoop.http.HttpServer2: adding path
spec: /ws/*
2018-03-08 08:16:04,843 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered
webapp guice modules
2018-03-08 08:16:04,846 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to
port 8042
2018-03-08 08:16:04,846 INFO org.mortbay.log: jetty-6.1.26
2018-03-08 08:16:04,877 INFO org.mortbay.log: Extract
jar:file:/opt/hadoop-2.8.2/share/hadoop/yarn/hadoop-yarn-common-2.8.2.jar!/webapps/node
to /tmp/Jetty_0_0_0_0_8042_node____19tj0x/webapp
2018-03-08 08:16:08,355 INFO org.mortbay.log: Started
[email protected]:8042
2018-03-08 08:16:08,356 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app
node started at 8042
2018-03-08 08:16:08,473 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node ID
assigned is : rb0101.local:42627
2018-03-08 08:16:08,498 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting
to ResourceManager at rb01rm01.local/192.168.1.100:8031
2018-03-08 08:16:08,613 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0
NM container statuses: []
2018-03-08 08:16:08,621 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering
with RM using containers :[]
2018-03-08 08:16:08,934 INFO
org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
Rolling master-key for container-tokens, got key with id -2086472604
2018-03-08 08:16:08,938 INFO
org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM:
Rolling master-key for container-tokens, got key with id -426187560
2018-03-08 08:16:08,939 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered
with ResourceManager as rb0101.local:42627 with total resource of
<memory:12288, vCores:6>
2018-03-08 08:16:08,939 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying
ContainerManager to unblock new container-requests
2018-03-08 08:26:04,174 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private
Deleted: 0
2018-03-08 08:36:04,170 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private
Deleted: 0
2018-03-08 08:46:04,170 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private
Deleted: 0
2. Listing all of YARN's Nodes, we can see it was returned to the RUNNING
state. However, when listing all nodes, it shows the node in 2 states; RUNNING
and SHUTDOWN:
[hadoop@rb01rm01 logs]$ /opt/hadoop/bin/yarn node -list -all
18/03/08 09:20:33 INFO client.RMProxy: Connecting to ResourceManager at
rb01rm01.local/192.168.1.100:8032
18/03/08 09:20:34 INFO client.AHSProxy: Connecting to Application History
server at rb01rm01.local/192.168.1.100:10200
Total Nodes:11
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
rb0106.local:44160 RUNNING rb0106.local:8042 0
rb0105.local:32832 RUNNING rb0105.local:8042 0
rb0101.local:42627 RUNNING rb0101.local:8042 0
rb0108.local:38209 RUNNING rb0108.local:8042 0
rb0107.local:34306 RUNNING rb0107.local:8042 0
rb0102.local:43063 RUNNING rb0102.local:8042 0
rb0103.local:42374 RUNNING rb0103.local:8042 0
rb0109.local:37455 RUNNING rb0109.local:8042 0
rb0110.local:36690 RUNNING rb0110.local:8042 0
rb0104.local:33268 RUNNING rb0104.local:8042 0
rb0101.local:43892 SHUTDOWN rb0101.local:8042 0
[hadoop@rb01rm01 logs]$
[hadoop@rb01rm01 logs]$ /opt/hadoop/bin/yarn node -list -states RUNNING
18/03/08 09:20:55 INFO client.RMProxy: Connecting to ResourceManager at
rb01rm01.local/192.168.1.100:8032
18/03/08 09:20:56 INFO client.AHSProxy: Connecting to Application History
server at rb01rm01.local/192.168.1.100:10200
Total Nodes:10
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
rb0106.local:44160 RUNNING rb0106.local:8042 0
rb0105.local:32832 RUNNING rb0105.local:8042 0
rb0101.local:42627 RUNNING rb0101.local:8042 0
rb0108.local:38209 RUNNING rb0108.local:8042 0
rb0107.local:34306 RUNNING rb0107.local:8042 0
rb0102.local:43063 RUNNING rb0102.local:8042 0
rb0103.local:42374 RUNNING rb0103.local:8042 0
rb0109.local:37455 RUNNING rb0109.local:8042 0
rb0110.local:36690 RUNNING rb0110.local:8042 0
rb0104.local:33268 RUNNING rb0104.local:8042 0
[hadoop@rb01rm01 logs]$ /opt/hadoop/bin/yarn node -list -states SHUTDOWN
18/03/08 09:21:01 INFO client.RMProxy: Connecting to ResourceManager at
rb01rm01.local/192.168.1.100:8032
18/03/08 09:21:01 INFO client.AHSProxy: Connecting to Application History
server at rb01rm01.local/192.168.1.100:10200
Total Nodes:0
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
[hadoop@rb01rm01 logs]$
3. ResourceManager however, does not list Node rb0101.local as SHUTDOWN when
specifically requesting list of Nodes in SHUTDOWN state:
[hadoop@rb01rm01 bin]$ /opt/hadoop/bin/yarn node -list -states SHUTDOWN
18/03/08 08:28:23 INFO client.RMProxy: Connecting to ResourceManager at
rb01rm01.local/v.x.y.z:8032
18/03/08 08:28:24 INFO client.AHSProxy: Connecting to Application History
server at rb01rm01.local/v.x.y.z:10200
Total Nodes:0
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
[hadoop@rb01rm01 bin]$
> YARN ResourceManager Lists A NodeManager As RUNNING & SHUTDOWN Simultaneously
> -----------------------------------------------------------------------------
>
> Key: YARN-8014
> URL: https://issues.apache.org/jira/browse/YARN-8014
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.8.2
> Reporter: Evan Tepsic
> Priority: Minor
>
> A graceful shutdown & then startup of a NodeManager process using YARN/HDFS
> v2.8.2 seems to successfully place the Node back into RUNNING state. However,
> ResouceManager appears to keep the Node also in SHUTDOWN state.
>
> *Steps To Reproduce:*
> 1. SSH to host running NodeManager.
> 2. Switch-to UserID that NodeManager is running as (hadoop).
> 3. Execute cmd: /opt/hadoop/sbin/yarn-daemon.sh stop nodemanager
> 4. Wait for NodeManager process to terminate gracefully.
> 5. Confirm Node is in SHUTDOWN state via:
> [http://rb01rm01.local:8088/cluster/nodes]
> 6. Execute cmd: /opt/hadoop/sbin/yarn-daemon.sh stop nodemanager
> 7. Confirm Node is in RUNNING state via:
> [http://rb01rm01.local:8088/cluster/nodes]
>
> *Investigation:*
> 1. Review contents of ResourceManager + NodeManager log-files:
> +ResourceManager log-[file:+|file:///+]
> 2018-03-08 08:15:44,085 INFO
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Node
> with node id : rb0101.local:43892 has shutdown, hence unregistering the node.
> 2018-03-08 08:15:44,092 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating
> Node rb0101.local:43892 as it is now SHUTDOWN
> 2018-03-08 08:15:44,092 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
> rb0101.local:43892 Node Transitioned from RUNNING to SHUTDOWN
> 2018-03-08 08:15:44,093 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> Removed node rb0101.local:43892 cluster capacity: <memory:110592, vCores:54>
> 2018-03-08 08:16:08,915 INFO
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:
> NodeManager from node rb0101.local(cmPort: 42627 httpPort: 8042) registered
> with capability: <memory:12288, vCores:6>, assigned nodeId rb0101.local:42627
> 2018-03-08 08:16:08,916 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
> rb0101.local:42627 Node Transitioned from NEW to RUNNING
> 2018-03-08 08:16:08,916 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> Added node rb0101.local:42627 cluster capacity: <memory:122880, vCores:60>
> 2018-03-08 08:16:34,826 WARN org.apache.hadoop.ipc.Server: Large response
> size 2976014 for call Call#428958 Retry#0
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
> 192.168.1.100:44034
>
> +NodeManager log-[file:+|file:///+]
> 2018-03-08 08:00:14,500 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> Cache Size Before Clean: 10720046250, Total Deleted: 0, Public
> Deleted: 0, Private Deleted: 0
> 2018-03-08 08:10:14,498 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> Cache Size Before Clean: 10720046250, Total Deleted: 0, Public
> Deleted: 0, Private Deleted: 0
> 2018-03-08 08:15:44,048 ERROR
> org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15:
> SIGTERM
> 2018-03-08 08:15:44,101 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Successfully
> Unregistered the Node rb0101.local:43892 with ResourceManager.
> 2018-03-08 08:15:44,114 INFO org.mortbay.log: Stopped
> [email protected]:8042
> 2018-03-08 08:15:44,226 INFO org.apache.hadoop.ipc.Server: Stopping server
> on 43892
> 2018-03-08 08:15:44,232 INFO org.apache.hadoop.ipc.Server: Stopping IPC
> Server listener on 43892
> 2018-03-08 08:15:44,237 INFO org.apache.hadoop.ipc.Server: Stopping IPC
> Server Responder
> 2018-03-08 08:15:44,239 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logag
> gregation.LogAggregationService waiting for pending aggregation during exit
> 2018-03-08 08:15:44,242 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Cont
> ainersMonitorImpl is interrupted. Exiting.
> 2018-03-08 08:15:44,284 INFO org.apache.hadoop.ipc.Server: Stopping server
> on 8040
> 2018-03-08 08:15:44,285 INFO org.apache.hadoop.ipc.Server: Stopping IPC
> Server listener on 8040
> 2018-03-08 08:15:44,285 INFO org.apache.hadoop.ipc.Server: Stopping IPC
> Server Responder
> 2018-03-08 08:15:44,287 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> Public cache exiting
> 2018-03-08 08:15:44,289 WARN
> org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl:
> org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is
> interrupted. Exiting.
> 2018-03-08 08:15:44,294 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager
> metrics system...
> 2018-03-08 08:15:44,295 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system
> stopped.
> 2018-03-08 08:15:44,296 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system
> shutdown complete.
> 2018-03-08 08:15:44,297 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NodeManager at rb0101.local/192.168.1.101
> ************************************************************/
> 2018-03-08 08:16:01,905 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeManager: STARTUP_MSG:
> /************************************************************
> STARTUP_MSG: Starting NodeManager
> STARTUP_MSG: user = hadoop
> STARTUP_MSG: host = rb0101.local/192.168.1.101
> STARTUP_MSG: args = []
> STARTUP_MSG: version = 2.8.2
> STARTUP_MSG: classpath = blahblahblah (truncated for size-purposes)
> STARTUP_MSG: build = Unknown -r Unknown; compiled by 'root' on
> 2017-09-14T18:22Z
> STARTUP_MSG: java = 1.8.0_144
> ************************************************************/
> 2018-03-08 08:16:01,918 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeManager: registered UNIX signal
> handlers for [TERM, HUP, INT]
> 2018-03-08 08:16:03,202 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeManager: Node Manager health
> check script is not available or doesn't have execute permission, so not
> starting the
> node health script runner.
> 2018-03-08 08:16:03,321 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
> Registering class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerEventType
> for class
>
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher
> 2018-03-08 08:16:03,322 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
> Registering class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationEventType
> for c
> lass
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher
> 2018-03-08 08:16:03,323 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
> Registering class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizationEventType
> for class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService
> 2018-03-08 08:16:03,323 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
> Registering class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServicesEventType
> for class org.apa
> che.hadoop.yarn.server.nodemanager.containermanager.AuxServices
> 2018-03-08 08:16:03,324 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
> Registering class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorEventType
> for
> class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> 2018-03-08 08:16:03,324 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
> Registering class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncherEventType
> f
> or class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher
> 2018-03-08 08:16:03,347 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
> Registering class
> org.apache.hadoop.yarn.server.nodemanager.ContainerManagerEventType for class
> org.apache.hadoop.y
> arn.server.nodemanager.containermanager.ContainerManagerImpl
> 2018-03-08 08:16:03,348 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
> Registering class
> org.apache.hadoop.yarn.server.nodemanager.NodeManagerEventType for class
> org.apache.hadoop.yarn.s
> erver.nodemanager.NodeManager
> 2018-03-08 08:16:03,402 INFO org.apache.hadoop.metrics2.impl.MetricsConfig:
> loaded properties from hadoop-metrics2.properties
> 2018-03-08 08:16:03,484 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot
> period at 10 second(s).
> 2018-03-08 08:16:03,484 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system
> started
> 2018-03-08 08:16:03,561 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: Using
> ResourceCalculatorPlugin :
> org.apache.hadoop.yarn.util.ResourceCalculatorPlugin@4b8729f
> f
> 2018-03-08 08:16:03,564 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
> Registering class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.event.LogHandlerEventType
> f
> or class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
> 2018-03-08 08:16:03,565 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
> Registering class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploa
> dEventType for class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploadService
> 2018-03-08 08:16:03,565 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
> AMRMProxyService is disabled
> 2018-03-08 08:16:03,566 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> per directory file limit = 8192
> 2018-03-08 08:16:03,621 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> usercache path :
> [file:/space/hadoop/tmp/nm-local-dir/usercache_|file:///space/hadoop/tmp/nm-local-dir/usercache_]
> DEL_1520518563569
> 2018-03-08 08:16:03,667 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
> path :
> [file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user1|file:///space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user1]
> 2018-03-08 08:16:03,667 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
> path :
> [file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user2|file:///space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user2]
> 2018-03-08 08:16:03,668 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
> path :
> [file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user3|file:///space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user3]
> 2018-03-08 08:16:03,681 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
> path :
> [file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user4|file:///space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user4]
> 2018-03-08 08:16:03,739 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
> Registering class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizerEventType
> for class
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker
> 2018-03-08 08:16:03,793 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices:
> Adding auxiliary service mapreduce_shuffle, "mapreduce_shuffle"
> 2018-03-08 08:16:03,826 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Using ResourceCalculatorPlugin :
> org.apache.hadoop.yarn.util.ResourceCalculatorPlugin@1187c9e8
> 2018-03-08 08:16:03,826 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Using ResourceCalculatorProcessTree : null
> 2018-03-08 08:16:03,827 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Physical memory check enabled: true
> 2018-03-08 08:16:03,827 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Virtual memory check enabled: true
> 2018-03-08 08:16:03,832 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> ContainersMonitor enabled: true
> 2018-03-08 08:16:03,841 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Nodemanager
> resources: memory set to 12288MB.
> 2018-03-08 08:16:03,841 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Nodemanager
> resources: vcores set to 6.
> 2018-03-08 08:16:03,846 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized
> nodemanager with : physical-memory=12288 virtual-memory=25805 virtual-cores=6
> 2018-03-08 08:16:03,850 INFO org.apache.hadoop.util.JvmPauseMonitor:
> Starting JVM pause monitor
> 2018-03-08 08:16:03,908 INFO org.apache.hadoop.ipc.CallQueueManager: Using
> callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 2000
> scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
> 2018-03-08 08:16:03,932 INFO org.apache.hadoop.ipc.Server: Starting Socket
> Reader #1 for port 42627
> 2018-03-08 08:16:04,153 INFO
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding
> protocol org.apache.hadoop.yarn.api.ContainerManagementProtocolPB to the
> server
> 2018-03-08 08:16:04,153 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
> Blocking new container-requests as container manager rpc server is still
> starting.
> 2018-03-08 08:16:04,154 INFO org.apache.hadoop.ipc.Server: IPC Server
> Responder: starting
> 2018-03-08 08:16:04,154 INFO org.apache.hadoop.ipc.Server: IPC Server
> listener on 42627: starting
> 2018-03-08 08:16:04,166 INFO
> org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
> Updating node address : rb0101.local:42627
> 2018-03-08 08:16:04,183 INFO org.apache.hadoop.ipc.CallQueueManager: Using
> callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 500
> scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
> 2018-03-08 08:16:04,184 INFO org.apache.hadoop.ipc.Server: Starting Socket
> Reader #1 for port 8040
> 2018-03-08 08:16:04,191 INFO
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding
> protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> to the server
> 2018-03-08 08:16:04,191 INFO org.apache.hadoop.ipc.Server: IPC Server
> Responder: starting
> 2018-03-08 08:16:04,191 INFO org.apache.hadoop.ipc.Server: IPC Server
> listener on 8040: starting
> 2018-03-08 08:16:04,192 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> Localizer started on port 8040
> 2018-03-08 08:16:04,312 INFO org.apache.hadoop.mapred.IndexCache: IndexCache
> created with max memory = 10485760
> 2018-03-08 08:16:04,330 INFO org.apache.hadoop.mapred.ShuffleHandler:
> mapreduce_shuffle listening on port 13562
> 2018-03-08 08:16:04,337 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
> ContainerManager started at rb0101.local/192.168.1.101:42627
> 2018-03-08 08:16:04,337 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
> ContainerManager bound to 0.0.0.0/0.0.0.0:0
> 2018-03-08 08:16:04,340 INFO
> org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer: Instantiating
> NMWebApp at 0.0.0.0:8042
> 2018-03-08 08:16:04,427 INFO org.mortbay.log: Logging to
> org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
> org.mortbay.log.Slf4jLog
> 2018-03-08 08:16:04,436 INFO
> org.apache.hadoop.security.authentication.server.AuthenticationFilter: Unable
> to initialize FileSignerSecretProvider, falling back to use random secrets.
> 2018-03-08 08:16:04,442 INFO org.apache.hadoop.http.HttpRequestLog: Http
> request log for http.requests.nodemanager is not defined
> 2018-03-08 08:16:04,450 INFO org.apache.hadoop.http.HttpServer2: Added
> global filter 'safety'
> (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
> 2018-03-08 08:16:04,461 INFO org.apache.hadoop.http.HttpServer2: Added
> filter static_user_filter
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
> context node
> 2018-03-08 08:16:04,462 INFO org.apache.hadoop.http.HttpServer2: Added
> filter static_user_filter
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
> context logs
> 2018-03-08 08:16:04,462 INFO org.apache.hadoop.http.HttpServer2: Added
> filter static_user_filter
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
> context static
> 2018-03-08 08:16:04,462 INFO
> org.apache.hadoop.security.HttpCrossOriginFilterInitializer: CORS filter not
> enabled. Please set hadoop.http.cross-origin.enabled to 'true' to enable it
> 2018-03-08 08:16:04,465 INFO org.apache.hadoop.http.HttpServer2: adding path
> spec: /node/*
> 2018-03-08 08:16:04,465 INFO org.apache.hadoop.http.HttpServer2: adding path
> spec: /ws/*
> 2018-03-08 08:16:04,843 INFO org.apache.hadoop.yarn.webapp.WebApps:
> Registered webapp guice modules
> 2018-03-08 08:16:04,846 INFO org.apache.hadoop.http.HttpServer2: Jetty bound
> to port 8042
> 2018-03-08 08:16:04,846 INFO org.mortbay.log: jetty-6.1.26
> 2018-03-08 08:16:04,877 INFO org.mortbay.log: Extract
> jar:[file:/opt/hadoop-2.8.2/share/hadoop/yarn/hadoop-yarn-common-2.8.2.jar!/webapps/node|file:///opt/hadoop-2.8.2/share/hadoop/yarn/hadoop-yarn-common-2.8.2.jar!/webapps/node]
> to /tmp/Jetty_0_0_0_0_8042_node____19tj0x/webapp
> 2018-03-08 08:16:08,355 INFO org.mortbay.log: Started
> [email protected]:8042
> 2018-03-08 08:16:08,356 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app
> node started at 8042
> 2018-03-08 08:16:08,473 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node ID
> assigned is : rb0101.local:42627
> 2018-03-08 08:16:08,498 INFO org.apache.hadoop.yarn.client.RMProxy:
> Connecting to ResourceManager at rb01rm01.local/192.168.1.100:8031
> 2018-03-08 08:16:08,613 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out
> 0 NM container statuses: []
> 2018-03-08 08:16:08,621 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering
> with RM using containers :[]
> 2018-03-08 08:16:08,934 INFO
> org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
> Rolling master-key for container-tokens, got key with id -2086472604
> 2018-03-08 08:16:08,938 INFO
> org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM:
> Rolling master-key for container-tokens, got key with id -426187560
> 2018-03-08 08:16:08,939 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered
> with ResourceManager as rb0101.local:42627 with total resource of
> <memory:12288, vCores:6>
> 2018-03-08 08:16:08,939 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying
> ContainerManager to unblock new container-requests
> 2018-03-08 08:26:04,174 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private
> Deleted: 0
> 2018-03-08 08:36:04,170 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private
> Deleted: 0
> 2018-03-08 08:46:04,170 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private
> Deleted: 0
> 2. Listing all of YARN's Nodes, we can see it was returned to the RUNNING
> state. However, when listing all nodes, it shows the node in 2 states;
> RUNNING and SHUTDOWN:
> [hadoop@rb01rm01 logs]$ /opt/hadoop/bin/yarn node -list -all
> 18/03/08 09:20:33 INFO client.RMProxy: Connecting to ResourceManager at
> rb01rm01.local/192.168.1.100:8032
> 18/03/08 09:20:34 INFO client.AHSProxy: Connecting to Application History
> server at rb01rm01.local/192.168.1.100:10200
> Total Nodes:11
> Node-Id Node-State Node-Http-Address Number-of-Running-Containers
> rb0106.local:44160 RUNNING rb0106.local:8042 0
> rb0105.local:32832 RUNNING rb0105.local:8042 0
> rb0101.local:42627 RUNNING rb0101.local:8042 0
> rb0108.local:38209 RUNNING rb0108.local:8042 0
> rb0107.local:34306 RUNNING rb0107.local:8042 0
> rb0102.local:43063 RUNNING rb0102.local:8042 0
> rb0103.local:42374 RUNNING rb0103.local:8042 0
> rb0109.local:37455 RUNNING rb0109.local:8042 0
> rb0110.local:36690 RUNNING rb0110.local:8042 0
> rb0104.local:33268 RUNNING rb0104.local:8042 0
> rb0101.local:43892 SHUTDOWN rb0101.local:8042 0
> [hadoop@rb01rm01 logs]$
> [hadoop@rb01rm01 logs]$ /opt/hadoop/bin/yarn node -list -states RUNNING
> 18/03/08 09:20:55 INFO client.RMProxy: Connecting to ResourceManager at
> rb01rm01.local/192.168.1.100:8032
> 18/03/08 09:20:56 INFO client.AHSProxy: Connecting to Application History
> server at rb01rm01.local/192.168.1.100:10200
> Total Nodes:10
> Node-Id Node-State Node-Http-Address Number-of-Running-Containers
> rb0106.local:44160 RUNNING rb0106.local:8042 0
> rb0105.local:32832 RUNNING rb0105.local:8042 0
> rb0101.local:42627 RUNNING rb0101.local:8042 0
> rb0108.local:38209 RUNNING rb0108.local:8042 0
> rb0107.local:34306 RUNNING rb0107.local:8042 0
> rb0102.local:43063 RUNNING rb0102.local:8042 0
> rb0103.local:42374 RUNNING rb0103.local:8042 0
> rb0109.local:37455 RUNNING rb0109.local:8042 0
> rb0110.local:36690 RUNNING rb0110.local:8042 0
> rb0104.local:33268 RUNNING rb0104.local:8042 0
> [hadoop@rb01rm01 logs]$ /opt/hadoop/bin/yarn node -list -states SHUTDOWN
> 18/03/08 09:21:01 INFO client.RMProxy: Connecting to ResourceManager at
> rb01rm01.local/192.168.1.100:8032
> 18/03/08 09:21:01 INFO client.AHSProxy: Connecting to Application History
> server at rb01rm01.local/192.168.1.100:10200
> Total Nodes:0
> Node-Id Node-State Node-Http-Address Number-of-Running-Containers
> [hadoop@rb01rm01 logs]$
> 3. ResourceManager however, does not list Node rb0101.local as SHUTDOWN when
> specifically requesting list of Nodes in SHUTDOWN state:
> [hadoop@rb01rm01 bin]$ /opt/hadoop/bin/yarn node -list -states SHUTDOWN
> 18/03/08 08:28:23 INFO client.RMProxy: Connecting to ResourceManager at
> rb01rm01.local/v.x.y.z:8032
> 18/03/08 08:28:24 INFO client.AHSProxy: Connecting to Application History
> server at rb01rm01.local/v.x.y.z:10200
> Total Nodes:0
> Node-Id Node-State Node-Http-Address Number-of-Running-Containers
> [hadoop@rb01rm01 bin]$
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]