Hi John Would you mind filing a jira with more details. The RM going down just because a host was not resolvable or DNS timed out is something that should be addressed.
thanks -- Hitesh On Mar 13, 2014, at 2:29 PM, John Lilley wrote: > Never mind… we figured out its DNS entry was going missing. > john > > From: John Lilley [mailto:[email protected]] > Sent: Thursday, March 13, 2014 2:52 PM > To: [email protected] > Subject: ResourceManager shutting down > > We have this erratic behavior where every so often the RM will shutdown with > an UnknownHostException. The odd thing is, the host it complains about have > been in use for days at that point without problem. Any ideas? > Thanks, > John > > > 2014-03-13 14:38:14,746 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(578)) - > application_1394204725813_0220 State change from ACCEPTED to RUNNING > 2014-03-13 14:38:15,794 FATAL resourcemanager.ResourceManager > (ResourceManager.java:run(449)) - Error in handling event type NODE_UPDATE to > the scheduler > java.lang.IllegalArgumentException: java.net.UnknownHostException: > skitzo.office.datalever.com > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418) > at > org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:247) > at > org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:195) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.createContainerToken(LeafQueue.java:1297) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1345) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1211) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1170) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:871) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:645) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:690) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:734) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:86) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.net.UnknownHostException: skitzo.office.datalever.com > ... 15 more > 2014-03-13 14:38:15,794 INFO resourcemanager.ResourceManager > (ResourceManager.java:run(453)) - Exiting, bbye.. > 2014-03-13 14:38:15,911 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped > [email protected]:8088 > 2014-03-13 14:38:16,013 ERROR delegation.AbstractDelegationTokenSecretManager > (AbstractDelegationTokenSecretManager.java:run(557)) - InterruptedExcpetion > recieved for ExpiredTokenRemover thread java.lang.InterruptedException: sleep > interrupted > 2014-03-13 14:38:16,013 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:stop(200)) - Stopping ResourceManager metrics > system... > 2014-03-13 14:38:16,014 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:stop(206)) - ResourceManager metrics system stopped. > 2014-03-13 14:38:16,014 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:shutdown(572)) - ResourceManager metrics system > shutdown complete. > 2014-03-13 14:38:16,015 WARN amlauncher.ApplicationMasterLauncher > (ApplicationMasterLauncher.java:run(98)) - > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher$LauncherThread > interrupted. Returning. > 2014-03-13 14:38:16,015 INFO ipc.Server (Server.java:stop(2442)) - Stopping > server on 8141 > 2014-03-13 14:38:16,017 INFO ipc.Server (Server.java:stop(2442)) - Stopping > server on 8050 > … and so on, it shuts down >
