Hi Hitesh,
Yes it is an issue. This is handled in
https://issues.apache.org/jira/i#browse/YARN-713 fixes DNS Issue. This fix
available on hadoop-2.4(unreleased).
Thanks & Regards
Rohith Sharma K S
-----Original Message-----
From: Hitesh Shah [mailto:[email protected]]
Sent: 14 March 2014 09:03
To: [email protected]
Subject: Re: ResourceManager shutting down
Hi John
Would you mind filing a jira with more details. The RM going down just because
a host was not resolvable or DNS timed out is something that should be
addressed.
thanks
-- Hitesh
On Mar 13, 2014, at 2:29 PM, John Lilley wrote:
> Never mind... we figured out its DNS entry was going missing.
> john
>
> From: John Lilley [mailto:[email protected]]
> Sent: Thursday, March 13, 2014 2:52 PM
> To: [email protected]
> Subject: ResourceManager shutting down
>
> We have this erratic behavior where every so often the RM will shutdown with
> an UnknownHostException. The odd thing is, the host it complains about have
> been in use for days at that point without problem. Any ideas?
> Thanks,
> John
>
>
> 2014-03-13 14:38:14,746 INFO rmapp.RMAppImpl
> (RMAppImpl.java:handle(578)) - application_1394204725813_0220 State
> change from ACCEPTED to RUNNING
> 2014-03-13 14:38:15,794 FATAL resourcemanager.ResourceManager
> (ResourceManager.java:run(449)) - Error in handling event type
> NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: java.net.UnknownHostException:
> skitzo.office.datalever.com
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418)
> at
> org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:247)
> at
> org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:195)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.createContainerToken(LeafQueue.java:1297)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1345)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1211)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1170)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:871)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:645)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:690)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:734)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:86)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.net.UnknownHostException: skitzo.office.datalever.com
> ... 15 more
> 2014-03-13 14:38:15,794 INFO resourcemanager.ResourceManager
> (ResourceManager.java:run(453)) - Exiting, bbye..
> 2014-03-13 14:38:15,911 INFO mortbay.log (Slf4jLog.java:info(67)) -
> Stopped [email protected]:8088
> 2014-03-13 14:38:16,013 ERROR
> delegation.AbstractDelegationTokenSecretManager
> (AbstractDelegationTokenSecretManager.java:run(557)) -
> InterruptedExcpetion recieved for ExpiredTokenRemover thread
> java.lang.InterruptedException: sleep interrupted
> 2014-03-13 14:38:16,013 INFO impl.MetricsSystemImpl
> (MetricsSystemImpl.java:stop(200)) - Stopping ResourceManager metrics
> system...
> 2014-03-13 14:38:16,014 INFO impl.MetricsSystemImpl
> (MetricsSystemImpl.java:stop(206)) - ResourceManager metrics system stopped.
> 2014-03-13 14:38:16,014 INFO impl.MetricsSystemImpl
> (MetricsSystemImpl.java:shutdown(572)) - ResourceManager metrics system
> shutdown complete.
> 2014-03-13 14:38:16,015 WARN amlauncher.ApplicationMasterLauncher
> (ApplicationMasterLauncher.java:run(98)) -
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher$LauncherThread
> interrupted. Returning.
> 2014-03-13 14:38:16,015 INFO ipc.Server (Server.java:stop(2442)) -
> Stopping server on 8141
> 2014-03-13 14:38:16,017 INFO ipc.Server (Server.java:stop(2442)) -
> Stopping server on 8050 ... and so on, it shuts down
>