[jira] [Commented] (YARN-196) Nodemanager if started before starting Resource manager is getting shutdown.But if both RM and NM are started and then after if RM is going down,NM is retrying for the RM.

Hitesh Shah (JIRA) Fri, 15 Feb 2013 12:05:14 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13579475#comment-13579475
 ]


Hitesh Shah commented on YARN-196:
----------------------------------

Fix trailing space issues in the patch. There also seem to be tabs in the patch.

Some comments on the test code: 
  - indentation issues in MyNodeStatusUpdater4
  - change code layout to put MyNodeStatusUpdater4 after MyNodeStatusUpdater3
  - fix MyNodeStatusUpdater4 to make waitIntervalMS a final field and passed in 
via the ctor. 
     - likewise fo waitForEver 
  - MyNodeStatusUpdater4::getRMClient does not use waitIntervalMS
  - could reuse waitIntervalMS var ( and reset to 20 seconds later ) so as to 
not rely on 30*1000 in the first check. 
  - also, please look at reducing the time values so that the test run does not 
take too long. Current unit test takes over a minute to complete.

Use "resourcemanager.connect.wait.ms" for NM_RESOURCEMANAGER_WAIT_MS to be more 
clear.
Likewise for NM_RESOURCEMANAGER_RETRY_INTERVAL_MS, use 
"resourcemanager.connect.retry_interval.ms" 
Add both these new variables into yarn-default.xml with clear description.

+        LOG.info("Connecting to ResourceManager at " + this.rmAddress);
   - should be also log current attempt counter.

+        RegisterNodeManagerRequest request = 
recordFactory.newRecordInstance(RegisterNodeManagerRequest.class);
+        request.setHttpPort(this.httpPort);
+        request.setResource(this.totalResource);
+        request.setNodeId(this.nodeId);
   - this object does not need to be created for each loop. Could be created 
once outside. 

Also, we should check as to what the NM does if the RM dies at an intermediate 
point i.e registration succeeded but eventually at some point the RM died. Does 
the NM keep re-trying at this point or die after a certain time interval or 
after all containers finish? That bit could be looked at in a separate jira if 
you find any issues in that aspect.

                
> Nodemanager if started before starting Resource manager is getting 
> shutdown.But if both RM and NM are started and then after if RM is going 
> down,NM is retrying for the RM.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-196
>                 URL: https://issues.apache.org/jira/browse/YARN-196
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.0.0, 2.0.0-alpha
>            Reporter: Ramgopal N
>            Assignee: Xuan Gong
>         Attachments: MAPREDUCE-3676.patch, YARN-196.1.patch, 
> YARN-196.2.patch, YARN-196.3.patch
>
>
> If NM is started before starting the RM ,NM is shutting down with the 
> following error
> {code}
> ERROR org.apache.hadoop.yarn.service.CompositeService: Error starting 
> services org.apache.hadoop.yarn.server.nodemanager.NodeManager
> org.apache.avro.AvroRuntimeException: 
> java.lang.reflect.UndeclaredThrowableException
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:149)
>       at 
> org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:167)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:242)
> Caused by: java.lang.reflect.UndeclaredThrowableException
>       at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:66)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:182)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:145)
>       ... 3 more
> Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: 
> Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on 
> connection exception: java.net.ConnectException: Connection refused; For more 
> details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>       at 
> org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:131)
>       at $Proxy23.registerNodeManager(Unknown Source)
>       at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59)
>       ... 5 more
> Caused by: java.net.ConnectException: Call From 
> HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection 
> exception: java.net.ConnectException: Connection refused; For more details 
> see:  http://wiki.apache.org/hadoop/ConnectionRefused
>       at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:857)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1141)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1100)
>       at 
> org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:128)
>       ... 7 more
> Caused by: java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:659)
>       at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:469)
>       at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:563)
>       at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:211)
>       at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1117)
>       ... 9 more
> 2012-01-16 15:04:13,336 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher thread interrupted
> java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
>       at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:76)
>       at java.lang.Thread.run(Thread.java:619)
> 2012-01-16 15:04:13,337 INFO org.apache.hadoop.yarn.service.AbstractService: 
> Service:Dispatcher is stopped.
> 2012-01-16 15:04:13,392 INFO org.mortbay.log: Stopped 
> [email protected]:9999
> 2012-01-16 15:04:13,493 INFO org.apache.hadoop.yarn.service.AbstractService: 
> Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is stopped.
> 2012-01-16 15:04:13,493 INFO org.apache.hadoop.ipc.Server: Stopping server on 
> 24290
> 2012-01-16 15:04:13,494 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
> Server listener on 24290
> 2012-01-16 15:04:13,495 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
> Server Responder
> 2012-01-16 15:04:13,496 INFO org.apache.hadoop.yarn.service.AbstractService: 
> Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
>  is stopped.
> 2012-01-16 15:04:13,496 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher thread interrupted
> java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
>       at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:76)
>       at java.lang.Thread.run(Thread.java:619)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-196) Nodemanager if started before starting Resource manager is getting shutdown.But if both RM and NM are started and then after if RM is going down,NM is retrying for the RM.

Reply via email to