[
https://issues.apache.org/jira/browse/YARN-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13574908#comment-13574908
]
Hitesh Shah commented on YARN-196:
----------------------------------
Xuan, some comments:
- testNMShutdownForRegistrationFailure tests for an explicit command from the
RM telling the NM to shut down.
- failing within 10 seconds seems too quick. The rpc layer internally retries
for a certain time period. From the NM layer, we should probably have a total
time length defined - say 15 mins and retry after 30 seconds or so within that
time period.
- also someone should be able to set the time period to -1 to disable the
upper bound and retry forever if needed.
- use same conventions as used elsewhere when naming variables -
rm_Retry_interval_ms does not confirm to the standards defined in the class.
- "LOG.debug("Fail to connect to RM");" - change to error and log the
exception stack trace unless it is being caught elsewhere and being printed. It
would also help to log how many retries were attempted before failing out.
- in the start() function, there is an AvroRuntimeException being thrown - we
should replace that with YarnException or an appropriate runtime exception.
- isRMStarted var is not needed - a simple break in the loop if the
registration is done should suffice.
- please remove the space in "rm_Retry_Count --;"
- the debug log message at the end of the loop should be set to use WARN
level. Also, please re-phrase it for more clarity - something along the lines
of Retrying connecting to RM, current no. of failed attempts ...
- the current patch seems to be catching all exceptions. This will cause a
problem in the case where the RM explicitly asks the NM to shutdown - maybe it
makes sense to move the retry logic into the registerWithRM function?
> Nodemanager if started before starting Resource manager is getting
> shutdown.But if both RM and NM are started and then after if RM is going
> down,NM is retrying for the RM.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-196
> URL: https://issues.apache.org/jira/browse/YARN-196
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 3.0.0, 2.0.0-alpha
> Reporter: Ramgopal N
> Assignee: Xuan Gong
> Attachments: MAPREDUCE-3676.patch, YARN-196.1.patch, YARN-196.2.patch
>
>
> If NM is started before starting the RM ,NM is shutting down with the
> following error
> {code}
> ERROR org.apache.hadoop.yarn.service.CompositeService: Error starting
> services org.apache.hadoop.yarn.server.nodemanager.NodeManager
> org.apache.avro.AvroRuntimeException:
> java.lang.reflect.UndeclaredThrowableException
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:149)
> at
> org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:167)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:242)
> Caused by: java.lang.reflect.UndeclaredThrowableException
> at
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:66)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:182)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:145)
> ... 3 more
> Caused by: com.google.protobuf.ServiceException: java.net.ConnectException:
> Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on
> connection exception: java.net.ConnectException: Connection refused; For more
> details see: http://wiki.apache.org/hadoop/ConnectionRefused
> at
> org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:131)
> at $Proxy23.registerNodeManager(Unknown Source)
> at
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59)
> ... 5 more
> Caused by: java.net.ConnectException: Call From
> HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection
> exception: java.net.ConnectException: Connection refused; For more details
> see: http://wiki.apache.org/hadoop/ConnectionRefused
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:857)
> at org.apache.hadoop.ipc.Client.call(Client.java:1141)
> at org.apache.hadoop.ipc.Client.call(Client.java:1100)
> at
> org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:128)
> ... 7 more
> Caused by: java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:659)
> at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:469)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:563)
> at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:211)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247)
> at org.apache.hadoop.ipc.Client.call(Client.java:1117)
> ... 9 more
> 2012-01-16 15:04:13,336 WARN org.apache.hadoop.yarn.event.AsyncDispatcher:
> AsyncDispatcher thread interrupted
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
> at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:76)
> at java.lang.Thread.run(Thread.java:619)
> 2012-01-16 15:04:13,337 INFO org.apache.hadoop.yarn.service.AbstractService:
> Service:Dispatcher is stopped.
> 2012-01-16 15:04:13,392 INFO org.mortbay.log: Stopped
> [email protected]:9999
> 2012-01-16 15:04:13,493 INFO org.apache.hadoop.yarn.service.AbstractService:
> Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is stopped.
> 2012-01-16 15:04:13,493 INFO org.apache.hadoop.ipc.Server: Stopping server on
> 24290
> 2012-01-16 15:04:13,494 INFO org.apache.hadoop.ipc.Server: Stopping IPC
> Server listener on 24290
> 2012-01-16 15:04:13,495 INFO org.apache.hadoop.ipc.Server: Stopping IPC
> Server Responder
> 2012-01-16 15:04:13,496 INFO org.apache.hadoop.yarn.service.AbstractService:
> Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
> is stopped.
> 2012-01-16 15:04:13,496 WARN org.apache.hadoop.yarn.event.AsyncDispatcher:
> AsyncDispatcher thread interrupted
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
> at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:76)
> at java.lang.Thread.run(Thread.java:619)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira