Hi,
I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3 Slaves),
running on Amazon EC2. The master is a High-Memory Extra Large Instance
(m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The slaves
are Extra Large Instances (m1.xlarge) running Datanode, TaskTracker,
RegionServer and Zookeeper.
>From time to time, every two or three days, one of the RegionServers
processes goes down, but the other processes (DataNode, TaskTracker,
Zookeeper) continue normally.
Reading the logs:
The connection with Zookeeper timed out:
---------------------------
11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, closing
socket connection and attempting reconnect
11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 61205ms for sessionid 0x346c561a55953e, closing
socket connection and attempting reconnect
---------------------------
And the Handlers start to fail:
---------------------------
11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
xx.xx.xx.xx:xxxx: output error
11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020
caught: java.nio.channels.ClosedChannelException
at
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
xx.xx.xx.xx:xxxx: output error
11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020
caught: java.nio.channels.ClosedChannelException
at
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
---------------------------
And finally the server throws a YouAreDeadException :( :
---------------------------
11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection to
server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection established
to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating session
11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing socket
connection
11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection to
server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection established
to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating session
11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
ZooKeeper service, session 0x346c561a55953e has expired, closing socket
connection
11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region server
serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083): Unhandled
exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
rejected; currently processing
ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
org.apache.hadoop.hbase.YouAreDeadException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
dead server
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:735)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:596)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
dead server
at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:204)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:262)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:669)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
at
org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
at $Proxy6.regionServerReport(Unknown Source)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:729)
... 2 more
11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
requests=66, regions=206, stores=2078, storefiles=970,
storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
flushQueueSize=0, usedHeap=1672, maxHeap=4083, blockCacheSize=705907552,
blockCacheFree=150412064, blockCacheCount=10648,
blockCacheHitCount=79578618, blockCacheMissCount=3036335,
blockCacheEvictedCount=1401352, blockCacheHitRatio=96,
blockCacheHitCachingRatio=98
11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled
exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
rejected; currently processing
ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
---------------------------
Then i restart the RegionServer and everything is back to normal.
Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any
abnormality in the same time window.
I think it was caused by the lost of connection to zookeeper. Is it
advisable to run zookeeper in the same machines?
if the RegionServer lost it's connection to Zookeeper, there's a way (a
configuration perhaps) to re-join the cluster, and not only die?
Any idea what is causing this?? Or to prevent it from happening?
Any help is appreciated.
Best Regards,
--
*Leonardo Gamas*
Software Engineer
+557134943514
+557581347440
[email protected]
www.jusbrasil.com.br