I will investigate this, thanks for the response. 2012/1/3 Sandy Pratt <[email protected]>
> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, > have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, > closing socket connection and attempting reconnect > > It looks like the process has been unresponsive for some time, so ZK has > terminated the session. Did you experience a long GC pause right before > this? If you don't have GC logging enabled for the RS, you can sometimes > tell by noticing a gap in the timestamps of the log statements leading up > to the crash. > > If it turns out to be GC, you might want to look at your kernel swappiness > setting (set it to 0) and your JVM params. > > Sandy > > > > -----Original Message----- > > From: Leonardo Gamas [mailto:[email protected]] > > Sent: Thursday, December 29, 2011 07:44 > > To: [email protected] > > Subject: RegionServer dying every two or three days > > > > Hi, > > > > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3 Slaves), > > running on Amazon EC2. The master is a High-Memory Extra Large Instance > > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The > > slaves are Extra Large Instances (m1.xlarge) running Datanode, > TaskTracker, > > RegionServer and Zookeeper. > > > > From time to time, every two or three days, one of the RegionServers > > processes goes down, but the other processes (DataNode, TaskTracker, > > Zookeeper) continue normally. > > > > Reading the logs: > > > > The connection with Zookeeper timed out: > > > > --------------------------- > > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, > have > > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, closing > > socket connection and attempting reconnect > > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, > have > > not heard from server in 61205ms for sessionid 0x346c561a55953e, closing > > socket connection and attempting reconnect > > --------------------------- > > > > And the Handlers start to fail: > > > > --------------------------- > > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call > > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from > > xx.xx.xx.xx:xxxx: output error > > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020 > > caught: java.nio.channels.ClosedChannelException > > at > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13 > > 3) > > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > > at > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: > > 1341) > > at > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB > > aseServer.java:727) > > at > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe > > rver.java:792) > > at > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 > > 083) > > > > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call > > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from > > xx.xx.xx.xx:xxxx: output error > > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020 > > caught: java.nio.channels.ClosedChannelException > > at > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13 > > 3) > > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > > at > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: > > 1341) > > at > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB > > aseServer.java:727) > > at > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe > > rver.java:792) > > at > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 > > 083) > > --------------------------- > > > > And finally the server throws a YouAreDeadException :( : > > > > --------------------------- > > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection to > > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 > > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection > > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating > session > > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to > > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing socket > > connection > > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection to > > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 > > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection > > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating > session > > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to > > ZooKeeper service, session 0x346c561a55953e has expired, closing socket > > connection > > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region > > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741, > > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083): > > Unhandled > > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > > rejected; currently processing > > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server > > org.apache.hadoop.hbase.YouAreDeadException: > > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as > > dead server > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > > Method) > > at > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor > > AccessorImpl.java:39) > > at > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon > > structorAccessorImpl.java:27) > > at > java.lang.reflect.Constructor.newInstance(Constructor.java:513) > > at > > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce > > ption.java:95) > > at > > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote > > Exception.java:79) > > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep > > ort(HRegionServer.java:735) > > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j > > ava:596) > > at java.lang.Thread.run(Thread.java:662) > > Caused by: org.apache.hadoop.ipc.RemoteException: > > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as > > dead server > > at > > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana > > ger.java:204) > > at > > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv > > erManager.java:262) > > at > > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav > > a:669) > > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces > > sorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > > at > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 > > 039) > > > > at > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771) > > at > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257 > > ) > > at $Proxy6.regionServerReport(Unknown Source) > > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep > > ort(HRegionServer.java:729) > > ... 2 more > > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics: > > requests=66, regions=206, stores=2078, storefiles=970, > > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0, > > flushQueueSize=0, usedHeap=1672, maxHeap=4083, > > blockCacheSize=705907552, blockCacheFree=150412064, > > blockCacheCount=10648, blockCacheHitCount=79578618, > > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352, > > blockCacheHitRatio=96, > > blockCacheHitCachingRatio=98 > > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled > > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > > rejected; currently processing > > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server > > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020 > > --------------------------- > > > > Then i restart the RegionServer and everything is back to normal. > > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any > > abnormality in the same time window. > > I think it was caused by the lost of connection to zookeeper. Is it > advisable to > > run zookeeper in the same machines? > > if the RegionServer lost it's connection to Zookeeper, there's a way (a > > configuration perhaps) to re-join the cluster, and not only die? > > > > Any idea what is causing this?? Or to prevent it from happening? > > > > Any help is appreciated. > > > > Best Regards, > > > > -- > > > > *Leonardo Gamas* > > Software Engineer > > +557134943514 > > +557581347440 > > [email protected] > > www.jusbrasil.com.br > -- *Leonardo Gamas* Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75) 8134-7440 [email protected] www.jusbrasil.com.br
