Sandy, It happened again:
----- 12/01/04 14:51:41 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 46571ms for sessionid 0x334a8c09b080033, closing socket connection and attempting reconnect 12/01/04 14:51:41 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 46513ms for sessionid 0x334a8c09b080034, closing socket connection and attempting reconnect ----- Comparing with the lines in the gc log, i found these lines around the time client session expired: ----- 14248.678: [GC 14248.679: [ParNew: 105291K->457K(118016K), 0.0221210 secs] 954217K->849457K(1705776K), 0.0222870 secs] [Times: user=0.05 sys=0.00, real=0.03 secs] 14249.959: [GC 14249.960: [ParNew: 105417K->392K(118016K), 0.0087140 secs] 954417K->849428K(1705776K), 0.0089260 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840 secs] 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05 sys=0.01, real=36.96 secs] 14296.604: [GC 14296.604: [ParNew: 105369K->523K(118016K), 0.0119440 secs] 954434K->849708K(1705776K), 0.0121650 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] ----- The third line took 36.96 seconds to execute, can this be causing this problem? I notice too some major compactions happening, even after i disabled it, with: <property> <name>hbase.hregion.majorcompaction</name> <value>0</value> </property> Reading the code a little it seems that, even if it's disabled, if all files are target in a compaction, it's considered a major compaction. Is it right? The machines don't have swap, so the swappiness parameter don't seem to apply here. Any other suggestion? Thanks. 2012/1/4 Leonardo Gamas <[email protected]> > I will investigate this, thanks for the response. > > > 2012/1/3 Sandy Pratt <[email protected]> > >> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, >> have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, >> closing socket connection and attempting reconnect >> >> It looks like the process has been unresponsive for some time, so ZK has >> terminated the session. Did you experience a long GC pause right before >> this? If you don't have GC logging enabled for the RS, you can sometimes >> tell by noticing a gap in the timestamps of the log statements leading up >> to the crash. >> >> If it turns out to be GC, you might want to look at your kernel >> swappiness setting (set it to 0) and your JVM params. >> >> Sandy >> >> >> > -----Original Message----- >> > From: Leonardo Gamas [mailto:[email protected]] >> > Sent: Thursday, December 29, 2011 07:44 >> > To: [email protected] >> > Subject: RegionServer dying every two or three days >> > >> > Hi, >> > >> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3 >> Slaves), >> > running on Amazon EC2. The master is a High-Memory Extra Large Instance >> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The >> > slaves are Extra Large Instances (m1.xlarge) running Datanode, >> TaskTracker, >> > RegionServer and Zookeeper. >> > >> > From time to time, every two or three days, one of the RegionServers >> > processes goes down, but the other processes (DataNode, TaskTracker, >> > Zookeeper) continue normally. >> > >> > Reading the logs: >> > >> > The connection with Zookeeper timed out: >> > >> > --------------------------- >> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, >> have >> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, >> closing >> > socket connection and attempting reconnect >> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, >> have >> > not heard from server in 61205ms for sessionid 0x346c561a55953e, closing >> > socket connection and attempting reconnect >> > --------------------------- >> > >> > And the Handlers start to fail: >> > >> > --------------------------- >> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call >> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from >> > xx.xx.xx.xx:xxxx: output error >> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020 >> > caught: java.nio.channels.ClosedChannelException >> > at >> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13 >> > 3) >> > at >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: >> > 1341) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB >> > aseServer.java:727) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe >> > rver.java:792) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 >> > 083) >> > >> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call >> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from >> > xx.xx.xx.xx:xxxx: output error >> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020 >> > caught: java.nio.channels.ClosedChannelException >> > at >> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13 >> > 3) >> > at >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: >> > 1341) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB >> > aseServer.java:727) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe >> > rver.java:792) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 >> > 083) >> > --------------------------- >> > >> > And finally the server throws a YouAreDeadException :( : >> > >> > --------------------------- >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection >> to >> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection >> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating >> session >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to >> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing socket >> > connection >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection >> to >> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection >> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating >> session >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to >> > ZooKeeper service, session 0x346c561a55953e has expired, closing socket >> > connection >> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region >> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741, >> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083): >> > Unhandled >> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT >> > rejected; currently processing >> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server >> > org.apache.hadoop.hbase.YouAreDeadException: >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; >> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as >> > dead server >> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >> > Method) >> > at >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor >> > AccessorImpl.java:39) >> > at >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon >> > structorAccessorImpl.java:27) >> > at >> java.lang.reflect.Constructor.newInstance(Constructor.java:513) >> > at >> > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce >> > ption.java:95) >> > at >> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote >> > Exception.java:79) >> > at >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep >> > ort(HRegionServer.java:735) >> > at >> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j >> > ava:596) >> > at java.lang.Thread.run(Thread.java:662) >> > Caused by: org.apache.hadoop.ipc.RemoteException: >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; >> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as >> > dead server >> > at >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana >> > ger.java:204) >> > at >> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv >> > erManager.java:262) >> > at >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav >> > a:669) >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >> > at >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces >> > sorImpl.java:25) >> > at java.lang.reflect.Method.invoke(Method.java:597) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 >> > 039) >> > >> > at >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257 >> > ) >> > at $Proxy6.regionServerReport(Unknown Source) >> > at >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep >> > ort(HRegionServer.java:729) >> > ... 2 more >> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics: >> > requests=66, regions=206, stores=2078, storefiles=970, >> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0, >> > flushQueueSize=0, usedHeap=1672, maxHeap=4083, >> > blockCacheSize=705907552, blockCacheFree=150412064, >> > blockCacheCount=10648, blockCacheHitCount=79578618, >> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352, >> > blockCacheHitRatio=96, >> > blockCacheHitCachingRatio=98 >> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled >> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT >> > rejected; currently processing >> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server >> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020 >> > --------------------------- >> > >> > Then i restart the RegionServer and everything is back to normal. >> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any >> > abnormality in the same time window. >> > I think it was caused by the lost of connection to zookeeper. Is it >> advisable to >> > run zookeeper in the same machines? >> > if the RegionServer lost it's connection to Zookeeper, there's a way (a >> > configuration perhaps) to re-join the cluster, and not only die? >> > >> > Any idea what is causing this?? Or to prevent it from happening? >> > >> > Any help is appreciated. >> > >> > Best Regards, >> > >> > -- >> > >> > *Leonardo Gamas* >> > Software Engineer >> > +557134943514 >> > +557581347440 >> > [email protected] >> > www.jusbrasil.com.br >> > > > > -- > > *Leonardo Gamas* > Software Engineer/Chaos Monkey Engineer > T (71) 3494-3514 > C (75) 8134-7440 > [email protected] > www.jusbrasil.com.br > > -- *Leonardo Gamas* Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75) 8134-7440 [email protected] www.jusbrasil.com.br
