St.Ack, I don't have made any attempt in GC tunning, yet. I will read the perf section as suggested. I'm currently using Nagios + JMX to monitor the cluster, but it's currently used for alert only, the perfdata is not been stored, so it's kind of useless right now, but i was thinking in use TSDB to store it, any known case of integration? ---
Sandy, Yes, my timeout is 30 seconds: <property> <name>zookeeper.session.timeout</name> <value>30000</value> </property> To our application it's a sufferable time to wait in case a RegionServer go offline. My heap is 4GB and my JVM params are: -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log I will try the -XX:+UseParallelOldGC param and post my feedback here. --- Ramkrishna, Seems the GC is the root of all evil in this case. ---- Thank you all for the answers. I will try out these valuable advices given here and post my results. Leo Gamas. 2012/1/5 Ramkrishna S Vasudevan <[email protected]> > Recently we faced a similar problem and it was due to GC config. Pls check > your GC. > > Regards > Ram > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of Stack > Sent: Thursday, January 05, 2012 2:50 AM > To: [email protected] > Subject: Re: RegionServer dying every two or three days > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas > <[email protected]> wrote: > > The third line took 36.96 seconds to execute, can this be causing this > > problem? > > > > Probably. Have you made any attempt at GC tuning? > > > > Reading the code a little it seems that, even if it's disabled, if all > > files are target in a compaction, it's considered a major compaction. Is > it > > right? > > > > That is right. They get 'upgraded' from minor to major. > > This should be fine though. What you are avoiding setting major > compactions to 0 is all regions being major compacted on a period, a > heavy weight effective rewrite of all your data (unless already major > compacted). It looks like you have this disabled which is good until > you've wrestled your cluster into submission. > > > > The machines don't have swap, so the swappiness parameter don't seem to > > apply here. Any other suggestion? > > > > See the perf section of the hbase manual. It has our current list. > > Are you monitoring your cluster w/ ganglia or tsdb? > > > St.Ack > > > Thanks. > > > > 2012/1/4 Leonardo Gamas <[email protected]> > > > >> I will investigate this, thanks for the response. > >> > >> > >> 2012/1/3 Sandy Pratt <[email protected]> > >> > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, > >>> have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, > >>> closing socket connection and attempting reconnect > >>> > >>> It looks like the process has been unresponsive for some time, so ZK > has > >>> terminated the session. Did you experience a long GC pause right > before > >>> this? If you don't have GC logging enabled for the RS, you can > sometimes > >>> tell by noticing a gap in the timestamps of the log statements leading > up > >>> to the crash. > >>> > >>> If it turns out to be GC, you might want to look at your kernel > >>> swappiness setting (set it to 0) and your JVM params. > >>> > >>> Sandy > >>> > >>> > >>> > -----Original Message----- > >>> > From: Leonardo Gamas [mailto:[email protected]] > >>> > Sent: Thursday, December 29, 2011 07:44 > >>> > To: [email protected] > >>> > Subject: RegionServer dying every two or three days > >>> > > >>> > Hi, > >>> > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3 > >>> Slaves), > >>> > running on Amazon EC2. The master is a High-Memory Extra Large > Instance > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The > >>> > slaves are Extra Large Instances (m1.xlarge) running Datanode, > >>> TaskTracker, > >>> > RegionServer and Zookeeper. > >>> > > >>> > From time to time, every two or three days, one of the RegionServers > >>> > processes goes down, but the other processes (DataNode, TaskTracker, > >>> > Zookeeper) continue normally. > >>> > > >>> > Reading the logs: > >>> > > >>> > The connection with Zookeeper timed out: > >>> > > >>> > --------------------------- > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed > out, > >>> have > >>> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, > >>> closing > >>> > socket connection and attempting reconnect > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed > out, > >>> have > >>> > not heard from server in 61205ms for sessionid 0x346c561a55953e, > closing > >>> > socket connection and attempting reconnect > >>> > --------------------------- > >>> > > >>> > And the Handlers start to fail: > >>> > > >>> > --------------------------- > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from > >>> > xx.xx.xx.xx:xxxx: output error > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on > 60020 > >>> > caught: java.nio.channels.ClosedChannelException > >>> > at > >>> > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13 > >>> > 3) > >>> > at > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > >>> > at > >>> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: > >>> > 1341) > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB > >>> > aseServer.java:727) > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe > >>> > rver.java:792) > >>> > at > >>> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 > >>> > 083) > >>> > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from > >>> > xx.xx.xx.xx:xxxx: output error > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on > 60020 > >>> > caught: java.nio.channels.ClosedChannelException > >>> > at > >>> > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13 > >>> > 3) > >>> > at > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > >>> > at > >>> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: > >>> > 1341) > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB > >>> > aseServer.java:727) > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe > >>> > rver.java:792) > >>> > at > >>> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 > >>> > 083) > >>> > --------------------------- > >>> > > >>> > And finally the server throws a YouAreDeadException :( : > >>> > > >>> > --------------------------- > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket > connection > >>> to > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, > initiating > >>> session > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to > >>> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing > socket > >>> > connection > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket > connection > >>> to > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, > initiating > >>> session > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to > >>> > ZooKeeper service, session 0x346c561a55953e has expired, closing > socket > >>> > connection > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region > >>> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741, > >>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083): > >>> > Unhandled > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > >>> > rejected; currently processing > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server > >>> > org.apache.hadoop.hbase.YouAreDeadException: > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > >>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 > as > >>> > dead server > >>> > at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > >>> > Method) > >>> > at > >>> > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor > >>> > AccessorImpl.java:39) > >>> > at > >>> > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon > >>> > structorAccessorImpl.java:27) > >>> > at > >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513) > >>> > at > >>> > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce > >>> > ption.java:95) > >>> > at > >>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote > >>> > Exception.java:79) > >>> > at > >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep > >>> > ort(HRegionServer.java:735) > >>> > at > >>> > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j > >>> > ava:596) > >>> > at java.lang.Thread.run(Thread.java:662) > >>> > Caused by: org.apache.hadoop.ipc.RemoteException: > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > >>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 > as > >>> > dead server > >>> > at > >>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana > >>> > ger.java:204) > >>> > at > >>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv > >>> > erManager.java:262) > >>> > at > >>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav > >>> > a:669) > >>> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown > Source) > >>> > at > >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces > >>> > sorImpl.java:25) > >>> > at java.lang.reflect.Method.invoke(Method.java:597) > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > >>> > at > >>> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 > >>> > 039) > >>> > > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771) > >>> > at > >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257 > >>> > ) > >>> > at $Proxy6.regionServerReport(Unknown Source) > >>> > at > >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep > >>> > ort(HRegionServer.java:729) > >>> > ... 2 more > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics: > >>> > requests=66, regions=206, stores=2078, storefiles=970, > >>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0, > >>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083, > >>> > blockCacheSize=705907552, blockCacheFree=150412064, > >>> > blockCacheCount=10648, blockCacheHitCount=79578618, > >>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352, > >>> > blockCacheHitRatio=96, > >>> > blockCacheHitCachingRatio=98 > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > >>> > rejected; currently processing > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020 > >>> > --------------------------- > >>> > > >>> > Then i restart the RegionServer and everything is back to normal. > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any > >>> > abnormality in the same time window. > >>> > I think it was caused by the lost of connection to zookeeper. Is it > >>> advisable to > >>> > run zookeeper in the same machines? > >>> > if the RegionServer lost it's connection to Zookeeper, there's a way > (a > >>> > configuration perhaps) to re-join the cluster, and not only die? > >>> > > >>> > Any idea what is causing this?? Or to prevent it from happening? > >>> > > >>> > Any help is appreciated. > >>> > > >>> > Best Regards, > >>> > > >>> > -- > >>> > > >>> > *Leonardo Gamas* > >>> > Software Engineer > >>> > +557134943514 > >>> > +557581347440 > >>> > [email protected] > >>> > www.jusbrasil.com.br > >>> > >> > >> > >> > >> -- > >> > >> *Leonardo Gamas* > >> Software Engineer/Chaos Monkey Engineer > >> T (71) 3494-3514 > >> C (75) 8134-7440 > >> [email protected] > >> www.jusbrasil.com.br > >> > >> > > > > > > -- > > > > *Leonardo Gamas* > > Software Engineer/Chaos Monkey Engineer > > T (71) 3494-3514 > > C (75) 8134-7440 > > [email protected] > > www.jusbrasil.com.br > > -- *Leonardo Gamas* Software Engineer +557134943514 +557581347440 [email protected] www.jusbrasil.com.br
