I checked the CPU Utilization graphics provided by Amazon (it's not accurate, since the sample time is about 5 minutes) and don't see any abnormality. I will setup TSDB with Nagios to have a more reliable source of performance data.
The machines don't have swap space, if i run: $ swapon -s To display swap usage summary, it returns an empty list. I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to tests. I don't have payed much attention to the value of the new size param. Thanks again for the help!! 2012/1/5 Sandy Pratt <[email protected]> > That size heap doesn't seem like it should cause a 36 second GC (a minor > GC even if I remember your logs correctly), so I tend to think that other > things are probably going on. > > This line here: > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840 secs] > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05 sys=0.01, > real=36.96 secs] > > is really mysterious to me. It seems to indicate that the process was > blocked for almost 37 seconds during a minor collection. Note the CPU > times are very low but the wall time is very high. If it was actually > doing GC work, I'd expect to see user time higher than real time, as it is > in other parallel collections (see your log snippet). Were you really so > CPU starved that it took 37 seconds to get in 50ms of work? I can't make > sense of that. I'm trying to think of something that would block you for > that long while all your threads are stopped for GC, other than being in > swap, but I can't come up with anything. You're certain you're not in swap? > > Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts while you > troubleshoot? > > Why is your new size so small? This generally means that relatively more > objects are being tenured than would be with a larger new size. This could > make collections of the old gen worse (GC time is said to be proportional > to the number of live objects in the generation, and CMS does indeed cause > STW pauses). A typical new to tenured ratio might be 1:3. Were the new > gen GCs taking too long? This is probably orthogonal to your immediate > issue, though. > > > > -----Original Message----- > From: Leonardo Gamas [mailto:[email protected]] > Sent: Thursday, January 05, 2012 5:33 AM > To: [email protected] > Subject: Re: RegionServer dying every two or three days > > St.Ack, > > I don't have made any attempt in GC tunning, yet. > I will read the perf section as suggested. > I'm currently using Nagios + JMX to monitor the cluster, but it's > currently used for alert only, the perfdata is not been stored, so it's > kind of useless right now, but i was thinking in use TSDB to store it, any > known case of integration? > --- > > Sandy, > > Yes, my timeout is 30 seconds: > > <property> > <name>zookeeper.session.timeout</name> > <value>30000</value> > </property> > > To our application it's a sufferable time to wait in case a RegionServer > go offline. > > My heap is 4GB and my JVM params are: > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m -XX:MaxNewSize=128m > -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log > > I will try the -XX:+UseParallelOldGC param and post my feedback here. > --- > > Ramkrishna, > > Seems the GC is the root of all evil in this case. > ---- > > Thank you all for the answers. I will try out these valuable advices given > here and post my results. > > Leo Gamas. > > 2012/1/5 Ramkrishna S Vasudevan <[email protected]> > > > Recently we faced a similar problem and it was due to GC config. Pls > > check your GC. > > > > Regards > > Ram > > > > -----Original Message----- > > From: [email protected] [mailto:[email protected]] On Behalf Of > > Stack > > Sent: Thursday, January 05, 2012 2:50 AM > > To: [email protected] > > Subject: Re: RegionServer dying every two or three days > > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas > > <[email protected]> wrote: > > > The third line took 36.96 seconds to execute, can this be causing > > > this problem? > > > > > > > Probably. Have you made any attempt at GC tuning? > > > > > > > Reading the code a little it seems that, even if it's disabled, if > > > all files are target in a compaction, it's considered a major > > > compaction. Is > > it > > > right? > > > > > > > That is right. They get 'upgraded' from minor to major. > > > > This should be fine though. What you are avoiding setting major > > compactions to 0 is all regions being major compacted on a period, a > > heavy weight effective rewrite of all your data (unless already major > > compacted). It looks like you have this disabled which is good until > > you've wrestled your cluster into submission. > > > > > > > The machines don't have swap, so the swappiness parameter don't seem > > > to apply here. Any other suggestion? > > > > > > > See the perf section of the hbase manual. It has our current list. > > > > Are you monitoring your cluster w/ ganglia or tsdb? > > > > > > St.Ack > > > > > Thanks. > > > > > > 2012/1/4 Leonardo Gamas <[email protected]> > > > > > >> I will investigate this, thanks for the response. > > >> > > >> > > >> 2012/1/3 Sandy Pratt <[email protected]> > > >> > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed > > >>> out, have not heard from server in 61103ms for sessionid > > >>> 0x23462a4cf93a8fc, closing socket connection and attempting > > >>> reconnect > > >>> > > >>> It looks like the process has been unresponsive for some time, so > > >>> ZK > > has > > >>> terminated the session. Did you experience a long GC pause right > > before > > >>> this? If you don't have GC logging enabled for the RS, you can > > sometimes > > >>> tell by noticing a gap in the timestamps of the log statements > > >>> leading > > up > > >>> to the crash. > > >>> > > >>> If it turns out to be GC, you might want to look at your kernel > > >>> swappiness setting (set it to 0) and your JVM params. > > >>> > > >>> Sandy > > >>> > > >>> > > >>> > -----Original Message----- > > >>> > From: Leonardo Gamas [mailto:[email protected]] > > >>> > Sent: Thursday, December 29, 2011 07:44 > > >>> > To: [email protected] > > >>> > Subject: RegionServer dying every two or three days > > >>> > > > >>> > Hi, > > >>> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3 > > >>> Slaves), > > >>> > running on Amazon EC2. The master is a High-Memory Extra Large > > Instance > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. > > >>> > The slaves are Extra Large Instances (m1.xlarge) running > > >>> > Datanode, > > >>> TaskTracker, > > >>> > RegionServer and Zookeeper. > > >>> > > > >>> > From time to time, every two or three days, one of the > > >>> > RegionServers processes goes down, but the other processes > > >>> > (DataNode, TaskTracker, > > >>> > Zookeeper) continue normally. > > >>> > > > >>> > Reading the logs: > > >>> > > > >>> > The connection with Zookeeper timed out: > > >>> > > > >>> > --------------------------- > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session > > >>> > timed > > out, > > >>> have > > >>> > not heard from server in 61103ms for sessionid > > >>> > 0x23462a4cf93a8fc, > > >>> closing > > >>> > socket connection and attempting reconnect > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session > > >>> > timed > > out, > > >>> have > > >>> > not heard from server in 61205ms for sessionid 0x346c561a55953e, > > closing > > >>> > socket connection and attempting reconnect > > >>> > --------------------------- > > >>> > > > >>> > And the Handlers start to fail: > > >>> > > > >>> > --------------------------- > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, > > >>> > call > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from > > >>> > xx.xx.xx.xx:xxxx: output error > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on > > 60020 > > >>> > caught: java.nio.channels.ClosedChannelException > > >>> > at > > >>> > > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13 > > >>> > 3) > > >>> > at > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > > >>> > at > > >>> > > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: > > >>> > 1341) > > >>> > at > > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons > > >>> > e(HB > > >>> > aseServer.java:727) > > >>> > at > > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas > > >>> > eSe > > >>> > rver.java:792) > > >>> > at > > >>> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 > > >>> > 083) > > >>> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, > > >>> > call > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from > > >>> > xx.xx.xx.xx:xxxx: output error > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on > > 60020 > > >>> > caught: java.nio.channels.ClosedChannelException > > >>> > at > > >>> > > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13 > > >>> > 3) > > >>> > at > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > > >>> > at > > >>> > > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: > > >>> > 1341) > > >>> > at > > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons > > >>> > e(HB > > >>> > aseServer.java:727) > > >>> > at > > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas > > >>> > eSe > > >>> > rver.java:792) > > >>> > at > > >>> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 > > >>> > 083) > > >>> > --------------------------- > > >>> > > > >>> > And finally the server throws a YouAreDeadException :( : > > >>> > > > >>> > --------------------------- > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket > > connection > > >>> to > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, > > initiating > > >>> session > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect > > >>> > to ZooKeeper service, session 0x23462a4cf93a8fc has expired, > > >>> > closing > > socket > > >>> > connection > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket > > connection > > >>> to > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, > > initiating > > >>> session > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect > > >>> > to ZooKeeper service, session 0x346c561a55953e has expired, > > >>> > closing > > socket > > >>> > connection > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING > > >>> > region server > > >>> > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741, > > >>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083): > > >>> > Unhandled > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server > > >>> > REPORT rejected; currently processing > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server > > >>> > org.apache.hadoop.hbase.YouAreDeadException: > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > > >>> > rejected; currently processing > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 > > as > > >>> > dead server > > >>> > at > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > > >>> > Method) > > >>> > at > > >>> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo > > r > > >>> > AccessorImpl.java:39) > > >>> > at > > >>> > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo > > n > > >>> > structorAccessorImpl.java:27) > > >>> > at > > >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513) > > >>> > at > > >>> > org.apache.hadoop.ipc.RemoteException.instantiateException(Remot > > >>> > eExce > > >>> > ption.java:95) > > >>> > at > > >>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remo > > >>> > te > > >>> > Exception.java:79) > > >>> > at > > >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ > > >>> > erRep > > >>> > ort(HRegionServer.java:735) > > >>> > at > > >>> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j > > >>> > ava:596) > > >>> > at java.lang.Thread.run(Thread.java:662) > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException: > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > > >>> > rejected; currently processing > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 > > as > > >>> > dead server > > >>> > at > > >>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerM > > >>> > ana > > >>> > ger.java:204) > > >>> > at > > >>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport( > > >>> > Serv > > >>> > erManager.java:262) > > >>> > at > > >>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaste > > >>> > r.jav > > >>> > a:669) > > >>> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown > > Source) > > >>> > at > > >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod > > >>> > Acces > > >>> > sorImpl.java:25) > > >>> > at java.lang.reflect.Method.invoke(Method.java:597) > > >>> > at > > >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > > >>> > at > > >>> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 > > >>> > 039) > > >>> > > > >>> > at > > >>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771) > > >>> > at > > >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.jav > > >>> > a:257 > > >>> > ) > > >>> > at $Proxy6.regionServerReport(Unknown Source) > > >>> > at > > >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ > > >>> > erRep > > >>> > ort(HRegionServer.java:729) > > >>> > ... 2 more > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics: > > >>> > requests=66, regions=206, stores=2078, storefiles=970, > > >>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0, > > >>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083, > > >>> > blockCacheSize=705907552, blockCacheFree=150412064, > > >>> > blockCacheCount=10648, blockCacheHitCount=79578618, > > >>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352, > > >>> > blockCacheHitRatio=96, > > >>> > blockCacheHitCachingRatio=98 > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: > > >>> > Unhandled > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server > > >>> > REPORT rejected; currently processing > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020 > > >>> > --------------------------- > > >>> > > > >>> > Then i restart the RegionServer and everything is back to normal. > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't > > >>> > see any abnormality in the same time window. > > >>> > I think it was caused by the lost of connection to zookeeper. Is > > >>> > it > > >>> advisable to > > >>> > run zookeeper in the same machines? > > >>> > if the RegionServer lost it's connection to Zookeeper, there's a > > >>> > way > > (a > > >>> > configuration perhaps) to re-join the cluster, and not only die? > > >>> > > > >>> > Any idea what is causing this?? Or to prevent it from happening? > > >>> > > > >>> > Any help is appreciated. > > >>> > > > >>> > Best Regards, > > >>> > > > >>> > -- > > >>> > > > >>> > *Leonardo Gamas* > > >>> > Software Engineer > > >>> > +557134943514 > > >>> > +557581347440 > > >>> > [email protected] > > >>> > www.jusbrasil.com.br > > >>> > > >> > > >> > > >> > > >> -- > > >> > > >> *Leonardo Gamas* > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75) > > >> 8134-7440 [email protected] www.jusbrasil.com.br > > >> > > >> > > > > > > > > > -- > > > > > > *Leonardo Gamas* > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75) > > > 8134-7440 [email protected] www.jusbrasil.com.br > > > > > > > -- > > *Leonardo Gamas* > Software Engineer > +557134943514 > +557581347440 > [email protected] > www.jusbrasil.com.br > -- *Leonardo Gamas* Software Engineer T +55 (71) 3494-3514 C +55 (75) 8134-7440 [email protected] www.jusbrasil.com.br
