Re: RegionServer dying every two or three days

Leonardo Gamas Thu, 05 Jan 2012 13:15:43 -0800

I checked the CPU Utilization graphics provided by Amazon (it's not
accurate, since the sample time is about 5 minutes) and don't see any
abnormality. I will setup TSDB with Nagios to have a more reliable source
of performance data.


The machines don't have swap space, if i run:

$ swapon -s

To display swap usage summary, it returns an empty list.

I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to tests.

I don't have payed much attention to the value of the new size param.

Thanks again for the help!!

2012/1/5 Sandy Pratt <[email protected]>

> That size heap doesn't seem like it should cause a 36 second GC (a minor
> GC even if I remember your logs correctly), so I tend to think that other
> things are probably going on.
>
> This line here:
>
> 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840 secs]
> 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05 sys=0.01,
> real=36.96 secs]
>
> is really mysterious to me.  It seems to indicate that the process was
> blocked for almost 37 seconds during a minor collection.  Note the CPU
> times are very low but the wall time is very high.  If it was actually
> doing GC work, I'd expect to see user time higher than real time, as it is
> in other parallel collections (see your log snippet).  Were you really so
> CPU starved that it took 37 seconds to get in 50ms of work?  I can't make
> sense of that.  I'm trying to think of something that would block you for
> that long while all your threads are stopped for GC, other than being in
> swap, but I can't come up with anything.  You're certain you're not in swap?
>
> Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts while you
> troubleshoot?
>
> Why is your new size so small?  This generally means that relatively more
> objects are being tenured than would be with a larger new size.  This could
> make collections of the old gen worse (GC time is said to be proportional
> to the number of live objects in the generation, and CMS does indeed cause
> STW pauses).  A typical new to tenured ratio might be 1:3.  Were the new
> gen GCs taking too long?  This is probably orthogonal to your immediate
> issue, though.
>
>
>
> -----Original Message-----
> From: Leonardo Gamas [mailto:[email protected]]
> Sent: Thursday, January 05, 2012 5:33 AM
> To: [email protected]
> Subject: Re: RegionServer dying every two or three days
>
>  St.Ack,
>
> I don't have made any attempt in GC tunning, yet.
> I will read the perf section as suggested.
> I'm currently using Nagios + JMX to monitor the cluster, but it's
> currently used for alert only, the perfdata is not been stored, so it's
> kind of useless right now, but i was thinking in use TSDB to store it, any
> known case of integration?
> ---
>
> Sandy,
>
> Yes, my timeout is 30 seconds:
>
> <property>
>   <name>zookeeper.session.timeout</name>
>   <value>30000</value>
> </property>
>
> To our application it's a sufferable time to wait in case a RegionServer
> go offline.
>
> My heap is 4GB and my JVM params are:
>
> -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m -XX:MaxNewSize=128m
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
>
> I will try the -XX:+UseParallelOldGC param and post my feedback here.
> ---
>
> Ramkrishna,
>
> Seems the GC is the root of all evil in this case.
> ----
>
> Thank you all for the answers. I will try out these valuable advices given
> here and post my results.
>
> Leo Gamas.
>
> 2012/1/5 Ramkrishna S Vasudevan <[email protected]>
>
> > Recently we faced a similar problem and it was due to GC config.  Pls
> > check your GC.
> >
> > Regards
> > Ram
> >
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]] On Behalf Of
> > Stack
> > Sent: Thursday, January 05, 2012 2:50 AM
> > To: [email protected]
> > Subject: Re: RegionServer dying every two or three days
> >
> > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > <[email protected]> wrote:
> > > The third line took 36.96 seconds to execute, can this be causing
> > > this problem?
> > >
> >
> > Probably.  Have you made any attempt at GC tuning?
> >
> >
> > > Reading the code a little it seems that, even if it's disabled, if
> > > all files are target in a compaction, it's considered a major
> > > compaction. Is
> > it
> > > right?
> > >
> >
> > That is right.  They get 'upgraded' from minor to major.
> >
> > This should be fine though.  What you are avoiding setting major
> > compactions to 0 is all regions being major compacted on a period, a
> > heavy weight effective rewrite of all your data (unless already major
> > compacted).   It looks like you have this disabled which is good until
> > you've wrestled your cluster into submission.
> >
> >
> > > The machines don't have swap, so the swappiness parameter don't seem
> > > to apply here. Any other suggestion?
> > >
> >
> > See the perf section of the hbase manual.  It has our current list.
> >
> > Are you monitoring your cluster w/ ganglia or tsdb?
> >
> >
> > St.Ack
> >
> > > Thanks.
> > >
> > > 2012/1/4 Leonardo Gamas <[email protected]>
> > >
> > >> I will investigate this, thanks for the response.
> > >>
> > >>
> > >> 2012/1/3 Sandy Pratt <[email protected]>
> > >>
> > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> > >>> out, have not heard from server in 61103ms for sessionid
> > >>> 0x23462a4cf93a8fc, closing socket connection and attempting
> > >>> reconnect
> > >>>
> > >>> It looks like the process has been unresponsive for some time, so
> > >>> ZK
> > has
> > >>> terminated the session.  Did you experience a long GC pause right
> > before
> > >>> this?  If you don't have GC logging enabled for the RS, you can
> > sometimes
> > >>> tell by noticing a gap in the timestamps of the log statements
> > >>> leading
> > up
> > >>> to the crash.
> > >>>
> > >>> If it turns out to be GC, you might want to look at your kernel
> > >>> swappiness setting (set it to 0) and your JVM params.
> > >>>
> > >>> Sandy
> > >>>
> > >>>
> > >>> > -----Original Message-----
> > >>> > From: Leonardo Gamas [mailto:[email protected]]
> > >>> > Sent: Thursday, December 29, 2011 07:44
> > >>> > To: [email protected]
> > >>> > Subject: RegionServer dying every two or three days
> > >>> >
> > >>> > Hi,
> > >>> >
> > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
> > >>> Slaves),
> > >>> > running on Amazon EC2. The master is a High-Memory Extra Large
> > Instance
> > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper.
> > >>> > The slaves are Extra Large Instances (m1.xlarge) running
> > >>> > Datanode,
> > >>> TaskTracker,
> > >>> > RegionServer and Zookeeper.
> > >>> >
> > >>> > From time to time, every two or three days, one of the
> > >>> > RegionServers processes goes down, but the other processes
> > >>> > (DataNode, TaskTracker,
> > >>> > Zookeeper) continue normally.
> > >>> >
> > >>> > Reading the logs:
> > >>> >
> > >>> > The connection with Zookeeper timed out:
> > >>> >
> > >>> > ---------------------------
> > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> > >>> > timed
> > out,
> > >>> have
> > >>> > not heard from server in 61103ms for sessionid
> > >>> > 0x23462a4cf93a8fc,
> > >>> closing
> > >>> > socket connection and attempting reconnect
> > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> > >>> > timed
> > out,
> > >>> have
> > >>> > not heard from server in 61205ms for sessionid 0x346c561a55953e,
> > closing
> > >>> > socket connection and attempting reconnect
> > >>> > ---------------------------
> > >>> >
> > >>> > And the Handlers start to fail:
> > >>> >
> > >>> > ---------------------------
> > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> > >>> > call
> > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> > >>> > xx.xx.xx.xx:xxxx: output error
> > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on
> > 60020
> > >>> > caught: java.nio.channels.ClosedChannelException
> > >>> >         at
> > >>> >
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> > >>> > 3)
> > >>> >         at
> > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > >>> >         at
> > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > >>> > 1341)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons
> > >>> > e(HB
> > >>> > aseServer.java:727)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas
> > >>> > eSe
> > >>> > rver.java:792)
> > >>> >         at
> > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> > >>> > 083)
> > >>> >
> > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> > >>> > call
> > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> > >>> > xx.xx.xx.xx:xxxx: output error
> > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on
> > 60020
> > >>> > caught: java.nio.channels.ClosedChannelException
> > >>> >         at
> > >>> >
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> > >>> > 3)
> > >>> >         at
> > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > >>> >         at
> > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > >>> > 1341)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons
> > >>> > e(HB
> > >>> > aseServer.java:727)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas
> > >>> > eSe
> > >>> > rver.java:792)
> > >>> >         at
> > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> > >>> > 083)
> > >>> > ---------------------------
> > >>> >
> > >>> > And finally the server throws a YouAreDeadException :( :
> > >>> >
> > >>> > ---------------------------
> > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> > connection
> > >>> to
> > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > initiating
> > >>> session
> > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect
> > >>> > to ZooKeeper service, session 0x23462a4cf93a8fc has expired,
> > >>> > closing
> > socket
> > >>> > connection
> > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> > connection
> > >>> to
> > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > initiating
> > >>> session
> > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect
> > >>> > to ZooKeeper service, session 0x346c561a55953e has expired,
> > >>> > closing
> > socket
> > >>> > connection
> > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING
> > >>> > region server
> > >>> > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > >>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
> > >>> > Unhandled
> > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> > >>> > REPORT rejected; currently processing
> > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > >>> > rejected; currently processing
> > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > as
> > >>> > dead server
> > >>> >         at
> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > >>> > Method)
> > >>> >         at
> > >>> >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
> > r
> > >>> > AccessorImpl.java:39)
> > >>> >         at
> > >>> >
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
> > n
> > >>> > structorAccessorImpl.java:27)
> > >>> >         at
> > >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > >>> >         at
> > >>> > org.apache.hadoop.ipc.RemoteException.instantiateException(Remot
> > >>> > eExce
> > >>> > ption.java:95)
> > >>> >         at
> > >>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remo
> > >>> > te
> > >>> > Exception.java:79)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ
> > >>> > erRep
> > >>> > ort(HRegionServer.java:735)
> > >>> >         at
> > >>> >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> > >>> > ava:596)
> > >>> >         at java.lang.Thread.run(Thread.java:662)
> > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > >>> > rejected; currently processing
> > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > as
> > >>> > dead server
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerM
> > >>> > ana
> > >>> > ger.java:204)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
> > >>> > Serv
> > >>> > erManager.java:262)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaste
> > >>> > r.jav
> > >>> > a:669)
> > >>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > Source)
> > >>> >         at
> > >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod
> > >>> > Acces
> > >>> > sorImpl.java:25)
> > >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > >>> >         at
> > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> > >>> > 039)
> > >>> >
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.jav
> > >>> > a:257
> > >>> > )
> > >>> >         at $Proxy6.regionServerReport(Unknown Source)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ
> > >>> > erRep
> > >>> > ort(HRegionServer.java:729)
> > >>> >         ... 2 more
> > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> > >>> > requests=66, regions=206, stores=2078, storefiles=970,
> > >>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
> > >>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
> > >>> > blockCacheSize=705907552, blockCacheFree=150412064,
> > >>> > blockCacheCount=10648, blockCacheHitCount=79578618,
> > >>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
> > >>> > blockCacheHitRatio=96,
> > >>> > blockCacheHitCachingRatio=98
> > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
> > >>> > Unhandled
> > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> > >>> > REPORT rejected; currently processing
> > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> > >>> > ---------------------------
> > >>> >
> > >>> > Then i restart the RegionServer and everything is back to normal.
> > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't
> > >>> > see any abnormality in the same time window.
> > >>> > I think it was caused by the lost of connection to zookeeper. Is
> > >>> > it
> > >>> advisable to
> > >>> > run zookeeper in the same machines?
> > >>> > if the RegionServer lost it's connection to Zookeeper, there's a
> > >>> > way
> > (a
> > >>> > configuration perhaps) to re-join the cluster, and not only die?
> > >>> >
> > >>> > Any idea what is causing this?? Or to prevent it from happening?
> > >>> >
> > >>> > Any help is appreciated.
> > >>> >
> > >>> > Best Regards,
> > >>> >
> > >>> > --
> > >>> >
> > >>> > *Leonardo Gamas*
> > >>> > Software Engineer
> > >>> > +557134943514
> > >>> > +557581347440
> > >>> > [email protected]
> > >>> > www.jusbrasil.com.br
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >>
> > >> *Leonardo Gamas*
> > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > >> 8134-7440 [email protected] www.jusbrasil.com.br
> > >>
> > >>
> > >
> > >
> > > --
> > >
> > > *Leonardo Gamas*
> > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > > 8134-7440 [email protected] www.jusbrasil.com.br
> >
> >
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer
> +557134943514
> +557581347440
> [email protected]
> www.jusbrasil.com.br
>



-- 

*Leonardo Gamas*
Software Engineer
T +55 (71) 3494-3514
C +55 (75) 8134-7440
[email protected]
www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Reply via email to