Re: RegionServer dying every two or three days

Leonardo Gamas Thu, 05 Jan 2012 05:33:49 -0800

 St.Ack,

I don't have made any attempt in GC tunning, yet.
I will read the perf section as suggested.
I'm currently using Nagios + JMX to monitor the cluster, but it's currently
used for alert only, the perfdata is not been stored, so it's kind of
useless right now, but i was thinking in use TSDB to store it, any known
case of integration?
---


Sandy,

Yes, my timeout is 30 seconds:

<property>
   <name>zookeeper.session.timeout</name>
   <value>30000</value>
</property>

To our application it's a sufferable time to wait in case a RegionServer go
offline.

My heap is 4GB and my JVM params are:

-Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m -XX:MaxNewSize=128m
-XX:+DoEscapeAnalysis -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log

I will try the -XX:+UseParallelOldGC param and post my feedback here.
---

Ramkrishna,

Seems the GC is the root of all evil in this case.
----

Thank you all for the answers. I will try out these valuable advices given
here and post my results.

Leo Gamas.

2012/1/5 Ramkrishna S Vasudevan <[email protected]>

> Recently we faced a similar problem and it was due to GC config.  Pls check
> your GC.
>
> Regards
> Ram
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Stack
> Sent: Thursday, January 05, 2012 2:50 AM
> To: [email protected]
> Subject: Re: RegionServer dying every two or three days
>
> On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> <[email protected]> wrote:
> > The third line took 36.96 seconds to execute, can this be causing this
> > problem?
> >
>
> Probably.  Have you made any attempt at GC tuning?
>
>
> > Reading the code a little it seems that, even if it's disabled, if all
> > files are target in a compaction, it's considered a major compaction. Is
> it
> > right?
> >
>
> That is right.  They get 'upgraded' from minor to major.
>
> This should be fine though.  What you are avoiding setting major
> compactions to 0 is all regions being major compacted on a period, a
> heavy weight effective rewrite of all your data (unless already major
> compacted).   It looks like you have this disabled which is good until
> you've wrestled your cluster into submission.
>
>
> > The machines don't have swap, so the swappiness parameter don't seem to
> > apply here. Any other suggestion?
> >
>
> See the perf section of the hbase manual.  It has our current list.
>
> Are you monitoring your cluster w/ ganglia or tsdb?
>
>
> St.Ack
>
> > Thanks.
> >
> > 2012/1/4 Leonardo Gamas <[email protected]>
> >
> >> I will investigate this, thanks for the response.
> >>
> >>
> >> 2012/1/3 Sandy Pratt <[email protected]>
> >>
> >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
> >>> have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
> >>> closing socket connection and attempting reconnect
> >>>
> >>> It looks like the process has been unresponsive for some time, so ZK
> has
> >>> terminated the session.  Did you experience a long GC pause right
> before
> >>> this?  If you don't have GC logging enabled for the RS, you can
> sometimes
> >>> tell by noticing a gap in the timestamps of the log statements leading
> up
> >>> to the crash.
> >>>
> >>> If it turns out to be GC, you might want to look at your kernel
> >>> swappiness setting (set it to 0) and your JVM params.
> >>>
> >>> Sandy
> >>>
> >>>
> >>> > -----Original Message-----
> >>> > From: Leonardo Gamas [mailto:[email protected]]
> >>> > Sent: Thursday, December 29, 2011 07:44
> >>> > To: [email protected]
> >>> > Subject: RegionServer dying every two or three days
> >>> >
> >>> > Hi,
> >>> >
> >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
> >>> Slaves),
> >>> > running on Amazon EC2. The master is a High-Memory Extra Large
> Instance
> >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
> >>> > slaves are Extra Large Instances (m1.xlarge) running Datanode,
> >>> TaskTracker,
> >>> > RegionServer and Zookeeper.
> >>> >
> >>> > From time to time, every two or three days, one of the RegionServers
> >>> > processes goes down, but the other processes (DataNode, TaskTracker,
> >>> > Zookeeper) continue normally.
> >>> >
> >>> > Reading the logs:
> >>> >
> >>> > The connection with Zookeeper timed out:
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> out,
> >>> have
> >>> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
> >>> closing
> >>> > socket connection and attempting reconnect
> >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> out,
> >>> have
> >>> > not heard from server in 61205ms for sessionid 0x346c561a55953e,
> closing
> >>> > socket connection and attempting reconnect
> >>> > ---------------------------
> >>> >
> >>> > And the Handlers start to fail:
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> >>> > xx.xx.xx.xx:xxxx: output error
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on
> 60020
> >>> > caught: java.nio.channels.ClosedChannelException
> >>> >         at
> >>> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> >>> > 3)
> >>> >         at
> >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >>> > 1341)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
> >>> > aseServer.java:727)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> >>> > rver.java:792)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 083)
> >>> >
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> >>> > xx.xx.xx.xx:xxxx: output error
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on
> 60020
> >>> > caught: java.nio.channels.ClosedChannelException
> >>> >         at
> >>> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> >>> > 3)
> >>> >         at
> >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >>> > 1341)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
> >>> > aseServer.java:727)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> >>> > rver.java:792)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 083)
> >>> > ---------------------------
> >>> >
> >>> > And finally the server throws a YouAreDeadException :( :
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> connection
> >>> to
> >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> initiating
> >>> session
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> >>> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing
> socket
> >>> > connection
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> connection
> >>> to
> >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> initiating
> >>> session
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> >>> > ZooKeeper service, session 0x346c561a55953e has expired, closing
> socket
> >>> > connection
> >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
> >>> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> >>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
> >>> > Unhandled
> >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>> > rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >>> > org.apache.hadoop.hbase.YouAreDeadException:
> >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> as
> >>> > dead server
> >>> >         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >>> > Method)
> >>> >         at
> >>> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor
> >>> > AccessorImpl.java:39)
> >>> >         at
> >>> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon
> >>> > structorAccessorImpl.java:27)
> >>> >         at
> >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >>> >         at
> >>> > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
> >>> > ption.java:95)
> >>> >         at
> >>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
> >>> > Exception.java:79)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
> >>> > ort(HRegionServer.java:735)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> >>> > ava:596)
> >>> >         at java.lang.Thread.run(Thread.java:662)
> >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> as
> >>> > dead server
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
> >>> > ger.java:204)
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv
> >>> > erManager.java:262)
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
> >>> > a:669)
> >>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> Source)
> >>> >         at
> >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> >>> > sorImpl.java:25)
> >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 039)
> >>> >
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> >>> > )
> >>> >         at $Proxy6.regionServerReport(Unknown Source)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
> >>> > ort(HRegionServer.java:729)
> >>> >         ... 2 more
> >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> >>> > requests=66, regions=206, stores=2078, storefiles=970,
> >>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
> >>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
> >>> > blockCacheSize=705907552, blockCacheFree=150412064,
> >>> > blockCacheCount=10648, blockCacheHitCount=79578618,
> >>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
> >>> > blockCacheHitRatio=96,
> >>> > blockCacheHitCachingRatio=98
> >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled
> >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>> > rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> >>> > ---------------------------
> >>> >
> >>> > Then i restart the RegionServer and everything is back to normal.
> >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any
> >>> > abnormality in the same time window.
> >>> > I think it was caused by the lost of connection to zookeeper. Is it
> >>> advisable to
> >>> > run zookeeper in the same machines?
> >>> > if the RegionServer lost it's connection to Zookeeper, there's a way
> (a
> >>> > configuration perhaps) to re-join the cluster, and not only die?
> >>> >
> >>> > Any idea what is causing this?? Or to prevent it from happening?
> >>> >
> >>> > Any help is appreciated.
> >>> >
> >>> > Best Regards,
> >>> >
> >>> > --
> >>> >
> >>> > *Leonardo Gamas*
> >>> > Software Engineer
> >>> > +557134943514
> >>> > +557581347440
> >>> > [email protected]
> >>> > www.jusbrasil.com.br
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> *Leonardo Gamas*
> >> Software Engineer/Chaos Monkey Engineer
> >> T (71) 3494-3514
> >> C (75) 8134-7440
> >> [email protected]
> >> www.jusbrasil.com.br
> >>
> >>
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer/Chaos Monkey Engineer
> > T (71) 3494-3514
> > C (75) 8134-7440
> > [email protected]
> > www.jusbrasil.com.br
>
>


-- 

*Leonardo Gamas*
Software Engineer
+557134943514
+557581347440
[email protected]
www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Reply via email to