Re: RegionServer dying every two or three days

Stack Wed, 04 Jan 2012 13:20:32 -0800

On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
<[email protected]> wrote:
> The third line took 36.96 seconds to execute, can this be causing this
> problem?
>


Probably.  Have you made any attempt at GC tuning?


> Reading the code a little it seems that, even if it's disabled, if all
> files are target in a compaction, it's considered a major compaction. Is it
> right?
>

That is right.  They get 'upgraded' from minor to major.

This should be fine though.  What you are avoiding setting major
compactions to 0 is all regions being major compacted on a period, a
heavy weight effective rewrite of all your data (unless already major
compacted).   It looks like you have this disabled which is good until
you've wrestled your cluster into submission.


> The machines don't have swap, so the swappiness parameter don't seem to
> apply here. Any other suggestion?
>

See the perf section of the hbase manual.  It has our current list.

Are you monitoring your cluster w/ ganglia or tsdb?


St.Ack

> Thanks.
>
> 2012/1/4 Leonardo Gamas <[email protected]>
>
>> I will investigate this, thanks for the response.
>>
>>
>> 2012/1/3 Sandy Pratt <[email protected]>
>>
>>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>>> have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
>>> closing socket connection and attempting reconnect
>>>
>>> It looks like the process has been unresponsive for some time, so ZK has
>>> terminated the session.  Did you experience a long GC pause right before
>>> this?  If you don't have GC logging enabled for the RS, you can sometimes
>>> tell by noticing a gap in the timestamps of the log statements leading up
>>> to the crash.
>>>
>>> If it turns out to be GC, you might want to look at your kernel
>>> swappiness setting (set it to 0) and your JVM params.
>>>
>>> Sandy
>>>
>>>
>>> > -----Original Message-----
>>> > From: Leonardo Gamas [mailto:[email protected]]
>>> > Sent: Thursday, December 29, 2011 07:44
>>> > To: [email protected]
>>> > Subject: RegionServer dying every two or three days
>>> >
>>> > Hi,
>>> >
>>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
>>> Slaves),
>>> > running on Amazon EC2. The master is a High-Memory Extra Large Instance
>>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
>>> > slaves are Extra Large Instances (m1.xlarge) running Datanode,
>>> TaskTracker,
>>> > RegionServer and Zookeeper.
>>> >
>>> > From time to time, every two or three days, one of the RegionServers
>>> > processes goes down, but the other processes (DataNode, TaskTracker,
>>> > Zookeeper) continue normally.
>>> >
>>> > Reading the logs:
>>> >
>>> > The connection with Zookeeper timed out:
>>> >
>>> > ---------------------------
>>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>>> have
>>> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
>>> closing
>>> > socket connection and attempting reconnect
>>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>>> have
>>> > not heard from server in 61205ms for sessionid 0x346c561a55953e, closing
>>> > socket connection and attempting reconnect
>>> > ---------------------------
>>> >
>>> > And the Handlers start to fail:
>>> >
>>> > ---------------------------
>>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
>>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
>>> > xx.xx.xx.xx:xxxx: output error
>>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020
>>> > caught: java.nio.channels.ClosedChannelException
>>> >         at
>>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
>>> > 3)
>>> >         at
>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>>> > 1341)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
>>> > aseServer.java:727)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
>>> > rver.java:792)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>>> > 083)
>>> >
>>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
>>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
>>> > xx.xx.xx.xx:xxxx: output error
>>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020
>>> > caught: java.nio.channels.ClosedChannelException
>>> >         at
>>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
>>> > 3)
>>> >         at
>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>>> > 1341)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
>>> > aseServer.java:727)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
>>> > rver.java:792)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>>> > 083)
>>> > ---------------------------
>>> >
>>> > And finally the server throws a YouAreDeadException :( :
>>> >
>>> > ---------------------------
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to
>>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating
>>> session
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing socket
>>> > connection
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to
>>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating
>>> session
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>> > ZooKeeper service, session 0x346c561a55953e has expired, closing socket
>>> > connection
>>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
>>> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
>>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
>>> > Unhandled
>>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>> > rejected; currently processing
>>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>>> > org.apache.hadoop.hbase.YouAreDeadException:
>>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
>>> > dead server
>>> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> > Method)
>>> >         at
>>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor
>>> > AccessorImpl.java:39)
>>> >         at
>>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon
>>> > structorAccessorImpl.java:27)
>>> >         at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>> >         at
>>> > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
>>> > ption.java:95)
>>> >         at
>>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
>>> > Exception.java:79)
>>> >         at
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
>>> > ort(HRegionServer.java:735)
>>> >         at
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
>>> > ava:596)
>>> >         at java.lang.Thread.run(Thread.java:662)
>>> > Caused by: org.apache.hadoop.ipc.RemoteException:
>>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
>>> > dead server
>>> >         at
>>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
>>> > ger.java:204)
>>> >         at
>>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv
>>> > erManager.java:262)
>>> >         at
>>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
>>> > a:669)
>>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>> >         at
>>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>>> > sorImpl.java:25)
>>> >         at java.lang.reflect.Method.invoke(Method.java:597)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>>> > 039)
>>> >
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
>>> > )
>>> >         at $Proxy6.regionServerReport(Unknown Source)
>>> >         at
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
>>> > ort(HRegionServer.java:729)
>>> >         ... 2 more
>>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
>>> > requests=66, regions=206, stores=2078, storefiles=970,
>>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
>>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
>>> > blockCacheSize=705907552, blockCacheFree=150412064,
>>> > blockCacheCount=10648, blockCacheHitCount=79578618,
>>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
>>> > blockCacheHitRatio=96,
>>> > blockCacheHitCachingRatio=98
>>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled
>>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>> > rejected; currently processing
>>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
>>> > ---------------------------
>>> >
>>> > Then i restart the RegionServer and everything is back to normal.
>>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any
>>> > abnormality in the same time window.
>>> > I think it was caused by the lost of connection to zookeeper. Is it
>>> advisable to
>>> > run zookeeper in the same machines?
>>> > if the RegionServer lost it's connection to Zookeeper, there's a way (a
>>> > configuration perhaps) to re-join the cluster, and not only die?
>>> >
>>> > Any idea what is causing this?? Or to prevent it from happening?
>>> >
>>> > Any help is appreciated.
>>> >
>>> > Best Regards,
>>> >
>>> > --
>>> >
>>> > *Leonardo Gamas*
>>> > Software Engineer
>>> > +557134943514
>>> > +557581347440
>>> > [email protected]
>>> > www.jusbrasil.com.br
>>>
>>
>>
>>
>> --
>>
>> *Leonardo Gamas*
>> Software Engineer/Chaos Monkey Engineer
>> T (71) 3494-3514
>> C (75) 8134-7440
>> [email protected]
>> www.jusbrasil.com.br
>>
>>
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer/Chaos Monkey Engineer
> T (71) 3494-3514
> C (75) 8134-7440
> [email protected]
> www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Reply via email to