Re: RegionServer dying every two or three days

Leonardo Gamas Wed, 04 Jan 2012 05:57:57 -0800

I will investigate this, thanks for the response.

2012/1/3 Sandy Pratt <[email protected]>


> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
> have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
> closing socket connection and attempting reconnect
>
> It looks like the process has been unresponsive for some time, so ZK has
> terminated the session.  Did you experience a long GC pause right before
> this?  If you don't have GC logging enabled for the RS, you can sometimes
> tell by noticing a gap in the timestamps of the log statements leading up
> to the crash.
>
> If it turns out to be GC, you might want to look at your kernel swappiness
> setting (set it to 0) and your JVM params.
>
> Sandy
>
>
> > -----Original Message-----
> > From: Leonardo Gamas [mailto:[email protected]]
> > Sent: Thursday, December 29, 2011 07:44
> > To: [email protected]
> > Subject: RegionServer dying every two or three days
> >
> > Hi,
> >
> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3 Slaves),
> > running on Amazon EC2. The master is a High-Memory Extra Large Instance
> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
> > slaves are Extra Large Instances (m1.xlarge) running Datanode,
> TaskTracker,
> > RegionServer and Zookeeper.
> >
> > From time to time, every two or three days, one of the RegionServers
> > processes goes down, but the other processes (DataNode, TaskTracker,
> > Zookeeper) continue normally.
> >
> > Reading the logs:
> >
> > The connection with Zookeeper timed out:
> >
> > ---------------------------
> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
> have
> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, closing
> > socket connection and attempting reconnect
> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
> have
> > not heard from server in 61205ms for sessionid 0x346c561a55953e, closing
> > socket connection and attempting reconnect
> > ---------------------------
> >
> > And the Handlers start to fail:
> >
> > ---------------------------
> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> > xx.xx.xx.xx:xxxx: output error
> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020
> > caught: java.nio.channels.ClosedChannelException
> >         at
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> > 3)
> >         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > 1341)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
> > aseServer.java:727)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> > rver.java:792)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> > 083)
> >
> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> > xx.xx.xx.xx:xxxx: output error
> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020
> > caught: java.nio.channels.ClosedChannelException
> >         at
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> > 3)
> >         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > 1341)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
> > aseServer.java:727)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> > rver.java:792)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> > 083)
> > ---------------------------
> >
> > And finally the server throws a YouAreDeadException :( :
> >
> > ---------------------------
> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection to
> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating
> session
> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing socket
> > connection
> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection to
> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating
> session
> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> > ZooKeeper service, session 0x346c561a55953e has expired, closing socket
> > connection
> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
> > Unhandled
> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > rejected; currently processing
> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> > org.apache.hadoop.hbase.YouAreDeadException:
> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> > dead server
> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> >         at
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor
> > AccessorImpl.java:39)
> >         at
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon
> > structorAccessorImpl.java:27)
> >         at
> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >         at
> > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
> > ption.java:95)
> >         at
> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
> > Exception.java:79)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
> > ort(HRegionServer.java:735)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> > ava:596)
> >         at java.lang.Thread.run(Thread.java:662)
> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> > dead server
> >         at
> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
> > ger.java:204)
> >         at
> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv
> > erManager.java:262)
> >         at
> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
> > a:669)
> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> >         at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> > sorImpl.java:25)
> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> > 039)
> >
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> > )
> >         at $Proxy6.regionServerReport(Unknown Source)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
> > ort(HRegionServer.java:729)
> >         ... 2 more
> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> > requests=66, regions=206, stores=2078, storefiles=970,
> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
> > blockCacheSize=705907552, blockCacheFree=150412064,
> > blockCacheCount=10648, blockCacheHitCount=79578618,
> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
> > blockCacheHitRatio=96,
> > blockCacheHitCachingRatio=98
> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled
> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > rejected; currently processing
> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> > ---------------------------
> >
> > Then i restart the RegionServer and everything is back to normal.
> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any
> > abnormality in the same time window.
> > I think it was caused by the lost of connection to zookeeper. Is it
> advisable to
> > run zookeeper in the same machines?
> > if the RegionServer lost it's connection to Zookeeper, there's a way (a
> > configuration perhaps) to re-join the cluster, and not only die?
> >
> > Any idea what is causing this?? Or to prevent it from happening?
> >
> > Any help is appreciated.
> >
> > Best Regards,
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer
> > +557134943514
> > +557581347440
> > [email protected]
> > www.jusbrasil.com.br
>



-- 

*Leonardo Gamas*
Software Engineer/Chaos Monkey Engineer
T (71) 3494-3514
C (75) 8134-7440
[email protected]
www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Reply via email to