11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, have not 
heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, closing socket 
connection and attempting reconnect

It looks like the process has been unresponsive for some time, so ZK has 
terminated the session.  Did you experience a long GC pause right before this?  
If you don't have GC logging enabled for the RS, you can sometimes tell by 
noticing a gap in the timestamps of the log statements leading up to the crash.

If it turns out to be GC, you might want to look at your kernel swappiness 
setting (set it to 0) and your JVM params.

Sandy


> -----Original Message-----
> From: Leonardo Gamas [mailto:[email protected]]
> Sent: Thursday, December 29, 2011 07:44
> To: [email protected]
> Subject: RegionServer dying every two or three days
> 
> Hi,
> 
> I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3 Slaves),
> running on Amazon EC2. The master is a High-Memory Extra Large Instance
> (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
> slaves are Extra Large Instances (m1.xlarge) running Datanode, TaskTracker,
> RegionServer and Zookeeper.
> 
> From time to time, every two or three days, one of the RegionServers
> processes goes down, but the other processes (DataNode, TaskTracker,
> Zookeeper) continue normally.
> 
> Reading the logs:
> 
> The connection with Zookeeper timed out:
> 
> ---------------------------
> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, closing
> socket connection and attempting reconnect
> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in 61205ms for sessionid 0x346c561a55953e, closing
> socket connection and attempting reconnect
> ---------------------------
> 
> And the Handlers start to fail:
> 
> ---------------------------
> 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> xx.xx.xx.xx:xxxx: output error
> 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020
> caught: java.nio.channels.ClosedChannelException
>         at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> 3)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> 1341)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
> aseServer.java:727)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> rver.java:792)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> 083)
> 
> 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> xx.xx.xx.xx:xxxx: output error
> 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020
> caught: java.nio.channels.ClosedChannelException
>         at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> 3)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> 1341)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
> aseServer.java:727)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> rver.java:792)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> 083)
> ---------------------------
> 
> And finally the server throws a YouAreDeadException :( :
> 
> ---------------------------
> 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection to
> server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating 
> session
> 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing socket
> connection
> 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection to
> server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating 
> session
> 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> ZooKeeper service, session 0x346c561a55953e has expired, closing socket
> connection
> 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
> server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
> Unhandled
> exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> rejected; currently processing
> ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> org.apache.hadoop.hbase.YouAreDeadException:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> dead server
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor
> AccessorImpl.java:39)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon
> structorAccessorImpl.java:27)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>         at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
> ption.java:95)
>         at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
> Exception.java:79)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
> ort(HRegionServer.java:735)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> ava:596)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> dead server
>         at
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
> ger.java:204)
>         at
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv
> erManager.java:262)
>         at
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
> a:669)
>         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> 039)
> 
>         at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
>         at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> )
>         at $Proxy6.regionServerReport(Unknown Source)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
> ort(HRegionServer.java:729)
>         ... 2 more
> 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> requests=66, regions=206, stores=2078, storefiles=970,
> storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
> flushQueueSize=0, usedHeap=1672, maxHeap=4083,
> blockCacheSize=705907552, blockCacheFree=150412064,
> blockCacheCount=10648, blockCacheHitCount=79578618,
> blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
> blockCacheHitRatio=96,
> blockCacheHitCachingRatio=98
> 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled
> exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> rejected; currently processing
> ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> ---------------------------
> 
> Then i restart the RegionServer and everything is back to normal.
> Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any
> abnormality in the same time window.
> I think it was caused by the lost of connection to zookeeper. Is it advisable 
> to
> run zookeeper in the same machines?
> if the RegionServer lost it's connection to Zookeeper, there's a way (a
> configuration perhaps) to re-join the cluster, and not only die?
> 
> Any idea what is causing this?? Or to prevent it from happening?
> 
> Any help is appreciated.
> 
> Best Regards,
> 
> --
> 
> *Leonardo Gamas*
> Software Engineer
> +557134943514
> +557581347440
> [email protected]
> www.jusbrasil.com.br

Reply via email to