On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas <[email protected]> wrote: > The third line took 36.96 seconds to execute, can this be causing this > problem? >
Probably. Have you made any attempt at GC tuning? > Reading the code a little it seems that, even if it's disabled, if all > files are target in a compaction, it's considered a major compaction. Is it > right? > That is right. They get 'upgraded' from minor to major. This should be fine though. What you are avoiding setting major compactions to 0 is all regions being major compacted on a period, a heavy weight effective rewrite of all your data (unless already major compacted). It looks like you have this disabled which is good until you've wrestled your cluster into submission. > The machines don't have swap, so the swappiness parameter don't seem to > apply here. Any other suggestion? > See the perf section of the hbase manual. It has our current list. Are you monitoring your cluster w/ ganglia or tsdb? St.Ack > Thanks. > > 2012/1/4 Leonardo Gamas <[email protected]> > >> I will investigate this, thanks for the response. >> >> >> 2012/1/3 Sandy Pratt <[email protected]> >> >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, >>> have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, >>> closing socket connection and attempting reconnect >>> >>> It looks like the process has been unresponsive for some time, so ZK has >>> terminated the session. Did you experience a long GC pause right before >>> this? If you don't have GC logging enabled for the RS, you can sometimes >>> tell by noticing a gap in the timestamps of the log statements leading up >>> to the crash. >>> >>> If it turns out to be GC, you might want to look at your kernel >>> swappiness setting (set it to 0) and your JVM params. >>> >>> Sandy >>> >>> >>> > -----Original Message----- >>> > From: Leonardo Gamas [mailto:[email protected]] >>> > Sent: Thursday, December 29, 2011 07:44 >>> > To: [email protected] >>> > Subject: RegionServer dying every two or three days >>> > >>> > Hi, >>> > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3 >>> Slaves), >>> > running on Amazon EC2. The master is a High-Memory Extra Large Instance >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The >>> > slaves are Extra Large Instances (m1.xlarge) running Datanode, >>> TaskTracker, >>> > RegionServer and Zookeeper. >>> > >>> > From time to time, every two or three days, one of the RegionServers >>> > processes goes down, but the other processes (DataNode, TaskTracker, >>> > Zookeeper) continue normally. >>> > >>> > Reading the logs: >>> > >>> > The connection with Zookeeper timed out: >>> > >>> > --------------------------- >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, >>> have >>> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, >>> closing >>> > socket connection and attempting reconnect >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, >>> have >>> > not heard from server in 61205ms for sessionid 0x346c561a55953e, closing >>> > socket connection and attempting reconnect >>> > --------------------------- >>> > >>> > And the Handlers start to fail: >>> > >>> > --------------------------- >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from >>> > xx.xx.xx.xx:xxxx: output error >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020 >>> > caught: java.nio.channels.ClosedChannelException >>> > at >>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13 >>> > 3) >>> > at >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) >>> > at >>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: >>> > 1341) >>> > at >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB >>> > aseServer.java:727) >>> > at >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe >>> > rver.java:792) >>> > at >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 >>> > 083) >>> > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from >>> > xx.xx.xx.xx:xxxx: output error >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020 >>> > caught: java.nio.channels.ClosedChannelException >>> > at >>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13 >>> > 3) >>> > at >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) >>> > at >>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: >>> > 1341) >>> > at >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB >>> > aseServer.java:727) >>> > at >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe >>> > rver.java:792) >>> > at >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 >>> > 083) >>> > --------------------------- >>> > >>> > And finally the server throws a YouAreDeadException :( : >>> > >>> > --------------------------- >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection >>> to >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating >>> session >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to >>> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing socket >>> > connection >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection >>> to >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating >>> session >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to >>> > ZooKeeper service, session 0x346c561a55953e has expired, closing socket >>> > connection >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region >>> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741, >>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083): >>> > Unhandled >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT >>> > rejected; currently processing >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server >>> > org.apache.hadoop.hbase.YouAreDeadException: >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; >>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as >>> > dead server >>> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>> > Method) >>> > at >>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor >>> > AccessorImpl.java:39) >>> > at >>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon >>> > structorAccessorImpl.java:27) >>> > at >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513) >>> > at >>> > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce >>> > ption.java:95) >>> > at >>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote >>> > Exception.java:79) >>> > at >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep >>> > ort(HRegionServer.java:735) >>> > at >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j >>> > ava:596) >>> > at java.lang.Thread.run(Thread.java:662) >>> > Caused by: org.apache.hadoop.ipc.RemoteException: >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; >>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as >>> > dead server >>> > at >>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana >>> > ger.java:204) >>> > at >>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv >>> > erManager.java:262) >>> > at >>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav >>> > a:669) >>> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >>> > at >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces >>> > sorImpl.java:25) >>> > at java.lang.reflect.Method.invoke(Method.java:597) >>> > at >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) >>> > at >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1 >>> > 039) >>> > >>> > at >>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771) >>> > at >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257 >>> > ) >>> > at $Proxy6.regionServerReport(Unknown Source) >>> > at >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep >>> > ort(HRegionServer.java:729) >>> > ... 2 more >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics: >>> > requests=66, regions=206, stores=2078, storefiles=970, >>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0, >>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083, >>> > blockCacheSize=705907552, blockCacheFree=150412064, >>> > blockCacheCount=10648, blockCacheHitCount=79578618, >>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352, >>> > blockCacheHitRatio=96, >>> > blockCacheHitCachingRatio=98 >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT >>> > rejected; currently processing >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020 >>> > --------------------------- >>> > >>> > Then i restart the RegionServer and everything is back to normal. >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any >>> > abnormality in the same time window. >>> > I think it was caused by the lost of connection to zookeeper. Is it >>> advisable to >>> > run zookeeper in the same machines? >>> > if the RegionServer lost it's connection to Zookeeper, there's a way (a >>> > configuration perhaps) to re-join the cluster, and not only die? >>> > >>> > Any idea what is causing this?? Or to prevent it from happening? >>> > >>> > Any help is appreciated. >>> > >>> > Best Regards, >>> > >>> > -- >>> > >>> > *Leonardo Gamas* >>> > Software Engineer >>> > +557134943514 >>> > +557581347440 >>> > [email protected] >>> > www.jusbrasil.com.br >>> >> >> >> >> -- >> >> *Leonardo Gamas* >> Software Engineer/Chaos Monkey Engineer >> T (71) 3494-3514 >> C (75) 8134-7440 >> [email protected] >> www.jusbrasil.com.br >> >> > > > -- > > *Leonardo Gamas* > Software Engineer/Chaos Monkey Engineer > T (71) 3494-3514 > C (75) 8134-7440 > [email protected] > www.jusbrasil.com.br
