In zoo.cfg I have not setup this value explicitly. My zoo.cfg looks like: tickTime=2000 initLimit=10 syncLimit=5
We use common zoo keeper cluster for 2 of our HBase clusters. I'll try increasing this value from zoo.cfg. However is it possible to set this value cluster specific? I thought this property in hbase-site.xml takes care of that: zookeeper.session.timeout On Wed, Jun 5, 2013 at 1:49 PM, Kevin O'dell <[email protected]>wrote: > Ameya, > > What does your zoo.cfg say for your timeout value? > > > On Wed, Jun 5, 2013 at 4:47 PM, Ameya Kantikar <[email protected]> wrote: > > > Hi, > > > > We have heavy map reduce write jobs running against our cluster. Every > once > > in a while, we see a region server going down. > > > > We are on : 0.94.2-cdh4.2.0, r > > > > We have done some tuning for heavy map reduce jobs, and have increased > > scanner timeouts, lease timeouts, have also tuned memstore as follows: > > > > hbase.hregion.memstore.block.multiplier: 4 > > hbase.hregion.memstore.flush.size: 134217728 > > hbase.hstore.blockingStoreFiles: 100 > > > > So now, we are still facing issues. Looking at the logs it looks like due > > to zoo keeper timeout. We have tuned zookeeper settings as follows on > > hbase-sie.xml: > > > > zookeeper.session.timeout: 300000 > > hbase.zookeeper.property.tickTime: 6000 > > > > > > The actual log looks like: > > > > > > 2013-06-05 11:46:40,405 WARN org.apache.hadoop.ipc.HBaseServer: > > (responseTooSlow): > > {"processingtimems":13468,"call":"next(6723331143689528698, 1000), rpc > > version=1, client version=29, methodsFingerPrint=54742778","client":" > > 10.20.73.65:41721 > > > > > ","starttimems":1370432786933,"queuetimems":1,"class":"HRegionServer","responsesize":39611416,"method":"next"} > > > > 2013-06-05 11:46:54,988 INFO org.apache.hadoop.io.compress.CodecPool: Got > > brand-new decompressor [.snappy] > > > > 2013-06-05 11:48:03,017 WARN org.apache.hadoop.hdfs.DFSClient: > > DFSOutputStream ResponseProcessor exception for block > > BP-53741567-10.20.73.56-1351630463427:blk_9026156240355850298_8775246 > > java.io.EOFException: Premature EOF: no length prefix available > > at > > > > > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > > at > > > > > org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:95) > > at > > > > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:656) > > > > 2013-06-05 11:48:03,020 WARN org.apache.hadoop.hbase.util.Sleeper: *We > > slept 48686ms instead of 3000ms*, this is likely due to a long garbage > > collecting pause and it's usually bad, see > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > > > > 2013-06-05 11:48:03,094 FATAL > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > server > > smartdeals-hbase14-snc1.snc1,60020,1370373396890: Unhandled exception: > > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > > currently processing smartdeals-hbase14-snc1.snc1,60020,1370373396890 as > > dead server > > > > (Not sure why it says 3000ms when we have timeout at 300000ms) > > > > We have done some GC tuning as well. Wondering what I can tune from > making > > RS going down? Any ideas? > > This is batch heavy cluster, and we care less about read latency. We can > > increase RAM bit more but not much (Already RS has 20GB memory) > > > > Thanks in advance. > > > > Ameya > > > > > > -- > Kevin O'Dell > Systems Engineer, Cloudera >
