bq. I just dont find this "hbase.zookeeper.property.tickTime" anywhere in the code base.
Neither do I. Mind filing a JIRA to correct this in troubleshooting.xml ? bq. increase tickTime in zoo.cfg? For shared zookeeper quorum, the above should be done. On Wed, Jun 5, 2013 at 5:45 PM, Ameya Kantikar <[email protected]> wrote: > One more thing. I just dont find this "hbase.zookeeper.property.tickTime" > anywhere in the code base. > Also, I could not find ZooKeeper API that takes tickTime from client. > > http://zookeeper.apache.org/doc/r3.3.3/api/org/apache/zookeeper/ZooKeeper.html > It takes sessionTime out value, but not tickTime. > > Is this even relevant anymore? hbase.zookeeper.property.tickTime ? > > So whats the solution, increase tickTime in zoo.cfg? (and not > hbase.zookeeper.property.tickTime > in hbase-site.xml?) > > Ameya > > > On Wed, Jun 5, 2013 at 3:18 PM, Ameya Kantikar <[email protected]> wrote: > > > Which tickTime is honored? > > > > One in zoo.cfg or hbase.zookeeper.property.tickTime in hbase-site.xml? > > > > My understanding now is, whichever tickTime is honored, session time can > > not be more than 20 times the value. > > > > I think this is whats happening on my cluster: > > > > My hbase.zookeeper.property.tickTime value is 6000 ms. However my timeout > > value is 300000 ms which is outside of 20 times tickTime. Hence ZooKeeper > > uses its syncLimit of 5, to generate 6000*5 = 30000 as timeout value for > my > > RS sessions. > > > > I'll try increasing hbase.zookeeper.property.tickTime value in > > hbase-site.xml and will monitor my cluster over next few days. > > > > Thanks Kevin & Ted for your help. > > > > Ameya > > > > > > > > > > On Wed, Jun 5, 2013 at 2:45 PM, Ted Yu <[email protected]> wrote: > > > >> bq. I thought this property in hbase-site.xml takes care of that: > >> zookeeper.session.timeout > >> > >> From > >> > >> > http://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#ch_zkSessions > >> : > >> > >> The client sends a requested timeout, the server responds with the > timeout > >> that it can give the client. The current implementation requires that > the > >> timeout be a minimum of 2 times the tickTime (as set in the server > >> configuration) and a maximum of 20 times the tickTime. The ZooKeeper > >> client > >> API allows access to the negotiated timeout. > >> The above means the shared zookeeper quorum may return timeout value > >> different from that of zookeeper.session.timeout > >> > >> Cheers > >> > >> On Wed, Jun 5, 2013 at 2:34 PM, Ameya Kantikar <[email protected]> > wrote: > >> > >> > In zoo.cfg I have not setup this value explicitly. My zoo.cfg looks > >> like: > >> > > >> > tickTime=2000 > >> > initLimit=10 > >> > syncLimit=5 > >> > > >> > We use common zoo keeper cluster for 2 of our HBase clusters. I'll try > >> > increasing this value from zoo.cfg. > >> > However is it possible to set this value cluster specific? > >> > I thought this property in hbase-site.xml takes care of that: > >> > zookeeper.session.timeout > >> > > >> > > >> > On Wed, Jun 5, 2013 at 1:49 PM, Kevin O'dell < > [email protected] > >> > >wrote: > >> > > >> > > Ameya, > >> > > > >> > > What does your zoo.cfg say for your timeout value? > >> > > > >> > > > >> > > On Wed, Jun 5, 2013 at 4:47 PM, Ameya Kantikar <[email protected]> > >> > wrote: > >> > > > >> > > > Hi, > >> > > > > >> > > > We have heavy map reduce write jobs running against our cluster. > >> Every > >> > > once > >> > > > in a while, we see a region server going down. > >> > > > > >> > > > We are on : 0.94.2-cdh4.2.0, r > >> > > > > >> > > > We have done some tuning for heavy map reduce jobs, and have > >> increased > >> > > > scanner timeouts, lease timeouts, have also tuned memstore as > >> follows: > >> > > > > >> > > > hbase.hregion.memstore.block.multiplier: 4 > >> > > > hbase.hregion.memstore.flush.size: 134217728 > >> > > > hbase.hstore.blockingStoreFiles: 100 > >> > > > > >> > > > So now, we are still facing issues. Looking at the logs it looks > >> like > >> > due > >> > > > to zoo keeper timeout. We have tuned zookeeper settings as follows > >> on > >> > > > hbase-sie.xml: > >> > > > > >> > > > zookeeper.session.timeout: 300000 > >> > > > hbase.zookeeper.property.tickTime: 6000 > >> > > > > >> > > > > >> > > > The actual log looks like: > >> > > > > >> > > > > >> > > > 2013-06-05 11:46:40,405 WARN org.apache.hadoop.ipc.HBaseServer: > >> > > > (responseTooSlow): > >> > > > {"processingtimems":13468,"call":"next(6723331143689528698, 1000), > >> rpc > >> > > > version=1, client version=29, > >> methodsFingerPrint=54742778","client":" > >> > > > 10.20.73.65:41721 > >> > > > > >> > > > > >> > > > >> > > >> > ","starttimems":1370432786933,"queuetimems":1,"class":"HRegionServer","responsesize":39611416,"method":"next"} > >> > > > > >> > > > 2013-06-05 11:46:54,988 INFO > >> org.apache.hadoop.io.compress.CodecPool: > >> > Got > >> > > > brand-new decompressor [.snappy] > >> > > > > >> > > > 2013-06-05 11:48:03,017 WARN org.apache.hadoop.hdfs.DFSClient: > >> > > > DFSOutputStream ResponseProcessor exception for block > >> > > > > >> BP-53741567-10.20.73.56-1351630463427:blk_9026156240355850298_8775246 > >> > > > java.io.EOFException: Premature EOF: no length prefix available > >> > > > at > >> > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > >> > > > at > >> > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:95) > >> > > > at > >> > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:656) > >> > > > > >> > > > 2013-06-05 11:48:03,020 WARN org.apache.hadoop.hbase.util.Sleeper: > >> *We > >> > > > slept 48686ms instead of 3000ms*, this is likely due to a long > >> garbage > >> > > > collecting pause and it's usually bad, see > >> > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >> > > > > >> > > > 2013-06-05 11:48:03,094 FATAL > >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING > region > >> > > server > >> > > > smartdeals-hbase14-snc1.snc1,60020,1370373396890: Unhandled > >> exception: > >> > > > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > rejected; > >> > > > currently processing > >> smartdeals-hbase14-snc1.snc1,60020,1370373396890 > >> > as > >> > > > dead server > >> > > > > >> > > > (Not sure why it says 3000ms when we have timeout at 300000ms) > >> > > > > >> > > > We have done some GC tuning as well. Wondering what I can tune > from > >> > > making > >> > > > RS going down? Any ideas? > >> > > > This is batch heavy cluster, and we care less about read latency. > We > >> > can > >> > > > increase RAM bit more but not much (Already RS has 20GB memory) > >> > > > > >> > > > Thanks in advance. > >> > > > > >> > > > Ameya > >> > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Kevin O'Dell > >> > > Systems Engineer, Cloudera > >> > > > >> > > >> > > > > >
