Awesome. Thanks Himanshu.
On Wed, Apr 17, 2013 at 10:48 PM, Himanshu Vashishtha < [email protected]> wrote: > Hello Ameya, > > Sorry to hear that. > > You have two options: > > 1) Apply HBase-8099 patch to your version. ( > https://issues.apache.org/jira/browse/HBASE-8099) The patch is simple, so > should be easy to do, OR, > 2) Turn off zk.multi feature (see hbase-default.xml). (You can refer to > CDH4.2.0 docs for that) > > This fix (HBase-8099) will be in CDH4.2.1, though. > > Please ask list if you have any more questions. > > Thanks, > Himanshu > > On Wed, Apr 17, 2013 at 10:38 PM, Ameya Kantikar <[email protected]> > wrote: > > > I am running Hbase 0.94.2 from cloudera cdh4.2. (10 machine cluster) > > > > Under heavy write load, and when replication is on, all my region servers > > are going down. > > I checked with cloudera version, it has HBASE-2611 bug patched in the > > version I am using, so not sure whats going on. Here is the stack: > > > > 2013-04-18 01:47:33,423 INFO > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > > Atomically moving relevance-hbase5-snc1.snc1,60020,1366247910200's hlogs > to > > my queue > > > > 2013-04-18 01:47:33,424 DEBUG > > org.apache.hadoop.hbase.replication.ReplicationZookeeper: The multi list > > size is: 1 > > > > 2013-04-18 01:47:33,425 WARN > > org.apache.hadoop.hbase.replication.ReplicationZookeeper: Got exception > in > > copyQueuesFromRSUsingMulti: > > > > org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = > > Directory not empty > > > > at > > org.apache.zookeeper.KeeperException.create(KeeperException.java:125) > > > > at > org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:925) > > > > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:901) > > > > at > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:538) > > > > at > > > > > org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1457) > > > > at > > > > > org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:705) > > > > at > > > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:585) > > > > at > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > > > at > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > > > > at java.lang.Thread.run(Thread.java:662) > > > > > > Followed by > > > > 2013-04-18 01:47:36,043 FATAL > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > server > > relevance-hbase2-snc1.snc1,60020,1366247745434: Writing replication > status > > > > > > I checked by turning replication off, and everything seems fine. I can > > reproduce this bug almost every time I run my write heavy job. > > > > > > Here is the complete log: > > > > http://pastebin.com/da0m475T > > > > > > > > Any ideas? > > > > > > Ameya > > >
