Shortly before BadVersion occurred, I saw: 2016-01-27 05:52:36,048 INFO [main-EventThread] replication.ReplicationTrackerZKImpl: /hbase/rs/r12s8.sjc.aristanetworks.com,9104,1453785783387 znode expired, triggering replicatorRemoved event 2016-01-27 05:52:36,051 INFO [main-EventThread] replication.ReplicationTrackerZKImpl: /hbase/rs/r12s3.sjc.aristanetworks.com,9104,1453785739822 znode expired, triggering replicatorRemoved event 2016-01-27 05:52:40,028 INFO [main-EventThread] replication.ReplicationTrackerZKImpl: /hbase/rs/r12s7.sjc.aristanetworks.com,9104,1453785743694 znode expired, triggering replicatorRemoved event
Can you check your zookeeper quorum to see if there was some problem ? Thanks On Tue, Jan 26, 2016 at 10:02 PM, tsuna <[email protected]> wrote: > Hi, > after a planned power outage one of our HBase clusters isn’t coming back > up healthy. The master shows the 16 region servers but zero regions. All > the RegionServers are experiencing the same problem, which is that they’re > getting a BadVersion error from ZooKeeper. This was with HBase 1.1.2 and > I just upgraded all the nodes to 1.1.3 to see if this would make a > difference, but it didn’t. > > 2016-01-27 05:54:02,402 WARN [RS_LOG_REPLAY_OPS-r12s4:9104-0] > coordination.ZkSplitLogWorkerCoordination: BADVERSION failed to assert > ownership for /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com > %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com > %252C9104%252C1452811286456.default.1453728374800 > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode > = BadVersion for /hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com > %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com > %252C9104%252C1452811286456.default.1453728374800 > at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:429) > at > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination.attemptToOwnTask(ZkSplitLogWorkerCoordination.java:370) > at > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$1.progress(ZkSplitLogWorkerCoordination.java:304) > at > org.apache.hadoop.hbase.util.FSHDFSUtils.checkIfCancelled(FSHDFSUtils.java:329) > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:244) > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:162) > at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:761) > at > org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297) > at > org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104) > at > org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72) > at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2016-01-27 05:54:02,404 WARN [RS_LOG_REPLAY_OPS-r12s4:9104-0] > coordination.ZkSplitLogWorkerCoordination: Failed to heartbeat the > task/hbase/splitWAL/WALs%2Fr12s16.sjc.aristanetworks.com > %2C9104%2C1452811286456-splitting%2Fr12s16.sjc.aristanetworks.com > %252C9104%252C1452811286456.default.1453728374800 > > I’m attaching the full log of the RS this was extracted from, which I just > restarted on 1.1.3, in case that’s of any help. > > I’ve never seen this before and after a bit of digging, I’m not really > going anywhere. Any ideas / suggestions? > > -- > Benoit "tsuna" Sigoure >
