Do an "ls" not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
On Wed, May 22, 2013 at 1:53 PM, [email protected] < [email protected]> wrote: > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 > > cZxid = 0x60281c1de > ctime = Wed May 22 15:11:17 EDT 2013 > mZxid = 0x60281c1de > mtime = Wed May 22 15:11:17 EDT 2013 > pZxid = 0x60281c1de > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x0 > dataLength = 0 > numChildren = 0 > > > > On Wed, May 22, 2013 at 11:49 PM, Ted Yu <[email protected]> wrote: > > > What does this command show you ? > > > > get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 > > > > Cheers > > > > On Wed, May 22, 2013 at 1:46 PM, [email protected] < > > [email protected]> wrote: > > > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 > > > [1] > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 > > > [] > > > > > > I'm on hbase-0.94.2-cdh4.2.1 > > > > > > Thanks > > > > > > > > > On Wed, May 22, 2013 at 11:40 PM, Varun Sharma <[email protected]> > > > wrote: > > > > > > > Also what version of HBase are you running ? > > > > > > > > > > > > On Wed, May 22, 2013 at 1:38 PM, Varun Sharma <[email protected]> > > > wrote: > > > > > > > > > Basically, > > > > > > > > > > You had va-p-hbase-02 crash - that caused all the replication > related > > > > data > > > > > in zookeeper to be moved to va-p-hbase-01 and have it take over for > > > > > replicating 02's logs. Now each region server also maintains an > > > in-memory > > > > > state of whats in ZK, it seems like when you start up 01, its > trying > > to > > > > > replicate the 02 logs underneath but its failing to because that > data > > > is > > > > > not in ZK. This is somewhat weird... > > > > > > > > > > Can you open the zookeepeer shell and do > > > > > > > > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 > > > > > > > > > > And give the output ? > > > > > > > > > > > > > > > On Wed, May 22, 2013 at 1:27 PM, [email protected] < > > > > > [email protected]> wrote: > > > > > > > > > >> Hi, > > > > >> > > > > >> This is bad ... and happened twice: I had my replication-slave > > cluster > > > > >> offlined. I performed quite a massive Merge operation on it and > > after > > > a > > > > >> couple of hours it had finished and I returned it back online. At > > the > > > > same > > > > >> time, the replication-master RS machines crashed (see first crash > > > > >> http://pastebin.com/1msNZ2tH) with the first exception being: > > > > >> > > > > >> org.apache.zookeeper.KeeperException$NoNodeException: > > KeeperErrorCode > > > = > > > > >> NoNode for > > > > >> > > > > >> > > > > > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 > > > > >> at > > > > >> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > > > > >> at > > > > >> > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > > > > >> at > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) > > > > >> at > > > > >> > > > > >> > > > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) > > > > >> at > > > > >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) > > > > >> at > > > > >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) > > > > >> at > > > > >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) > > > > >> at > > > > >> > > > > >> > > > > > > > > > > org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) > > > > >> at > > > > >> > > > > >> > > > > > > > > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) > > > > >> at > > > > >> > > > > >> > > > > > > > > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) > > > > >> at > > > > >> > > > > >> > > > > > > > > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) > > > > >> > > > > >> Before restarting the crashed RS's, I have applied a > > > 'stop_replication' > > > > >> cmd. Then fired up the RS's again. They've started o.k. but once > > I've > > > > hit > > > > >> 'start_replication' they have crashed once again. The second crash > > log > > > > >> http://pastebin.com/8Nb5epJJ has the same initial exception > > > > >> (org.apache.zookeeper.KeeperException$NoNodeException: > > > > >> KeeperErrorCode = NoNode). I've started the crash region servers > > again > > > > >> without replication and currently all is well, but I need to start > > > > >> replication asap. > > > > >> > > > > >> Does anyone have an idea what's going on and how can I solve it ? > > > > >> > > > > >> Thanks, > > > > >> Amit > > > > >> > > > > > > > > > > > > > > > > > > > >
