Basically ls /hbase/rs and what do you see for va-p-02-d ?
On Wed, May 22, 2013 at 2:19 PM, Varun Sharma <[email protected]> wrote: > Can you do ls /hbase/rs and see what you get for 02-d - instead of looking > in /replication/, could you look in /hbase/replication/rs - I want to see > if the timestamps are matching or not ? > > Varun > > > On Wed, May 22, 2013 at 2:17 PM, Varun Sharma <[email protected]> wrote: > >> I see - so looks okay - there's just a lot of deep nesting in there - if >> you look into these you nodes by doing ls - you should see a bunch of >> WAL(s) which still need to be replicated... >> >> Varun >> >> >> On Wed, May 22, 2013 at 2:16 PM, Varun Sharma <[email protected]>wrote: >> >>> 2013-05-22 15:31:25,929 WARN >>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient >>> ZooKeeper exception: >>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for * >>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 >>> * >>> * >>> * >>> *01->[01->02->02]->01* >>> >>> *Looks like a bunch of cascading failures causing this deep nesting... * >>> >>> >>> On Wed, May 22, 2013 at 2:09 PM, Amit Mor <[email protected]>wrote: >>> >>>> empty return: >>>> >>>> [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls >>>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >>>> [] >>>> >>>> >>>> >>>> On Thu, May 23, 2013 at 12:05 AM, Varun Sharma <[email protected]> >>>> wrote: >>>> >>>> > Do an "ls" not a get here and give the output ? >>>> > >>>> > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >>>> > >>>> > >>>> > On Wed, May 22, 2013 at 1:53 PM, [email protected] < >>>> > [email protected]> wrote: >>>> > >>>> > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get >>>> > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >>>> > > >>>> > > cZxid = 0x60281c1de >>>> > > ctime = Wed May 22 15:11:17 EDT 2013 >>>> > > mZxid = 0x60281c1de >>>> > > mtime = Wed May 22 15:11:17 EDT 2013 >>>> > > pZxid = 0x60281c1de >>>> > > cversion = 0 >>>> > > dataVersion = 0 >>>> > > aclVersion = 0 >>>> > > ephemeralOwner = 0x0 >>>> > > dataLength = 0 >>>> > > numChildren = 0 >>>> > > >>>> > > >>>> > > >>>> > > On Wed, May 22, 2013 at 11:49 PM, Ted Yu <[email protected]> >>>> wrote: >>>> > > >>>> > > > What does this command show you ? >>>> > > > >>>> > > > get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >>>> > > > >>>> > > > Cheers >>>> > > > >>>> > > > On Wed, May 22, 2013 at 1:46 PM, [email protected] < >>>> > > > [email protected]> wrote: >>>> > > > >>>> > > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 >>>> > > > > [1] >>>> > > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls >>>> > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >>>> > > > > [] >>>> > > > > >>>> > > > > I'm on hbase-0.94.2-cdh4.2.1 >>>> > > > > >>>> > > > > Thanks >>>> > > > > >>>> > > > > >>>> > > > > On Wed, May 22, 2013 at 11:40 PM, Varun Sharma < >>>> [email protected]> >>>> > > > > wrote: >>>> > > > > >>>> > > > > > Also what version of HBase are you running ? >>>> > > > > > >>>> > > > > > >>>> > > > > > On Wed, May 22, 2013 at 1:38 PM, Varun Sharma < >>>> [email protected] >>>> > > >>>> > > > > wrote: >>>> > > > > > >>>> > > > > > > Basically, >>>> > > > > > > >>>> > > > > > > You had va-p-hbase-02 crash - that caused all the >>>> replication >>>> > > related >>>> > > > > > data >>>> > > > > > > in zookeeper to be moved to va-p-hbase-01 and have it take >>>> over >>>> > for >>>> > > > > > > replicating 02's logs. Now each region server also >>>> maintains an >>>> > > > > in-memory >>>> > > > > > > state of whats in ZK, it seems like when you start up 01, >>>> its >>>> > > trying >>>> > > > to >>>> > > > > > > replicate the 02 logs underneath but its failing to because >>>> that >>>> > > data >>>> > > > > is >>>> > > > > > > not in ZK. This is somewhat weird... >>>> > > > > > > >>>> > > > > > > Can you open the zookeepeer shell and do >>>> > > > > > > >>>> > > > > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 >>>> > > > > > > >>>> > > > > > > And give the output ? >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > On Wed, May 22, 2013 at 1:27 PM, [email protected] < >>>> > > > > > > [email protected]> wrote: >>>> > > > > > > >>>> > > > > > >> Hi, >>>> > > > > > >> >>>> > > > > > >> This is bad ... and happened twice: I had my >>>> replication-slave >>>> > > > cluster >>>> > > > > > >> offlined. I performed quite a massive Merge operation on >>>> it and >>>> > > > after >>>> > > > > a >>>> > > > > > >> couple of hours it had finished and I returned it back >>>> online. >>>> > At >>>> > > > the >>>> > > > > > same >>>> > > > > > >> time, the replication-master RS machines crashed (see first >>>> > crash >>>> > > > > > >> http://pastebin.com/1msNZ2tH) with the first exception >>>> being: >>>> > > > > > >> >>>> > > > > > >> org.apache.zookeeper.KeeperException$NoNodeException: >>>> > > > KeeperErrorCode >>>> > > > > = >>>> > > > > > >> NoNode for >>>> > > > > > >> >>>> > > > > > >> >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 >>>> > > > > > >> at >>>> > > > > > >> >>>> > > > >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:111) >>>> > > > > > >> at >>>> > > > > > >> >>>> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >>>> > > > > > >> at >>>> > > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) >>>> > > > > > >> at >>>> > > > > > >> >>>> > > > > > >> >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) >>>> > > > > > >> at >>>> > > > > > >> >>>> > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) >>>> > > > > > >> at >>>> > > > > > >> >>>> > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) >>>> > > > > > >> at >>>> > > > > > >> >>>> > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) >>>> > > > > > >> at >>>> > > > > > >> >>>> > > > > > >> >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) >>>> > > > > > >> at >>>> > > > > > >> >>>> > > > > > >> >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) >>>> > > > > > >> at >>>> > > > > > >> >>>> > > > > > >> >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) >>>> > > > > > >> at >>>> > > > > > >> >>>> > > > > > >> >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) >>>> > > > > > >> >>>> > > > > > >> Before restarting the crashed RS's, I have applied a >>>> > > > > 'stop_replication' >>>> > > > > > >> cmd. Then fired up the RS's again. They've started o.k. >>>> but once >>>> > > > I've >>>> > > > > > hit >>>> > > > > > >> 'start_replication' they have crashed once again. The >>>> second >>>> > crash >>>> > > > log >>>> > > > > > >> http://pastebin.com/8Nb5epJJ has the same initial >>>> exception >>>> > > > > > >> (org.apache.zookeeper.KeeperException$NoNodeException: >>>> > > > > > >> KeeperErrorCode = NoNode). I've started the crash region >>>> servers >>>> > > > again >>>> > > > > > >> without replication and currently all is well, but I need >>>> to >>>> > start >>>> > > > > > >> replication asap. >>>> > > > > > >> >>>> > > > > > >> Does anyone have an idea what's going on and how can I >>>> solve it >>>> > ? >>>> > > > > > >> >>>> > > > > > >> Thanks, >>>> > > > > > >> Amit >>>> > > > > > >> >>>> > > > > > > >>>> > > > > > > >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> >>> >>> >> >
