No the server came out fine just because after the crash (RS's - the masters were still running), I immediately pulled the breaks with stop_replication. Then I start the RS's and they came back fine (not replicating). Once I hit 'start_replication' again they had crashed for the second time. Eventually I deleted the heavily nested replication znodes and the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH with Cloudera Manager Parcels thing and I'm still trying to figure out how to replace their jars with mine in a clean and non intrusive way
On Thu, May 23, 2013 at 10:33 AM, Varun Sharma <va...@pinterest.com> wrote: > Actually, it seems like something else was wrong here - the servers came up > just fine on trying again - so could not really reproduce the issue. > > Amit: Did you try patching 8207 ? > > Varun > > > On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha <hv.cs...@gmail.com > >wrote: > > > That sounds like a bug for sure. Could you create a jira with logs/znode > > dump/steps to reproduce it? > > > > Thanks, > > himanshu > > > > > > On Wed, May 22, 2013 at 5:01 PM, Varun Sharma <va...@pinterest.com> > wrote: > > > > > It seems I can reproduce this - I did a few rolling restarts and got > > > screwed with NoNode exceptions - I am running 0.94.7 which has the fix > > but > > > my nodes don't contain hyphens - nodes are no longer coming back up... > > > > > > Thanks > > > Varun > > > > > > > > > On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha < > hv.cs...@gmail.com > > > >wrote: > > > > > > > I'd suggest to please patch the code with 8207; cdh4.2.1 doesn't > have > > > it. > > > > > > > > With hyphens in the name, ReplicationSource gets confused and tried > to > > > set > > > > data in a znode which doesn't exist. > > > > > > > > Thanks, > > > > Himanshu > > > > > > > > > > > > On Wed, May 22, 2013 at 2:42 PM, Amit Mor <amit.mor.m...@gmail.com> > > > wrote: > > > > > > > > > yes, indeed - hyphens are part of the host name (annoying legacy > > stuff > > > in > > > > > my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 > was > > > > > backported by Cloudera into their flavor of 0.94.2, but > > > > > the mysterious occurrence of the percent sign in zkcli (ls > > > > > > > > > > > > > > > > > > > > /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895) > > > > > might be a sign for such problem. How deep should my rmr in zkcli > (an > > > > > example would be most welcomed :) be ? I have no serious problem > > > running > > > > > copyTable with a time period corresponding to the outage and then > to > > > > start > > > > > the sync back again. One question though, how did it cause a crash > ? > > > > > > > > > > > > > > > On Thu, May 23, 2013 at 12:32 AM, Varun Sharma < > va...@pinterest.com> > > > > > wrote: > > > > > > > > > > > I believe there were cascading failures which got these deep > nodes > > > > > > containing still to be replicated WAL(s) - I suspect there is > > either > > > > some > > > > > > parsing bug or something which is causing the replication source > to > > > not > > > > > > work - also which version are you using - does it have > > > > > > https://issues.apache.org/jira/browse/HBASE-8207 - since you use > > > > hyphens > > > > > > in > > > > > > our paths. One way to get back up is to delete these nodes but > then > > > you > > > > > > lose data in these WAL(s)... > > > > > > > > > > > > > > > > > > On Wed, May 22, 2013 at 2:22 PM, Amit Mor < > amit.mor.m...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > va-p-hbase-02-d,60020,1369249862401 > > > > > > > > > > > > > > > > > > > > > On Thu, May 23, 2013 at 12:20 AM, Varun Sharma < > > > va...@pinterest.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Basically > > > > > > > > > > > > > > > > ls /hbase/rs and what do you see for va-p-02-d ? > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 22, 2013 at 2:19 PM, Varun Sharma < > > > va...@pinterest.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Can you do ls /hbase/rs and see what you get for 02-d - > > instead > > > > of > > > > > > > > looking > > > > > > > > > in /replication/, could you look in /hbase/replication/rs > - I > > > > want > > > > > to > > > > > > > see > > > > > > > > > if the timestamps are matching or not ? > > > > > > > > > > > > > > > > > > Varun > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 22, 2013 at 2:17 PM, Varun Sharma < > > > > va...@pinterest.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> I see - so looks okay - there's just a lot of deep nesting > > in > > > > > there > > > > > > - > > > > > > > if > > > > > > > > >> you look into these you nodes by doing ls - you should > see a > > > > bunch > > > > > > of > > > > > > > > >> WAL(s) which still need to be replicated... > > > > > > > > >> > > > > > > > > >> Varun > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> On Wed, May 22, 2013 at 2:16 PM, Varun Sharma < > > > > > va...@pinterest.com > > > > > > > > >wrote: > > > > > > > > >> > > > > > > > > >>> 2013-05-22 15:31:25,929 WARN > > > > > > > > >>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: > > > > Possibly > > > > > > > > transient > > > > > > > > >>> ZooKeeper exception: > > > > > > > > >>> > > org.apache.zookeeper.KeeperException$SessionExpiredException: > > > > > > > > >>> KeeperErrorCode = Session expired for * > > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 > > > > > > > > >>> * > > > > > > > > >>> * > > > > > > > > >>> * > > > > > > > > >>> *01->[01->02->02]->01* > > > > > > > > >>> > > > > > > > > >>> *Looks like a bunch of cascading failures causing this > deep > > > > > > > nesting... > > > > > > > > * > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> On Wed, May 22, 2013 at 2:09 PM, Amit Mor < > > > > > amit.mor.m...@gmail.com > > > > > > > > >wrote: > > > > > > > > >>> > > > > > > > > >>>> empty return: > > > > > > > > >>>> > > > > > > > > >>>> [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls > > > > > > > > >>>> > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 > > > > > > > > >>>> [] > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> On Thu, May 23, 2013 at 12:05 AM, Varun Sharma < > > > > > > va...@pinterest.com > > > > > > > > > > > > > > > > >>>> wrote: > > > > > > > > >>>> > > > > > > > > >>>> > Do an "ls" not a get here and give the output ? > > > > > > > > >>>> > > > > > > > > > >>>> > ls > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>>> > On Wed, May 22, 2013 at 1:53 PM, > > amit.mor.m...@gmail.com< > > > > > > > > >>>> > amit.mor.m...@gmail.com> wrote: > > > > > > > > >>>> > > > > > > > > > >>>> > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get > > > > > > > > >>>> > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 > > > > > > > > >>>> > > > > > > > > > > >>>> > > cZxid = 0x60281c1de > > > > > > > > >>>> > > ctime = Wed May 22 15:11:17 EDT 2013 > > > > > > > > >>>> > > mZxid = 0x60281c1de > > > > > > > > >>>> > > mtime = Wed May 22 15:11:17 EDT 2013 > > > > > > > > >>>> > > pZxid = 0x60281c1de > > > > > > > > >>>> > > cversion = 0 > > > > > > > > >>>> > > dataVersion = 0 > > > > > > > > >>>> > > aclVersion = 0 > > > > > > > > >>>> > > ephemeralOwner = 0x0 > > > > > > > > >>>> > > dataLength = 0 > > > > > > > > >>>> > > numChildren = 0 > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > > >>>> > > On Wed, May 22, 2013 at 11:49 PM, Ted Yu < > > > > > yuzhih...@gmail.com > > > > > > > > > > > > > > > >>>> wrote: > > > > > > > > >>>> > > > > > > > > > > >>>> > > > What does this command show you ? > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > get > > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > Cheers > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > On Wed, May 22, 2013 at 1:46 PM, > > > > amit.mor.m...@gmail.com< > > > > > > > > >>>> > > > amit.mor.m...@gmail.com> wrote: > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > > ls > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 > > > > > > > > >>>> > > > > [1] > > > > > > > > >>>> > > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls > > > > > > > > >>>> > > > > > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 > > > > > > > > >>>> > > > > [] > > > > > > > > >>>> > > > > > > > > > > > > >>>> > > > > I'm on hbase-0.94.2-cdh4.2.1 > > > > > > > > >>>> > > > > > > > > > > > > >>>> > > > > Thanks > > > > > > > > >>>> > > > > > > > > > > > > >>>> > > > > > > > > > > > > >>>> > > > > On Wed, May 22, 2013 at 11:40 PM, Varun Sharma < > > > > > > > > >>>> va...@pinterest.com> > > > > > > > > >>>> > > > > wrote: > > > > > > > > >>>> > > > > > > > > > > > > >>>> > > > > > Also what version of HBase are you running ? > > > > > > > > >>>> > > > > > > > > > > > > > >>>> > > > > > > > > > > > > > >>>> > > > > > On Wed, May 22, 2013 at 1:38 PM, Varun Sharma > < > > > > > > > > >>>> va...@pinterest.com > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > wrote: > > > > > > > > >>>> > > > > > > > > > > > > > >>>> > > > > > > Basically, > > > > > > > > >>>> > > > > > > > > > > > > > > >>>> > > > > > > You had va-p-hbase-02 crash - that caused > all > > > the > > > > > > > > >>>> replication > > > > > > > > >>>> > > related > > > > > > > > >>>> > > > > > data > > > > > > > > >>>> > > > > > > in zookeeper to be moved to va-p-hbase-01 > and > > > have > > > > > it > > > > > > > take > > > > > > > > >>>> over > > > > > > > > >>>> > for > > > > > > > > >>>> > > > > > > replicating 02's logs. Now each region > server > > > also > > > > > > > > >>>> maintains an > > > > > > > > >>>> > > > > in-memory > > > > > > > > >>>> > > > > > > state of whats in ZK, it seems like when you > > > start > > > > > up > > > > > > > 01, > > > > > > > > >>>> its > > > > > > > > >>>> > > trying > > > > > > > > >>>> > > > to > > > > > > > > >>>> > > > > > > replicate the 02 logs underneath but its > > failing > > > > to > > > > > > > > because > > > > > > > > >>>> that > > > > > > > > >>>> > > data > > > > > > > > >>>> > > > > is > > > > > > > > >>>> > > > > > > not in ZK. This is somewhat weird... > > > > > > > > >>>> > > > > > > > > > > > > > > >>>> > > > > > > Can you open the zookeepeer shell and do > > > > > > > > >>>> > > > > > > > > > > > > > > >>>> > > > > > > ls > > > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 > > > > > > > > >>>> > > > > > > > > > > > > > > >>>> > > > > > > And give the output ? > > > > > > > > >>>> > > > > > > > > > > > > > > >>>> > > > > > > > > > > > > > > >>>> > > > > > > On Wed, May 22, 2013 at 1:27 PM, > > > > > > > amit.mor.m...@gmail.com< > > > > > > > > >>>> > > > > > > amit.mor.m...@gmail.com> wrote: > > > > > > > > >>>> > > > > > > > > > > > > > > >>>> > > > > > >> Hi, > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > >> This is bad ... and happened twice: I had > my > > > > > > > > >>>> replication-slave > > > > > > > > >>>> > > > cluster > > > > > > > > >>>> > > > > > >> offlined. I performed quite a massive Merge > > > > > operation > > > > > > > on > > > > > > > > >>>> it and > > > > > > > > >>>> > > > after > > > > > > > > >>>> > > > > a > > > > > > > > >>>> > > > > > >> couple of hours it had finished and I > > returned > > > it > > > > > > back > > > > > > > > >>>> online. > > > > > > > > >>>> > At > > > > > > > > >>>> > > > the > > > > > > > > >>>> > > > > > same > > > > > > > > >>>> > > > > > >> time, the replication-master RS machines > > > crashed > > > > > (see > > > > > > > > first > > > > > > > > >>>> > crash > > > > > > > > >>>> > > > > > >> http://pastebin.com/1msNZ2tH) with the > first > > > > > > exception > > > > > > > > >>>> being: > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > >> > > > > > org.apache.zookeeper.KeeperException$NoNodeException: > > > > > > > > >>>> > > > KeeperErrorCode > > > > > > > > >>>> > > > > = > > > > > > > > >>>> > > > > > >> NoNode for > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > > > > > > > > > >>>> > > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 > > > > > > > > >>>> > > > > > >> at > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > > > > > > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > > > > > > > > >>>> > > > > > >> at > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > > > > > > > > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > > > > > > > > >>>> > > > > > >> at > > > > > > > > >>>> > > > > > > > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) > > > > > > > > >>>> > > > > > >> at > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > > > > > > > > > >>>> > > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) > > > > > > > > >>>> > > > > > >> at > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > > > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) > > > > > > > > >>>> > > > > > >> at > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > > > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) > > > > > > > > >>>> > > > > > >> at > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > > > org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) > > > > > > > > >>>> > > > > > >> at > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > > > > > > > > > >>>> > > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) > > > > > > > > >>>> > > > > > >> at > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > > > > > > > > > >>>> > > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) > > > > > > > > >>>> > > > > > >> at > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > > > > > > > > > >>>> > > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) > > > > > > > > >>>> > > > > > >> at > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > > > > > > > > > >>>> > > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > >> Before restarting the crashed RS's, I have > > > > applied > > > > > a > > > > > > > > >>>> > > > > 'stop_replication' > > > > > > > > >>>> > > > > > >> cmd. Then fired up the RS's again. They've > > > > started > > > > > > o.k. > > > > > > > > >>>> but once > > > > > > > > >>>> > > > I've > > > > > > > > >>>> > > > > > hit > > > > > > > > >>>> > > > > > >> 'start_replication' they have crashed once > > > again. > > > > > The > > > > > > > > >>>> second > > > > > > > > >>>> > crash > > > > > > > > >>>> > > > log > > > > > > > > >>>> > > > > > >> http://pastebin.com/8Nb5epJJ has the same > > > > initial > > > > > > > > >>>> exception > > > > > > > > >>>> > > > > > >> > > > > > > (org.apache.zookeeper.KeeperException$NoNodeException: > > > > > > > > >>>> > > > > > >> KeeperErrorCode = NoNode). I've started the > > > crash > > > > > > > region > > > > > > > > >>>> servers > > > > > > > > >>>> > > > again > > > > > > > > >>>> > > > > > >> without replication and currently all is > > well, > > > > but > > > > > I > > > > > > > need > > > > > > > > >>>> to > > > > > > > > >>>> > start > > > > > > > > >>>> > > > > > >> replication asap. > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > >> Does anyone have an idea what's going on > and > > > how > > > > > can > > > > > > I > > > > > > > > >>>> solve it > > > > > > > > >>>> > ? > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > >> Thanks, > > > > > > > > >>>> > > > > > >> Amit > > > > > > > > >>>> > > > > > >> > > > > > > > > >>>> > > > > > > > > > > > > > > >>>> > > > > > > > > > > > > > > >>>> > > > > > > > > > > > > > >>>> > > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >