Thanks, Jean-Daniel. Logs don't show anything abnormal (not even warnings). How soon you think the region servers should join?
I am guessing the sequence should be something along the lines -- zookeeper needs to timeout old master session first (2 mins or so ) , then hot spare should wean next master election (we probably should see that happening if we can tail its log, right?) and then the rest of the crowd should join in something like what seems to be governed by hbase.regionserver.msginterval property , if i read the code correctly? So all -in -all probably something like 3 minutes should warrant everybody has found the new master one way or another , right? if not, we have a problem, right? Thanks. -Dmitriy On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans <[email protected]> wrote: > Maybe there is something else in there, would be useful to see logs > from the region servers when you are shutting down master 1 and > bringing up master2. > > About "I have no failover for a critical component of my > infrastructure.", so is the Namenode, and for the moment you can't do > much about it. What's usually recommended is to put both the master > and the NN together on a more reliable machine. And the master ain't > that critical, almost everything works without it. > > J-D > > On Fri, May 13, 2011 at 12:08 PM, sean barden <[email protected]> wrote: >> So I updated one of my clusters from CDHb1 to u0 with no issues(in the >> upgrade). Hbase failed over to it's "backup" master server just find >> in the older version. As 0.90.1+15.18, I had hoped the fix would be >> in u0 for the failover issue. However, I'm having the same issue. >> master1 fails or I shut it down, master2 waits for RS'es to check in >> forever. Restarting the services for master2 and all RS's does >> nothing until I start up master1. So, essentially, I have no failover >> for a critical component of my infrastructure. Needless to say I'm >> exceptionally frustrated. Any ideas to a fix or workaround would be >> greatly appreciated. >> >> Regards, >> >> Sean >> >> On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans <[email protected]> >> wrote: >>> Upgrade to CDH3u0 which as far as I can tell has it: >>> http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt >>> >>> J-D >>> >>> On Thu, May 5, 2011 at 9:55 AM, sean barden <[email protected]> wrote: >>>> Looks like my issue. We're using 0.90.1-CDH3B4 . Looks like an >>>> upgrade is in order. Can you suggest a workaround? >>>> >>>> thx, >>>> >>>> Sean >>>> >>>> On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans <[email protected]> >>>> wrote: >>>>> This sounds like https://issues.apache.org/jira/browse/HBASE-3545 >>>>> which was fix in 0.90.2, which version are you testing? >>>>> >>>>> J-D >>>>> >>>>> On Thu, May 5, 2011 at 9:23 AM, sean barden <[email protected]> wrote: >>>>>> Hi, >>>>>> >>>>>> I'm testing failing over from one master to another by stopping >>>>>> master1(master2 is always running). Master2 web i/f kicks in and I can >>>>>> zk_dump but the region servers never show up. Logs on master2 show >>>>>> repeated >>>>>> entries below: >>>>>> >>>>>> 2011-05-05 09:10:05,938 INFO >>>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>> Waiting on regionserver(s) to checkin >>>>>> 2011-05-05 09:10:07,440 INFO >>>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>> Waiting on regionserver(s) to checkin >>>>>> >>>>>> Obviously the RS are not checking in. Not sure why. >>>>>> >>>>>> Any ideas? >>>>>> >>>>>> thx, >>>>>> >>>>>> -- >>>>>> Sean Barden >>>>>> [email protected] >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Sean Barden >>>> [email protected] >>>> >>> >> >> >> >> -- >> Sean Barden >> [email protected] >> >
