ok the problem seems to be multi-nic hosting on masters. the hbase master starts up and uses canonical hostname to listen on which points to a wrong nic. I am not sure why so i am not changign this but i am struggling to override this at the moment as nothing seems to work (master.dns.interface=eth2, master.dns.server=ip2 ... tried all possible combinatiosn... it probably has something to do with reverse lookup so i added entry to hosts files to no avail so far. i will have to talk to our admins to see why we can't switch the canonical host name to ip that all the nodes are supposed to use it with .
thanks. -d On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov <[email protected]> wrote: > Thanks, Jean-Daniel. > > Logs don't show anything abnormal (not even warnings). How soon you > think the region servers should join? > > I am guessing the sequence should be something along the lines -- > zookeeper needs to timeout old master session first (2 mins or so ) , > then hot spare should wean next master election (we probably should > see that happening if we can tail its log, right?) > and then the rest of the crowd should join in something like what > seems to be governed by hbase.regionserver.msginterval property , if i > read the code correctly? > > So all -in -all probably something like 3 minutes should warrant > everybody has found the new master one way or another , right? if not, > we have a problem, right? > > Thanks. > -Dmitriy > > On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans > <[email protected]> wrote: >> Maybe there is something else in there, would be useful to see logs >> from the region servers when you are shutting down master 1 and >> bringing up master2. >> >> About "I have no failover for a critical component of my >> infrastructure.", so is the Namenode, and for the moment you can't do >> much about it. What's usually recommended is to put both the master >> and the NN together on a more reliable machine. And the master ain't >> that critical, almost everything works without it. >> >> J-D >> >> On Fri, May 13, 2011 at 12:08 PM, sean barden <[email protected]> wrote: >>> So I updated one of my clusters from CDHb1 to u0 with no issues(in the >>> upgrade). Hbase failed over to it's "backup" master server just find >>> in the older version. As 0.90.1+15.18, I had hoped the fix would be >>> in u0 for the failover issue. However, I'm having the same issue. >>> master1 fails or I shut it down, master2 waits for RS'es to check in >>> forever. Restarting the services for master2 and all RS's does >>> nothing until I start up master1. So, essentially, I have no failover >>> for a critical component of my infrastructure. Needless to say I'm >>> exceptionally frustrated. Any ideas to a fix or workaround would be >>> greatly appreciated. >>> >>> Regards, >>> >>> Sean >>> >>> On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans <[email protected]> >>> wrote: >>>> Upgrade to CDH3u0 which as far as I can tell has it: >>>> http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt >>>> >>>> J-D >>>> >>>> On Thu, May 5, 2011 at 9:55 AM, sean barden <[email protected]> wrote: >>>>> Looks like my issue. We're using 0.90.1-CDH3B4 . Looks like an >>>>> upgrade is in order. Can you suggest a workaround? >>>>> >>>>> thx, >>>>> >>>>> Sean >>>>> >>>>> On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans <[email protected]> >>>>> wrote: >>>>>> This sounds like https://issues.apache.org/jira/browse/HBASE-3545 >>>>>> which was fix in 0.90.2, which version are you testing? >>>>>> >>>>>> J-D >>>>>> >>>>>> On Thu, May 5, 2011 at 9:23 AM, sean barden <[email protected]> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I'm testing failing over from one master to another by stopping >>>>>>> master1(master2 is always running). Master2 web i/f kicks in and I can >>>>>>> zk_dump but the region servers never show up. Logs on master2 show >>>>>>> repeated >>>>>>> entries below: >>>>>>> >>>>>>> 2011-05-05 09:10:05,938 INFO >>>>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>>> Waiting on regionserver(s) to checkin >>>>>>> 2011-05-05 09:10:07,440 INFO >>>>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>>> Waiting on regionserver(s) to checkin >>>>>>> >>>>>>> Obviously the RS are not checking in. Not sure why. >>>>>>> >>>>>>> Any ideas? >>>>>>> >>>>>>> thx, >>>>>>> >>>>>>> -- >>>>>>> Sean Barden >>>>>>> [email protected] >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Sean Barden >>>>> [email protected] >>>>> >>>> >>> >>> >>> >>> -- >>> Sean Barden >>> [email protected] >>> >> >
