Ok i think the issue is largely solved. Thanks for your help, guys. -d
On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov <[email protected]> wrote: > ok the problem seems to be multi-nic hosting on masters. the hbase > master starts up and uses canonical hostname to listen on which points > to a wrong nic. I am not sure why so i am not changign this but i am > struggling to override this at the moment as nothing seems to work > (master.dns.interface=eth2, master.dns.server=ip2 ... tried all > possible combinatiosn... it probably has something to do with reverse > lookup so i added entry to hosts files to no avail so far. i will have > to talk to our admins to see why we can't switch the canonical host > name to ip that all the nodes are supposed to use it with . > > thanks. > -d > > On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov <[email protected]> wrote: >> Thanks, Jean-Daniel. >> >> Logs don't show anything abnormal (not even warnings). How soon you >> think the region servers should join? >> >> I am guessing the sequence should be something along the lines -- >> zookeeper needs to timeout old master session first (2 mins or so ) , >> then hot spare should wean next master election (we probably should >> see that happening if we can tail its log, right?) >> and then the rest of the crowd should join in something like what >> seems to be governed by hbase.regionserver.msginterval property , if i >> read the code correctly? >> >> So all -in -all probably something like 3 minutes should warrant >> everybody has found the new master one way or another , right? if not, >> we have a problem, right? >> >> Thanks. >> -Dmitriy >> >> On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans >> <[email protected]> wrote: >>> Maybe there is something else in there, would be useful to see logs >>> from the region servers when you are shutting down master 1 and >>> bringing up master2. >>> >>> About "I have no failover for a critical component of my >>> infrastructure.", so is the Namenode, and for the moment you can't do >>> much about it. What's usually recommended is to put both the master >>> and the NN together on a more reliable machine. And the master ain't >>> that critical, almost everything works without it. >>> >>> J-D >>> >>> On Fri, May 13, 2011 at 12:08 PM, sean barden <[email protected]> wrote: >>>> So I updated one of my clusters from CDHb1 to u0 with no issues(in the >>>> upgrade). Hbase failed over to it's "backup" master server just find >>>> in the older version. As 0.90.1+15.18, I had hoped the fix would be >>>> in u0 for the failover issue. However, I'm having the same issue. >>>> master1 fails or I shut it down, master2 waits for RS'es to check in >>>> forever. Restarting the services for master2 and all RS's does >>>> nothing until I start up master1. So, essentially, I have no failover >>>> for a critical component of my infrastructure. Needless to say I'm >>>> exceptionally frustrated. Any ideas to a fix or workaround would be >>>> greatly appreciated. >>>> >>>> Regards, >>>> >>>> Sean >>>> >>>> On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans <[email protected]> >>>> wrote: >>>>> Upgrade to CDH3u0 which as far as I can tell has it: >>>>> http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt >>>>> >>>>> J-D >>>>> >>>>> On Thu, May 5, 2011 at 9:55 AM, sean barden <[email protected]> wrote: >>>>>> Looks like my issue. We're using 0.90.1-CDH3B4 . Looks like an >>>>>> upgrade is in order. Can you suggest a workaround? >>>>>> >>>>>> thx, >>>>>> >>>>>> Sean >>>>>> >>>>>> On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans >>>>>> <[email protected]> wrote: >>>>>>> This sounds like https://issues.apache.org/jira/browse/HBASE-3545 >>>>>>> which was fix in 0.90.2, which version are you testing? >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Thu, May 5, 2011 at 9:23 AM, sean barden <[email protected]> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm testing failing over from one master to another by stopping >>>>>>>> master1(master2 is always running). Master2 web i/f kicks in and I can >>>>>>>> zk_dump but the region servers never show up. Logs on master2 show >>>>>>>> repeated >>>>>>>> entries below: >>>>>>>> >>>>>>>> 2011-05-05 09:10:05,938 INFO >>>>>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>>>> Waiting on regionserver(s) to checkin >>>>>>>> 2011-05-05 09:10:07,440 INFO >>>>>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>>>> Waiting on regionserver(s) to checkin >>>>>>>> >>>>>>>> Obviously the RS are not checking in. Not sure why. >>>>>>>> >>>>>>>> Any ideas? >>>>>>>> >>>>>>>> thx, >>>>>>>> >>>>>>>> -- >>>>>>>> Sean Barden >>>>>>>> [email protected] >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Sean Barden >>>>>> [email protected] >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Sean Barden >>>> [email protected] >>>> >>> >> >
