I have see this issue too. But mine is garbage date and delete it. Linux supports a couple of host in hosts. I want to what you will do ?
-----邮件原件----- 发件人: [email protected] [mailto:[email protected]] 代表 Stack 发送时间: 2011年5月15日 6:05 收件人: [email protected] 主题: Re: Hbase Master Failover Issue What did you do to solve it? Thanks, St.Ack On Fri, May 13, 2011 at 6:17 PM, Dmitriy Lyubimov <[email protected]> wrote: > Ok i think the issue is largely solved. Thanks for your help, guys. > > -d > > On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov <[email protected]> wrote: >> ok the problem seems to be multi-nic hosting on masters. the hbase >> master starts up and uses canonical hostname to listen on which points >> to a wrong nic. I am not sure why so i am not changign this but i am >> struggling to override this at the moment as nothing seems to work >> (master.dns.interface=eth2, master.dns.server=ip2 ... tried all >> possible combinatiosn... it probably has something to do with reverse >> lookup so i added entry to hosts files to no avail so far. i will have >> to talk to our admins to see why we can't switch the canonical host >> name to ip that all the nodes are supposed to use it with . >> >> thanks. >> -d >> >> On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> Thanks, Jean-Daniel. >>> >>> Logs don't show anything abnormal (not even warnings). How soon you >>> think the region servers should join? >>> >>> I am guessing the sequence should be something along the lines -- >>> zookeeper needs to timeout old master session first (2 mins or so ) , >>> then hot spare should wean next master election (we probably should >>> see that happening if we can tail its log, right?) >>> and then the rest of the crowd should join in something like what >>> seems to be governed by hbase.regionserver.msginterval property , if i >>> read the code correctly? >>> >>> So all -in -all probably something like 3 minutes should warrant >>> everybody has found the new master one way or another , right? if not, >>> we have a problem, right? >>> >>> Thanks. >>> -Dmitriy >>> >>> On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans >>> <[email protected]> wrote: >>>> Maybe there is something else in there, would be useful to see logs >>>> from the region servers when you are shutting down master 1 and >>>> bringing up master2. >>>> >>>> About "I have no failover for a critical component of my >>>> infrastructure.", so is the Namenode, and for the moment you can't do >>>> much about it. What's usually recommended is to put both the master >>>> and the NN together on a more reliable machine. And the master ain't >>>> that critical, almost everything works without it. >>>> >>>> J-D >>>> >>>> On Fri, May 13, 2011 at 12:08 PM, sean barden <[email protected]> wrote: >>>>> So I updated one of my clusters from CDHb1 to u0 with no issues(in the >>>>> upgrade). Hbase failed over to it's "backup" master server just find >>>>> in the older version. As 0.90.1+15.18, I had hoped the fix would be >>>>> in u0 for the failover issue. However, I'm having the same issue. >>>>> master1 fails or I shut it down, master2 waits for RS'es to check in >>>>> forever. Restarting the services for master2 and all RS's does >>>>> nothing until I start up master1. So, essentially, I have no failover >>>>> for a critical component of my infrastructure. Needless to say I'm >>>>> exceptionally frustrated. Any ideas to a fix or workaround would be >>>>> greatly appreciated. >>>>> >>>>> Regards, >>>>> >>>>> Sean >>>>> >>>>> On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans <[email protected]> >>>>> wrote: >>>>>> Upgrade to CDH3u0 which as far as I can tell has it: >>>>>> http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt >>>>>> >>>>>> J-D >>>>>> >>>>>> On Thu, May 5, 2011 at 9:55 AM, sean barden <[email protected]> wrote: >>>>>>> Looks like my issue. We're using 0.90.1-CDH3B4 . Looks like an >>>>>>> upgrade is in order. Can you suggest a workaround? >>>>>>> >>>>>>> thx, >>>>>>> >>>>>>> Sean >>>>>>> >>>>>>> On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans >>>>>>> <[email protected]> wrote: >>>>>>>> This sounds like https://issues.apache.org/jira/browse/HBASE-3545 >>>>>>>> which was fix in 0.90.2, which version are you testing? >>>>>>>> >>>>>>>> J-D >>>>>>>> >>>>>>>> On Thu, May 5, 2011 at 9:23 AM, sean barden <[email protected]> wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm testing failing over from one master to another by stopping >>>>>>>>> master1(master2 is always running). Master2 web i/f kicks in and I >>>>>>>>> can >>>>>>>>> zk_dump but the region servers never show up. Logs on master2 show >>>>>>>>> repeated >>>>>>>>> entries below: >>>>>>>>> >>>>>>>>> 2011-05-05 09:10:05,938 INFO >>>>>>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>>>>> Waiting on regionserver(s) to checkin >>>>>>>>> 2011-05-05 09:10:07,440 INFO >>>>>>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>>>>> Waiting on regionserver(s) to checkin >>>>>>>>> >>>>>>>>> Obviously the RS are not checking in. Not sure why. >>>>>>>>> >>>>>>>>> Any ideas? >>>>>>>>> >>>>>>>>> thx, >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Sean Barden >>>>>>>>> [email protected] >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sean Barden >>>>>>> [email protected] >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Sean Barden >>>>> [email protected] >>>>> >>>> >>> >> >
