Re: Hbase Master Failover Issue

Gaojinchao Sat, 14 May 2011 18:21:24 -0700

I have see this issue too. But mine is garbage date and delete it.
Linux supports a couple of host in hosts. I want to what you will do ?


-----邮件原件-----
发件人: [email protected] [mailto:[email protected]] 代表 Stack
发送时间: 2011年5月15日 6:05
收件人: [email protected]
主题: Re: Hbase Master Failover Issue

What did you do to solve it?
Thanks,
St.Ack

On Fri, May 13, 2011 at 6:17 PM, Dmitriy Lyubimov <[email protected]> wrote:
> Ok i think the issue is largely solved. Thanks for your help, guys.
>
> -d
>
> On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov <[email protected]> wrote:
>> ok the problem seems to be multi-nic hosting on masters. the hbase
>> master starts up and uses canonical hostname to listen on which points
>> to a wrong nic. I am not sure why so i am not changign this but i am
>> struggling to override this at the moment as nothing seems to work
>> (master.dns.interface=eth2, master.dns.server=ip2 ... tried all
>> possible combinatiosn... it probably has something to do with reverse
>> lookup so i added entry to hosts files to no avail so far. i will have
>> to talk to our admins to see why we can't switch the canonical host
>> name to ip that all the nodes are supposed to use it with .
>>
>> thanks.
>> -d
>>
>> On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>> Thanks, Jean-Daniel.
>>>
>>> Logs don't show anything abnormal (not even warnings). How soon you
>>> think the region servers should join?
>>>
>>> I am guessing the sequence should be something along the lines --
>>>  zookeeper needs to timeout old master session first (2 mins or so ) ,
>>> then hot spare should wean next master election (we probably should
>>> see that happening if we can tail its log, right?)
>>> and then the rest of the crowd should join in something like what
>>> seems to be governed by hbase.regionserver.msginterval property , if i
>>> read the code correctly?
>>>
>>> So all -in -all probably something like 3 minutes should warrant
>>> everybody has found the new master one way or another , right? if not,
>>> we have a problem, right?
>>>
>>> Thanks.
>>> -Dmitriy
>>>
>>> On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
>>> <[email protected]> wrote:
>>>> Maybe there is something else in there, would be useful to see logs
>>>> from the region servers when you are shutting down master 1 and
>>>> bringing up master2.
>>>>
>>>> About "I have no failover for a critical component of my
>>>> infrastructure.", so is the Namenode, and for the moment you can't do
>>>> much about it. What's usually recommended is to put both the master
>>>> and the NN together on a more reliable machine. And the master ain't
>>>> that critical, almost everything works without it.
>>>>
>>>> J-D
>>>>
>>>> On Fri, May 13, 2011 at 12:08 PM, sean barden <[email protected]> wrote:
>>>>> So I updated one of my clusters from CDHb1 to u0 with no issues(in the
>>>>> upgrade).  Hbase failed over to it's "backup" master server just find
>>>>> in the older version.  As 0.90.1+15.18, I had hoped the fix would be
>>>>> in u0 for the failover issue.  However, I'm having the same issue.
>>>>> master1 fails or I shut it down,  master2 waits for RS'es to check in
>>>>> forever.  Restarting the services for master2 and all RS's does
>>>>> nothing until I start up master1.  So, essentially, I have no failover
>>>>> for a critical component of my infrastructure.  Needless to say I'm
>>>>> exceptionally frustrated.  Any ideas to a fix or workaround would be
>>>>> greatly appreciated.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Sean
>>>>>
>>>>> On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans <[email protected]> 
>>>>> wrote:
>>>>>> Upgrade to CDH3u0 which as far as I can tell has it:
>>>>>> http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Thu, May 5, 2011 at 9:55 AM, sean barden <[email protected]> wrote:
>>>>>>> Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
>>>>>>> upgrade is in order.  Can you suggest a workaround?
>>>>>>>
>>>>>>> thx,
>>>>>>>
>>>>>>> Sean
>>>>>>>
>>>>>>> On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans 
>>>>>>> <[email protected]> wrote:
>>>>>>>> This sounds like https://issues.apache.org/jira/browse/HBASE-3545
>>>>>>>> which was fix in 0.90.2, which version are you testing?
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Thu, May 5, 2011 at 9:23 AM, sean barden <[email protected]> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm testing failing over from one master to another by stopping
>>>>>>>>> master1(master2 is always running).  Master2 web i/f kicks in and I 
>>>>>>>>> can
>>>>>>>>> zk_dump but the region servers never show up.  Logs on master2 show 
>>>>>>>>> repeated
>>>>>>>>> entries below:
>>>>>>>>>
>>>>>>>>> 2011-05-05 09:10:05,938 INFO 
>>>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>>> Waiting on regionserver(s) to checkin
>>>>>>>>> 2011-05-05 09:10:07,440 INFO 
>>>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>>> Waiting on regionserver(s) to checkin
>>>>>>>>>
>>>>>>>>> Obviously the RS are not checking in.  Not sure why.
>>>>>>>>>
>>>>>>>>> Any ideas?
>>>>>>>>>
>>>>>>>>> thx,
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sean Barden
>>>>>>>>> [email protected]
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sean Barden
>>>>>>> [email protected]
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sean Barden
>>>>> [email protected]
>>>>>
>>>>
>>>
>>
>

Re: Hbase Master Failover Issue

Reply via email to