Re: Hbase Master Failover Issue

sean barden Mon, 16 May 2011 12:45:49 -0700

Dima and I work together.  He's got a good amount of opensource
experience on me and I got pulled away to work on something
else(MS-SQL issues, no less).  He gets all the fun. :).  Seriously,
the issue wouldn't have been solved without him stepping up.  thx
Dima!.



sean

On Mon, May 16, 2011 at 1:59 PM, Jean-Daniel Cryans <[email protected]> wrote:
> Hey Dmitriy,
>
> Awesome you could figure it out. I wonder if there's something that
> could be done in HBase to help debugging such problems... Suggestions?
>
> Also, just to make sure, this thread was started by Sean and it seems
> you stepped up for him... you are working together right? At least
> that's what Rapportive tells me, but still trying to make sure we
> didn't forget someone else's problem.
>
> Good on you,
>
> J-D
>
> On Sun, May 15, 2011 at 12:50 PM, Dmitriy Lyubimov <[email protected]> wrote:
>> The problem was multinic configuration at master nodes.
>>
>> i saw that the processes starts listening on a wrong NIC
>>
>> I read the source code and saw that with default settings it would use
>> whatever ip is reported by canonical hostname, i.e. whatever retruned
>> by something like
>>
>> ping `hostname`,
>>
>>
>> our canonical hostname was resolving of course  the wrong nic.
>>
>> i kind of did not want to edit /etc/hostsnames (i guessed our admins
>> had a reason to point hostname to that nic), so i forcefully set
>> 'eth0' as hbase.master.dns.interface (if i remember that property name
>> correctly).
>>
>> it started listening on what was pointed by eth0:0 isntead of eth0
>> which solved the problem anyway.
>>
>> (funny thing though i still couldn't make it listen on eth0 ip, but
>> rather on eth0:0 only although both had reverse dns. apparently
>> whatever native code is used, lists both ips for that interface and
>> then the first one that has reverse dns is used, so there's no way to
>> force it to listen on other ones).
>>
>> Bottom line, with multinic configurations your hostname better points
>> to the ip you want it to listen on in /etc/hosts. If it's different,
>> one cannot use the default configuration.
>>
>> -d
>>
>> On Sat, May 14, 2011 at 3:04 PM, Stack <[email protected]> wrote:
>>> What did you do to solve it?
>>> Thanks,
>>> St.Ack
>>>
>>> On Fri, May 13, 2011 at 6:17 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>>> Ok i think the issue is largely solved. Thanks for your help, guys.
>>>>
>>>> -d
>>>>
>>>> On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov <[email protected]> 
>>>> wrote:
>>>>> ok the problem seems to be multi-nic hosting on masters. the hbase
>>>>> master starts up and uses canonical hostname to listen on which points
>>>>> to a wrong nic. I am not sure why so i am not changign this but i am
>>>>> struggling to override this at the moment as nothing seems to work
>>>>> (master.dns.interface=eth2, master.dns.server=ip2 ... tried all
>>>>> possible combinatiosn... it probably has something to do with reverse
>>>>> lookup so i added entry to hosts files to no avail so far. i will have
>>>>> to talk to our admins to see why we can't switch the canonical host
>>>>> name to ip that all the nodes are supposed to use it with .
>>>>>
>>>>> thanks.
>>>>> -d
>>>>>
>>>>> On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov <[email protected]> 
>>>>> wrote:
>>>>>> Thanks, Jean-Daniel.
>>>>>>
>>>>>> Logs don't show anything abnormal (not even warnings). How soon you
>>>>>> think the region servers should join?
>>>>>>
>>>>>> I am guessing the sequence should be something along the lines --
>>>>>>  zookeeper needs to timeout old master session first (2 mins or so ) ,
>>>>>> then hot spare should wean next master election (we probably should
>>>>>> see that happening if we can tail its log, right?)
>>>>>> and then the rest of the crowd should join in something like what
>>>>>> seems to be governed by hbase.regionserver.msginterval property , if i
>>>>>> read the code correctly?
>>>>>>
>>>>>> So all -in -all probably something like 3 minutes should warrant
>>>>>> everybody has found the new master one way or another , right? if not,
>>>>>> we have a problem, right?
>>>>>>
>>>>>> Thanks.
>>>>>> -Dmitriy
>>>>>>
>>>>>> On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
>>>>>> <[email protected]> wrote:
>>>>>>> Maybe there is something else in there, would be useful to see logs
>>>>>>> from the region servers when you are shutting down master 1 and
>>>>>>> bringing up master2.
>>>>>>>
>>>>>>> About "I have no failover for a critical component of my
>>>>>>> infrastructure.", so is the Namenode, and for the moment you can't do
>>>>>>> much about it. What's usually recommended is to put both the master
>>>>>>> and the NN together on a more reliable machine. And the master ain't
>>>>>>> that critical, almost everything works without it.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Fri, May 13, 2011 at 12:08 PM, sean barden <[email protected]> wrote:
>>>>>>>> So I updated one of my clusters from CDHb1 to u0 with no issues(in the
>>>>>>>> upgrade).  Hbase failed over to it's "backup" master server just find
>>>>>>>> in the older version.  As 0.90.1+15.18, I had hoped the fix would be
>>>>>>>> in u0 for the failover issue.  However, I'm having the same issue.
>>>>>>>> master1 fails or I shut it down,  master2 waits for RS'es to check in
>>>>>>>> forever.  Restarting the services for master2 and all RS's does
>>>>>>>> nothing until I start up master1.  So, essentially, I have no failover
>>>>>>>> for a critical component of my infrastructure.  Needless to say I'm
>>>>>>>> exceptionally frustrated.  Any ideas to a fix or workaround would be
>>>>>>>> greatly appreciated.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Sean
>>>>>>>>
>>>>>>>> On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans 
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> Upgrade to CDH3u0 which as far as I can tell has it:
>>>>>>>>> http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Thu, May 5, 2011 at 9:55 AM, sean barden <[email protected]> wrote:
>>>>>>>>>> Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
>>>>>>>>>> upgrade is in order.  Can you suggest a workaround?
>>>>>>>>>>
>>>>>>>>>> thx,
>>>>>>>>>>
>>>>>>>>>> Sean
>>>>>>>>>>
>>>>>>>>>> On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans 
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>> This sounds like https://issues.apache.org/jira/browse/HBASE-3545
>>>>>>>>>>> which was fix in 0.90.2, which version are you testing?
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 5, 2011 at 9:23 AM, sean barden <[email protected]> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm testing failing over from one master to another by stopping
>>>>>>>>>>>> master1(master2 is always running).  Master2 web i/f kicks in and 
>>>>>>>>>>>> I can
>>>>>>>>>>>> zk_dump but the region servers never show up.  Logs on master2 
>>>>>>>>>>>> show repeated
>>>>>>>>>>>> entries below:
>>>>>>>>>>>>
>>>>>>>>>>>> 2011-05-05 09:10:05,938 INFO 
>>>>>>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>>>>>> Waiting on regionserver(s) to checkin
>>>>>>>>>>>> 2011-05-05 09:10:07,440 INFO 
>>>>>>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>>>>>> Waiting on regionserver(s) to checkin
>>>>>>>>>>>>
>>>>>>>>>>>> Obviously the RS are not checking in.  Not sure why.
>>>>>>>>>>>>
>>>>>>>>>>>> Any ideas?
>>>>>>>>>>>>
>>>>>>>>>>>> thx,
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Sean Barden
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Sean Barden
>>>>>>>>>> [email protected]
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sean Barden
>>>>>>>> [email protected]
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>



-- 
Sean Barden
[email protected]

Re: Hbase Master Failover Issue

Reply via email to