Re: Hbase Master Failover Issue

Dmitriy Lyubimov Fri, 13 May 2011 17:33:14 -0700

ok the problem seems to be multi-nic hosting on masters. the hbase
master starts up and uses canonical hostname to listen on which points
to a wrong nic. I am not sure why so i am not changign this but i am
struggling to override this at the moment as nothing seems to work
(master.dns.interface=eth2, master.dns.server=ip2 ... tried all
possible combinatiosn... it probably has something to do with reverse
lookup so i added entry to hosts files to no avail so far. i will have
to talk to our admins to see why we can't switch the canonical host
name to ip that all the nodes are supposed to use it with .


thanks.
-d

On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov <[email protected]> wrote:
> Thanks, Jean-Daniel.
>
> Logs don't show anything abnormal (not even warnings). How soon you
> think the region servers should join?
>
> I am guessing the sequence should be something along the lines --
>  zookeeper needs to timeout old master session first (2 mins or so ) ,
> then hot spare should wean next master election (we probably should
> see that happening if we can tail its log, right?)
> and then the rest of the crowd should join in something like what
> seems to be governed by hbase.regionserver.msginterval property , if i
> read the code correctly?
>
> So all -in -all probably something like 3 minutes should warrant
> everybody has found the new master one way or another , right? if not,
> we have a problem, right?
>
> Thanks.
> -Dmitriy
>
> On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
> <[email protected]> wrote:
>> Maybe there is something else in there, would be useful to see logs
>> from the region servers when you are shutting down master 1 and
>> bringing up master2.
>>
>> About "I have no failover for a critical component of my
>> infrastructure.", so is the Namenode, and for the moment you can't do
>> much about it. What's usually recommended is to put both the master
>> and the NN together on a more reliable machine. And the master ain't
>> that critical, almost everything works without it.
>>
>> J-D
>>
>> On Fri, May 13, 2011 at 12:08 PM, sean barden <[email protected]> wrote:
>>> So I updated one of my clusters from CDHb1 to u0 with no issues(in the
>>> upgrade).  Hbase failed over to it's "backup" master server just find
>>> in the older version.  As 0.90.1+15.18, I had hoped the fix would be
>>> in u0 for the failover issue.  However, I'm having the same issue.
>>> master1 fails or I shut it down,  master2 waits for RS'es to check in
>>> forever.  Restarting the services for master2 and all RS's does
>>> nothing until I start up master1.  So, essentially, I have no failover
>>> for a critical component of my infrastructure.  Needless to say I'm
>>> exceptionally frustrated.  Any ideas to a fix or workaround would be
>>> greatly appreciated.
>>>
>>> Regards,
>>>
>>> Sean
>>>
>>> On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans <[email protected]> 
>>> wrote:
>>>> Upgrade to CDH3u0 which as far as I can tell has it:
>>>> http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt
>>>>
>>>> J-D
>>>>
>>>> On Thu, May 5, 2011 at 9:55 AM, sean barden <[email protected]> wrote:
>>>>> Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
>>>>> upgrade is in order.  Can you suggest a workaround?
>>>>>
>>>>> thx,
>>>>>
>>>>> Sean
>>>>>
>>>>> On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans <[email protected]> 
>>>>> wrote:
>>>>>> This sounds like https://issues.apache.org/jira/browse/HBASE-3545
>>>>>> which was fix in 0.90.2, which version are you testing?
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Thu, May 5, 2011 at 9:23 AM, sean barden <[email protected]> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm testing failing over from one master to another by stopping
>>>>>>> master1(master2 is always running).  Master2 web i/f kicks in and I can
>>>>>>> zk_dump but the region servers never show up.  Logs on master2 show 
>>>>>>> repeated
>>>>>>> entries below:
>>>>>>>
>>>>>>> 2011-05-05 09:10:05,938 INFO 
>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>> Waiting on regionserver(s) to checkin
>>>>>>> 2011-05-05 09:10:07,440 INFO 
>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>> Waiting on regionserver(s) to checkin
>>>>>>>
>>>>>>> Obviously the RS are not checking in.  Not sure why.
>>>>>>>
>>>>>>> Any ideas?
>>>>>>>
>>>>>>> thx,
>>>>>>>
>>>>>>> --
>>>>>>> Sean Barden
>>>>>>> [email protected]
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sean Barden
>>>>> [email protected]
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Sean Barden
>>> [email protected]
>>>
>>
>

Re: Hbase Master Failover Issue

Reply via email to