Dima and I work together. He's got a good amount of opensource experience on me and I got pulled away to work on something else(MS-SQL issues, no less). He gets all the fun. :). Seriously, the issue wouldn't have been solved without him stepping up. thx Dima!.
sean On Mon, May 16, 2011 at 1:59 PM, Jean-Daniel Cryans <[email protected]> wrote: > Hey Dmitriy, > > Awesome you could figure it out. I wonder if there's something that > could be done in HBase to help debugging such problems... Suggestions? > > Also, just to make sure, this thread was started by Sean and it seems > you stepped up for him... you are working together right? At least > that's what Rapportive tells me, but still trying to make sure we > didn't forget someone else's problem. > > Good on you, > > J-D > > On Sun, May 15, 2011 at 12:50 PM, Dmitriy Lyubimov <[email protected]> wrote: >> The problem was multinic configuration at master nodes. >> >> i saw that the processes starts listening on a wrong NIC >> >> I read the source code and saw that with default settings it would use >> whatever ip is reported by canonical hostname, i.e. whatever retruned >> by something like >> >> ping `hostname`, >> >> >> our canonical hostname was resolving of course the wrong nic. >> >> i kind of did not want to edit /etc/hostsnames (i guessed our admins >> had a reason to point hostname to that nic), so i forcefully set >> 'eth0' as hbase.master.dns.interface (if i remember that property name >> correctly). >> >> it started listening on what was pointed by eth0:0 isntead of eth0 >> which solved the problem anyway. >> >> (funny thing though i still couldn't make it listen on eth0 ip, but >> rather on eth0:0 only although both had reverse dns. apparently >> whatever native code is used, lists both ips for that interface and >> then the first one that has reverse dns is used, so there's no way to >> force it to listen on other ones). >> >> Bottom line, with multinic configurations your hostname better points >> to the ip you want it to listen on in /etc/hosts. If it's different, >> one cannot use the default configuration. >> >> -d >> >> On Sat, May 14, 2011 at 3:04 PM, Stack <[email protected]> wrote: >>> What did you do to solve it? >>> Thanks, >>> St.Ack >>> >>> On Fri, May 13, 2011 at 6:17 PM, Dmitriy Lyubimov <[email protected]> wrote: >>>> Ok i think the issue is largely solved. Thanks for your help, guys. >>>> >>>> -d >>>> >>>> On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov <[email protected]> >>>> wrote: >>>>> ok the problem seems to be multi-nic hosting on masters. the hbase >>>>> master starts up and uses canonical hostname to listen on which points >>>>> to a wrong nic. I am not sure why so i am not changign this but i am >>>>> struggling to override this at the moment as nothing seems to work >>>>> (master.dns.interface=eth2, master.dns.server=ip2 ... tried all >>>>> possible combinatiosn... it probably has something to do with reverse >>>>> lookup so i added entry to hosts files to no avail so far. i will have >>>>> to talk to our admins to see why we can't switch the canonical host >>>>> name to ip that all the nodes are supposed to use it with . >>>>> >>>>> thanks. >>>>> -d >>>>> >>>>> On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov <[email protected]> >>>>> wrote: >>>>>> Thanks, Jean-Daniel. >>>>>> >>>>>> Logs don't show anything abnormal (not even warnings). How soon you >>>>>> think the region servers should join? >>>>>> >>>>>> I am guessing the sequence should be something along the lines -- >>>>>> zookeeper needs to timeout old master session first (2 mins or so ) , >>>>>> then hot spare should wean next master election (we probably should >>>>>> see that happening if we can tail its log, right?) >>>>>> and then the rest of the crowd should join in something like what >>>>>> seems to be governed by hbase.regionserver.msginterval property , if i >>>>>> read the code correctly? >>>>>> >>>>>> So all -in -all probably something like 3 minutes should warrant >>>>>> everybody has found the new master one way or another , right? if not, >>>>>> we have a problem, right? >>>>>> >>>>>> Thanks. >>>>>> -Dmitriy >>>>>> >>>>>> On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans >>>>>> <[email protected]> wrote: >>>>>>> Maybe there is something else in there, would be useful to see logs >>>>>>> from the region servers when you are shutting down master 1 and >>>>>>> bringing up master2. >>>>>>> >>>>>>> About "I have no failover for a critical component of my >>>>>>> infrastructure.", so is the Namenode, and for the moment you can't do >>>>>>> much about it. What's usually recommended is to put both the master >>>>>>> and the NN together on a more reliable machine. And the master ain't >>>>>>> that critical, almost everything works without it. >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Fri, May 13, 2011 at 12:08 PM, sean barden <[email protected]> wrote: >>>>>>>> So I updated one of my clusters from CDHb1 to u0 with no issues(in the >>>>>>>> upgrade). Hbase failed over to it's "backup" master server just find >>>>>>>> in the older version. As 0.90.1+15.18, I had hoped the fix would be >>>>>>>> in u0 for the failover issue. However, I'm having the same issue. >>>>>>>> master1 fails or I shut it down, master2 waits for RS'es to check in >>>>>>>> forever. Restarting the services for master2 and all RS's does >>>>>>>> nothing until I start up master1. So, essentially, I have no failover >>>>>>>> for a critical component of my infrastructure. Needless to say I'm >>>>>>>> exceptionally frustrated. Any ideas to a fix or workaround would be >>>>>>>> greatly appreciated. >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Sean >>>>>>>> >>>>>>>> On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans >>>>>>>> <[email protected]> wrote: >>>>>>>>> Upgrade to CDH3u0 which as far as I can tell has it: >>>>>>>>> http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt >>>>>>>>> >>>>>>>>> J-D >>>>>>>>> >>>>>>>>> On Thu, May 5, 2011 at 9:55 AM, sean barden <[email protected]> wrote: >>>>>>>>>> Looks like my issue. We're using 0.90.1-CDH3B4 . Looks like an >>>>>>>>>> upgrade is in order. Can you suggest a workaround? >>>>>>>>>> >>>>>>>>>> thx, >>>>>>>>>> >>>>>>>>>> Sean >>>>>>>>>> >>>>>>>>>> On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans >>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>> This sounds like https://issues.apache.org/jira/browse/HBASE-3545 >>>>>>>>>>> which was fix in 0.90.2, which version are you testing? >>>>>>>>>>> >>>>>>>>>>> J-D >>>>>>>>>>> >>>>>>>>>>> On Thu, May 5, 2011 at 9:23 AM, sean barden <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I'm testing failing over from one master to another by stopping >>>>>>>>>>>> master1(master2 is always running). Master2 web i/f kicks in and >>>>>>>>>>>> I can >>>>>>>>>>>> zk_dump but the region servers never show up. Logs on master2 >>>>>>>>>>>> show repeated >>>>>>>>>>>> entries below: >>>>>>>>>>>> >>>>>>>>>>>> 2011-05-05 09:10:05,938 INFO >>>>>>>>>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>>>>>>>> Waiting on regionserver(s) to checkin >>>>>>>>>>>> 2011-05-05 09:10:07,440 INFO >>>>>>>>>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>>>>>>>> Waiting on regionserver(s) to checkin >>>>>>>>>>>> >>>>>>>>>>>> Obviously the RS are not checking in. Not sure why. >>>>>>>>>>>> >>>>>>>>>>>> Any ideas? >>>>>>>>>>>> >>>>>>>>>>>> thx, >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Sean Barden >>>>>>>>>>>> [email protected] >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Sean Barden >>>>>>>>>> [email protected] >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Sean Barden >>>>>>>> [email protected] >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -- Sean Barden [email protected]
