Jean
DNS mismatch was the cause! good-eyes!
Here is what I had to do:
1) changed hostnames to fully qualified ones, on all machines
file : /etc/sysconfig/network
before: devperf-sn6
now : devperf-sn6.pcs.hds.com
2) used fully qualified hostnames (FQHN) in 'hbase-site.xml'
before : devperf-sn6
now : devperf-sn6.pcs.hds.com
Then even after a restart, zookeeper was still doing lookup on old
hostnames and erroring out
3) I had some shorthand alias in /etc/hosts (on master node)
ip_address1 hmaster
ip_address2 rs1
I deleted these (and restarted machine just to be sure)
4) delete zookeeper dir on ZK machines (this one was not very obvious!)
rm -rf /tmp/hbase-hadoop
only then things started working!
I am happy to document this in wiki some place if it might help others.
A) Is there any other 'best practices' to keep DNS / HOST LOOKUPs straight?
A2) would it be safer if I used the IP addresses? Or reverse DNS
required even then?
B) I do miss the short hand aliases in /etc/hosts. Is there a way to have
these aliases, without interfering with Hbase / zookeeper?
thanks for your help!
http://sujee.net
On Fri, Jun 10, 2011 at 2:38 PM, Jean-Daniel Cryans <[email protected]>wrote:
> There's a DNS mismatch:
>
> devperf-sn10,60020,1307732557915
> devperf-sn10.pcs.hds.com,60020,1307732557915
>
> And 0.90 has a big regression with that (0.92 already has the fixes,
> but it's not released yet). Make sure your nodes all resolve the same
> hostnames per http://hbase.apache.org/book.html#dns
>
> BTW the clue comes from those kinda lines:
>
> 2011-06-10 12:03:50,975 INFO
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: No HServerInfo
> found for devperf-sn10.pcs.hds.com,60020,1307732557915
>
> J-D
>
> On Fri, Jun 10, 2011 at 9:26 PM, Sujee Maniyam <[email protected]> wrote:
> > looks like this RS has the ROOT region. The shutdown was initiated by a
> > kill <pid> command by me.
> > any thing specific I should look for in logs / config?
> >
> > thanks
> > http://sujee.net
> >
> >
> > On Fri, Jun 10, 2011 at 2:09 PM, Stack <[email protected]> wrote:
> >
> >> That looks like we're waiting on the shutdown of the -ROOT- region?
> >> Is that so. Anything on why it won't go down earlier in the log?
> >> St.Ack
> >>
> >>
> >> On Fri, Jun 10, 2011 at 12:23 PM, Sujee Maniyam <[email protected]>
> wrote:
> >> > Hi all
> >> > I am running Hbase on a 6 node cluster. HBase comes up fine, I can
> >> create
> >> > a test table and put rows and scan. But I can't cleanly shut it down.
> >> the
> >> > stop-hbase command goes on for ever printing dots. And I can see a
> >> couple
> >> > of RegionServers are not terminating.
> >> >
> >> > here are the details:
> >> >
> >> > 5 RS , 1 Master
> >> > 3 zookeepers
> >> >
> >> > hbase : 0.90.1-cdh3u0, r (both hadoop & hbase are Cloudera cdh 3
> >> > distributions)
> >> > hadoop : 0.20.2-cdh3u0
> >> >
> >> > master-log : http://pastebin.com/tBvJDPHc
> >> > rserver log : http://pastebin.com/EsWYAuUk
> >> > hbase_site.xml : http://pastebin.com/sU7EM2QK
> >> >
> >> >
> >> > During the shutdown, I see this in the region server logs:
> >> >
> >> > 2011-06-10 12:03:55,940 DEBUG
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on
> 70236052
> >> > 2011-06-10 12:03:58,942 DEBUG
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on
> 70236052
> >> > ....
> >> >
> >> >
> >> > thanks very much for your help!
> >> > Sujee Maniyam
> >> > http://sujee.net
> >> >
> >>
> >
>