RE: Trying to get the region servers working....

Buttler, David Thu, 03 Jun 2010 16:24:46 -0700

The problem with your scenario is that if you get partitioned then both sides 
will go down because ZK needs a quorum (half plus one).  As for Hadoop, by 
default it does not support multiple name nodes.  The master file lists 
secondary name nodes (they just log what happens on the NN).  There is some 
work to make Hadoop more resistant to a SPOF, f.e. see Avatar nodes.  But to 
reiterate, by having two ZK nodes, you are reducing the reliability of the 
system -- everything will go down if one of the ZK nodes goes down.


Even up to a cluster size of 12, I don't think you will need more than one ZK 
for reliability.  You may want to collocate a ZK node on each rack for 
performance reasons, but you still want an odd number of them.

Finally, I would also be a little leery about using VMs for HBase.  The issue 
is disk performance more than anything else -- your system is likely going to 
be IO bound, and VM disk performance is its weakness.
Dave

-----Original Message-----
From: Anthony Ikeda [mailto:[email protected]] 
Sent: Thursday, June 03, 2010 3:53 PM
To: [email protected]
Subject: RE: Trying to get the region servers working....

David, we are currently prototyping an Active-Active site scenario so
for now we will be using 2 ZooKeeper servers (one on site a the other on
site b).

>From what I understand, Hadoop also supports multiple Masters? I'm only
going on the pretence that the config file is labelled
${HADOOP_HOME}/conf/master*s*

What our final setup will be is yet to be determined based on testing -
I have a total of 12 VM's to play with. Right now I'm starting with 4
and then going to build up the cluster and see what kind of performance
we get.

Anthony


-----Original Message-----
From: Buttler, David [mailto:[email protected]] 
Sent: Friday, 4 June 2010 2:50 AM
To: [email protected]
Subject: RE: Trying to get the region servers working....

Just to be clear, you are not actually running exactly 2 ZK nodes are
you?  I think one ZK node on your master is sufficient for this size of
cluster.  If that node goes down you entire cluster is gone in any case.
And remember, you need to have an odd number of ZK nodes.  And 3 nodes
probably doesn't make sense either -- if you have a large enough cluster
to need a ZK quorum, then you probably want to have the ability to take
one node offline and have the cluster work with an additional failure.
Dave


From: Anthony Ikeda [mailto:[email protected]]
Sent: Wednesday, June 02, 2010 5:38 PM
To: [email protected]
Subject: Trying to get the region servers working....

I've successfully got hadoop installed and running:
Server1 (172.28.1.138) - master, namenode,  jobtracker, tasktracker
Server2 (172.28.1.139) - slave, datanode
Server3 (172.28.2.136) - slave, datanode
Server4 (172.28.2.137) - Slave, datanode

I'm now trying to get HBase up and running with the HBase managing
ZooKeeper.

My HBase setup is:
Server1 - master, zookeeper1
Server2 - slave, regionserver
Server3 - slave, regionserver, zookeeper2
Server4 - Slave, regionserver

However the region servers seem to keep resolving the master server to
127.0.0.1:60000

This is the log entry (${HBASE_HOME}/logs/
hbase-hbase-regionserver-SVRH127.log):
2010-06-03 09:57:52,394 INFO org.apache.zookeeper.ClientCnxn: Server
connection successful
2010-06-03 09:57:52,432 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: SyncConnected, type: None, path: null
2010-06-03 09:57:52,433 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Set watcher on
master address ZNode /hbase/master
2010-06-03 09:57:52,485 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Read ZNode
/hbase/master got 127.0.0.1:60000
2010-06-03 09:57:52,486 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at
127.0.0.1:60000 that we are up
2010-06-03 09:58:52,914 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to
master. Retrying. Error was:
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)


>From what I can tell in the ZooKeeper logs, it has started successfully
and is communicating.
${HBASE_HOME/logs/ hbase-hbase-zookeeper-SVRH124.log
2010-06-03 10:05:47,286 INFO
org.apache.zookeeper.server.ZooKeeperServer: Created server
2010-06-03 10:05:47,288 INFO
org.apache.zookeeper.server.quorum.Follower: Following
/172.28.2.136:2888
2010-06-03 10:05:47,290 INFO
org.apache.zookeeper.server.quorum.FastLeaderElection: Sending new
notification.
2010-06-03 10:05:47,321 INFO
org.apache.zookeeper.server.quorum.Follower: Getting a snapshot from
leader
2010-06-03 10:05:47,335 INFO
org.apache.zookeeper.server.persistence.FileTxnSnapLog: Snapshotting:
200000000
2010-06-03 10:06:07,272 WARN
org.apache.zookeeper.server.quorum.Follower: Got zxid 0x200000001
expected 0x1
Thu Jun  3 10:14:09 EST 2010 Stopping zookeeper
Thu Jun  3 10:14:09 EST 2010 Killing zookeeper

And ${HBASE_HOME/logs/ hbase-hbase-zookeeper-SVRH127.log
2010-06-03 10:05:48,008 INFO
org.apache.zookeeper.server.ZooKeeperServer: Created server
2010-06-03 10:05:48,015 INFO
org.apache.zookeeper.server.quorum.FastLeaderElection: Sending new
notification.
2010-06-03 10:05:48,016 INFO
org.apache.zookeeper.server.persistence.FileSnap: Reading snapshot
/home/hbase/zkeeper/data/version-2/snapshot.0
2010-06-03 10:05:48,020 INFO
org.apache.zookeeper.server.persistence.FileTxnSnapLog: Snapshotting:
10000000b
2010-06-03 10:05:48,041 INFO
org.apache.zookeeper.server.quorum.FollowerHandler: Follower sid: 1 :
info :
org.apache.zookeeper.server.quorum.quorumpeer$quorumser...@6f878144
2010-06-03 10:05:48,041 WARN
org.apache.zookeeper.server.quorum.FollowerHandler: Sending snapshot
last zxid of peer is 0x10000000b  zxid of leader is 0x200000000
2010-06-03 10:05:48,048 WARN org.apache.zookeeper.server.quorum.Leader:
Commiting zxid 0x200000000 from /172.28.2.136:2888 not first!
2010-06-03 10:05:48,048 WARN org.apache.zookeeper.server.quorum.Leader:
First is 0
2010-06-03 10:06:07,992 INFO org.apache.zookeeper.server.NIOServerCnxn:
Connected to /172.28.1.138:23600 lastZxid 0
2010-06-03 10:06:07,992 INFO org.apache.zookeeper.server.NIOServerCnxn:
Creating new session 0x228fb20c3760000
2010-06-03 10:06:08,010 INFO org.apache.zookeeper.server.NIOServerCnxn:
Finished init of 0x228fb20c3760000 valid:true
2010-06-03 10:06:30,002 INFO
org.apache.zookeeper.server.SessionTrackerImpl: Expiring session
0x128fb1975310000
2010-06-03 10:06:30,003 INFO
org.apache.zookeeper.server.ZooKeeperServer: Expiring session
0x128fb1975310000
2010-06-03 10:06:30,004 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination request for id: 0x128fb1975310000
2010-06-03 10:06:30,004 INFO
org.apache.zookeeper.server.SessionTrackerImpl: Expiring session
0x128fb1975310003
2010-06-03 10:06:30,004 INFO
org.apache.zookeeper.server.ZooKeeperServer: Expiring session
0x128fb1975310003
2010-06-03 10:06:30,004 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination request for id: 0x128fb1975310003
2010-06-03 10:06:30,005 INFO
org.apache.zookeeper.server.SessionTrackerImpl: Expiring session
0x128fb1975310001
2010-06-03 10:06:30,005 INFO
org.apache.zookeeper.server.ZooKeeperServer: Expiring session
0x128fb1975310001
2010-06-03 10:06:30,005 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination request for id: 0x128fb1975310001
2010-06-03 10:06:30,005 INFO
org.apache.zookeeper.server.SessionTrackerImpl: Expiring session
0x128fb1975310002
2010-06-03 10:06:30,005 INFO
org.apache.zookeeper.server.ZooKeeperServer: Expiring session
0x128fb1975310002
2010-06-03 10:06:30,005 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination request for id: 0x128fb1975310002
2010-06-03 10:09:59,904 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination request for id: 0x228fb20c3760000
2010-06-03 10:09:59,906 INFO org.apache.zookeeper.server.NIOServerCnxn:
closing session:0x228fb20c3760000 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/172.28.2.136:2181
remote=/172.28.1.138:23600]
Thu Jun  3 10:14:09 EST 2010 Stopping zookeeper
Thu Jun  3 10:14:09 EST 2010 Killing zookeeper


The hbase-site.xml for each server is configured as:
<configuration>
        <property>
                <name>hbase.rootdir</name>
                <value>hdfs://172.28.1.138/hbase</value>
        </property>
        <property>
                <name>hbase.master</name>
                <value>172.28.1.138:60000</value>
        </property>
        <property>
                <name>hbase.cluster.distributed</name>
                <value>true</value>
        </property>
        <property>
                <name>hbase.zookeeper.quorum</name>
                <value>172.28.1.138,172.28.2.136</value>
        </property>
</configuration>

My ${HBASE_HOME}/conf/regionservers files are:
Server1 (172.28.1.138):
172.28.2.136
172.28.2.137
172.28.1.139

Server2 (172.28.1.139):
172.28.2.136
172.28.2.137
172.28.1.139

Server3 (172.28.2.136):
172.28.2.136
172.28.2.137
172.28.1.139

Server4 (172.28.2.137):
172.28.2.136
172.28.2.137
172.28.1.139

Question:
Why can't the region servers contact the master? I've checked the
/etc/hosts file and there are 2 entries to resolve the server name
(127.0.0.1 and 172.28.x.x) with 127.0.0.1 coming first. But I've been
told not to change this as it affects other functions of the server.

Anthony Ikeda
Java Analyst/Programmer
Cardlink Services Limited
Level 4, 3 Rider Boulevard
Rhodes NSW 2138

Web: www.**cardlink.com.au<http://**www.**cardlink.com.au> | Tel: + 61 2
9646 9221 | Fax: + 61 2 9646 9283
[cid:[email protected]]


**********************************************************************
This e-mail message and any attachments are intended only for the use of
the addressee(s) named above and may contain information that is
privileged and confidential. If you are not the intended recipient, any
display, dissemination, distribution, or copying is strictly prohibited.
If you believe you have received this e-mail message in error, please
immediately notify the sender by replying to this e-mail message or by
telephone to (02) 9646 9222. Please delete the email and any attachments
and do not retain the email or any attachments in any form.
**********************************************************************

_____________________________________________________________________ 
This e-mail has been scanned for viruses by MCI's Internet Managed 
Scanning Services - powered by MessageLabs. For further information 
visit http://*www.*mci.com

**********************************************************************
This e-mail message and any attachments are intended only for the use of the 
addressee(s) named above and may contain information that is privileged and 
confidential. If you are not the intended recipient, any display, 
dissemination, distribution, or copying is strictly prohibited.   If you 
believe you have received this e-mail message in error, please immediately 
notify the sender by replying to this e-mail message or by telephone to (02) 
9646 9222. Please delete the email and any attachments and do not retain the 
email or any attachments in any form.
**********************************************************************

RE: Trying to get the region servers working....

Reply via email to