RE: Queries on Zookeeper failure and RegionServer restartup

Buttler, David Tue, 20 Sep 2011 10:16:05 -0700

Have you looked at this:
http://hbase.apache.org/book.html#zookeeper

Inline...

-----Original Message-----
From: Stuti Awasthi [mailto:[email protected]] 
Sent: Tuesday, September 20, 2011 9:32 AM
To: [email protected]
Subject: RE: Queries on Zookeeper failure and RegionServer restartup

Hi David,

Thanks for your response. I am not clear with few things here :

1. Odd number of nodes in your zookeeper ensemble.
Why is it required. Can you please explain with example. Does that mean that if 
I have 3 nodes on which I am running zookeeper and out of which 1 is failed, 
then the cluster will work. And if out of 3 , 2 are failed then cluster will be 
down.

Buttler> Yes, this is correct.

2. " you do realize that you have to have a majority of zookeeper nodes alive 
for zookeeper to work,"
Please explain this.

Buttler> Zookeeper needs a quorum of nodes.  The algorithm that zookeeper uses 
defines a quorum as a simple majority.  I.e. more than half.  If you have 4 
nodes, and 2 die, then you have only 2 nodes alive, which is exactly half, not 
"more than half".  Zookeeper will then assume that it can no longer function.  
Therefore, the advice in the book is to have an odd number of nodes so that you 
will never be in the case of having "exactly" half of your nodes working.

Thanks

-----Original Message-----
From: Buttler, David [mailto:[email protected]] 
Sent: Tuesday, September 20, 2011 9:08 PM
To: [email protected]
Subject: RE: Queries on Zookeeper failure and RegionServer restartup

Wait, you do realize that you have to have a majority of zookeeper nodes alive 
for zookeeper to work, right?  That means that you get lower reliability with 
two nodes than one node: if either node goes down, zookeeper will give up.  
This also implies that you need to have an odd number of nodes in your 
zookeeper ensemble.

Also, hbase requires synchronized time across the cluster.  You can't rely on 
the built-in clocks to keep time synchronized to a close enough delta over a 
reasonable period of time (e.g. after a month things will fall apart).  Luckily 
this is a solved problem: ntp

Dave

-----Original Message-----
From: Stuti Awasthi [mailto:[email protected]] 
Sent: Tuesday, September 20, 2011 4:40 AM
To: [email protected]
Subject: RE: Queries on Zookeeper failure and RegionServer restartup

Hi Ramkrishna,
Thanks for reply, I setup the system date and rechecked ,now region server are 
starting .

Thanks
Stuti

-----Original Message-----
From: Ramkrishna S Vasudevan [mailto:[email protected]] 
Sent: Tuesday, September 20, 2011 1:56 PM
To: [email protected]
Subject: RE: Queries on Zookeeper failure and RegionServer restartup

Reg the clockoutofSync exception, just check if your cluster has same time set. 
 This problem comes when you have time differences.

Best Regards
Ram

-----Original Message-----
From: Stuti Awasthi [mailto:[email protected]]
Sent: Tuesday, September 20, 2011 1:28 PM
To: [email protected]
Subject: Queries on Zookeeper failure and RegionServer restartup

Hi all,

I have 2 node cluster. I run Regionserver, Zookeeper on both nodes and Master 
on 1 and Backup Master on other.

Here what I did : I stopped Zookeeper on 1 node and after that I was unable to 
access Hbase.

ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to 
connect to ZooKeeper but the connection closes immediately. This could be a 
sign that the server has too many connections (30 is the default).
Consider inspecting your ZK server logs for that error and then make sure you 
are reusing HBaseConfiguration as often as you can. See HTable's javadoc for 
more information.

Queries :

1.        If one of the zookeeper is going down , cluster is inaccessible
then why we are running multiple zookeeper nodes?

2.       Is there some way that if one of zookeeper nodes are working,
cluster can be accessible?

Some other test :
If I stop RegionServer and Master on 1 node, then bakupMaster becomes Master 
and I can access the Hbase cluster but when I try to restart Region server on 
the same node on which I have shut down it gives me following error . How to 
fix this ?

2011-09-20 12:06:03,647 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
serverName=master,60020,1316500563205, load=(requests=0, regions=0, 
usedHeap=22, maxHeap=993): Unhandled exception:
org.apache.hadoop.hbase.ClockOutOfSyncException: Server
master,60020,1316500563205 has been rejected; Reported time is too far out of 
sync with master.  Time difference of 352381ms > max allowed of 30000ms
org.apache.hadoop.hbase.ClockOutOfSyncException:
org.apache.hadoop.hbase.ClockOutOfSyncException: Server
master,60020,1316500563205 has been rejected; Reported time is too far out of 
sync with master.  Time difference of 352381ms > max allowed of 30000ms
                at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
                at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcces
sorImpl.java:39)
                at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc
torAccessorImpl.java:27)
                at
java.lang.reflect.Constructor.newInstance(Constructor.java:513)
                at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.j
ava:96)
                at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.
java:80)
                at
org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServ
er.java:1515)
                at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionS
erver.java:1479)
                at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:57
1)
                at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hbase.ClockOutOfSyncException: Server
master,60020,1316500563205 has been rejected; Reported time is too far out of 
sync with master.  Time difference of 352381ms > max allowed of 30000ms

                at
org.apache.hadoop.hbase.master.ServerManager.checkClockSkew(ServerManager.ja
va:181)
                at
org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManag
er.java:129)
                at
org.apache.hadoop.hbase.master.HMaster.regionServerStartup(HMaster.java:615)

Your inputs are required

Thanks
Stuti

________________________________
::DISCLAIMER::
----------------------------------------------------------------------------
-------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. 
Any views or opinions presented in this email are solely those of the author 
and may not necessarily reflect the opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, 
distribution and / or publication of this message without the prior written 
consent of the author of this e-mail is strictly prohibited. If you have 
received this email in error please delete it and notify the sender 
immediately. Before opening any mail and attachments please check them for 
viruses and defect.

----------------------------------------------------------------------------
-------------------------------------------

RE: Queries on Zookeeper failure and RegionServer restartup

Reply via email to