Hi JM, Thank you!
it is case sensitive indeed, a simple change of 'z' brings back ALL RegionServers (and a 'Z' could bring down all too), I spent few hours on other areas and hadn't realized this 'Z' effect. Thanks again. On 22 Nov 2012, at 8:39 AM, Jean-Marc Spaggiari wrote: > I think the MAIN difference is the uppercase on the property... Seems > that hbase-site.xml is case sensitive (which seems to be normal in > Java and unix world). > > You might want to retry by putting back the uppercase to see if this > was the issue. > > JM > > 2012/11/21, [email protected] <[email protected]>: >> Hi >> >> I changed the order of ZooKeepers in the value of hbase.zookeeper.quorum, >> from "m146,m145,m143" to "m143,m145,m146", set timeout from 60000 to 70000, >> and commented out lzo property. it works now, here is the diff >> >> 1) $ diff hbase-site.xml hbase-site.xml.xxx >> 41,44c41,43 >> < >> < <property> >> < <name>hbase.zookeeper.quorum</name> >> < <value>m143,m145,m146</value> >> --- >>> <property> >>> <name>hbase.ZooKeeper.quorum</name> >>> <value>m146,m145,m143</value> >> 49c48,55 >> < <value>70000</value> >> --- >>> <value>60000</value> >>> </property> >>> >>> <!-- >>> /** >>> <property> >>> <name>hbase.regionserver.codecs</name> >>> <value>lzo,gz</value> >> 50a57,58 >>> **/ >>> --> >> >> Above is the only change today . >> >> >> 2) hbase log: >> 2012-11-22 07:26:19,431 INFO org.apache.zookeeper.ZooKeeper: Initiating >> client connection, connectString=m145:2181,m143:2181,m146:2181 >> sessionTimeout=70000 watcher=regionserver:6$ >> >> >> I don't know why but it works now. It seems that hbase somehow could not >> read in hbase-site.xml correctly. >> >> >> Thanks >> >> >> >> >> On 22 Nov 2012, at 7:51 AM, Jean-Marc Spaggiari wrote: >> >>> Can you do JPS on your master and look at the logs too? >>> >>> Another think, can you try with hbase.zookeeper.quorum instead of >>> hbase.ZooKeeper.quorum? >>> >>> 2012/11/21, [email protected] <[email protected]>: >>>> Hi, >>>> >>>> Here are my HBase configuration and test: >>>> >>>> 1) {$HBASE_HOME}hbase/conf/hbase-site.xml >>>> <property> >>>> <name>hbase.ZooKeeper.quorum</name> >>>> <value>m146,m145,m143</value> >>>> </property> >>>> >>>> <property> >>>> <name>zookeeper.session.timeout</name> >>>> <value>60000</value> >>>> </property> >>>> >>>> >>>> 2) {$HBASE_HOME}hbase/conf/hbase-env.sh >>>> export HBASE_MANAGES_ZK=false >>>> >>>> >>>> 3) I used " {$ZK_HOME}/bin/zkCli.sh -server m145,m146,m143" to test the >>>> connection, it worked >>>> [zk: m145,m146,m143(CONNECTED) 0] >>>> >>>> >>>> 4) from the logs, I found that the connectString was odd, the >>>> RegionServer >>>> did not use the setting of "hbase.ZooKeeper.quorum" in >>>> conf/hbase-site.xml, >>>> it seemed that it always used the default and tried to connect >>>> "localhost:2181" in the distributed cluster: >>>> >>>> 2012-11-21 17:21:42,299 INFO org.apache.zookeeper.ZooKeeper: Initiating >>>> client connection, connectString=localhost:2181 sessionTimeout=60000 >>>> watcher=regionserver:60020 >>>> ... >>>> 2012-11-21 17:21:42,313 INFO org.apache.zookeeper.ClientCnxn: Opening >>>> socket connection to server localhost/127.0.0.1:2181. Will not attempt >>>> to >>>> authenticate using SASL (Unable to locate a login configura$ >>>> ... >>>> 2012-11-21 17:21:42,316 WARN org.apache.zookeeper.ClientCnxn: Session >>>> 0x0 >>>> for server null, unexpected error, closing socket connection and >>>> attempting >>>> reconnect java.net.ConnectException: Connection refused >>>> ... (remark: it tried above 3 times, then had FATAL error as follows) >>>> >>>> 2012-11-21 17:21:57,846 ERROR >>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020 >>>> Received unexpected KeeperException, re-throwing exception >>>> ... >>>> 2012-11-21 17:21:57,847 FATAL >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region >>>> server >>>> ... >>>> >>>> >>>> >>>> Please help. >>>> >>>> Thanks >>>> >>>> >>>> >>>> >>>> >>>> On 22 Nov 2012, at 1:22 AM, Jean-Marc Spaggiari wrote: >>>> >>>>> Hi, >>>>> >>>>> What do you have on your HBase configuration? Are you passing the name >>>>> of the Quorum servers? >>>>> $ cat conf/hbase-site.xml >>>>> ...... >>>>> </property> >>>>> <property> >>>>> <name>hbase.zookeeper.quorum</name> >>>>> <value>cube,latitude,node3</value> >>>>> <description>Comma separated list of servers in the ZooKeeper >>>>> Quorum. >>>>> For example, >>>>> "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com". >>>>> By default this is set to localhost for local and >>>>> pseudo-distributed >>>>> modes >>>>> of operation. For a fully-distributed setup, this should be set to >>>>> a >>>>> full >>>>> list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in >>>>> hbase-env.sh >>>>> this is the list of servers which we will start/stop ZooKeeper on. >>>>> </description> >>>>> </property> >>>>> ..... >>>>> >>>>> 2012/11/21, [email protected] <[email protected]>: >>>>>> Hi, >>>>>> >>>>>> >>>>>> I have the following line in /etc/hosts in all servers, should I keep >>>>>> it >>>>>> or >>>>>> comment it out or ...? >>>>>> >>>>>> 127.0.0.1 localhost >>>>>> >>>>>> Please help. >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> >>>>>> On 21 Nov 2012, at 7:16 PM, [email protected] wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> >>>>>>> Please help!! >>>>>>> >>>>>>> HBase version: 0.94 >>>>>>> ZooKeeper: 3.4.4 >>>>>>> >>>>>>> One of the regional servers stopped very quickly after HBASE is >>>>>>> started: >>>>>>> >>>>>>> ### Check JPS after HBASE cluster was started, could find the >>>>>>> HRegionServer process (*** there is no any ZooKeeper instance running >>>>>>> in >>>>>>> this server ***) >>>>>>> $ jps >>>>>>> 24767 Jps >>>>>>> 18418 TaskTracker >>>>>>> 24678 HRegionServer >>>>>>> 18156 DataNode >>>>>>> >>>>>>> ### Wait a while and checked JPS again, HRegionServer process gone >>>>>>> $ jps >>>>>>> 18418 TaskTracker >>>>>>> 24784 Jps >>>>>>> 18156 DataNode >>>>>>> >>>>>>> >>>>>>> ### Here is the setting in hbase-site.xml ( enabled >>>>>>> hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000) >>>>>>> <property> >>>>>>> <name>hbase.cluster.distributed</name> >>>>>>> <value>true</value> >>>>>>> </property> >>>>>>> >>>>>>> <property> >>>>>>> <name>hbase.ZooKeeper.quorum</name> >>>>>>> <value>m146,m145,m143</value> >>>>>>> </property> >>>>>>> >>>>>>> <property> >>>>>>> <name>zookeeper.session.timeout</name> >>>>>>> <value>60000</value> >>>>>>> </property> >>>>>>> >>>>>>> >>>>>>> ### hbase-env.sh also tells HBASE not to manage local instance of >>>>>>> ZooKeeper >>>>>>> export HBASE_MANAGES_ZK=false >>>>>>> >>>>>>> >>>>>>> ###This server can connect to the 3 ZooKeepers, >>>>>>> ./zkCli.sh -server m145,m146,m143 ==> [zk: >>>>>>> m145,m146,m143(CONNECTED) >>>>>>> 0] >>>>>>> >>>>>>> >>>>>>> ### checked the hbase log file, found something odd, seemed that it >>>>>>> tried >>>>>>> to connect local ZooKeeper >>>>>>> 2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper: >>>>>>> Initiating >>>>>>> client connection, connectString=localhost:2181 sessionTimeout=60000 >>>>>>> watcher=regionserver:60020 >>>>>>> >>>>>>> 2012-11-21 17:31:33,254 WARN >>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly >>>>>>> transient >>>>>>> ZooKeeper exception: >>>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException: >>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/master >>>>>>> >>>>>>> 2012-11-21 17:31:33,254 INFO >>>>>>> org.apache.hadoop.hbase.util.RetryCounter: >>>>>>> Sleeping 2000ms before retry #1... >>>>>>> 2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn: Client >>>>>>> session timed out, have not heard from server in 60010ms for >>>>>>> sessionid >>>>>>> 0x0, closing socket connection and attempting reconnect >>>>>>> >>>>>>> 2012-11-21 17:32:33,362 WARN >>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly >>>>>>> transient >>>>>>> ZooKeeper exception: >>>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException: >>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/master >>>>>>> >>>>>>> ...... >>>>>>> >>>>>>> 2012-11-21 17:34:33,570 ERROR >>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper >>>>>>> exists >>>>>>> failed after 3 retries >>>>>>> 2012-11-21 17:34:33,571 WARN >>>>>>> org.apache.hadoop.hbase.zookeeper.ZKUtil: >>>>>>> regionserver:60020 Unable to set watcher on znode /hbase/master >>>>>>> 2012-11-21 17:34:33,573 ERROR >>>>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: >>>>>>> regionserver:60020 >>>>>>> Received unexpected KeeperException, re-throwing exception >>>>>>> 2012-11-21 17:34:33,573 FATAL >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region >>>>>>> server >>>>>>> ...... >>>>>>> 2012-11-21 17:34:33,576 FATAL >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer >>>>>>> abort: >>>>>>> loaded coprocessors are: [] >>>>>>> >>>>>>> 2012-11-21 17:34:36,580 FATAL >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region >>>>>>> server >>>>>>> m144,60020,1353490232962: Initialization of RS failed. Hence >>>>>>> aborting >>>>>>> RS. >>>>>>> java.io.IOException: Received the shutdown message while waiting. >>>>>>> at >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669) >>>>>>> at java.lang.Thread.run(Thread.java:662) >>>>>>> 2012-11-21 17:34:36,581 FATAL >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer >>>>>>> abort: >>>>>>> loaded coprocessors are: [] >>>>>>> >>>>>>> >>>>>>> Please help! >>>>>>> QUESTION: Is it a bug and I need to check something else? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>> >>>> >> >>
