Re: A region server stopped (timeout after trying to connect local Zookeeper)

Jean-Marc Spaggiari Wed, 21 Nov 2012 15:52:14 -0800

Can you do JPS on your master and look at the logs too?

Another think, can you try with hbase.zookeeper.quorum instead of
hbase.ZooKeeper.quorum?


2012/11/21, [email protected] <[email protected]>:
> Hi,
>
> Here are my HBase configuration and test:
>
> 1) {$HBASE_HOME}hbase/conf/hbase-site.xml
> <property>
> <name>hbase.ZooKeeper.quorum</name>
> <value>m146,m145,m143</value>
> </property>
>
> <property>
> <name>zookeeper.session.timeout</name>
> <value>60000</value>
> </property>
>
>
> 2) {$HBASE_HOME}hbase/conf/hbase-env.sh
> export HBASE_MANAGES_ZK=false
>
>
> 3) I used " {$ZK_HOME}/bin/zkCli.sh -server m145,m146,m143"  to test the
> connection, it worked
> [zk: m145,m146,m143(CONNECTED) 0]
>
>
> 4) from the logs, I found that the connectString was odd, the RegionServer
> did not use the setting of "hbase.ZooKeeper.quorum" in conf/hbase-site.xml,
> it seemed that it always used the default and tried to connect
> "localhost:2181" in the distributed cluster:
>
>       2012-11-21 17:21:42,299 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection, connectString=localhost:2181 sessionTimeout=60000
> watcher=regionserver:60020
>       ...
>       2012-11-21 17:21:42,313 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server localhost/127.0.0.1:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configura$
>       ...
>       2012-11-21 17:21:42,316 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x0
> for server null, unexpected error, closing socket connection and attempting
> reconnect java.net.ConnectException: Connection refused
>       ...  (remark: it tried above 3 times, then had FATAL error as follows)
>
>       2012-11-21 17:21:57,846 ERROR
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
> Received unexpected KeeperException, re-throwing exception
>       ...
>       2012-11-21 17:21:57,847 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> ...
>
>
>
> Please help.
>
> Thanks
>
>
>
>
>
> On 22 Nov 2012, at 1:22 AM, Jean-Marc Spaggiari wrote:
>
>> Hi,
>>
>> What do you have on your HBase configuration? Are you passing the name
>> of the Quorum servers?
>> $ cat conf/hbase-site.xml
>> ......
>>  </property>
>>    <property>
>>      <name>hbase.zookeeper.quorum</name>
>>      <value>cube,latitude,node3</value>
>>      <description>Comma separated list of servers in the ZooKeeper
>> Quorum.
>>      For example,
>> "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
>>      By default this is set to localhost for local and pseudo-distributed
>> modes
>>      of operation. For a fully-distributed setup, this should be set to a
>> full
>>      list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
>> hbase-env.sh
>>      this is the list of servers which we will start/stop ZooKeeper on.
>>      </description>
>>    </property>
>> .....
>>
>> 2012/11/21, [email protected] <[email protected]>:
>>> Hi,
>>>
>>>
>>> I have the following line in /etc/hosts in all servers, should I keep it
>>> or
>>> comment it out or ...?
>>>
>>> 127.0.0.1       localhost
>>>
>>> Please help.
>>>
>>> Thanks
>>>
>>>
>>>
>>> On 21 Nov 2012, at 7:16 PM, [email protected] wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> Please help!!
>>>>
>>>> HBase version: 0.94
>>>> ZooKeeper: 3.4.4
>>>>
>>>> One of the regional servers stopped very quickly after HBASE is
>>>> started:
>>>>
>>>> ### Check JPS after HBASE cluster was started, could find the
>>>> HRegionServer process (*** there is no any ZooKeeper instance running
>>>> in
>>>> this server ***)
>>>> $ jps
>>>> 24767 Jps
>>>> 18418 TaskTracker
>>>> 24678 HRegionServer
>>>> 18156 DataNode
>>>>
>>>> ### Wait a while and checked JPS again,  HRegionServer process gone
>>>> $ jps
>>>> 18418 TaskTracker
>>>> 24784 Jps
>>>> 18156 DataNode
>>>>
>>>>
>>>> ### Here is the setting in hbase-site.xml ( enabled
>>>> hbase.cluster.distributed, set up 3 ZooKeepers, timeout= 60000)
>>>> <property>
>>>> <name>hbase.cluster.distributed</name>
>>>> <value>true</value>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>hbase.ZooKeeper.quorum</name>
>>>> <value>m146,m145,m143</value>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>zookeeper.session.timeout</name>
>>>> <value>60000</value>
>>>> </property>
>>>>
>>>>
>>>> ### hbase-env.sh also tells HBASE not to manage local instance of
>>>> ZooKeeper
>>>> export HBASE_MANAGES_ZK=false
>>>>
>>>>
>>>> ###This server can connect to the 3 ZooKeepers,
>>>> ./zkCli.sh -server m145,m146,m143          ==>  [zk: 
>>>> m145,m146,m143(CONNECTED)
>>>> 0]
>>>>
>>>>
>>>> ### checked the hbase log file, found something odd,  seemed that it
>>>> tried
>>>> to connect local ZooKeeper
>>>> 2012-11-21 17:30:33,066 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>>> client connection, connectString=localhost:2181 sessionTimeout=60000
>>>> watcher=regionserver:60020
>>>>
>>>> 2012-11-21 17:31:33,254 WARN
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>> transient
>>>> ZooKeeper exception:
>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>
>>>> 2012-11-21 17:31:33,254 INFO org.apache.hadoop.hbase.util.RetryCounter:
>>>> Sleeping 2000ms before retry #1...
>>>> 2012-11-21 17:32:33,262 INFO org.apache.zookeeper.ClientCnxn: Client
>>>> session timed out, have not heard from server in 60010ms for sessionid
>>>> 0x0, closing socket connection and attempting reconnect
>>>>
>>>> 2012-11-21 17:32:33,362 WARN
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>> transient
>>>> ZooKeeper exception:
>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>> KeeperErrorCode = ConnectionLoss for /hbase/master
>>>>
>>>> ......
>>>>
>>>> 2012-11-21 17:34:33,570 ERROR
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
>>>> exists
>>>> failed after 3 retries
>>>> 2012-11-21 17:34:33,571 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
>>>> regionserver:60020 Unable to set watcher on znode /hbase/master
>>>> 2012-11-21 17:34:33,573 ERROR
>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020
>>>> Received unexpected KeeperException, re-throwing exception
>>>> 2012-11-21 17:34:33,573 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>> server
>>>> ......
>>>> 2012-11-21 17:34:33,576 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>>> loaded coprocessors are: []
>>>>
>>>> 2012-11-21 17:34:36,580 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>> server
>>>> m144,60020,1353490232962: Initialization of RS failed.  Hence aborting
>>>> RS.
>>>> java.io.IOException: Received the shutdown message while waiting.
>>>>    at
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:623)
>>>>    at
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:598)
>>>>    at
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:560)
>>>>    at
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:669)
>>>>    at java.lang.Thread.run(Thread.java:662)
>>>> 2012-11-21 17:34:36,581 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>>> loaded coprocessors are: []
>>>>
>>>>
>>>> Please help!
>>>> QUESTION: Is it a bug and I need to check something else?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>
>

Re: A region server stopped (timeout after trying to connect local Zookeeper)

Reply via email to