If connection loss is followed by session expired, then you can't
recover as the region server will be forced offline.

In a small cluster, keep only 1 zookeeper on the master node/namenode,
and leave the other nodes for regionserver/datanode. Heavy IO can give
weird results when mixed with zookeeper is it relies on fast disk
access, and region servers tend to drive that rely high (depending on
what you do with them).

You should never ever swap, if you do then please review your heap allocation.

J-D

2011/4/19 bijieshan <[email protected]>:
> Thanks J-D.
> I have learned that there's several possibilities can lead to 
> ConnectionLossException, like FullGC, heavily swap space, or IO waits reasons.
> Especially about the IO waits reasons, does any good suggestions you can 
> provide about the networking mode? In my current env, I put the Zookeeper, 
> hdfs, hbase in the same machine, any problems about that?
>
> Regards,
> Jeason Bean
>
> -----邮件原件-----
> 发件人: [email protected] [mailto:[email protected]] 代表 Jean-Daniel Cryans
> 发送时间: 2011年4月19日 1:14
> 收件人: [email protected]
> 主题: Re: Does it necessarily to handle the "Zookeeper.ConnectionLossException" 
> in ZKUtil.getDataAndWatch?
>
> Take a look at the zookeeper server log, it should give you a clue. If
> it says there's too many connections, then you're hitting a well known
> problem with HBase 0.90, just look for the other threads in this
> mailing list about that.
>
> J-D
>
> On Sat, Apr 16, 2011 at 3:01 AM, bijieshan <[email protected]> wrote:
>> Thanks for Jean-Daniel Cryans's reply.
>> I have refered to the issue of HBASE-3065.And it's indeed the same problem.
>> Liyin Tang has given a resolvent to this issue . When the 
>> ConnectionLossException happened, take some retries to re-connetct to the ZK 
>> server.
>> Maybe it can be reconnect successfully with high probability, but not always.
>> In my scenario:
>> 1. The ConnectionLossException happened.
>> 2. The Hmaster process aborted due to session got expired.
>> 3. When I restart the Hmaster process, the ConnectionLossException was 
>> happened again. So the initialization failed, and the Hmaster aborted again.
>>
>> My question is under what conditions does the ConnectionLossException 
>> happened? I know the network reasons can cause this problem. Does any other 
>> possibilities exists?
>> Thanks!
>>
>> Jieshan Bean
>>
>> ===================================================================================================================
>> -----邮件原件-----
>> 发件人: [email protected] [mailto:[email protected]] 代表 Jean-Daniel Cryans
>> 发送时间: 2011年4月15日 2:27
>> 收件人: [email protected]
>> 主题: Re: Does it necessarily to handle the 
>> "Zookeeper.ConnectionLossException" in ZKUtil.getDataAndWatch?
>>
>> I guess we should, there's
>> https://issues.apache.org/jira/browse/HBASE-3065 that's open, but in
>> your case like I mentioned in your other email there seems to be
>> something weird in your environment.
>>
>> J-D
>>
>> On Thu, Apr 14, 2011 at 12:51 AM, bijieshan <[email protected]> wrote:
>>> Hi,
>>> The "KeeperException$ConnectionLossException" exception occurred while the 
>>> cluster is running, as we know, it's a Zookeeper "recoverable" 
>>> exception(And this exception has been handled in the method of 
>>> ZooKeeperWatcher.ZooKeeperWatcher),and the suggestion is that we should 
>>> retry a while. Does it necessarily?
>>>
>>> Here is the exception logs:
>>>
>>> 2011-03-21 13:26:53,135 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
>>> master:60000-0x22e8e6ee15f0046 Unable to get data of znode 
>>> /hbase/unassigned/59ba25120921011b7d9ed4025d30c105
>>> org.apache.zookeeper.KeeperException$ConnectionLossException: 
>>> KeeperErrorCode = ConnectionLoss for 
>>> /hbase/unassigned/59ba25120921011b7d9ed4025d30c105
>>>         at 
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>>>         at 
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:932)
>>>         at 
>>> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
>>>         at 
>>> org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:739)
>>>         at 
>>> org.apache.hadoop.hbase.master.AssignmentManager.nodeDataChanged(AssignmentManager.java:525)
>>>         at 
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
>>>         at 
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
>>> 2011-03-21 13:26:53,137 ERROR 
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
>>> master:60000-0x22e8e6ee15f0046 Received unexpected KeeperException, 
>>> re-throwing exception
>>> org.apache.zookeeper.KeeperException$ConnectionLossException: 
>>> KeeperErrorCode = ConnectionLoss for 
>>> /hbase/unassigned/59ba25120921011b7d9ed4025d30c105
>>>         at 
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>>>         at 
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:932)
>>>         at 
>>> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
>>>         at 
>>> org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:739)
>>>         at 
>>> org.apache.hadoop.hbase.master.AssignmentManager.nodeDataChanged(AssignmentManager.java:525)
>>>         at 
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
>>>         at 
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
>>>
>>> Expecting for the reply!
>>> Thank you.
>>>
>>> Regards,
>>> Jeason Bean
>>>
>>>
>>
>

Reply via email to