>From what I see, nothing happened to zookeeper.

What happened:

1) The master wasn't able to scan the -ROOT- region because the
connection was refused  (same with .META.)
2010-05-27 08:40:44,259 WARN org.apache.hadoop.hbase.master.BaseScanner:
Scan ROOT region
java.io.IOException: Call to /10.251.158.224:60020 failed on local
exception: java.io.IOException: Connection reset by peer

2) The master's session with zookeeper was timed out
2010-05-27 08:40:46,630 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x128c8b303040000 to sun.nio.ch.selectionkeyi...@744e022c
java.io.IOException: Session Expired
   at
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
   at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

3) The master was kicked out of the cluster, tries to re-enter
2010-05-27 08:40:46,631 INFO org.apache.hadoop.hbase.master.HMaster: Master
lost its znode, trying to get a new one

4) The master was able to win the race the be the main master again
(easy, there's only 1 machine in your cluster)
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Wrote master address
10.251.158.224:60000 to ZooKeeper

5) This master still isn't able to scan -ROOT-
2010-05-27 08:41:44,270 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.251.158.224:60020,
regionname: -ROOT-,,0, startKey: <>}

So I see 2 main issues:

 - Your master's zookeeper session timed out. Why? Hard to tell with
those logs since it happened before what you pasted. Very slow IO?
Swapping + GC?
 - The your region server seemed to have moved elsewhere, or something
weird like that. DNS blip? Can't tell from the logs.

> Shouldn't Zookeeper recovery nicely? How can I prevent such error from
> happening in the future?

Nothing happened to zookeeper. And since you have only 1 machine, even
if the ZK process did die for some reason, how could it even recover?
Reliability with ZK is 3 machines and more, nothing can be guaranteed
with only 1 machine.

Now on how to prevent, we need to understand the root cause of the 2
issues I listed.

Also, not sure if you saw that, but the first minute in your log is
out of order. Very apparent with the first two lines.

J-D

Reply via email to