>From what I see, nothing happened to zookeeper. What happened:
1) The master wasn't able to scan the -ROOT- region because the connection was refused (same with .META.) 2010-05-27 08:40:44,259 WARN org.apache.hadoop.hbase.master.BaseScanner: Scan ROOT region java.io.IOException: Call to /10.251.158.224:60020 failed on local exception: java.io.IOException: Connection reset by peer 2) The master's session with zookeeper was timed out 2010-05-27 08:40:46,630 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x128c8b303040000 to sun.nio.ch.selectionkeyi...@744e022c java.io.IOException: Session Expired at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) 3) The master was kicked out of the cluster, tries to re-enter 2010-05-27 08:40:46,631 INFO org.apache.hadoop.hbase.master.HMaster: Master lost its znode, trying to get a new one 4) The master was able to win the race the be the main master again (easy, there's only 1 machine in your cluster) org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Wrote master address 10.251.158.224:60000 to ZooKeeper 5) This master still isn't able to scan -ROOT- 2010-05-27 08:41:44,270 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {server: 10.251.158.224:60020, regionname: -ROOT-,,0, startKey: <>} So I see 2 main issues: - Your master's zookeeper session timed out. Why? Hard to tell with those logs since it happened before what you pasted. Very slow IO? Swapping + GC? - The your region server seemed to have moved elsewhere, or something weird like that. DNS blip? Can't tell from the logs. > Shouldn't Zookeeper recovery nicely? How can I prevent such error from > happening in the future? Nothing happened to zookeeper. And since you have only 1 machine, even if the ZK process did die for some reason, how could it even recover? Reliability with ZK is 3 machines and more, nothing can be guaranteed with only 1 machine. Now on how to prevent, we need to understand the root cause of the 2 issues I listed. Also, not sure if you saw that, but the first minute in your log is out of order. Very apparent with the first two lines. J-D