> So I see 2 main issues: > - Your master's zookeeper session timed out. Why? Hard to tell with > those logs since it happened before what you pasted. Very slow IO?
I'm not swapping and I doubt that zookeeper session timed out because of slow IO since my applications aren't even close to stress the hardware. I've already followed the instruction in http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 to avoid this kind of problem. > Swapping + GC? > - The your region server seemed to have moved elsewhere, or something > weird like that. DNS blip? Can't tell from the logs. Maybe a DNS blip. But how can I confirm it? Logs? I didn't move anything and as soon as I restarted the cluster things got back on track. The hbase master log just repeated the text bellow for the last 8 hours before the crash. The zookeeper and region server logs are clear from errors. 2010-05-27 08:28:15,118 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {server: 10.251.158.224:60020, regionname: -ROOT-,,0, startKey: <>} 2010-05-27 08:28:15,125 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scan of 1 row(s) of meta region {server: 10.251.158.224:60020, regionname: -ROOT-,,0, startKey: <>} complete 2010-05-27 08:28:26,379 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scanning meta region {server: 10.251.158.224:60020, regionname: .META.,,1, startKey: <>} 2010-05-27 08:28:26,787 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 73 row(s) of meta region {server: 10.251.158.224:60020, regionname: .META.,,1, startKey: <>} complete 2010-05-27 08:28:26,788 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned 2010-05-27 08:28:32,603 INFO org.apache.hadoop.hbase.master.ServerManager: 1 region servers, 0 dead, average load 75.0 2010-05-27 08:29:15,123 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {server: 10.251.158.224:60020, regionname: -ROOT-,,0, startKey: <>} 2010-05-27 08:29:15,138 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scan of 1 row(s) of meta region {server: 10.251.158.224:60020, regionname: -ROOT-,,0, startKey: <>} complete 2010-05-27 08:29:26,380 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scanning meta region {server: 10.251.158.224:60020, regionname: .META.,,1, startKey: <>} 2010-05-27 08:29:26,738 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 73 row(s) of meta region {server: 10.251.158.224:60020, regionname: .META.,,1, startKey: <>} complete 2010-05-27 08:29:26,738 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned 2010-05-27 08:29:32,605 INFO org.apache.hadoop.hbase.master.ServerManager: 1 region servers, 0 dead, average load 75.0 On Thu, May 27, 2010 at 12:49 PM, Jean-Daniel Cryans <jdcry...@apache.org>wrote: > From what I see, nothing happened to zookeeper. > > What happened: > > 1) The master wasn't able to scan the -ROOT- region because the > connection was refused (same with .META.) > 2010-05-27 08:40:44,259 WARN org.apache.hadoop.hbase.master.BaseScanner: > Scan ROOT region > java.io.IOException: Call to /10.251.158.224:60020 failed on local > exception: java.io.IOException: Connection reset by peer > > 2) The master's session with zookeeper was timed out > 2010-05-27 08:40:46,630 WARN org.apache.zookeeper.ClientCnxn: Exception > closing session 0x128c8b303040000 to sun.nio.ch.selectionkeyi...@744e022c > java.io.IOException: Session Expired > at > > org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589) > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) > > 3) The master was kicked out of the cluster, tries to re-enter > 2010-05-27 08:40:46,631 INFO org.apache.hadoop.hbase.master.HMaster: Master > lost its znode, trying to get a new one > > 4) The master was able to win the race the be the main master again > (easy, there's only 1 machine in your cluster) > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Wrote master address > 10.251.158.224:60000 to ZooKeeper > > 5) This master still isn't able to scan -ROOT- > 2010-05-27 08:41:44,270 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scanning meta region {server: > 10.251.158.224:60020, > regionname: -ROOT-,,0, startKey: <>} > > So I see 2 main issues: > > - Your master's zookeeper session timed out. Why? Hard to tell with > those logs since it happened before what you pasted. Very slow IO? > Swapping + GC? > - The your region server seemed to have moved elsewhere, or something > weird like that. DNS blip? Can't tell from the logs. > > > Shouldn't Zookeeper recovery nicely? How can I prevent such error from > > happening in the future? > > Nothing happened to zookeeper. And since you have only 1 machine, even > if the ZK process did die for some reason, how could it even recover? > Reliability with ZK is 3 machines and more, nothing can be guaranteed > with only 1 machine. > > Now on how to prevent, we need to understand the root cause of the 2 > issues I listed. > > Also, not sure if you saw that, but the first minute in your log is > out of order. Very apparent with the first two lines. > > J-D >