> So I see 2 main issues:

>  - Your master's zookeeper session timed out. Why? Hard to tell with
> those logs since it happened before what you pasted. Very slow IO?

I'm not swapping and I doubt that zookeeper session timed out because of
slow IO since my applications aren't even close to stress the hardware. I've
already followed the instruction in
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 to avoid this kind of
problem.

> Swapping + GC?
>  - The your region server seemed to have moved elsewhere, or something
> weird like that. DNS blip? Can't tell from the logs.

Maybe a DNS blip. But how can I confirm it? Logs? I didn't move anything and
as soon as I restarted the cluster things got back on track.

The hbase master log just repeated the text bellow for the last 8 hours
before the crash. The zookeeper and region server logs are clear from
errors.

2010-05-27 08:28:15,118 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.251.158.224:60020,
regionname: -ROOT-,,0, startKey: <>}
2010-05-27 08:28:15,125 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.251.158.224:60020, regionname: -ROOT-,,0, startKey: <>} complete
2010-05-27 08:28:26,379 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 10.251.158.224:60020,
regionname: .META.,,1, startKey: <>}
2010-05-27 08:28:26,787 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 73 row(s) of meta region {server:
10.251.158.224:60020, regionname: .META.,,1, startKey: <>} complete
2010-05-27 08:28:26,788 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2010-05-27 08:28:32,603 INFO org.apache.hadoop.hbase.master.ServerManager: 1
region servers, 0 dead, average load 75.0
2010-05-27 08:29:15,123 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.251.158.224:60020,
regionname: -ROOT-,,0, startKey: <>}
2010-05-27 08:29:15,138 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.251.158.224:60020, regionname: -ROOT-,,0, startKey: <>} complete
2010-05-27 08:29:26,380 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 10.251.158.224:60020,
regionname: .META.,,1, startKey: <>}
2010-05-27 08:29:26,738 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 73 row(s) of meta region {server:
10.251.158.224:60020, regionname: .META.,,1, startKey: <>} complete
2010-05-27 08:29:26,738 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2010-05-27 08:29:32,605 INFO org.apache.hadoop.hbase.master.ServerManager: 1
region servers, 0 dead, average load 75.0





On Thu, May 27, 2010 at 12:49 PM, Jean-Daniel Cryans <jdcry...@apache.org>wrote:

> From what I see, nothing happened to zookeeper.
>
> What happened:
>
> 1) The master wasn't able to scan the -ROOT- region because the
> connection was refused  (same with .META.)
> 2010-05-27 08:40:44,259 WARN org.apache.hadoop.hbase.master.BaseScanner:
> Scan ROOT region
> java.io.IOException: Call to /10.251.158.224:60020 failed on local
> exception: java.io.IOException: Connection reset by peer
>
> 2) The master's session with zookeeper was timed out
> 2010-05-27 08:40:46,630 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x128c8b303040000 to sun.nio.ch.selectionkeyi...@744e022c
> java.io.IOException: Session Expired
>   at
>
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>   at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>
> 3) The master was kicked out of the cluster, tries to re-enter
> 2010-05-27 08:40:46,631 INFO org.apache.hadoop.hbase.master.HMaster: Master
> lost its znode, trying to get a new one
>
> 4) The master was able to win the race the be the main master again
> (easy, there's only 1 machine in your cluster)
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Wrote master address
> 10.251.158.224:60000 to ZooKeeper
>
> 5) This master still isn't able to scan -ROOT-
> 2010-05-27 08:41:44,270 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server:
> 10.251.158.224:60020,
> regionname: -ROOT-,,0, startKey: <>}
>
> So I see 2 main issues:
>
>  - Your master's zookeeper session timed out. Why? Hard to tell with
> those logs since it happened before what you pasted. Very slow IO?
> Swapping + GC?
>  - The your region server seemed to have moved elsewhere, or something
> weird like that. DNS blip? Can't tell from the logs.
>
> > Shouldn't Zookeeper recovery nicely? How can I prevent such error from
> > happening in the future?
>
> Nothing happened to zookeeper. And since you have only 1 machine, even
> if the ZK process did die for some reason, how could it even recover?
> Reliability with ZK is 3 machines and more, nothing can be guaranteed
> with only 1 machine.
>
> Now on how to prevent, we need to understand the root cause of the 2
> issues I listed.
>
> Also, not sure if you saw that, but the first minute in your log is
> out of order. Very apparent with the first two lines.
>
> J-D
>

Reply via email to