J-D,
Below is the closest I found in the regionserver log. There was no 'slept'
in either the master or zookeeper logs.
2010-07-21 14:36:18,664 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x129f24e134a002c to sun.nio.ch.selectionkeyi...@356f144c
java.io.IOException: TIMED OUT
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-07-21 14:36:18,665 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
250116ms, ten times longer than scheduled: 10000
2010-07-21 14:36:18,667 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to
master for 250229 milliseconds - retrying
2010-07-21 14:36:18,760 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP
2010-07-21 14:36:18,761 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60020
>From http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9, I think my best
best is to stop client swapping, which in my case would be a Map job and
increase zookeeper timeout settings. I will give them a trial after
finishing queued data load.
Other suggestions are most welcome.
On Wed, Jul 21, 2010 at 2:55 PM, Jean-Daniel Cryans <[email protected]>wrote:
> ZooKeeper is only a canary, telling the region server that it was
> partionned from the cluster for longer than the default timeout
> somehow, usually because of GC pauses. You should see lines like
> "slept for x, long than y" messages before what you pasted.
>
> J-D
>
>