So your java process paused for 250116ms, that's how long the process wasn't responding (aka "stop-the-world" pause).
You should: - Make sure HBase isn't CPU starved (how many MR tasks on those machines? Left some room for HBase?) - Make sure there's no swap. Also set swappiness to 0 - Give more RAM to HBase Last resort is giving a higher timeout, but that would mean that you are overcommitting your machines. J-D On Wed, Jul 21, 2010 at 5:15 PM, Steve Kuo <kuosen...@gmail.com> wrote: > J-D, > > Below is the closest I found in the regionserver log. There was no 'slept' > in either the master or zookeeper logs. > > 2010-07-21 14:36:18,664 WARN org.apache.zookeeper.ClientCnxn: Exception > closing session 0x129f24e134a002c to sun.nio.ch.selectionkeyi...@356f144c > java.io.IOException: TIMED OUT > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) > 2010-07-21 14:36:18,665 WARN org.apache.hadoop.hbase.util.Sleeper: We slept > 250116ms, ten times longer than scheduled: 10000 > 2010-07-21 14:36:18,667 WARN > org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to > master for 250229 milliseconds - retrying > 2010-07-21 14:36:18,760 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP > 2010-07-21 14:36:18,761 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > server on 60020 > > From http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9, I think my best > best is to stop client swapping, which in my case would be a Map job and > increase zookeeper timeout settings. I will give them a trial after > finishing queued data load. > > Other suggestions are most welcome. > > On Wed, Jul 21, 2010 at 2:55 PM, Jean-Daniel Cryans > <jdcry...@apache.org>wrote: > >> ZooKeeper is only a canary, telling the region server that it was >> partionned from the cluster for longer than the default timeout >> somehow, usually because of GC pauses. You should see lines like >> "slept for x, long than y" messages before what you pasted. >> >> J-D >> >> >