Re: Regionserver died due to problem connecting to HMaster?

Steve Kuo Wed, 21 Jul 2010 17:32:12 -0700

* Each node is a 4 CPU machine with max of 3 mappers and 1 regionserver.  No
reducer when importing data to hbase.


* Each region server is allocated 4G of memory.  The full options are:
-Xmx4096m -server -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode -server -XX:+HeapDumpOnOutOfMemoryError
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode

On Wed, Jul 21, 2010 at 5:21 PM, Jean-Daniel Cryans <[email protected]>wrote:

> So your java process paused for 250116ms, that's how long the process
> wasn't responding (aka "stop-the-world" pause).
>
> You should:
>
>  - Make sure HBase isn't CPU starved (how many MR tasks on those
> machines? Left some room for HBase?)
>  - Make sure there's no swap. Also set swappiness to 0
>  - Give more RAM to HBase
>
> Last resort is giving a higher timeout, but that would mean that you
> are overcommitting your machines.
>
> J-D
>
> On Wed, Jul 21, 2010 at 5:15 PM, Steve Kuo <[email protected]> wrote:
> > J-D,
> >
> > Below is the closest I found in the regionserver log.  There was no
> 'slept'
> > in either the master or zookeeper logs.
> >
> > 2010-07-21 14:36:18,664 WARN org.apache.zookeeper.ClientCnxn: Exception
> > closing session 0x129f24e134a002c to sun.nio.ch.selectionkeyi...@356f144c
> > java.io.IOException: TIMED OUT
> >        at
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > 2010-07-21 14:36:18,665 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept
> > 250116ms, ten times longer than scheduled: 10000
> > 2010-07-21 14:36:18,667 WARN
> > org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to
> > master for 250229 milliseconds - retrying
> > 2010-07-21 14:36:18,760 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP
> > 2010-07-21 14:36:18,761 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> > server on 60020
> >
> > From http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9, I think my
> best
> > best is to stop client swapping, which in my case would be a Map job and
> > increase zookeeper timeout settings.  I will give them a trial after
> > finishing queued data load.
> >
> > Other suggestions are most welcome.
> >
> > On Wed, Jul 21, 2010 at 2:55 PM, Jean-Daniel Cryans <[email protected]
> >wrote:
> >
> >> ZooKeeper is only a canary, telling the region server that it was
> >> partionned from the cluster for longer than the default timeout
> >> somehow, usually because of GC pauses. You should see lines like
> >> "slept for x, long than y" messages before what you pasted.
> >>
> >> J-D
> >>
> >>
> >
>

Re: Regionserver died due to problem connecting to HMaster?

Reply via email to