Re: Regionserver died due to problem connecting to HMaster?

2010-07-23 Thread Steve Kuo
Here is an interesting anecdote. I had regionservers running on each of 8 node hadoop cluster. Yesterday morning, I ran a series of MR jobs where the last MR job does a batched inserts into a production MySQL server. All other MR jobs have 3 mappers and 3 reducers running on a node. The db job

Re: Regionserver died due to problem connecting to HMaster?

2010-07-21 Thread Steve Kuo
* Each node is a 4 CPU machine with max of 3 mappers and 1 regionserver. No reducer when importing data to hbase. * Each region server is allocated 4G of memory. The full options are: -Xmx4096m -server -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -server -XX:+H

Re: Regionserver died due to problem connecting to HMaster?

2010-07-21 Thread Jean-Daniel Cryans
So your java process paused for 250116ms, that's how long the process wasn't responding (aka "stop-the-world" pause). You should: - Make sure HBase isn't CPU starved (how many MR tasks on those machines? Left some room for HBase?) - Make sure there's no swap. Also set swappiness to 0 - Give mo

Re: Regionserver died due to problem connecting to HMaster?

2010-07-21 Thread Steve Kuo
J-D, Below is the closest I found in the regionserver log. There was no 'slept' in either the master or zookeeper logs. 2010-07-21 14:36:18,664 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x129f24e134a002c to sun.nio.ch.selectionkeyi...@356f144c java.io.IOException: TIMED OUT

Re: Regionserver died due to problem connecting to HMaster?

2010-07-21 Thread Jean-Daniel Cryans
ZooKeeper is only a canary, telling the region server that it was partionned from the cluster for longer than the default timeout somehow, usually because of GC pauses. You should see lines like "slept for x, long than y" messages before what you pasted. J-D On Wed, Jul 21, 2010 at 2:49 PM, Steve

Re: Regionserver died due to problem connecting to HMaster?

2010-07-21 Thread Steve Kuo
It's shaping up to be zookeeper problem. The UI showed only 4 RS's running but when I went on one of the nodes, I saw one of the missing RS was still running. This RS eventually got terminated due to the following exception and proceeded to shut down. I will search on all zookeeper related threa

Regionserver died due to problem connecting to HMaster?

2010-07-21 Thread Steve Kuo
I started a hbase cluster of 8 nodes and two regionservers died before I even started any Map job writing data into it. There are several interesting exceptions and I really appreciate any help on identifying the culprit and methods to fix it. BTW, I restarted these regionservers manually and the