Shen,

It's a design decision, and we historically preferred to let cluster
managers decide whether they want to restart the processes that died
or investigate why it has died then decide on what they want to do.
You can easily write tools that will restart the region servers if
they die, but the fact that they die in the first place is the real
issue.

Looking at your logs, I cannot tell why exactly your region server
died (also the master log you gave is refers to the death of PC3 one
hour later, not PC4). I do see that the zookeeper server expired the
session almost a whole minute before the region server figured it out,
but the RS is really quiet... is there anything else running on that
cluster that doesn't touch hbase but that could affect it? Like MR
jobs that don't use HBase or something like that?

J-D

On Wed, Jan 5, 2011 at 11:44 PM, ChingShen <[email protected]> wrote:
> Hi all,
>
>    I encounter a problem about long gc pause cause the region server's local
> zookeeper client cannot send heartbeats, the session times out.
>  But I want to know why the HBase master sends a MSG_REGIONSERVER_STOP op to
> region sever to stop its services rather than reinitialize a new zookeeper
> client or restart region server?
>
>   There are 3 RS/DN/TT and 1 MS/NN/JT in my cluster(Hadoop-0.20.2,  HBase
> 0.20.6), and set vm.swappiness to zero.
>
> hbase-ites-master-clusterPC1.log
> 2011-01-06 13:10:57,003 INFO org.apache.hadoop.hbase.master.ServerManager:
> clusterPC4,60020,1294280765301 znode expired
> 2011-01-06 13:10:57,004 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Processing todo: ProcessServerShutdown of
> ites-clusterPC4,60020,1294280765301
> 2011-01-06 13:10:57,004 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
> server clusterPC4,60020,1294280765301: logSplit: false, rootRescanned:
> false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> 2011-01-06 13:10:57,007 INFO org.apache.hadoop.hbase.regionserver.HLog:
> Splitting 1 hlog(s) in
> hdfs://clusterPC1:54001/hbase20_6/.logs/ites-clusterPC4,60020,1294280765301
> 2011-01-06 13:10:57,007 DEBUG org.apache.hadoop.hbase.regionserver.HLog:
> Splitting hlog 1 of 1:
> hdfs://clusterPC1:54001/hbase20_6/.logs/ites-clusterPC4,60020,1294280765301/hlog.dat.1294280765667,
> length=0
> .............
>
> hbase-ites-regionserver-clusterPC4.log:
> 2011-01-06 12:21:15,773 DEBUG
> org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms
> elapsed
> 2011-01-06 13:11:03,849 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x12d59208f560000 to sun.nio.ch.selectionkeyi...@402f0df1
> java.io.IOException: TIMED OUT
>         at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2011-01-06 13:11:09,628 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP
> 2011-01-06 13:11:31,491 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> server on 60020
> ............
>
> Please see the attach files.
> Thanks.
>
> Shen
>

Reply via email to