Re: HBase Master not picking up dead regionserver

Robert J Berger Sat, 17 Sep 2011 01:32:59 -0700

Im not sure if its the exact same problem, but  issue on Amazon EC2 if you are 
using their DNS and you have to Stop/Start an instance running a regionserver. 
This just happened to me today.


One of our regionservers got into the infrequent instance "hung" mode where the 
instance isn't quite dead but all TCP connections don't work.

The only way to fix it (assuming its a pure EBS volume) is to Stop the instance 
(usually have to Force Stop) and then Start it again. 

That actually creates a new instance but with the same EBS disk volumes. The 
problem is that it has a new IP address and a new DNS name.

Even though I took the old name out of the regionservers file and put the new 
name in, the HBase master kept trying to access the old IP address. It picked 
up the new funcitoning node fine. The only way I could get it to stop trying to 
access the old name was to stop and start the HBase cluster.

I am though still running HBase 0.20.3, so I've got that going for me...

(We're really close to moving to a modern one soon. I actually have people 
helping me now! :-)


On Sep 16, 2011, at 10:33 AM, Jean-Daniel Cryans wrote:

> This happens often to users with a broken reverse DNS setup, look at
> the master log around when it was supposed to process the dead node
> and it should tell you that it doesn't know who that is (because the
> server name it sees is different from the one registered in the
> master).
> 
> One example from http://search-hadoop.com/m/CANUA1qRCkQ1
> 
> 10567 2011-07-14 18:56:04,530 INFO
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
> ephemeral node deleted, processing expiration
> [server-2.domain.net.,60020,1310680454144]
> 10568 2011-07-14 18:56:04,530 INFO
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: No HServerInfo
> found for server-2.domain.net.,60020,1310680454144
> 
> You can see in their RS log:
> 
> 2011-07-14 18:56:03,423 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server
> at: server-2.domain.net,60020,1310680454144
> 
> server-2.domain.net,60020 != server-2.domain.net,60020.
> 
> J-D
> 
> On Fri, Sep 16, 2011 at 2:31 AM, Jorn Argelo - Ephorus
> <[email protected]> wrote:
>> Hi all,
>> 
>> 
>> 
>> I'm in the process of testing our small cluster running the CDH3U1
>> version of Hadoop / Hbase. I'm currently having the problem when I stop
>> a regionserver (either cleanly or kill it hard) that the HBase master is
>> not detecting that the regionserver is dead. If I do this to the
>> regionserver running the META region then the entire cluster is
>> completely unusable because the HBase master is not moving the META
>> region to another regionserver. It simply keeps on trying to reconnect
>> to the dead regionserver and it stays there forever, even up to the
>> level it renders the entire cluster unusable. Here's a snapshot of the
>> error in the hbase master log (and for the record it's datanode03 which
>> is the one that is dead):
>> 
>> 
>> 
>> 
>> 
>> 2011-09-16 11:22:12,514 DEBUG
>> org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing
>> plan for region ephorus_test,
>> /entries/liberalism/,1315833925382.918c3035c5387c00e8d6589f7dce64e7.;
>> plan=hri=ephorus_test,
>> /entries/liberalism/,1315833925382.918c3035c5387c00e8d6589f7dce64e7.,
>> src=datanode01.dev.ephorus-labs.com,60020,1316078209570,
>> dest=datanode03.dev.ephorus-labs.com,60020,1316162005809
>> 
>> 2011-09-16 11:22:12,514 DEBUG
>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region
>> ephorus_test,
>> /entries/liberalism/,1315833925382.918c3035c5387c00e8d6589f7dce64e7. to
>> datanode03.dev.ephorus-labs.com,60020,1316162005809
>> 
>> 2011-09-16 11:22:12,514 WARN
>> org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for
>> region 05f13ffa2ec18aac9ffa6f79a23c12b2 from server
>> datanode02.dev.ephorus-labs.com,60020,1316078218061 but region was in
>> the state
>> TestTable,0009796041,1316100506914.05f13ffa2ec18aac9ffa6f79a23c12b2.
>> state=OPEN, ts=1316164932386 and not in expected PENDING_OPEN or OPENING
>> states
>> 
>> 2011-09-16 11:22:12,514 WARN
>> org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of
>> ephorus_test,
>> /entries/liberalism/,1315833925382.918c3035c5387c00e8d6589f7dce64e7. to
>> serverName=datanode03.dev.ephorus-labs.com,60020,1316162005809,
>> load=(requests=0, regions=8, usedHeap=42, maxHeap=4083), trying to
>> assign elsewhere instead; retry=0
>> 
>> java.net.ConnectException: Connection refused
>> 
>>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> 
>>        at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>> 
>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.ja
>> va:206)
>> 
>>        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
>> 
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseC
>> lient.java:328)
>> 
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:8
>> 83)
>> 
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>> 
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>> 
>>        at $Proxy6.openRegion(Unknown Source)
>> 
>>        at
>> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManage
>> r.java:559)
>> 
>>        at
>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManage
>> r.java:931)
>> 
>>        at
>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManage
>> r.java:746)
>> 
>>        at
>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManage
>> r.java:726)
>> 
>>        at
>> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(Close
>> dRegionHandler.java:92)
>> 
>>        at
>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>> 
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto
>> r.java:886)
>> 
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
>> va:908)
>> 
>>        at java.lang.Thread.run(Thread.java:662)
>> 
>> 
>> 
>> Maybe worthwhile to say that this behavior is the same regardless if the
>> cluster is idle or loaded. Apart from that (and some infamous
>> stop-the-world GC issues which I got to fix) the cluster is running
>> fine.
>> 
>> 
>> 
>> For reference: the zookeeper ensemble is properly terminating the
>> session as we can see here:
>> 
>> 
>> 
>> 2011-09-16 10:33:25,988 - INFO  [CommitProcessor:1:NIOServerCnxn@1580] -
>> Established session 0x1324d1aa92a01bb with negotiated timeout 40000 for
>> client /10.20.4.98:47238
>> 
>> 2011-09-16 10:33:29,180 - INFO
>> [ProcessThread:-1:PrepRequestProcessor@407] - Got user-level
>> KeeperException when processing sessionid:0x1324d1aa92a01bb type:create
>> cxid:0xd zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error
>> Path:/hbase/rs/datanode03,60020,1316162005809 Error:KeeperErrorCode =
>> NodeExists for /hbase/rs/datanode03,60020,1316162005809
>> 
>> 2011-09-16 10:34:06,414 - INFO
>> [ProcessThread:-1:PrepRequestProcessor@387] - Processed session
>> termination for sessionid: 0x2324dad8d770170
>> 
>> 2011-09-16 10:34:06,430 - INFO
>> [ProcessThread:-1:PrepRequestProcessor@387] - Processed session
>> termination for sessionid: 0x1324d1aa92a01bb
>> 
>> 2011-09-16 10:34:06,438 - INFO
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
>> socket connection for client /10.20.4.98:47238 which had sessionid
>> 0x1324d1aa92a01bb
>> 
>> 
>> 
>> I can also confirm in the zk_dump found in the hbase master web UI that
>> the zookeeper ensemble no longer has the session active yet the HBase
>> master does not detect this. However, the hbase shell still reports that
>> all servers are alive:
>> 
>> 
>> 
>> hbase(main):001:0> status
>> 
>> 3 servers, 0 dead, 96.3333 average load
>> 
>> 
>> 
>> Maybe I am missing something obvious but I'm quite stumped on this. I
>> found a thread on Google where J-D suggested the session timeout, but
>> nothing happens if I let it run overnight (so that is 12 hours+). You
>> can find it here:
>> http://apache-hbase.679495.n3.nabble.com/Can-master-detect-sudden-region
>> -server-death-td1141384.html
>> 
>> 
>> 
>> The only way for the HBase master to detect that the regionserver is
>> dead is by restarting the HBase master ... which is frankly not really
>> what I want.
>> 
>> 
>> 
>> Any pointers would be greatly appreciated.
>> 
>> 
>> 
>> Thanks,
>> 
>> Jorn
>> 
>> 

__________________
Robert J Berger - CTO
Runa Inc.
+1 408-838-8896
http://blog.ibd.com

Re: HBase Master not picking up dead regionserver

Reply via email to