Hey guys,
I had a similar problem a week ago and could not solve the problem in a clean 
way.
I restarted the master and region servers so everything could work again.

As Ram said, I think the best solution is to update HBase, which is not very 
easy but
still the best option we found.

Regards,
Pablo

-----Original Message-----
From: Ramkrishna.S.Vasudevan [mailto:[email protected]] 
Sent: terça-feira, 12 de junho de 2012 08:15
To: [email protected]
Subject: RE: Master stuck trying to rebalance regions to a zombie region 
servers.

Hi Pradeep,

Many changes have happened from the version that you are specifying upto the 
recent version.  So may be you can try out latest versions.

Regards
Ram

> -----Original Message-----
> From: Pradeep Gopaluni [mailto:[email protected]]
> Sent: Tuesday, June 12, 2012 4:35 PM
> To: [email protected]
> Subject: Master stuck trying to rebalance regions to a zombie region 
> servers.
> 
> Hi,
> 
> We are using HBASE 0.90.3 and Hadoop 0.20.205 for our cluster. Today, 
> we have observed that the HBASE master was stuck because of  zombie 
> regionservers that was responding to ping but was stuck.
> 
> The logs show that the master tried to re-assign its regions from an 
> overloaded (ServerA) and one of the regions to be rebalanced is 
> designated for the server in zombie state (SeverB). This re-assignment 
> started failing with SocketTimeoutException, and the master was stuck 
> waiting. This got resolved only when ServerB was taken out of network. 
> There were no master logs printed for close to 30 mins, between the 
> attempted reassignment, and the time ServerB removed from network.
> 
> The issue of infinite wait seems to be very similar to the hadoop RPC 
> issue
> - https://issues.apache.org/jira/browse/HADOOP-6889 reported and fixed 
> in 0.20.205.
> Did anyone face the same problem, was this fixed in later versions of 
> HBASE?
> 
> --
> Regards
> pradeep
> 
> *2012-06-07 06:59:32,693 HbaseMaster.domain.com:60000-BalancerChore
> INFO
> org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance 
> in 1ms. Moving 26 regions off of 24 overloaded servers onto 0 less 
> loaded servers .
> 2012-06-07 06:59:32,746 HbaseMaster.domain.com:60000-BalancerChore 
> INFO
> org.apache.hadoop.hbase.master.HMaster: balance
> hri=users,XePh1ThgkEgb1LJG8rWWnQ==,1327804993156.b935614a50366924dfac3
> 3
> 8bf6d504ef.,
> src=ServerA.domain.com
> <http://servera.domain.com/>,60020,1338788540876,
> dest=ServerB.domain.com
> <http://serverb.domain.com/>,60020,1339042481685
> .
> .
> 2012-06-07 06:59:53,854 MASTER_CLOSE_REGION-
> HbaseMaster.domain.com:60000-3
> WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed 
> assignment of
> users,XePh1ThgkEgb1LJG8rWWnQ==,1327804993156.b935614a50366924dfac338bf
> 6
> d504ef.
> to serverName=ServerB.domain.com
> <http://serverb.domain.com/>,60020,1339042481685,
> load=(requests=0, regions=57, usedHeap=30, maxHeap=3998), trying to 
> assign elsewhere instead; retry=0
> java.net.SocketTimeoutException: 20000 millis timeout while waiting 
> for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending remote=/ 
> 10.193.190.107:60020]
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.
> j
> ava:213)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
>         at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBas
> e
> Client.java:328)
>         at
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:
> 883)
>         at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>         at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>         at $Proxy7.openRegion(Unknown Source)
>         at
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerMana
> g
> er.java:559)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentMana
> g
> er.java:934)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentMana
> g
> er.java:749)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentMana
> g
> er.java:729)
>         at
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(Clo
> s
> edRegionHandler.java:92)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:15
> 6
> )
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecu
> t
> or.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> j
> ava:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2012-06-07 07:01:31,884 908082002@qtp-1543904466-10 INFO
> org.apache.zookeeper.ZooKeeper: Initiating client connection, 
> connectString= 
> zaas3.domain.com:2181,zaas2.domain.com:2181,zaas1.domain.com:2181,
> zaas5.domain.com:2181,zaas4.domain.com:2181 sessionTimeout=300000 
> watcher=hconnection
> 2012-06-07 07:01:31,884 908082002@qtp-1543904466-10-SendThread() INFO
> org.apache.zookeeper.ClientCnxn: Opening socket connection to server
> zaas1.domain.com/10.193.190.117:2181
> 2012-06-07 07:01:31,885 908082002@qtp-1543904466-10-SendThread(
> zaas1.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to zaas1.domain.com/10.193.190.117:2181,
> initiating
> session
> 2012-06-07 07:01:31,887 908082002@qtp-1543904466-10-SendThread(
> zaas1.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server zaas1.domain.com/10.193.190.117:2181,
> sessionid = 0x133cce032e46274, negotiated timeout = 300000
> 2012-06-07 07:01:33,695 908082002@qtp-1543904466-10 INFO 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplement
> a
> tion:
> Closed zookeeper sessionid=0x133cce032e46274
> 2012-06-07 07:01:33,697 908082002@qtp-1543904466-10 INFO
> org.apache.zookeeper.ZooKeeper: Session: 0x133cce032e46274 closed
> 2012-06-07 07:01:33,697 908082002@qtp-1543904466-10-EventThread INFO
> org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2012-06-07 07:04:32,738 HbaseMaster.domain.com:60000-BalancerChore 
> INFO
> org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance 
> in 1ms. Moving 1 regions off of 1 overloaded servers onto 0 less 
> loaded servers
> 2012-06-07 07:04:32,738 HbaseMaster.domain.com:60000-BalancerChore 
> INFO
> org.apache.hadoop.hbase.master.HMaster: balance 
> hri=users,iLy9s53ke7L+6gQ9ASdzbA==,1327771137166.d75ed55fbfd7c63ab3fa3
> a
> 51fb3006a5.,
> src=mdg357.domain.com,60020,1338788541085,
> dest=ServerB.domain.com<http://serverb.domain.com/>
> ,60020,1339042481685
> 2012-06-07 07:04:53,290 MASTER_CLOSE_REGION-
> HbaseMaster.domain.com:60000-0
> WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed 
> assignment of 
> users,iLy9s53ke7L+6gQ9ASdzbA==,1327771137166.d75ed55fbfd7c63ab3fa3a51f
> b
> 3006a5.
> to serverName=ServerB.domain.com
> <http://serverb.domain.com/>,60020,1339042481685,
> load=(requests=0, regions=57, usedHeap=30, maxHeap=3998), trying to 
> assign elsewhere instead; retry=0
> java.net.SocketTimeoutException: 20000 millis timeout while waiting 
> for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending remote=/ 
> 10.193.190.107:60020]
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.
> j
> ava:213)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
>         at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBas
> e
> Client.java:328)
>         at
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:
> 883)
>         at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>         at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>         at $Proxy7.openRegion(Unknown Source)
>         at
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerMana
> g
> er.java:559)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentMana
> g
> er.java:934)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentMana
> g
> er.java:749)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentMana
> g
> er.java:729)
>         at
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(Clo
> s
> edRegionHandler.java:92)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:15
> 6
> )
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecu
> t
> or.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> j
> ava:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2012-06-07 07:08:04,680 908082002@qtp-1543904466-10 INFO
> org.apache.zookeeper.ZooKeeper: Initiating client connection, 
> connectString= 
> zaas3.domain.com:2181,zaas2.domain.com:2181,zaas1.domain.com:2181,
> zaas5.domain.com:2181,zaas4.domain.com:2181 sessionTimeout=300000 
> watcher=hconnection
> 2012-06-07 07:08:04,681 908082002@qtp-1543904466-10-SendThread() INFO
> org.apache.zookeeper.ClientCnxn: Opening socket connection to server
> zaas5.domain.com/10.193.190.68:2181
> 2012-06-07 07:08:04,681 908082002@qtp-1543904466-10-SendThread(
> zaas5.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to zaas5.domain.com/10.193.190.68:2181,
> initiating
> session
> 2012-06-07 07:08:04,683 908082002@qtp-1543904466-10-SendThread(
> zaas5.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server zaas5.domain.com/10.193.190.68:2181,
> sessionid = 0x533cce12e1a4a58, negotiated timeout = 300000
> 2012-06-07 07:08:06,147 908082002@qtp-1543904466-10 INFO 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplement
> a
> tion:
> Closed zookeeper sessionid=0x533cce12e1a4a58
> 2012-06-07 07:08:06,149 908082002@qtp-1543904466-10 INFO
> org.apache.zookeeper.ZooKeeper: Session: 0x533cce12e1a4a58 closed
> 2012-06-07 07:08:06,149 908082002@qtp-1543904466-10-EventThread INFO
> org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2012-06-07 07:09:32,781 HbaseMaster.domain.com:60000-BalancerChore 
> INFO
> org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance 
> in 1ms. Moving 1 regions off of 1 overloaded servers onto 0 less 
> loaded servers
> 2012-06-07 07:09:32,781 HbaseMaster.domain.com:60000-BalancerChore 
> INFO
> org.apache.hadoop.hbase.master.HMaster: balance
> hri=users,VWsWSM9RoX+wNoekA5yXrw==,1327686874360.7e07caa524f0b5fc18691
> 9
> d402ab3d11.,
> src=mdg177.domain.com,60020,1338788540944,
> dest=ServerB.domain.com<http://serverb.domain.com/>
> ,60020,1339042481685
> 2012-06-07 07:09:54,410 MASTER_CLOSE_REGION-
> HbaseMaster.domain.com:60000-2
> WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed 
> assignment of
> users,VWsWSM9RoX+wNoekA5yXrw==,1327686874360.7e07caa524f0b5fc186919d40
> 2
> ab3d11.
> to serverName=ServerB.domain.com
> <http://serverb.domain.com/>,60020,1339042481685,
> load=(requests=0, regions=57, usedHeap=30, maxHeap=3998), trying to 
> assign elsewhere instead; retry=0 0.107:60020]
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.
> j
> ava:213)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
>         at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBas
> e
> Client.java:328)
>         at
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:
> 883)
>         at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>         at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>         at $Proxy7.openRegion(Unknown Source)
>         at
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerMana
> g
> er.java:559)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentMana
> g
> er.java:934)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentMana
> g
> er.java:749)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentMana
> g
> er.java:729)
>         at
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(Clo
> s
> edRegionHandler.java:92)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:15
> 6
> )
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecu
> t
> or.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> j
> ava:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2012-06-07 07:12:43,271 IPC Server handler 22 on 60000 INFO
> org.apache.hadoop.hbase.master.HMaster: Balance=false
> 2012-06-07 07:40:00,016 main-EventThread INFO
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer 
> ephemeral node deleted, processing expiration 
> [ServerB.domain.com<http://serverb.domain.com/>
> ,60020,1339042481685]*
> --
> Regards
> pradeep

Reply via email to