Hi Pradeep, Many changes have happened from the version that you are specifying upto the recent version. So may be you can try out latest versions.
Regards Ram > -----Original Message----- > From: Pradeep Gopaluni [mailto:[email protected]] > Sent: Tuesday, June 12, 2012 4:35 PM > To: [email protected] > Subject: Master stuck trying to rebalance regions to a zombie region > servers. > > Hi, > > We are using HBASE 0.90.3 and Hadoop 0.20.205 for our cluster. Today, > we > have observed that the HBASE master was stuck because of zombie > regionservers that was responding to ping but was stuck. > > The logs show that the master tried to re-assign its regions from an > overloaded (ServerA) and one of the regions to be rebalanced is > designated > for the server in zombie state (SeverB). This re-assignment started > failing > with SocketTimeoutException, and the master was stuck waiting. This got > resolved only when ServerB was taken out of network. There were no > master > logs printed for close to 30 mins, between the attempted reassignment, > and > the time ServerB removed from network. > > The issue of infinite wait seems to be very similar to the hadoop RPC > issue > - https://issues.apache.org/jira/browse/HADOOP-6889 reported and fixed > in > 0.20.205. > Did anyone face the same problem, was this fixed in later versions of > HBASE? > > -- > Regards > pradeep > > *2012-06-07 06:59:32,693 HbaseMaster.domain.com:60000-BalancerChore > INFO > org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance > in > 1ms. Moving 26 regions off of 24 overloaded servers onto 0 less loaded > servers > . > 2012-06-07 06:59:32,746 HbaseMaster.domain.com:60000-BalancerChore INFO > org.apache.hadoop.hbase.master.HMaster: balance > hri=users,XePh1ThgkEgb1LJG8rWWnQ==,1327804993156.b935614a50366924dfac33 > 8bf6d504ef., > src=ServerA.domain.com > <http://servera.domain.com/>,60020,1338788540876, > dest=ServerB.domain.com > <http://serverb.domain.com/>,60020,1339042481685 > . > . > 2012-06-07 06:59:53,854 MASTER_CLOSE_REGION- > HbaseMaster.domain.com:60000-3 > WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed > assignment of > users,XePh1ThgkEgb1LJG8rWWnQ==,1327804993156.b935614a50366924dfac338bf6 > d504ef. > to serverName=ServerB.domain.com > <http://serverb.domain.com/>,60020,1339042481685, > load=(requests=0, regions=57, usedHeap=30, maxHeap=3998), trying to > assign > elsewhere instead; retry=0 > java.net.SocketTimeoutException: 20000 millis timeout while waiting for > channel to be ready for connect. ch : > java.nio.channels.SocketChannel[connection-pending remote=/ > 10.193.190.107:60020] > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.j > ava:213) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBase > Client.java:328) > at > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java: > 883) > at > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) > at $Proxy7.openRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManag > er.java:559) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManag > er.java:934) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManag > er.java:749) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManag > er.java:729) > at > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(Clos > edRegionHandler.java:92) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156 > ) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecut > or.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.j > ava:908) > at java.lang.Thread.run(Thread.java:662) > 2012-06-07 07:01:31,884 908082002@qtp-1543904466-10 INFO > org.apache.zookeeper.ZooKeeper: Initiating client connection, > connectString= > zaas3.domain.com:2181,zaas2.domain.com:2181,zaas1.domain.com:2181, > zaas5.domain.com:2181,zaas4.domain.com:2181 sessionTimeout=300000 > watcher=hconnection > 2012-06-07 07:01:31,884 908082002@qtp-1543904466-10-SendThread() INFO > org.apache.zookeeper.ClientCnxn: Opening socket connection to server > zaas1.domain.com/10.193.190.117:2181 > 2012-06-07 07:01:31,885 908082002@qtp-1543904466-10-SendThread( > zaas1.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to zaas1.domain.com/10.193.190.117:2181, > initiating > session > 2012-06-07 07:01:31,887 908082002@qtp-1543904466-10-SendThread( > zaas1.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server zaas1.domain.com/10.193.190.117:2181, > sessionid = 0x133cce032e46274, negotiated timeout = 300000 > 2012-06-07 07:01:33,695 908082002@qtp-1543904466-10 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa > tion: > Closed zookeeper sessionid=0x133cce032e46274 > 2012-06-07 07:01:33,697 908082002@qtp-1543904466-10 INFO > org.apache.zookeeper.ZooKeeper: Session: 0x133cce032e46274 closed > 2012-06-07 07:01:33,697 908082002@qtp-1543904466-10-EventThread INFO > org.apache.zookeeper.ClientCnxn: EventThread shut down > 2012-06-07 07:04:32,738 HbaseMaster.domain.com:60000-BalancerChore INFO > org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance > in > 1ms. Moving 1 regions off of 1 overloaded servers onto 0 less loaded > servers > 2012-06-07 07:04:32,738 HbaseMaster.domain.com:60000-BalancerChore INFO > org.apache.hadoop.hbase.master.HMaster: balance > hri=users,iLy9s53ke7L+6gQ9ASdzbA==,1327771137166.d75ed55fbfd7c63ab3fa3a > 51fb3006a5., > src=mdg357.domain.com,60020,1338788541085, > dest=ServerB.domain.com<http://serverb.domain.com/> > ,60020,1339042481685 > 2012-06-07 07:04:53,290 MASTER_CLOSE_REGION- > HbaseMaster.domain.com:60000-0 > WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed > assignment of > users,iLy9s53ke7L+6gQ9ASdzbA==,1327771137166.d75ed55fbfd7c63ab3fa3a51fb > 3006a5. > to serverName=ServerB.domain.com > <http://serverb.domain.com/>,60020,1339042481685, > load=(requests=0, regions=57, usedHeap=30, maxHeap=3998), trying to > assign > elsewhere instead; retry=0 > java.net.SocketTimeoutException: 20000 millis timeout while waiting for > channel to be ready for connect. ch : > java.nio.channels.SocketChannel[connection-pending remote=/ > 10.193.190.107:60020] > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.j > ava:213) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBase > Client.java:328) > at > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java: > 883) > at > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) > at $Proxy7.openRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManag > er.java:559) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManag > er.java:934) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManag > er.java:749) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManag > er.java:729) > at > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(Clos > edRegionHandler.java:92) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156 > ) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecut > or.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.j > ava:908) > at java.lang.Thread.run(Thread.java:662) > 2012-06-07 07:08:04,680 908082002@qtp-1543904466-10 INFO > org.apache.zookeeper.ZooKeeper: Initiating client connection, > connectString= > zaas3.domain.com:2181,zaas2.domain.com:2181,zaas1.domain.com:2181, > zaas5.domain.com:2181,zaas4.domain.com:2181 sessionTimeout=300000 > watcher=hconnection > 2012-06-07 07:08:04,681 908082002@qtp-1543904466-10-SendThread() INFO > org.apache.zookeeper.ClientCnxn: Opening socket connection to server > zaas5.domain.com/10.193.190.68:2181 > 2012-06-07 07:08:04,681 908082002@qtp-1543904466-10-SendThread( > zaas5.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to zaas5.domain.com/10.193.190.68:2181, > initiating > session > 2012-06-07 07:08:04,683 908082002@qtp-1543904466-10-SendThread( > zaas5.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server zaas5.domain.com/10.193.190.68:2181, > sessionid = 0x533cce12e1a4a58, negotiated timeout = 300000 > 2012-06-07 07:08:06,147 908082002@qtp-1543904466-10 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementa > tion: > Closed zookeeper sessionid=0x533cce12e1a4a58 > 2012-06-07 07:08:06,149 908082002@qtp-1543904466-10 INFO > org.apache.zookeeper.ZooKeeper: Session: 0x533cce12e1a4a58 closed > 2012-06-07 07:08:06,149 908082002@qtp-1543904466-10-EventThread INFO > org.apache.zookeeper.ClientCnxn: EventThread shut down > 2012-06-07 07:09:32,781 HbaseMaster.domain.com:60000-BalancerChore INFO > org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance > in > 1ms. Moving 1 regions off of 1 overloaded servers onto 0 less loaded > servers > 2012-06-07 07:09:32,781 HbaseMaster.domain.com:60000-BalancerChore INFO > org.apache.hadoop.hbase.master.HMaster: balance > hri=users,VWsWSM9RoX+wNoekA5yXrw==,1327686874360.7e07caa524f0b5fc186919 > d402ab3d11., > src=mdg177.domain.com,60020,1338788540944, > dest=ServerB.domain.com<http://serverb.domain.com/> > ,60020,1339042481685 > 2012-06-07 07:09:54,410 MASTER_CLOSE_REGION- > HbaseMaster.domain.com:60000-2 > WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed > assignment of > users,VWsWSM9RoX+wNoekA5yXrw==,1327686874360.7e07caa524f0b5fc186919d402 > ab3d11. > to serverName=ServerB.domain.com > <http://serverb.domain.com/>,60020,1339042481685, > load=(requests=0, regions=57, usedHeap=30, maxHeap=3998), trying to > assign > elsewhere instead; retry=0 > 0.107:60020] > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.j > ava:213) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBase > Client.java:328) > at > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java: > 883) > at > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) > at $Proxy7.openRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManag > er.java:559) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManag > er.java:934) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManag > er.java:749) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManag > er.java:729) > at > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(Clos > edRegionHandler.java:92) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156 > ) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecut > or.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.j > ava:908) > at java.lang.Thread.run(Thread.java:662) > 2012-06-07 07:12:43,271 IPC Server handler 22 on 60000 INFO > org.apache.hadoop.hbase.master.HMaster: Balance=false > 2012-06-07 07:40:00,016 main-EventThread INFO > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer > ephemeral node deleted, processing expiration > [ServerB.domain.com<http://serverb.domain.com/> > ,60020,1339042481685]* > -- > Regards > pradeep
