Hi, We are using HBASE 0.90.3 and Hadoop 0.20.205 for our cluster. Today, we have observed that the HBASE master was stuck because of zombie regionservers that was responding to ping but was stuck.
The logs show that the master tried to re-assign its regions from an overloaded (ServerA) and one of the regions to be rebalanced is designated for the server in zombie state (SeverB). This re-assignment started failing with SocketTimeoutException, and the master was stuck waiting. This got resolved only when ServerB was taken out of network. There were no master logs printed for close to 30 mins, between the attempted reassignment, and the time ServerB removed from network. The issue of infinite wait seems to be very similar to the hadoop RPC issue - https://issues.apache.org/jira/browse/HADOOP-6889 reported and fixed in 0.20.205. Did anyone face the same problem, was this fixed in later versions of HBASE? -- Regards pradeep *2012-06-07 06:59:32,693 HbaseMaster.domain.com:60000-BalancerChore INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 1ms. Moving 26 regions off of 24 overloaded servers onto 0 less loaded servers . 2012-06-07 06:59:32,746 HbaseMaster.domain.com:60000-BalancerChore INFO org.apache.hadoop.hbase.master.HMaster: balance hri=users,XePh1ThgkEgb1LJG8rWWnQ==,1327804993156.b935614a50366924dfac338bf6d504ef., src=ServerA.domain.com <http://servera.domain.com/>,60020,1338788540876, dest=ServerB.domain.com <http://serverb.domain.com/>,60020,1339042481685 . . 2012-06-07 06:59:53,854 MASTER_CLOSE_REGION-HbaseMaster.domain.com:60000-3 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of users,XePh1ThgkEgb1LJG8rWWnQ==,1327804993156.b935614a50366924dfac338bf6d504ef. to serverName=ServerB.domain.com <http://serverb.domain.com/>,60020,1339042481685, load=(requests=0, regions=57, usedHeap=30, maxHeap=3998), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/ 10.193.190.107:60020] at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy7.openRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:559) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:934) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:749) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:729) at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:92) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2012-06-07 07:01:31,884 908082002@qtp-1543904466-10 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString= zaas3.domain.com:2181,zaas2.domain.com:2181,zaas1.domain.com:2181, zaas5.domain.com:2181,zaas4.domain.com:2181 sessionTimeout=300000 watcher=hconnection 2012-06-07 07:01:31,884 908082002@qtp-1543904466-10-SendThread() INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server zaas1.domain.com/10.193.190.117:2181 2012-06-07 07:01:31,885 908082002@qtp-1543904466-10-SendThread( zaas1.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Socket connection established to zaas1.domain.com/10.193.190.117:2181, initiating session 2012-06-07 07:01:31,887 908082002@qtp-1543904466-10-SendThread( zaas1.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server zaas1.domain.com/10.193.190.117:2181, sessionid = 0x133cce032e46274, negotiated timeout = 300000 2012-06-07 07:01:33,695 908082002@qtp-1543904466-10 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Closed zookeeper sessionid=0x133cce032e46274 2012-06-07 07:01:33,697 908082002@qtp-1543904466-10 INFO org.apache.zookeeper.ZooKeeper: Session: 0x133cce032e46274 closed 2012-06-07 07:01:33,697 908082002@qtp-1543904466-10-EventThread INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2012-06-07 07:04:32,738 HbaseMaster.domain.com:60000-BalancerChore INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 1ms. Moving 1 regions off of 1 overloaded servers onto 0 less loaded servers 2012-06-07 07:04:32,738 HbaseMaster.domain.com:60000-BalancerChore INFO org.apache.hadoop.hbase.master.HMaster: balance hri=users,iLy9s53ke7L+6gQ9ASdzbA==,1327771137166.d75ed55fbfd7c63ab3fa3a51fb3006a5., src=mdg357.domain.com,60020,1338788541085, dest=ServerB.domain.com<http://serverb.domain.com/> ,60020,1339042481685 2012-06-07 07:04:53,290 MASTER_CLOSE_REGION-HbaseMaster.domain.com:60000-0 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of users,iLy9s53ke7L+6gQ9ASdzbA==,1327771137166.d75ed55fbfd7c63ab3fa3a51fb3006a5. to serverName=ServerB.domain.com <http://serverb.domain.com/>,60020,1339042481685, load=(requests=0, regions=57, usedHeap=30, maxHeap=3998), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/ 10.193.190.107:60020] at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy7.openRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:559) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:934) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:749) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:729) at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:92) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2012-06-07 07:08:04,680 908082002@qtp-1543904466-10 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString= zaas3.domain.com:2181,zaas2.domain.com:2181,zaas1.domain.com:2181, zaas5.domain.com:2181,zaas4.domain.com:2181 sessionTimeout=300000 watcher=hconnection 2012-06-07 07:08:04,681 908082002@qtp-1543904466-10-SendThread() INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server zaas5.domain.com/10.193.190.68:2181 2012-06-07 07:08:04,681 908082002@qtp-1543904466-10-SendThread( zaas5.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Socket connection established to zaas5.domain.com/10.193.190.68:2181, initiating session 2012-06-07 07:08:04,683 908082002@qtp-1543904466-10-SendThread( zaas5.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server zaas5.domain.com/10.193.190.68:2181, sessionid = 0x533cce12e1a4a58, negotiated timeout = 300000 2012-06-07 07:08:06,147 908082002@qtp-1543904466-10 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Closed zookeeper sessionid=0x533cce12e1a4a58 2012-06-07 07:08:06,149 908082002@qtp-1543904466-10 INFO org.apache.zookeeper.ZooKeeper: Session: 0x533cce12e1a4a58 closed 2012-06-07 07:08:06,149 908082002@qtp-1543904466-10-EventThread INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2012-06-07 07:09:32,781 HbaseMaster.domain.com:60000-BalancerChore INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 1ms. Moving 1 regions off of 1 overloaded servers onto 0 less loaded servers 2012-06-07 07:09:32,781 HbaseMaster.domain.com:60000-BalancerChore INFO org.apache.hadoop.hbase.master.HMaster: balance hri=users,VWsWSM9RoX+wNoekA5yXrw==,1327686874360.7e07caa524f0b5fc186919d402ab3d11., src=mdg177.domain.com,60020,1338788540944, dest=ServerB.domain.com<http://serverb.domain.com/> ,60020,1339042481685 2012-06-07 07:09:54,410 MASTER_CLOSE_REGION-HbaseMaster.domain.com:60000-2 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of users,VWsWSM9RoX+wNoekA5yXrw==,1327686874360.7e07caa524f0b5fc186919d402ab3d11. to serverName=ServerB.domain.com <http://serverb.domain.com/>,60020,1339042481685, load=(requests=0, regions=57, usedHeap=30, maxHeap=3998), trying to assign elsewhere instead; retry=0 0.107:60020] at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy7.openRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:559) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:934) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:749) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:729) at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:92) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2012-06-07 07:12:43,271 IPC Server handler 22 on 60000 INFO org.apache.hadoop.hbase.master.HMaster: Balance=false 2012-06-07 07:40:00,016 main-EventThread INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [ServerB.domain.com<http://serverb.domain.com/> ,60020,1339042481685]* -- Regards pradeep
