[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-26 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262499#comment-13262499
 ] 

ramkrishna.s.vasudevan commented on HBASE-5816:
---

@Maryann
If you are planning to work on this pls go ahead. :)

 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
 timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
 while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
 remote=/10.239.47.87:60020]
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
 at 
 org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
 at $Proxy7.openRegion(Unknown Source)
 at 
 org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912)
 at 

[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260240#comment-13260240
 ] 

stack commented on HBASE-5816:
--

@Ram I think the aim would be a simplification; one queue to assign from rather 
than from multiple.  Also as is, I think state a little distributed across 
multiple variables and maps.  We should coalesce if possible.  I think the 
Maryann suggestion of trying a double or triple concurrent assign in a unit 
test a good start.

 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
 timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
 while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
 remote=/10.239.47.87:60020]
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
 at 
 org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
 at $Proxy7.openRegion(Unknown Source)
 at 
 

[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-20 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258301#comment-13258301
 ] 

ramkrishna.s.vasudevan commented on HBASE-5816:
---

I thought more on this..  There are already some state management 
datastructures like RIT, regions and servers map. So adding more may again be 
redundant(my thought). and handling them should be clean.






 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
 timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
 while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
 remote=/10.239.47.87:60020]
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
 at 
 org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
 at $Proxy7.openRegion(Unknown Source)
 at 
 org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573)
 at 
 

[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-19 Thread Maryann Xue (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257298#comment-13257298
 ] 

Maryann Xue commented on HBASE-5816:


Yes, Zhihong, not in trunk now. It is only in 0.90 branch.

I think there are two different problems here.
1. The ServerShutdownHandler should check more strictly before calling assign() 
for regions planned to move to it.
2. Master should handle concurrent requests of assign() more properly. In the 
case i attached, the later assign() call by ServerShutdownHandler could 
actually be ignored, since AssignmentManager is already doing assignment job 
for this region. There could be situations when client code calls assign() from 
interface level and cause the same problem.

True, doing dead server check is not safe, for the private assign() method does 
a number of retry attempts in its own loop, and the server could be found dead 
just after the first attempt fails.

stack, i think maintaining a queue for assign() requests, whether from the 
balancer or the ServerShutdownHandler or client calls, can be a good solution. 
However, if the AssignmentManager is now good in its own logic among region 
states and regionsInTransition, it should be justified to assume that if assign 
gets a wrong state, it simply indicates there is an another assign that has 
just succeeded. So it should be ok for an invalid/later assign to return.

 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed 

[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-19 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257312#comment-13257312
 ] 

ramkrishna.s.vasudevan commented on HBASE-5816:
---

I will try to dig in this more.  The code that Maryann says is in 0.90 only.  
Some subtle changes are there between 0.90 and 0.92+ version regarding 
assignments.
So its better we investigate the bug in 0.92 also seperately.

 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
 timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
 while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
 remote=/10.239.47.87:60020]
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
 at 
 org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
 at $Proxy7.openRegion(Unknown Source)
 at 
 org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573)
 at 
 

[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-19 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257479#comment-13257479
 ] 

ramkrishna.s.vasudevan commented on HBASE-5816:
---

@Stack
In our internal version of 0.90 which has changes as in 0.92 along with Timeout 
monitor related changes we did what ever Maryann did. It works.  But we had 
timeout monitor to rescue.
As you said its better to have some datastructure like a set or queue so that 
we need not allow any new assign to happen if something is present in the 
datastructure that we have added.
But we should be very careful as when to clear it as we have some retry 
attempts also made when assign fails.

 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
 timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
 while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
 remote=/10.239.47.87:60020]
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
 

[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-19 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257551#comment-13257551
 ] 

ramkrishna.s.vasudevan commented on HBASE-5816:
---

In 0.94 i tried to reproduce i was not able to get it.  But note that the 
HBASE-5396 fix is not there in it.

 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
 timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
 while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
 remote=/10.239.47.87:60020]
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
 at 
 org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
 at $Proxy7.openRegion(Unknown Source)
 at 
 org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127)
 at 
 

[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-19 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257729#comment-13257729
 ] 

stack commented on HBASE-5816:
--

@Maryann Agree on your 1., and 2. above.  Its possible to make a standalone 
AssignmentManager using mocks -- see TestAssignmentManager.  Maybe we should 
try some of your suppositions over in unit tests Maryann and find holes in AM 
by writing unit tests?

 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
 timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
 while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
 remote=/10.239.47.87:60020]
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
 at 
 org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
 at $Proxy7.openRegion(Unknown Source)
 at 
 org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573)
 at 
 

[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-18 Thread Maryann Xue (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257194#comment-13257194
 ] 

Maryann Xue commented on HBASE-5816:


stack, your suggestion seems an ultimate solution to the current HMaster 
workflow. trunk has the same problem.
The case i attached were actually introduced by HBASE-5396, which tried to let 
ServerShutdownHandler assign the region in an earlier stage instead of waiting 
for the TimeoutMonitor to do the job. But the isRegionOnline test seems too 
weak here.
  for (HRegionInfo hri : regionsFromRegionPlansForServer) {
if (!this.services.getAssignmentManager().isRegionOnline(hri)) {
  this.services.getAssignmentManager().assign(hri, true);
  reassignedPlans++;
}
  }

However, i think any client call to HBaseAdmin.assign() that coincide at this 
point would cause the same problem. There is a lock guarding the private 
assign() method to deal with concurrent assigns, but the entire assign process 
is not atomic. It should be safe for the later thread just return or get an 
exception if the region has already been assigned by an earlier thread.

 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to 

[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-18 Thread Maryann Xue (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257195#comment-13257195
 ] 

Maryann Xue commented on HBASE-5816:


stack, your suggestion seems an ultimate solution to the current HMaster 
workflow. trunk has the same problem.
The case i attached were actually introduced by HBASE-5396, which tried to let 
ServerShutdownHandler assign the region in an earlier stage instead of waiting 
for the TimeoutMonitor to do the job. But the isRegionOnline test seems too 
weak here.
  for (HRegionInfo hri : regionsFromRegionPlansForServer) {
if (!this.services.getAssignmentManager().isRegionOnline(hri)) {
  this.services.getAssignmentManager().assign(hri, true);
  reassignedPlans++;
}
  }

However, i think any client call to HBaseAdmin.assign() that coincide at this 
point would cause the same problem. There is a lock guarding the private 
assign() method to deal with concurrent assigns, but the entire assign process 
is not atomic. It should be safe for the later thread just return or get an 
exception if the region has already been assigned by an earlier thread.

 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to 

[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-18 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257217#comment-13257217
 ] 

Zhihong Yu commented on HBASE-5816:
---

HBASE-5396 was only integrated to 0.90
So it shouldn't be the cause for problem in trunk.

 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
 timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
 while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
 remote=/10.239.47.87:60020]
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
 at 
 org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
 at $Proxy7.openRegion(Unknown Source)
 at 
 org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912)
 at 
 

[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-18 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257252#comment-13257252
 ] 

stack commented on HBASE-5816:
--

Great stuff Maryann.  Where is the above bit of code from?  I don't find it in 
trunk (could be me).

bq. It should be safe for the later thread just return or get an exception if 
the region has already been assigned by an earlier thread.

What are you thinking?  When we go into the assign, we check if the region is 
in transition and unless its a force assign, just return?  Or would you do this 
earlier?  Maybe the balancer should be more deferential?  It could check if the 
regionserver its been asked move a region from is on the deadservers list.  
This would still be racy though.  Would doing the check in the assign method be 
enough?  (I've not looked at the code).

Thanks for the help on this stuff.


 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
 timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
 while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
 

[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region

2012-04-18 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257255#comment-13257255
 ] 

stack commented on HBASE-5816:
--

Should we have the servershutdownhandler and the balancer feed a single queue 
that assignment manager pulls from?  If the region is already in the queue then 
we'd favor the purposed assignment (the balancers?) rather than the random one?

 Balancer and ServerShutdownHandler concurrently reassigning the same region
 ---

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
 timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
 while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
 remote=/10.239.47.87:60020]
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
 at 
 org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
 at $Proxy7.openRegion(Unknown Source)
 at 
 org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573)
 at