[jira] [Commented] (HBASE-5816) Two concurrent assign would cause master to abort with msg "Unexpected state trying to OFFLINE; "

2012-04-18 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256799#comment-13256799
 ] 

stack commented on HBASE-5816:
--

Thanks for filing the issue Maryann.  I think we need to address the root 
problem of two threads in the master both at the same time trying to assign the 
same region rather than do as is done here where we just stop the abort.  The 
patch as is will only move the problem down the line (we'll likely end up w/ a 
single region double assigned?).  Let me update the issue title.  This log 
snippet is a really good find.

> Two concurrent assign would cause master to abort with msg "Unexpected state 
> trying to OFFLINE; "
> -
>
> Key: HBASE-5816
> URL: https://issues.apache.org/jira/browse/HBASE-5816
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.90.6
>Reporter: Maryann Xue
> Attachments: HBASE-5816.patch
>
>
> The first assign thread exits with success after updating the RegionState to 
> PENDING_OPEN, while the second assign follows immediately into "assign" and 
> fails the RegionState check in setOfflineInZooKeeper(). This causes the 
> master to abort.
> In the below case, the two concurrent assigns occurred when AM tried to 
> assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
> tried to assign this region (from the region plan) spontaneously.
> 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
> hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
> src=hadoop05.sh.intel.com,60020,1334544902186, 
> dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:44:57,648 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> (offlining)
> 2012-04-17 05:44:57,648 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
> regions=0, usedHeap=0, maxHeap=0) for region 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
> 2012-04-17 05:44:57,666 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
> node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
> (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
>  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
> 2012-04-17 05:52:58,984 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
> was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> state=CLOSED, ts=1334612697672, 
> server=hadoop05.sh.intel.com,60020,1334544902186
> 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
> fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
> 2012-04-17 05:52:59,096 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
> region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
> plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
>  src=hadoop05.sh.intel.com,60020,1334544902186, 
> dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:52:59,096 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
> xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:54:19,159 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
> was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> state=PENDING_OPEN, ts=1334613179096, 
> server=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:54:59,033 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
> serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
> regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
> java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
> timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
> while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
> remote=/10.239.47.87:60020]
> at 
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
> at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invok

[jira] [Commented] (HBASE-5816) Two concurrent assign would cause master to abort with msg "Unexpected state trying to OFFLINE; "

2012-04-18 Thread Uma Maheswara Rao G (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256663#comment-13256663
 ] 

Uma Maheswara Rao G commented on HBASE-5816:


Maryann, This looks to be an issue in trunk also right?

{code}
 if (!hijack && !state.isClosed() && !state.isOffline()) {
  String msg = "Unexpected state : " + state + " .. Cannot transit it to 
OFFLINE.";
  this.master.abort(msg, new IllegalStateException(msg));
  return -1;
}
{code}

Since region already assigned by previos call, state might have changed to 
inRegionTransition. If we just log an return now, i think it will just skip 
this assignment. I think it may be ok.




> Two concurrent assign would cause master to abort with msg "Unexpected state 
> trying to OFFLINE; "
> -
>
> Key: HBASE-5816
> URL: https://issues.apache.org/jira/browse/HBASE-5816
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.90.6
>Reporter: Maryann Xue
> Attachments: HBASE-5816.patch
>
>
> The first assign thread exits with success after updating the RegionState to 
> PENDING_OPEN, while the second assign follows immediately into "assign" and 
> fails the RegionState check in setOfflineInZooKeeper(). This causes the 
> master to abort.
> In the below case, the two concurrent assigns occurred when AM tried to 
> assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
> tried to assign this region (from the region plan) spontaneously.
> 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
> hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
> src=hadoop05.sh.intel.com,60020,1334544902186, 
> dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:44:57,648 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> (offlining)
> 2012-04-17 05:44:57,648 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
> regions=0, usedHeap=0, maxHeap=0) for region 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
> 2012-04-17 05:44:57,666 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
> node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
> (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
>  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
> 2012-04-17 05:52:58,984 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
> was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> state=CLOSED, ts=1334612697672, 
> server=hadoop05.sh.intel.com,60020,1334544902186
> 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
> fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
> 2012-04-17 05:52:59,096 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
> region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
> plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
>  src=hadoop05.sh.intel.com,60020,1334544902186, 
> dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:52:59,096 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
> xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:54:19,159 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
> was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> state=PENDING_OPEN, ts=1334613179096, 
> server=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:54:59,033 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
> serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
> regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
> java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
> timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
> while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
> remote=/10.239.47.87:60020]
> at 
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
> at org.apache.hadoop.hbase.ipc

[jira] [Commented] (HBASE-5816) Two concurrent assign would cause master to abort with msg "Unexpected state trying to OFFLINE; "

2012-04-18 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256617#comment-13256617
 ] 

Hadoop QA commented on HBASE-5816:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12523068/HBASE-5816.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1565//console

This message is automatically generated.

> Two concurrent assign would cause master to abort with msg "Unexpected state 
> trying to OFFLINE; "
> -
>
> Key: HBASE-5816
> URL: https://issues.apache.org/jira/browse/HBASE-5816
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.90.6
>Reporter: Maryann Xue
> Attachments: HBASE-5816.patch
>
>
> The first assign thread exits with success after updating the RegionState to 
> PENDING_OPEN, while the second assign follows immediately into "assign" and 
> fails the RegionState check in setOfflineInZooKeeper(). This causes the 
> master to abort.
> In the below case, the two concurrent assigns occurred when AM tried to 
> assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
> tried to assign this region (from the region plan) spontaneously.
> 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
> hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
> src=hadoop05.sh.intel.com,60020,1334544902186, 
> dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:44:57,648 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> (offlining)
> 2012-04-17 05:44:57,648 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
> regions=0, usedHeap=0, maxHeap=0) for region 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
> 2012-04-17 05:44:57,666 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
> node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
> (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
>  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
> 2012-04-17 05:52:58,984 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
> was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> state=CLOSED, ts=1334612697672, 
> server=hadoop05.sh.intel.com,60020,1334544902186
> 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
> fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
> 2012-04-17 05:52:59,096 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
> region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
> plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
>  src=hadoop05.sh.intel.com,60020,1334544902186, 
> dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:52:59,096 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
> xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:54:19,159 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
> was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> state=PENDING_OPEN, ts=1334613179096, 
> server=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:54:59,033 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
> serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
> regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
> java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
> timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
> while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.239.47.8