[jira] [Commented] (HBASE-5816) Two concurrent assign would cause master to abort with msg "Unexpected state trying to OFFLINE; "
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256799#comment-13256799 ] stack commented on HBASE-5816: -- Thanks for filing the issue Maryann. I think we need to address the root problem of two threads in the master both at the same time trying to assign the same region rather than do as is done here where we just stop the abort. The patch as is will only move the problem down the line (we'll likely end up w/ a single region double assigned?). Let me update the issue title. This log snippet is a really good find. > Two concurrent assign would cause master to abort with msg "Unexpected state > trying to OFFLINE; " > - > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invok
[jira] [Commented] (HBASE-5816) Two concurrent assign would cause master to abort with msg "Unexpected state trying to OFFLINE; "
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256663#comment-13256663 ] Uma Maheswara Rao G commented on HBASE-5816: Maryann, This looks to be an issue in trunk also right? {code} if (!hijack && !state.isClosed() && !state.isOffline()) { String msg = "Unexpected state : " + state + " .. Cannot transit it to OFFLINE."; this.master.abort(msg, new IllegalStateException(msg)); return -1; } {code} Since region already assigned by previos call, state might have changed to inRegionTransition. If we just log an return now, i think it will just skip this assignment. I think it may be ok. > Two concurrent assign would cause master to abort with msg "Unexpected state > trying to OFFLINE; " > - > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc
[jira] [Commented] (HBASE-5816) Two concurrent assign would cause master to abort with msg "Unexpected state trying to OFFLINE; "
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256617#comment-13256617 ] Hadoop QA commented on HBASE-5816: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12523068/HBASE-5816.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1565//console This message is automatically generated. > Two concurrent assign would cause master to abort with msg "Unexpected state > trying to OFFLINE; " > - > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.8