[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262499#comment-13262499 ] ramkrishna.s.vasudevan commented on HBASE-5816: --- @Maryann If you are planning to work on this pls go ahead. :) > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Assignee: ramkrishna.s.vasudevan >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) > at $Proxy7.openRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127) > at > org.apach
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260240#comment-13260240 ] stack commented on HBASE-5816: -- @Ram I think the aim would be a simplification; one queue to assign from rather than from multiple. Also as is, I think state a little distributed across multiple variables and maps. We should coalesce if possible. I think the Maryann suggestion of trying a double or triple concurrent assign in a unit test a good start. > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Assignee: ramkrishna.s.vasudevan >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) > at $Proxy7.openRegion(Unknown S
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258301#comment-13258301 ] ramkrishna.s.vasudevan commented on HBASE-5816: --- I thought more on this.. There are already some state management datastructures like RIT, regions and servers map. So adding more may again be redundant(my thought). and handling them should be clean. > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Assignee: ramkrishna.s.vasudevan >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) > at $Proxy7.openRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(Ser
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258055#comment-13258055 ] Maryann Xue commented on HBASE-5816: @stack, @ramkrishna Should be able to reproduce problem 2 in trunk in unit test by initiating concurrent assign() from two threads. > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Assignee: ramkrishna.s.vasudevan >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) > at $Proxy7.openRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(Assig
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257729#comment-13257729 ] stack commented on HBASE-5816: -- @Maryann Agree on your 1., and 2. above. Its possible to make a standalone AssignmentManager using mocks -- see TestAssignmentManager. Maybe we should try some of your suppositions over in unit tests Maryann and find holes in AM by writing unit tests? > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Assignee: ramkrishna.s.vasudevan >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) > at $Proxy7.openRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.send
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257551#comment-13257551 ] ramkrishna.s.vasudevan commented on HBASE-5816: --- In 0.94 i tried to reproduce i was not able to get it. But note that the HBASE-5396 fix is not there in it. > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Assignee: ramkrishna.s.vasudevan >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) > at $Proxy7.openRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(Assign
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257479#comment-13257479 ] ramkrishna.s.vasudevan commented on HBASE-5816: --- @Stack In our internal version of 0.90 which has changes as in 0.92 along with Timeout monitor related changes we did what ever Maryann did. It works. But we had timeout monitor to rescue. As you said its better to have some datastructure like a set or queue so that we need not allow any new assign to happen if something is present in the datastructure that we have added. But we should be very careful as when to clear it as we have some retry attempts also made when assign fails. > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Assignee: ramkrishna.s.vasudevan >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257312#comment-13257312 ] ramkrishna.s.vasudevan commented on HBASE-5816: --- I will try to dig in this more. The code that Maryann says is in 0.90 only. Some subtle changes are there between 0.90 and 0.92+ version regarding assignments. So its better we investigate the bug in 0.92 also seperately. > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Assignee: ramkrishna.s.vasudevan >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) > at $Proxy7.openRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.s
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257298#comment-13257298 ] Maryann Xue commented on HBASE-5816: Yes, Zhihong, not in trunk now. It is only in 0.90 branch. I think there are two different problems here. 1. The ServerShutdownHandler should check more strictly before calling assign() for regions "planned" to move to it. 2. Master should handle concurrent requests of assign() more properly. In the case i attached, the later assign() call by ServerShutdownHandler could actually be ignored, since AssignmentManager is already doing assignment job for this region. There could be situations when client code calls assign() from interface level and cause the same problem. True, doing dead server check is not safe, for the private assign() method does a number of retry attempts in its own loop, and the server could be found dead just after the first attempt fails. stack, i think maintaining a queue for assign() requests, whether from the balancer or the ServerShutdownHandler or client calls, can be a good solution. However, if the AssignmentManager is now good in its own logic among region states and regionsInTransition, it should be justified to assume that if assign gets a wrong state, it simply indicates there is an another assign that has just succeeded. So it should be ok for an "invalid/later" assign to return. > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Assignee: ramkrishna.s.vasudevan >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 W
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257255#comment-13257255 ] stack commented on HBASE-5816: -- Should we have the servershutdownhandler and the balancer feed a single queue that assignment manager pulls from? If the region is already in the queue then we'd favor the purposed assignment (the balancers?) rather than the random one? > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Assignee: ramkrishna.s.vasudevan >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) > at $Proxy7.openRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(Server
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257252#comment-13257252 ] stack commented on HBASE-5816: -- Great stuff Maryann. Where is the above bit of code from? I don't find it in trunk (could be me). bq. It should be safe for the later thread just return or get an exception if the region has already been assigned by an earlier thread. What are you thinking? When we go into the assign, we check if the region is in transition and unless its a force assign, just return? Or would you do this earlier? Maybe the balancer should be more deferential? It could check if the regionserver its been asked move a region from is on the deadservers list. This would still be racy though. Would doing the check in the assign method be enough? (I've not looked at the code). Thanks for the help on this stuff. > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Assignee: ramkrishna.s.vasudevan >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.ch
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257217#comment-13257217 ] Zhihong Yu commented on HBASE-5816: --- HBASE-5396 was only integrated to 0.90 So it shouldn't be the cause for problem in trunk. > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 12 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) > at $Proxy7.openRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127) > at > org.apache.hadoop.hbase.master.AssignmentManager.as
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257195#comment-13257195 ] Maryann Xue commented on HBASE-5816: stack, your suggestion seems an ultimate solution to the current HMaster workflow. trunk has the same problem. The case i attached were actually introduced by HBASE-5396, which tried to let ServerShutdownHandler assign the region in an earlier stage instead of waiting for the TimeoutMonitor to do the job. But the isRegionOnline test seems too weak here. for (HRegionInfo hri : regionsFromRegionPlansForServer) { if (!this.services.getAssignmentManager().isRegionOnline(hri)) { this.services.getAssignmentManager().assign(hri, true); reassignedPlans++; } } However, i think any client call to HBaseAdmin.assign() that coincide at this point would cause the same problem. There is a lock guarding the private assign() method to deal with concurrent assigns, but the entire assign process is not atomic. It should be safe for the later thread just return or get an exception if the region has already been assigned by an earlier thread. > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere in
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257194#comment-13257194 ] Maryann Xue commented on HBASE-5816: stack, your suggestion seems an ultimate solution to the current HMaster workflow. trunk has the same problem. The case i attached were actually introduced by HBASE-5396, which tried to let ServerShutdownHandler assign the region in an earlier stage instead of waiting for the TimeoutMonitor to do the job. But the isRegionOnline test seems too weak here. for (HRegionInfo hri : regionsFromRegionPlansForServer) { if (!this.services.getAssignmentManager().isRegionOnline(hri)) { this.services.getAssignmentManager().assign(hri, true); reassignedPlans++; } } However, i think any client call to HBaseAdmin.assign() that coincide at this point would cause the same problem. There is a lock guarding the private assign() method to deal with concurrent assigns, but the entire assign process is not atomic. It should be safe for the later thread just return or get an exception if the region has already been assigned by an earlier thread. > Balancer and ServerShutdownHandler concurrently reassigning the same region > --- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.6 >Reporter: Maryann Xue >Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:6-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere in