[jira] [Commented] (HBASE-17801) Assigning dead region causing FAILED_OPEN permanent RIT that needs manual resolve
[ https://issues.apache.org/jira/browse/HBASE-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942704#comment-15942704 ] Allan Yang commented on HBASE-17801: Another thought, can we lock the tables of the regions to be processed by SSH using {{TableLockManager}}? like each TableEventHandler do in prepare stage. > Assigning dead region causing FAILED_OPEN permanent RIT that needs manual > resolve > -- > > Key: HBASE-17801 > URL: https://issues.apache.org/jira/browse/HBASE-17801 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.1.2 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang >Priority: Critical > > In Apache 1.x, there is a Assignment Manager bug when SSH and drop table > happens at the same time. Here is the sequence: > (1). The Region Server hosting the target region is dead, SSH (server > shutdown handler) offlined all regions hosted by the RS: > {noformat} > 2017-02-20 20:39:25,022 ERROR > org.apache.hadoop.hbase.master.MasterRpcServices: Region server > rs01.foo.com,60020,1486760911253 reported a fatal error: > ABORTING region server rs01.foo.com,60020,1486760911253: > regionserver:60020-0x55a076071923f5f, > quorum=zk01.foo.com:2181,zk02.foo.com:2181,zk3.foo.com:2181, baseZNode=/hbase > regionserver:60020-0x1234567890abcdf received expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:613) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:524) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > 2017-02-20 20:42:43,775 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs > for rs01.foo.com,60020,1486760911253 before assignment; region count=999 > 2017-02-20 20:43:31,784 INFO org.apache.hadoop.hbase.master.RegionStates: > Transition {783a4814b862a6e23a3265a874c3048b state=OPEN, ts=1487568368296, > server=rs01.foo.com,60020,1486760911253} to {783a4814b862a6e23a3265a874c3048b > state=OFFLINE, ts=1487648611784, server=rs01.foo.com,60020,1486760911253} > {noformat} > (2). Now SSH goes through each region and check whether it should be > re-assigned (at this time, SSH do check whether a table is disabled/deleted). > If a region needs to be re-assigned, it would put into a list. Since at > this time, the troubled region is still on the table that is enabled, it will > be in the list. > {noformat} > 2017-02-20 20:43:31,795 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 999 > region(s) that rs01.foo.com,60020,1486760911253 was carrying (and 0 > regions(s) that were opening on this server) > {noformat} > (3). Now, disable and delete table come in and also try to offline the > region; since the region is already offlined, the deleted table just removes > the region from meta and in-memory. > {noformat} > 2017-02-20 20:43:32,429 INFO org.apache.hadoop.hbase.master.HMaster: > Client=b_kylin/null disable t1 > 2017-02-20 20:43:34,275 INFO > org.apache.hadoop.hbase.zookeeper.ZKTableStateManager: Moving table t1 state > from DISABLING to DISABLED > 2017-02-20 20:43:34,276 INFO > org.apache.hadoop.hbase.master.procedure.DisableTableProcedure: Disabled > table, t1, is completed. > 2017-02-20 20:43:35,624 INFO org.apache.hadoop.hbase.master.HMaster: > Client=b_kylin/null delete t1 > 2017-02-20 20:43:36,011 INFO org.apache.hadoop.hbase.MetaTableAccessor: > Deleted [{ENCODED => fbf9fda1381636aa5b3cd6e3fe0f6c1e, NAME => > 't1,,1487568367030.fbf9fda1381636aa5b3cd6e3fe0f6c1e.', STARTKEY => '', ENDKEY > => '\x00\x01'}, {ENCODED => 783a4814b862a6e23a3265a874c3048b, NAME => > 't1,\x00\x01,1487568367030.783a4814b862a6e23a3265a874c3048b.', STARTKEY => > '\x00\x01', ENDKEY => ''}] > {noformat} > (4). However, SSH calls Assignment Manager to reassign the dead region (note > that the dead region is in the re-assign list SSH collected and we don't > re-check again) > {noformat} > 2017-02-20 20:43:52,725 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Assigning but not in region > states: {ENCODED => 783a4814b862a6e23a3265a874c3048b, NAME => > 't1,\x00\x01,1487568367030.783a4814b862a6e23a3265a874c3048b.', STARTKEY => > '\x00\x01', ENDKEY => ''} > {noformat} > (5). In the region server that the dead region tries to land, because the > table is dropped, we could not open region and now the dead region is in > FAILED_OPEN, which is in permanent RIT state. >
[jira] [Commented] (HBASE-17801) Assigning dead region causing FAILED_OPEN permanent RIT that needs manual resolve
[ https://issues.apache.org/jira/browse/HBASE-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15932120#comment-15932120 ] Allan Yang commented on HBASE-17801: {quote} Is there a way for DeleteTableProcedure to notify ServerShutdownHandler that these regions are being offlined ? {quote} I think it is hard, and likely end up with some race conditions Maybe the master can check whether the table of the region is already disabled or deleted when assigning a region, if so, then giving up retry. > Assigning dead region causing FAILED_OPEN permanent RIT that needs manual > resolve > -- > > Key: HBASE-17801 > URL: https://issues.apache.org/jira/browse/HBASE-17801 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.1.2 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang >Priority: Critical > > In Apache 1.x, there is a Assignment Manager bug when SSH and drop table > happens at the same time. Here is the sequence: > (1). The Region Server hosting the target region is dead, SSH (server > shutdown handler) offlined all regions hosted by the RS: > {noformat} > 2017-02-20 20:39:25,022 ERROR > org.apache.hadoop.hbase.master.MasterRpcServices: Region server > rs01.foo.com,60020,1486760911253 reported a fatal error: > ABORTING region server rs01.foo.com,60020,1486760911253: > regionserver:60020-0x55a076071923f5f, > quorum=zk01.foo.com:2181,zk02.foo.com:2181,zk3.foo.com:2181, baseZNode=/hbase > regionserver:60020-0x1234567890abcdf received expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:613) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:524) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > 2017-02-20 20:42:43,775 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs > for rs01.foo.com,60020,1486760911253 before assignment; region count=999 > 2017-02-20 20:43:31,784 INFO org.apache.hadoop.hbase.master.RegionStates: > Transition {783a4814b862a6e23a3265a874c3048b state=OPEN, ts=1487568368296, > server=rs01.foo.com,60020,1486760911253} to {783a4814b862a6e23a3265a874c3048b > state=OFFLINE, ts=1487648611784, server=rs01.foo.com,60020,1486760911253} > {noformat} > (2). Now SSH goes through each region and check whether it should be > re-assigned (at this time, SSH do check whether a table is disabled/deleted). > If a region needs to be re-assigned, it would put into a list. Since at > this time, the troubled region is still on the table that is enabled, it will > be in the list. > {noformat} > 2017-02-20 20:43:31,795 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 999 > region(s) that rs01.foo.com,60020,1486760911253 was carrying (and 0 > regions(s) that were opening on this server) > {noformat} > (3). Now, disable and delete table come in and also try to offline the > region; since the region is already offlined, the deleted table just removes > the region from meta and in-memory. > {noformat} > 2017-02-20 20:43:32,429 INFO org.apache.hadoop.hbase.master.HMaster: > Client=b_kylin/null disable t1 > 2017-02-20 20:43:34,275 INFO > org.apache.hadoop.hbase.zookeeper.ZKTableStateManager: Moving table t1 state > from DISABLING to DISABLED > 2017-02-20 20:43:34,276 INFO > org.apache.hadoop.hbase.master.procedure.DisableTableProcedure: Disabled > table, t1, is completed. > 2017-02-20 20:43:35,624 INFO org.apache.hadoop.hbase.master.HMaster: > Client=b_kylin/null delete t1 > 2017-02-20 20:43:36,011 INFO org.apache.hadoop.hbase.MetaTableAccessor: > Deleted [{ENCODED => fbf9fda1381636aa5b3cd6e3fe0f6c1e, NAME => > 't1,,1487568367030.fbf9fda1381636aa5b3cd6e3fe0f6c1e.', STARTKEY => '', ENDKEY > => '\x00\x01'}, {ENCODED => 783a4814b862a6e23a3265a874c3048b, NAME => > 't1,\x00\x01,1487568367030.783a4814b862a6e23a3265a874c3048b.', STARTKEY => > '\x00\x01', ENDKEY => ''}] > {noformat} > (4). However, SSH calls Assignment Manager to reassign the dead region (note > that the dead region is in the re-assign list SSH collected and we don't > re-check again) > {noformat} > 2017-02-20 20:43:52,725 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Assigning but not in region > states: {ENCODED => 783a4814b862a6e23a3265a874c3048b, NAME => > 't1,\x00\x01,1487568367030.783a4814b862a6e23a3265a874c3048b.', STARTKEY => > '\x00\x01', ENDKEY => ''} > {noformat} > (5). In the region
[jira] [Commented] (HBASE-17801) Assigning dead region causing FAILED_OPEN permanent RIT that needs manual resolve
[ https://issues.apache.org/jira/browse/HBASE-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930742#comment-15930742 ] Ted Yu commented on HBASE-17801: Is there a way for DeleteTableProcedure to notify ServerShutdownHandler that these regions are being offlined ? > Assigning dead region causing FAILED_OPEN permanent RIT that needs manual > resolve > -- > > Key: HBASE-17801 > URL: https://issues.apache.org/jira/browse/HBASE-17801 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.1.2 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang >Priority: Critical > > In Apache 1.x, there is a Assignment Manager bug when SSH and drop table > happens at the same time. Here is the sequence: > (1). The Region Server hosting the target region is dead, SSH (server > shutdown handler) offlined all regions hosted by the RS: > {noformat} > 2017-02-20 20:39:25,022 ERROR > org.apache.hadoop.hbase.master.MasterRpcServices: Region server > rs01.foo.com,60020,1486760911253 reported a fatal error: > ABORTING region server rs01.foo.com,60020,1486760911253: > regionserver:60020-0x55a076071923f5f, > quorum=zk01.foo.com:2181,zk02.foo.com:2181,zk3.foo.com:2181, baseZNode=/hbase > regionserver:60020-0x1234567890abcdf received expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:613) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:524) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > 2017-02-20 20:42:43,775 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs > for rs01.foo.com,60020,1486760911253 before assignment; region count=999 > 2017-02-20 20:43:31,784 INFO org.apache.hadoop.hbase.master.RegionStates: > Transition {783a4814b862a6e23a3265a874c3048b state=OPEN, ts=1487568368296, > server=rs01.foo.com,60020,1486760911253} to {783a4814b862a6e23a3265a874c3048b > state=OFFLINE, ts=1487648611784, server=rs01.foo.com,60020,1486760911253} > {noformat} > (2). Now SSH goes through each region and check whether it should be > re-assigned (at this time, SSH do check whether a table is disabled/deleted). > If a region needs to be re-assigned, it would put into a list. Since at > this time, the troubled region is still on the table that is enabled, it will > be in the list. > {noformat} > 2017-02-20 20:43:31,795 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 999 > region(s) that rs01.foo.com,60020,1486760911253 was carrying (and 0 > regions(s) that were opening on this server) > {noformat} > (3). Now, disable and delete table come in and also try to offline the > region; since the region is already offlined, the deleted table just removes > the region from meta and in-memory. > {noformat} > 2017-02-20 20:43:32,429 INFO org.apache.hadoop.hbase.master.HMaster: > Client=b_kylin/null disable t1 > 2017-02-20 20:43:34,275 INFO > org.apache.hadoop.hbase.zookeeper.ZKTableStateManager: Moving table t1 state > from DISABLING to DISABLED > 2017-02-20 20:43:34,276 INFO > org.apache.hadoop.hbase.master.procedure.DisableTableProcedure: Disabled > table, t1, is completed. > 2017-02-20 20:43:35,624 INFO org.apache.hadoop.hbase.master.HMaster: > Client=b_kylin/null delete t1 > 2017-02-20 20:43:36,011 INFO org.apache.hadoop.hbase.MetaTableAccessor: > Deleted [{ENCODED => fbf9fda1381636aa5b3cd6e3fe0f6c1e, NAME => > 't1,,1487568367030.fbf9fda1381636aa5b3cd6e3fe0f6c1e.', STARTKEY => '', ENDKEY > => '\x00\x01'}, {ENCODED => 783a4814b862a6e23a3265a874c3048b, NAME => > 't1,\x00\x01,1487568367030.783a4814b862a6e23a3265a874c3048b.', STARTKEY => > '\x00\x01', ENDKEY => ''}] > {noformat} > (4). However, SSH calls Assignment Manager to reassign the dead region (note > that the dead region is in the re-assign list SSH collected and we don't > re-check again) > {noformat} > 2017-02-20 20:43:52,725 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Assigning but not in region > states: {ENCODED => 783a4814b862a6e23a3265a874c3048b, NAME => > 't1,\x00\x01,1487568367030.783a4814b862a6e23a3265a874c3048b.', STARTKEY => > '\x00\x01', ENDKEY => ''} > {noformat} > (5). In the region server that the dead region tries to land, because the > table is dropped, we could not open region and now the dead region is in > FAILED_OPEN, which is in permanent RIT state. > {noformat} > 2017-02-20 20:43:52,861 INFO >