[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 5/4/21, 5:08 PM: -- Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it. [Edit: Removed some distraction.] Remains to be seen if CatalogJanitor will clean the regions. Will update. was (Author: apurtell): Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it. [Edit: Removed some distraction.] Remains to be seen if CatalogJanitor will clean the regions. Probably. What is happening during latest tests is because isRIT evaluates to true usually (it's an ingest test after all) and HBASE-25840 is applied all of the cleanup is deferred. Will update and see if region GC proceeds as expected. > SPLIT state detritus > > > Key: HBASE-25829 > URL: https://issues.apache.org/jira/browse/HBASE-25829 > Project: HBase > Issue Type: Bug >Affects Versions: 2.4.3 >Reporter: Andrew Kyle Purtell >Assignee: Andrew Kyle Purtell >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3 > > > Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this > happened in the happy path. > There were no errors accessing all loaded table data. The integration test > writes a log to HDFS of every cell written to HBase and the verify phase uses > that log to read each value and confirm it. That seems fine: > {noformat} > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544 > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0 > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0 > {noformat} > However whenever the balancer runs there are a number of concerning INFO > level log messages printed of the form _assignment.RegionStates: Skipping, no > server for state=SPLIT, location=null, table=TABLENAME_ > For example: > {noformat} > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 5/4/21, 5:06 PM: -- Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it. [Edit: Removed some distraction.] Remains to be seen if CatalogJanitor will clean the regions. Probably. What is happening during latest tests is because isRIT evaluates to true usually (it's an ingest test after all) and HBASE-25840 is applied all of the cleanup is deferred. Will update and see if region GC proceeds as expected. was (Author: apurtell): Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it. If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat} 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. There is a simple change to HbckChore that should accompany the other changes I have under test: The conditional for determining if a region is orphan should become {{if (hri == null && !splitParentRegions.contains(encodedRegionName) && !mergedParentRegions.contains(encodedRegionName))}} so you can ignore the non-zero orphan count. With the complete change it would be reported as 0. Remains to be seen if CatalogJanitor will clean the regions. Probably. What is happening during latest tests is because isRIT evaluates to true usually (it's an ingest test after all) and HBASE-25840 is applied all of the cleanup is deferred. Will update and see if region GC proceeds as expected. > SPLIT state detritus > > > Key: HBASE-25829 > URL: https://issues.apache.org/jira/browse/HBASE-25829 > Project: HBase > Issue Type: Bug >Affects Versions: 2.4.3 >Reporter: Andrew Kyle Purtell >Assignee: Andrew Kyle Purtell >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3 > > > Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this > happened in the happy path. > There were no errors accessing all loaded table data. The integration test > writes a log to HDFS of every cell written to HBase and the verify phase uses > that log to read each value and confirm it. That seems fine: > {noformat} > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544 > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0 > 2021-04-30 02:16:33,316 INFO
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 5/4/21, 1:08 AM: -- Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it. If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat} 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. There is a simple change to HbckChore that should accompany the other changes I have under test: The conditional for determining if a region is orphan should become {{if (hri == null && !splitParentRegions.contains(encodedRegionName) && !mergedParentRegions.contains(encodedRegionName))}} so you can ignore the non-zero orphan count. With the complete change it would be reported as 0. Remains to be seen if CatalogJanitor will clean the regions. Probably. What is happening during latest tests is because isRIT evaluates to true usually (it's an ingest test after all) and HBASE-25840 is applied all of the cleanup is deferred. Will update and see if region GC proceeds as expected. was (Author: apurtell): Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it. If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat} 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. There is a simple change to HbckChore that should accompany the other changes I have under test: The conditional for determining if a region is orphan should become {{if (hri == null && !splitParentRegions.contains(encodedRegionName) && !mergedParentRegions.contains(encodedRegionName))}} so you can ignore the non-zero orphan count. With the complete change it would be reported as 0. Remains to be seen if CatalogJanitor will clean the regions. Probably. What is happening during latest tests is because isRIT evaluates to true usually (it's an
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 5/4/21, 12:57 AM: --- Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it. If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat} 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. There is a simple change to HbckChore that should accompany the other changes I have under test: The conditional for determining if a region is orphan should become {{if (hri == null && !splitParentRegions.contains(encodedRegionName) && !mergedParentRegions.contains(encodedRegionName))}} so you can ignore the non-zero orphan count. With the complete change it would be reported as 0. Remains to be seen if CatalogJanitor will clean the regions. Probably. What is happening during latest tests is because isRIT evaluates to true usually (it's an ingest test after all) and HBASE-25840 is applied all of the cleanup is deferred. Will wait until the end to see if region GC proceeds as expected. was (Author: apurtell): Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it, but does not! If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat} 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. There is a simple change to HbckChore that should accompany the other changes I have under test: The conditional for determining if a region is orphan should become {{if (hri == null && !splitParentRegions.contains(encodedRegionName) && !mergedParentRegions.contains(encodedRegionName))}} so you can ignore the non-zero orphan count. With the complete change it would be reported as 0. Anyway, there is more to debug. Seems CatalogJanitor or the region GC procedures it submits are not completing their work. I have added
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 5/3/21, 10:56 PM: --- Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it, but does not! If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat} 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. There is a simple change to HbckChore that should accompany the other changes I have under test: The conditional for determining if a region is orphan should become {{if (hri == null && !splitParentRegions.contains(encodedRegionName) && !mergedParentRegions.contains(encodedRegionName))}} so you can ignore the non-zero orphan count. With the complete change it would be reported as 0. Anyway, there is more to debug. Seems CatalogJanitor or the region GC procedures it submits are not completing their work. I have added some debug logging to CatalogJanitor to investigate further. was (Author: apurtell): Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it, but does not! If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat}2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. There is a simple change to HbckChore that should accompany the other changes I have under test: The conditional for determining if a region is orphan should become {{if (hri == null && !splitParentRegions.contains(encodedRegionName) && !mergedParentRegions.contains(encodedRegionName))}} so you can ignore the non-zero orphan count. With the complete change it would be reported as 0. Anyway, there is more to debug. CatalogJanitor or the region GC procedures it submits may not be completing their work. I have added some debug logging to CatalogJanitor to investigate further. > SPLIT state detritus >
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 5/3/21, 10:55 PM: --- Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it, but does not! If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat}2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. There is a simple change to HbckChore that should accompany the other changes I have under test: The conditional for determining if a region is orphan should become {{if (hri == null && !splitParentRegions.contains(encodedRegionName) && !mergedParentRegions.contains(encodedRegionName))}} so you can ignore the non-zero orphan count. With the complete change it would be reported as 0. Anyway, there is more to debug. CatalogJanitor or the region GC procedures it submits may not be completing their work. I have added some debug logging to CatalogJanitor to investigate further. was (Author: apurtell): Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it, but does not! If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat}2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. There is a simple change to HbckChore that should accompany the other changes I have under test: The conditional for determining if a region is orphan should become {{ if (hri == null && !splitParentRegions.contains(encodedRegionName) && !mergedParentRegions.contains(encodedRegionName)) }} so you can ignore the non-zero orphan count. With the complete change it would be reported as 0. Anyway, there is more to debug. CatalogJanitor or the region GC procedures it submits may not be completing their work. I have added some debug logging to CatalogJanitor to investigate further. > SPLIT state detritus
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 5/3/21, 10:54 PM: --- Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it, but does not! If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat}2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. There is a simple change to HbckChore that should accompany the other changes I have under test: The conditional for determining if a region is orphan should become {{ if (hri == null && !splitParentRegions.contains(encodedRegionName) && !mergedParentRegions.contains(encodedRegionName)) }} so you can ignore the non-zero orphan count. With the complete change it would be reported as 0. Anyway, there is more to debug. CatalogJanitor or the region GC procedures it submits may not be completing their work. I have added some debug logging to CatalogJanitor to investigate further. was (Author: apurtell): Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it, but does not! If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat}2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. I have added some debug logging to CatalogJanitor to investigate further. > SPLIT state detritus > > > Key: HBASE-25829 > URL: https://issues.apache.org/jira/browse/HBASE-25829 > Project: HBase > Issue Type: Bug >Affects Versions: 2.4.3 >Reporter: Andrew Kyle Purtell >Assignee: Andrew Kyle Purtell >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3 > > > Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this > happened in the happy path. > There were no
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 5/3/21, 10:50 PM: --- Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it, but does not! If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat}2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. I have added some debug logging to CatalogJanitor to investigate further. was (Author: apurtell): Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesyetem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it, but does not! If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat}2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesyetem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. I have added some debug logging to CatalogJanitor to investigate further. > SPLIT state detritus > > > Key: HBASE-25829 > URL: https://issues.apache.org/jira/browse/HBASE-25829 > Project: HBase > Issue Type: Bug >Affects Versions: 2.4.3 >Reporter: Andrew Kyle Purtell >Assignee: Andrew Kyle Purtell >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3 > > > Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this > happened in the happy path. > There were no errors accessing all loaded table data. The integration test > writes a log to HDFS of every cell written to HBase and the verify phase uses > that log to read each value and confirm it. That seems fine: > {noformat} > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544 > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0 > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify:
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 5/3/21, 10:47 PM: --- Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesyetem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it, but does not! If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better: {noformat}2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesyetem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. I have added some debug logging to CatalogJanitor to investigate further. was (Author: apurtell): Subtasks look good. Back to the main issue. {noformat} 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager 2021-05-03 20:30:29,964 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 20:30:29,975 INFO [master/ip-172-31-58-47:8100.Chore.2] master.HbckChore: Loaded 3 tables 184 regions from filesyetem and found 0 orphan regions {noformat} The 51 extra regions are SPLIT parents, with server = null. I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is supposed to clean it, but does not! If I patch AssignmentManager#markRegionAsSplit to remove the parent from {{regionStates}}, things begin to look better: {noformat}2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager 2021-05-03 22:08:29,036 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions 2021-05-03 22:08:29,043 INFO [master/ip-172-31-58-47:8100.Chore.1] master.HbckChore: Loaded 3 tables 32 regions from filesyetem and found 9 orphan regions {noformat} No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. I have added some debug logging to CatalogJanitor to investigate further. > SPLIT state detritus > > > Key: HBASE-25829 > URL: https://issues.apache.org/jira/browse/HBASE-25829 > Project: HBase > Issue Type: Bug >Affects Versions: 2.4.3 >Reporter: Andrew Kyle Purtell >Assignee: Andrew Kyle Purtell >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3 > > > Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this > happened in the happy path. > There were no errors accessing all loaded table data. The integration test > writes a log to HDFS of every cell written to HBase and the verify phase uses > that log to read each value and confirm it. That seems fine: > {noformat} > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544 > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0 > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0 > {noformat} > However whenever the balancer runs there are a number
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337690#comment-17337690 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 11:41 PM: This addresses the issue of multiple split request transaction submissions. {code:java} diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java index 107330d90b..48cc26086f 100644 --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java @@ -1119,7 +1119,13 @@ public class AssignmentManager { LOG.debug("Split request from " + serverName + ", parent=" + parent + " splitKey=" + Bytes.toStringBinary(splitKey)); } - master.getMasterProcedureExecutor().submitProcedure(createSplitProcedure(parent, splitKey)); +if (regionStates.getRegionState(parent).isOpened() && + !regionStates.getRegionState(parent).isSplitting()) { + master.getMasterProcedureExecutor().submitProcedure(createSplitProcedure(parent, splitKey)); +} else { + LOG.warn("Ignoring split request from " + serverName + +", parent=" + parent + " because parent is already splitting or not online"); +} // If the RS is < 2.0 throw an exception to abort the operation, we are handling the split if (master.getServerManager().getVersionNumber(serverName) < 0x020) { {code} Is this the complete fix, though? With this patch in place, now we just have in the master log: {noformat} 2021-04-30 23:22:14,971 WARN [RpcServer.priority.RWQ.Codel.write.handler=1,queue=0,port=8100] assignment.AssignmentManager: Ignoring split request from ip-172-31-63-83.us-west-2.compute.internal,8120,1619824775800, parent={ENCODED => df7aa0e0af5a2b757ad86f2cf051fcbb, NAME => 'IntegrationTestLoadCommonCrawl,,1619824793285.df7aa0e0af5a2b757ad86f2cf051fcbb.', STARTKEY => '', ENDKEY => ''} because parent is already splitting or not online {noformat} but the first submission of the split request is already in progress and completes just fine. was (Author: apurtell): This addresses the issue of multiple split request transaction submissions. {code:java} diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java index 107330d90b..48cc26086f 100644 --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java @@ -1119,7 +1119,13 @@ public class AssignmentManager { LOG.debug("Split request from " + serverName + ", parent=" + parent + " splitKey=" + Bytes.toStringBinary(splitKey)); } - master.getMasterProcedureExecutor().submitProcedure(createSplitProcedure(parent, splitKey)); +if (regionStates.getRegionState(parent).isOpened() && + !regionStates.getRegionState(parent).isSplitting()) { + master.getMasterProcedureExecutor().submitProcedure(createSplitProcedure(parent, splitKey)); +} else { + LOG.warn("Ignoring split request from " + serverName + +", parent=" + parent + " because parent is already splitting or not online"); +} // If the RS is < 2.0 throw an exception to abort the operation, we are handling the split if (master.getServerManager().getVersionNumber(serverName) < 0x020) { {code} Is this the complete fix, though? Should the RS be submitting this report more than once? With this patch in place, now we just have in the master log: {noformat} 2021-04-30 23:22:14,971 WARN [RpcServer.priority.RWQ.Codel.write.handler=1,queue=0,port=8100] assignment.AssignmentManager: Ignoring split request from ip-172-31-63-83.us-west-2.compute.internal,8120,1619824775800, parent={ENCODED => df7aa0e0af5a2b757ad86f2cf051fcbb, NAME => 'IntegrationTestLoadCommonCrawl,,1619824793285.df7aa0e0af5a2b757ad86f2cf051fcbb.', STARTKEY => '', ENDKEY => ''} because parent is already splitting or not online {noformat} but the first submission of the split request is already in progress and completes just fine. > SPLIT state detritus > > > Key: HBASE-25829 > URL: https://issues.apache.org/jira/browse/HBASE-25829 > Project: HBase > Issue Type: Bug >Affects Versions: 2.4.3 >Reporter: Andrew Kyle Purtell >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3 > > > Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this > happened in
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337654#comment-17337654 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 10:55 PM: {quote}In RegionStates#getAssignmentsForBalancer, we have a check for isTableDisabled or isSplitParent. Both exclude some subset of regions that should not be candidates for consideration, but why not get to the point? {quote} The change is good on its own merits, but I don't believe the balancer is doing something it should not, so won't address the root issue. I will follow up on getAssignmentsForBalancer change in a subtask. Back to the split state handling. Perhaps we are trying to create a SplitTransitionProcedure for the split more than once. was (Author: apurtell): {quote}In RegionStates#getAssignmentsForBalancer, we have a check for isTableDisabled or isSplitParent. Both exclude some subset of regions that should not be candidates for consideration, but why not get to the point? {quote} This helps, but it treats the symptom. I will follow up on this in a subtask. Back to the split state handling. Perhaps we are trying to create a SplitTransitionProcedure for the split more than once. > SPLIT state detritus > > > Key: HBASE-25829 > URL: https://issues.apache.org/jira/browse/HBASE-25829 > Project: HBase > Issue Type: Bug >Affects Versions: 2.4.3 >Reporter: Andrew Kyle Purtell >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3 > > > Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this > happened in the happy path. > There were no errors accessing all loaded table data. The integration test > writes a log to HDFS of every cell written to HBase and the verify phase uses > that log to read each value and confirm it. That seems fine: > {noformat} > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544 > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0 > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0 > {noformat} > However whenever the balancer runs there are a number of concerning INFO > level log messages printed of the form _assignment.RegionStates: Skipping, no > server for state=SPLIT, location=null, table=TABLENAME_ > For example: > {noformat} > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8 > {noformat} > The HBCK chore notices them but does nothing: > "Loaded *80 regions* from in-memory state of AssignmentManager" > "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions" > "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions" > Yes, there are exactly 7 region state records of SPLIT state with > server=null. > {noformat} > 2021-04-30 02:02:09,300 INFO [master/ip-172-31-58-47:8100.Chore.1] > master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager > 2021-04-30 02:02:09,300 INFO [master/ip-172-31-58-47:8100.Chore.1] > master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0 > orphan regions > 2021-04-30 02:02:09,306 INFO [master/ip-172-31-58-47:8100.Chore.1] >
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337654#comment-17337654 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 10:50 PM: {quote}In RegionStates#getAssignmentsForBalancer, we have a check for isTableDisabled or isSplitParent. Both exclude some subset of regions that should not be candidates for consideration, but why not get to the point? {quote} This helps, but it treats the symptom. I will follow up on this in a subtask. Back to the split state handling. Perhaps we are trying to create a SplitTransitionProcedure for the split more than once. was (Author: apurtell): bq. In RegionStates#getAssignmentsForBalancer, we have a check for isTableDisabled or isSplitParent. Both exclude some subset of regions that should not be candidates for consideration, but why not get to the point? This helps, but it treats the symptom. I will follow up on this in a subtask. Back to the split state handling. Why is this printed three times? {noformat} 2021-04-30 22:06:15,686 INFO [PEWorker-6] procedure.MasterProcedureScheduler: Took xlock for pid=93, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl, parent=2d64bc9357b5f780307611f43052e092, daughterA=817c9985e6a2ee7645d3a292758ec729, daughterB=8ea34c3e5bcfb4c7d81e4212d700b824 2021-04-30 22:06:15,686 INFO [PEWorker-6] procedure.MasterProcedureScheduler: Took xlock for pid=93, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl, parent=2d64bc9357b5f780307611f43052e092, daughterA=817c9985e6a2ee7645d3a292758ec729, daughterB=8ea34c3e5bcfb4c7d81e4212d700b824 2021-04-30 22:06:15,686 INFO [PEWorker-6] procedure.MasterProcedureScheduler: Took xlock for pid=93, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl, parent=2d64bc9357b5f780307611f43052e092, daughterA=817c9985e6a2ee7645d3a292758ec729, daughterB=8ea34c3e5bcfb4c7d81e4212d700b824 {noformat} because what follows is 1. A successful split. 2. A failed SplitTableRegionProcedure with assignment.AssignmentManager: Failed transition org.apache.hadoop.hbase.client.DoNotRetryRegionException: 2d64bc9357b5f780307611f43052e092 is not OPEN; state=CLOSING 3. Another failed SplitTableRegionProcedure with assignment.AssignmentManager: Failed transition org.apache.hadoop.hbase.client.DoNotRetryRegionException: 2d64bc9357b5f780307611f43052e092 is not OPEN; state=CLOSING So we ran the split three times? And only one succeeded. > SPLIT state detritus > > > Key: HBASE-25829 > URL: https://issues.apache.org/jira/browse/HBASE-25829 > Project: HBase > Issue Type: Bug >Affects Versions: 2.4.3 >Reporter: Andrew Kyle Purtell >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3 > > > Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this > happened in the happy path. > There were no errors accessing all loaded table data. The integration test > writes a log to HDFS of every cell written to HBase and the verify phase uses > that log to read each value and confirm it. That seems fine: > {noformat} > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544 > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0 > 2021-04-30 02:16:33,316 INFO [main] > test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0 > {noformat} > However whenever the balancer runs there are a number of concerning INFO > level log messages printed of the form _assignment.RegionStates: Skipping, no > server for state=SPLIT, location=null, table=TABLENAME_ > For example: > {noformat} > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] >
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337628#comment-17337628 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 9:25 PM: --- I'm going to try out a patch, see if it helps. In theory this both excludes regions in SPLITTING or SPLITTING_NEW from balancer attempts to mutate assignments and skips a bunch of other uninteresting (as in not actively assigned) states, reducing balancer workload and also avoiding unintended consequences. In RegionStates#getAssignmentsForBalancer, we have a check for isTableDisabled or isSplitParent. Both exclude some subset of regions that should not be candidates for consideration, but why not get to the point? {code} diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java index d3553f11a3..64abd4e3ac 100644 --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java @@ -553,23 +552,25 @@ public class RegionStates { * wants to iterate this exported list. We need to synchronize on regions * since all access to this.servers is under a lock on this.regions. * - * @return A clone of current assignments. + * @return A clone of current open or opening assignments. */ public Map>> getAssignmentsForBalancer( TableStateManager tableStateManager, List onlineServers) { final Map>> result = new HashMap<>(); for (RegionStateNode node : regionsMap.values()) { - if (isTableDisabled(tableStateManager, node.getTable())) { -continue; - } - if (node.getRegionInfo().isSplitParent()) { + // When balancing, we are only interested in OPEN or OPENING regions and expected + // to be online at that server until possibly the next balancer iteration or unless + // we decide to move it. Other states are not interesting as the region will either + // be closing, or splitting/merging, or will not be deployed. + if (!(node.isInState(State.OPEN)||node.isInState(State.OPENING))) { continue; } Map> tableResult = result.computeIfAbsent(node.getTable(), t -> new HashMap<>()); final ServerName serverName = node.getRegionLocation(); + // A region in ONLINE or OPENING state should have a location. if (serverName == null) { -LOG.info("Skipping, no server for " + node); +LOG.warn("Skipping, no server for " + node); continue; } List serverResult = {code} was (Author: apurtell): I'm going to try out a patch, see if it helps. In RegionStates#getAssignmentsForBalancer, we have a check for isTableDisabled or isSplitParent. Both exclude some subset of regions that should not be candidates for consideration, but why not get to the point? {code} diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java index d3553f11a3..64abd4e3ac 100644 --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java @@ -553,23 +552,25 @@ public class RegionStates { * wants to iterate this exported list. We need to synchronize on regions * since all access to this.servers is under a lock on this.regions. * - * @return A clone of current assignments. + * @return A clone of current open or opening assignments. */ public Map>> getAssignmentsForBalancer( TableStateManager tableStateManager, List onlineServers) { final Map>> result = new HashMap<>(); for (RegionStateNode node : regionsMap.values()) { - if (isTableDisabled(tableStateManager, node.getTable())) { -continue; - } - if (node.getRegionInfo().isSplitParent()) { + // When balancing, we are only interested in OPEN or OPENING regions and expected + // to be online at that server until possibly the next balancer iteration or unless + // we decide to move it. Other states are not interesting as the region will either + // be closing, or splitting/merging, or will not be deployed. + if (!(node.isInState(State.OPEN)||node.isInState(State.OPENING))) { continue; } Map> tableResult = result.computeIfAbsent(node.getTable(), t -> new HashMap<>()); final ServerName serverName = node.getRegionLocation(); + // A region in ONLINE or OPENING state should have a location. if (serverName == null) { -LOG.info("Skipping, no server for " + node); +LOG.warn("Skipping,
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337591#comment-17337591 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 8:04 PM: --- This warning repeats frequently and involves the split parents that accumulate as bad state: {noformat} 2021-04-30 19:35:01,115 INFO [PEWorker-27] procedure.MasterProcedureScheduler: Took xlock for pid=313, ppid=312, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, region=c9076fb746a0690900130cd5d38f1815, UNASSIGN 2021-04-30 19:35:01,116 INFO [PEWorker-27] assignment.RegionStateStore: pid=313 updating hbase:meta row=c9076fb746a0690900130cd5d38f1815, regionState=CLOSING, regionLocation=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832 2021-04-30 19:35:01,116 INFO [PEWorker-27] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=314, ppid=313, state=RUNNABLE; CloseRegionProcedure c9076fb746a0690900130cd5d38f1815, server=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832}] 2021-04-30 19:35:05,546 WARN [RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100] assignment.AssignmentManager: Failed transition org.apache.hadoop.hbase.client.DoNotRetryRegionException: c9076fb746a0690900130cd5d38f1815 is not OPEN; state=CLOSING at org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291) at org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194) at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019) at org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727) at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) {noformat} The "Skipping, no server for state=SPLIT" messages accumulate for the affected split parents. Not every split is affected. In the master logs I see balancer and split activity co-occurring most of the time. One may be interefering with the other. We seem to get into trouble only when balancer activity and split procedure log lines are interlaced in the master log. was (Author: apurtell): This warning repeats frequently and involves the split parents that accumulate as bad state: {noformat} 2021-04-30 19:35:01,115 INFO [PEWorker-27] procedure.MasterProcedureScheduler: Took xlock for pid=313, ppid=312, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, region=c9076fb746a0690900130cd5d38f1815, UNASSIGN 2021-04-30 19:35:01,116 INFO [PEWorker-27] assignment.RegionStateStore: pid=313 updating hbase:meta row=c9076fb746a0690900130cd5d38f1815, regionState=CLOSING, regionLocation=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832 2021-04-30 19:35:01,116 INFO [PEWorker-27] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=314, ppid=313, state=RUNNABLE; CloseRegionProcedure c9076fb746a0690900130cd5d38f1815, server=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832}] 2021-04-30 19:35:05,546 WARN [RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100] assignment.AssignmentManager: Failed transition org.apache.hadoop.hbase.client.DoNotRetryRegionException: c9076fb746a0690900130cd5d38f1815 is not OPEN; state=CLOSING at org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291) at org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194) at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112) at
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337591#comment-17337591 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 8:00 PM: --- This warning repeats frequently and involves the split parents that accumulate as bad state: {noformat} 2021-04-30 19:35:01,115 INFO [PEWorker-27] procedure.MasterProcedureScheduler: Took xlock for pid=313, ppid=312, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, region=c9076fb746a0690900130cd5d38f1815, UNASSIGN 2021-04-30 19:35:01,116 INFO [PEWorker-27] assignment.RegionStateStore: pid=313 updating hbase:meta row=c9076fb746a0690900130cd5d38f1815, regionState=CLOSING, regionLocation=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832 2021-04-30 19:35:01,116 INFO [PEWorker-27] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=314, ppid=313, state=RUNNABLE; CloseRegionProcedure c9076fb746a0690900130cd5d38f1815, server=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832}] 2021-04-30 19:35:05,546 WARN [RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100] assignment.AssignmentManager: Failed transition org.apache.hadoop.hbase.client.DoNotRetryRegionException: c9076fb746a0690900130cd5d38f1815 is not OPEN; state=CLOSING at org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291) at org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194) at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019) at org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727) at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) {noformat} The "Skipping, no server for state=SPLIT" messages accumulate for the affected split parents. Not every split is affected. In the master logs I see balancer and split activity co-occurring most of the time. One may be interefering with the other. was (Author: apurtell): This warning happens at every split: {noformat} 2021-04-30 19:35:01,115 INFO [PEWorker-27] procedure.MasterProcedureScheduler: Took xlock for pid=313, ppid=312, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, region=c9076fb746a0690900130cd5d38f1815, UNASSIGN 2021-04-30 19:35:01,116 INFO [PEWorker-27] assignment.RegionStateStore: pid=313 updating hbase:meta row=c9076fb746a0690900130cd5d38f1815, regionState=CLOSING, regionLocation=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832 2021-04-30 19:35:01,116 INFO [PEWorker-27] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=314, ppid=313, state=RUNNABLE; CloseRegionProcedure c9076fb746a0690900130cd5d38f1815, server=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832}] 2021-04-30 19:35:05,546 WARN [RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100] assignment.AssignmentManager: Failed transition org.apache.hadoop.hbase.client.DoNotRetryRegionException: c9076fb746a0690900130cd5d38f1815 is not OPEN; state=CLOSING at org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291) at org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194) at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122) at
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337586#comment-17337586 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 7:41 PM: --- I have a nonstandard balancer configuration. This might be relevant: hbase.balancer.period=6 hbase.master.balancer.stochastic.minCostNeedBalance=0.025 hbase.master.balancer.stochastic.regionCountCost=100 hbase.master.balancer.stochastic.localityCost=30 hbase.master.balancer.stochastic.moveCost=10 hbase.master.balancer.stochastic.readRequestCost=10 hbase.master.balancer.stochastic.writeRequestCost=15 I think there is a state desynchronization bug related to splitting. The weird log lines reported here are for the split parent after a split completes. Split procedure begins: {noformat} 2021-04-30 19:11:39,874 INFO [PEWorker-65] procedure.MasterProcedureScheduler: Took xlock for pid=79, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, daughterA=2b607331c1f37f7efa21ae6c2281c89e, daughterB=4c53f22c3fe52af4a882a5f7936aa658 2021-04-30 19:11:39,874 INFO [PEWorker-65] procedure.MasterProcedureScheduler: Took xlock for pid=79, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, daughterA=2b607331c1f37f7efa21ae6c2281c89e, daughterB=4c53f22c3fe52af4a882a5f7936aa658 2021-04-30 19:11:39,874 INFO [PEWorker-65] procedure.MasterProcedureScheduler: Took xlock for pid=79, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, daughterA=2b607331c1f37f7efa21ae6c2281c89e, daughterB=4c53f22c3fe52af4a882a5f7936aa658 {noformat} UNASSIGN parent region (22ab65a3e71c3ab7f94c1c47067ad54e), includes WARN for failed transition: {noformat} 2021-04-30 19:11:39,878 INFO [PEWorker-65] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=80, ppid=79, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e, UNASSIGN}] 2021-04-30 19:11:39,878 INFO [PEWorker-66] procedure.MasterProcedureScheduler: Took xlock for pid=80, ppid=79, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e, UNASSIGN 2021-04-30 19:11:39,879 INFO [PEWorker-66] assignment.RegionStateStore: pid=80 updating hbase:meta row=22ab65a3e71c3ab7f94c1c47067ad54e, regionState=CLOSING, regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 2021-04-30 19:11:40,032 INFO [PEWorker-66] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=81, ppid=80, state=RUNNABLE; CloseRegionProcedure 22ab65a3e71c3ab7f94c1c47067ad54e, server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}] 2021-04-30 19:11:40,303 WARN [RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100] assignment.AssignmentManager: Failed transition org.apache.hadoop.hbase.client.DoNotRetryRegionException: 22ab65a3e71c3ab7f94c1c47067ad54e is not OPEN; state=CLOSING at org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291) at org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194) at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019) at org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727) at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) 2021-04-30 19:11:44,814 INFO [PEWorker-68] assignment.RegionStateStore: pid=80 updating hbase:meta
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337586#comment-17337586 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 7:31 PM: --- This is what happens. I have a nonstandard balancer configuration. It is moving regions during a split. hbase.balancer.period=6 hbase.master.balancer.stochastic.minCostNeedBalance=0.025 hbase.master.balancer.stochastic.regionCountCost=100 hbase.master.balancer.stochastic.localityCost=30 hbase.master.balancer.stochastic.moveCost=10 hbase.master.balancer.stochastic.readRequestCost=10 hbase.master.balancer.stochastic.writeRequestCost=15 I think there is a state desynchronization bug in the master when the balancer decides to move regions around when a split is in progress. The weird log lines reported here are for the split parent after a split completes. Split procedure begins: {noformat} 2021-04-30 19:11:39,874 INFO [PEWorker-65] procedure.MasterProcedureScheduler: Took xlock for pid=79, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, daughterA=2b607331c1f37f7efa21ae6c2281c89e, daughterB=4c53f22c3fe52af4a882a5f7936aa658 2021-04-30 19:11:39,874 INFO [PEWorker-65] procedure.MasterProcedureScheduler: Took xlock for pid=79, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, daughterA=2b607331c1f37f7efa21ae6c2281c89e, daughterB=4c53f22c3fe52af4a882a5f7936aa658 2021-04-30 19:11:39,874 INFO [PEWorker-65] procedure.MasterProcedureScheduler: Took xlock for pid=79, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, daughterA=2b607331c1f37f7efa21ae6c2281c89e, daughterB=4c53f22c3fe52af4a882a5f7936aa658 {noformat} UNASSIGN parent region (22ab65a3e71c3ab7f94c1c47067ad54e), includes WARN for failed transition: {noformat} 2021-04-30 19:11:39,878 INFO [PEWorker-65] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=80, ppid=79, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e, UNASSIGN}] 2021-04-30 19:11:39,878 INFO [PEWorker-66] procedure.MasterProcedureScheduler: Took xlock for pid=80, ppid=79, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e, UNASSIGN 2021-04-30 19:11:39,879 INFO [PEWorker-66] assignment.RegionStateStore: pid=80 updating hbase:meta row=22ab65a3e71c3ab7f94c1c47067ad54e, regionState=CLOSING, regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 2021-04-30 19:11:40,032 INFO [PEWorker-66] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=81, ppid=80, state=RUNNABLE; CloseRegionProcedure 22ab65a3e71c3ab7f94c1c47067ad54e, server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}] 2021-04-30 19:11:40,303 WARN [RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100] assignment.AssignmentManager: Failed transition org.apache.hadoop.hbase.client.DoNotRetryRegionException: 22ab65a3e71c3ab7f94c1c47067ad54e is not OPEN; state=CLOSING at org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291) at org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194) at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019) at org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727) at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) 2021-04-30
[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus
[ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337022#comment-17337022 ] Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 2:15 AM: --- Also, I didn't see a HBCK2 option that does anything for this case. {{./bin/hbase hbck2 -j /path/to/jar fixMeta}} printed a message about kicking the chore. The chore didn't handle it. {{setRegionState}} doesn't seem right. Maybe a {{removeRegionState}} if we had the option. was (Author: apurtell): Also, I didn't see a HBCK2 option that does anything for this case. > SPLIT state detritus > > > Key: HBASE-25829 > URL: https://issues.apache.org/jira/browse/HBASE-25829 > Project: HBase > Issue Type: Bug >Affects Versions: 2.4.3 >Reporter: Andrew Kyle Purtell >Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3 > > > Seen after an integration test load with 'calm' monkey, so this happened in > the happy path. > There were no errors accessing all loaded table data, however whenever the > balancer runs there are a number of concerning INFO level log messages > printed of the form _assignment.RegionStates: Skipping, no server for > state=SPLIT, location=null, table=TABLENAME_ > For example: > {noformat} > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572 > 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2] > assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, > table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8 > {noformat} > The HBCK chore notices them but does nothing: > "Loaded *80 regions* from in-memory state of AssignmentManager" > "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions" > "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions" > Yes, there are exactly 7 region state records of SPLIT state with > server=null. > {noformat} > 2021-04-30 02:02:09,300 INFO [master/ip-172-31-58-47:8100.Chore.1] > master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager > 2021-04-30 02:02:09,300 INFO [master/ip-172-31-58-47:8100.Chore.1] > master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0 > orphan regions > 2021-04-30 02:02:09,306 INFO [master/ip-172-31-58-47:8100.Chore.1] > master.HbckChore: Loaded 3 tables 80 regions from filesystem and found 0 > orphan regions > {noformat} > This repeats indefinitely. -- This message was sent by Atlassian Jira (v8.3.4#803005)