[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-05-04 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 5/4/21, 5:08 PM:
--

Subtasks look good. Back to the main issue.
{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}
The 51 extra regions are SPLIT parents, with server = null.

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is 
supposed to clean it.

[Edit: Removed some distraction.]

Remains to be seen if CatalogJanitor will clean the regions. Will update.


was (Author: apurtell):
Subtasks look good. Back to the main issue.
{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}
The 51 extra regions are SPLIT parents, with server = null.

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is 
supposed to clean it.

[Edit: Removed some distraction.]

Remains to be seen if CatalogJanitor will clean the regions. Probably. What is 
happening during latest tests is because isRIT evaluates to true usually (it's 
an ingest test after all) and HBASE-25840 is applied all of the cleanup is 
deferred. Will update and see if region GC proceeds as expected.

> SPLIT state detritus
> 
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.3
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this 
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test 
> writes a log to HDFS of every cell written to HBase and the verify phase uses 
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number of concerning INFO 
> level log messages printed of the form _assignment.RegionStates: Skipping, no 
> server for state=SPLIT, location=null, table=TABLENAME_ 
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-05-04 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 5/4/21, 5:06 PM:
--

Subtasks look good. Back to the main issue.
{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}
The 51 extra regions are SPLIT parents, with server = null.

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is 
supposed to clean it.

[Edit: Removed some distraction.]

Remains to be seen if CatalogJanitor will clean the regions. Probably. What is 
happening during latest tests is because isRIT evaluates to true usually (it's 
an ingest test after all) and HBASE-25840 is applied all of the cleanup is 
deferred. Will update and see if region GC proceeds as expected.


was (Author: apurtell):
Subtasks look good. Back to the main issue.
{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}
The 51 extra regions are SPLIT parents, with server = null.

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is 
supposed to clean it.

If I patch AssignmentManager#markRegionAsSplit to remove the parent from 
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:
{noformat}
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}
No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. There is a simple change to HbckChore that should accompany the other 
changes I have under test: The conditional for determining if a region is 
orphan should become {{if (hri == null && 
!splitParentRegions.contains(encodedRegionName) && 
!mergedParentRegions.contains(encodedRegionName))}} so you can ignore the 
non-zero orphan count. With the complete change it would be reported as 0.

Remains to be seen if CatalogJanitor will clean the regions. Probably. What is 
happening during latest tests is because isRIT evaluates to true usually (it's 
an ingest test after all) and HBASE-25840 is applied all of the cleanup is 
deferred. Will update and see if region GC proceeds as expected.

> SPLIT state detritus
> 
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.3
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this 
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test 
> writes a log to HDFS of every cell written to HBase and the verify phase uses 
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO  

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-05-03 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 5/4/21, 1:08 AM:
--

Subtasks look good. Back to the main issue.
{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}
The 51 extra regions are SPLIT parents, with server = null.

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is 
supposed to clean it.

If I patch AssignmentManager#markRegionAsSplit to remove the parent from 
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:
{noformat}
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}
No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. There is a simple change to HbckChore that should accompany the other 
changes I have under test: The conditional for determining if a region is 
orphan should become {{if (hri == null && 
!splitParentRegions.contains(encodedRegionName) && 
!mergedParentRegions.contains(encodedRegionName))}} so you can ignore the 
non-zero orphan count. With the complete change it would be reported as 0.

Remains to be seen if CatalogJanitor will clean the regions. Probably. What is 
happening during latest tests is because isRIT evaluates to true usually (it's 
an ingest test after all) and HBASE-25840 is applied all of the cleanup is 
deferred. Will update and see if region GC proceeds as expected.


was (Author: apurtell):
Subtasks look good. Back to the main issue.
{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}
The 51 extra regions are SPLIT parents, with server = null.

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is 
supposed to clean it.

If I patch AssignmentManager#markRegionAsSplit to remove the parent from 
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:
{noformat}
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}
No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. There is a simple change to HbckChore that should accompany the other 
changes I have under test: The conditional for determining if a region is 
orphan should become {{if (hri == null && 
!splitParentRegions.contains(encodedRegionName) && 
!mergedParentRegions.contains(encodedRegionName))}} so you can ignore the 
non-zero orphan count. With the complete change it would be reported as 0.

Remains to be seen if CatalogJanitor will clean the regions. Probably. What is 
happening during latest tests is because isRIT evaluates to true usually (it's 
an 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-05-03 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 5/4/21, 12:57 AM:
---

Subtasks look good. Back to the main issue.
{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}
The 51 extra regions are SPLIT parents, with server = null.

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is 
supposed to clean it.

If I patch AssignmentManager#markRegionAsSplit to remove the parent from 
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:
{noformat}
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}
No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. There is a simple change to HbckChore that should accompany the other 
changes I have under test: The conditional for determining if a region is 
orphan should become {{if (hri == null && 
!splitParentRegions.contains(encodedRegionName) && 
!mergedParentRegions.contains(encodedRegionName))}} so you can ignore the 
non-zero orphan count. With the complete change it would be reported as 0.

Remains to be seen if CatalogJanitor will clean the regions. Probably. What is 
happening during latest tests is because isRIT evaluates to true usually (it's 
an ingest test after all) and HBASE-25840 is applied all of the cleanup is 
deferred. Will wait until the end to see if region GC proceeds as expected.


was (Author: apurtell):
Subtasks look good. Back to the main issue.
{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}
The 51 extra regions are SPLIT parents, with server = null.

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from 
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:
{noformat}
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}
No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. There is a simple change to HbckChore that should accompany the other 
changes I have under test: The conditional for determining if a region is 
orphan should become {{if (hri == null && 
!splitParentRegions.contains(encodedRegionName) && 
!mergedParentRegions.contains(encodedRegionName))}} so you can ignore the 
non-zero orphan count. With the complete change it would be reported as 0.

Anyway, there is more to debug. Seems CatalogJanitor or the region GC 
procedures it submits are not completing their work. I have added 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-05-03 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 5/3/21, 10:56 PM:
---

Subtasks look good. Back to the main issue.
{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}
The 51 extra regions are SPLIT parents, with server = null.

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from 
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:
{noformat}
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}
No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. There is a simple change to HbckChore that should accompany the other 
changes I have under test: The conditional for determining if a region is 
orphan should become {{if (hri == null && 
!splitParentRegions.contains(encodedRegionName) && 
!mergedParentRegions.contains(encodedRegionName))}} so you can ignore the 
non-zero orphan count. With the complete change it would be reported as 0.

Anyway, there is more to debug. Seems CatalogJanitor or the region GC 
procedures it submits are not completing their work. I have added some debug 
logging to CatalogJanitor to investigate further.


was (Author: apurtell):
Subtasks look good. Back to the main issue. 

{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}

The 51 extra regions are SPLIT parents, with server = null. 

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state.  CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from  
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:

{noformat}2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}

No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. There is a simple change to HbckChore that should accompany the other 
changes I have under test: The conditional for determining if a region is 
orphan should become {{if (hri == null && 
!splitParentRegions.contains(encodedRegionName) && 
!mergedParentRegions.contains(encodedRegionName))}} so you can ignore the 
non-zero orphan count. With the complete change it would be reported as 0. 

Anyway, there is more to debug. CatalogJanitor or the region GC procedures it 
submits may not be completing their work. I have added some debug logging to 
CatalogJanitor to investigate further. 

> SPLIT state detritus
> 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-05-03 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 5/3/21, 10:55 PM:
---

Subtasks look good. Back to the main issue. 

{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}

The 51 extra regions are SPLIT parents, with server = null. 

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state.  CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from  
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:

{noformat}2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}

No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. There is a simple change to HbckChore that should accompany the other 
changes I have under test: The conditional for determining if a region is 
orphan should become {{if (hri == null && 
!splitParentRegions.contains(encodedRegionName) && 
!mergedParentRegions.contains(encodedRegionName))}} so you can ignore the 
non-zero orphan count. With the complete change it would be reported as 0. 

Anyway, there is more to debug. CatalogJanitor or the region GC procedures it 
submits may not be completing their work. I have added some debug logging to 
CatalogJanitor to investigate further. 


was (Author: apurtell):
Subtasks look good. Back to the main issue. 

{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}

The 51 extra regions are SPLIT parents, with server = null. 

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state.  CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from  
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:

{noformat}2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}

No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. There is a simple change to HbckChore that should accompany the other 
changes I have under test: The conditional for determining if a region is 
orphan should become {{ if (hri == null && 
!splitParentRegions.contains(encodedRegionName) && 
!mergedParentRegions.contains(encodedRegionName)) }} so you can ignore the 
non-zero orphan count. With the complete change it would be reported as 0. 

Anyway, there is more to debug. CatalogJanitor or the region GC procedures it 
submits may not be completing their work. I have added some debug logging to 
CatalogJanitor to investigate further. 

> SPLIT state detritus

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-05-03 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 5/3/21, 10:54 PM:
---

Subtasks look good. Back to the main issue. 

{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}

The 51 extra regions are SPLIT parents, with server = null. 

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state.  CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from  
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:

{noformat}2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}

No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. There is a simple change to HbckChore that should accompany the other 
changes I have under test: The conditional for determining if a region is 
orphan should become {{ if (hri == null && 
!splitParentRegions.contains(encodedRegionName) && 
!mergedParentRegions.contains(encodedRegionName)) }} so you can ignore the 
non-zero orphan count. With the complete change it would be reported as 0. 

Anyway, there is more to debug. CatalogJanitor or the region GC procedures it 
submits may not be completing their work. I have added some debug logging to 
CatalogJanitor to investigate further. 


was (Author: apurtell):
Subtasks look good. Back to the main issue. 

{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}

The 51 extra regions are SPLIT parents, with server = null. 

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state.  CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from  
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:

{noformat}2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}

No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. 

I have added some debug logging to CatalogJanitor to investigate further. 

> SPLIT state detritus
> 
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.3
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this 
> happened in the happy path.
> There were no 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-05-03 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 5/3/21, 10:50 PM:
---

Subtasks look good. Back to the main issue. 

{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}

The 51 extra regions are SPLIT parents, with server = null. 

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state.  CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from  
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:

{noformat}2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}

No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. 

I have added some debug logging to CatalogJanitor to investigate further. 


was (Author: apurtell):
Subtasks look good. Back to the main issue. 

{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesyetem and found 0 
orphan regions
{noformat}

The 51 extra regions are SPLIT parents, with server = null. 

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state.  CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from  
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:

{noformat}2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesyetem and found 9 orphan 
regions
{noformat}

No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. 

I have added some debug logging to CatalogJanitor to investigate further. 

> SPLIT state detritus
> 
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.3
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this 
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test 
> writes a log to HDFS of every cell written to HBase and the verify phase uses 
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-05-03 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338650#comment-17338650
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 5/3/21, 10:47 PM:
---

Subtasks look good. Back to the main issue. 

{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesyetem and found 0 
orphan regions
{noformat}

The 51 extra regions are SPLIT parents, with server = null. 

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state.  CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from  
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:

{noformat}2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesyetem and found 9 orphan 
regions
{noformat}

No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. 

I have added some debug logging to CatalogJanitor to investigate further. 


was (Author: apurtell):
Subtasks look good. Back to the main issue. 

{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesyetem and found 0 
orphan regions
{noformat}

The 51 extra regions are SPLIT parents, with server = null. 

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state.  CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from  
{{regionStates}}, things begin to look better:

{noformat}2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesyetem and found 9 orphan 
regions
{noformat}

No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. 

I have added some debug logging to CatalogJanitor to investigate further. 

> SPLIT state detritus
> 
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.3
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this 
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test 
> writes a log to HDFS of every cell written to HBase and the verify phase uses 
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-04-30 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337690#comment-17337690
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 11:41 PM:


This addresses the issue of multiple split request transaction submissions.

{code:java}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
index 107330d90b..48cc26086f 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
@@ -1119,7 +1119,13 @@ public class AssignmentManager {
   LOG.debug("Split request from " + serverName +
   ", parent=" + parent + " splitKey=" + 
Bytes.toStringBinary(splitKey));
 }
-
master.getMasterProcedureExecutor().submitProcedure(createSplitProcedure(parent,
 splitKey));
+if (regionStates.getRegionState(parent).isOpened() &&
+  !regionStates.getRegionState(parent).isSplitting()) {
+  
master.getMasterProcedureExecutor().submitProcedure(createSplitProcedure(parent,
 splitKey));
+} else {
+  LOG.warn("Ignoring split request from " + serverName +
+", parent=" + parent + " because parent is already splitting or not 
online");
+}
 
 // If the RS is < 2.0 throw an exception to abort the operation, we are 
handling the split
 if (master.getServerManager().getVersionNumber(serverName) < 0x020) { 
{code}

Is this the complete fix, though? 

With this patch in place, now we just have in the master log:

{noformat}
2021-04-30 23:22:14,971 WARN  
[RpcServer.priority.RWQ.Codel.write.handler=1,queue=0,port=8100]
assignment.AssignmentManager: Ignoring split request from 
ip-172-31-63-83.us-west-2.compute.internal,8120,1619824775800, parent={ENCODED 
=> df7aa0e0af5a2b757ad86f2cf051fcbb, NAME => 
'IntegrationTestLoadCommonCrawl,,1619824793285.df7aa0e0af5a2b757ad86f2cf051fcbb.',
 STARTKEY => '', ENDKEY => ''} because parent is already splitting or not online
{noformat}

but the first submission of the split request is already in progress and 
completes just fine. 


was (Author: apurtell):
This addresses the issue of multiple split request transaction submissions.

{code:java}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
index 107330d90b..48cc26086f 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
@@ -1119,7 +1119,13 @@ public class AssignmentManager {
   LOG.debug("Split request from " + serverName +
   ", parent=" + parent + " splitKey=" + 
Bytes.toStringBinary(splitKey));
 }
-
master.getMasterProcedureExecutor().submitProcedure(createSplitProcedure(parent,
 splitKey));
+if (regionStates.getRegionState(parent).isOpened() &&
+  !regionStates.getRegionState(parent).isSplitting()) {
+  
master.getMasterProcedureExecutor().submitProcedure(createSplitProcedure(parent,
 splitKey));
+} else {
+  LOG.warn("Ignoring split request from " + serverName +
+", parent=" + parent + " because parent is already splitting or not 
online");
+}
 
 // If the RS is < 2.0 throw an exception to abort the operation, we are 
handling the split
 if (master.getServerManager().getVersionNumber(serverName) < 0x020) { 
{code}

Is this the complete fix, though? Should the RS be submitting this report more 
than once? 

With this patch in place, now we just have in the master log:

{noformat}
2021-04-30 23:22:14,971 WARN  
[RpcServer.priority.RWQ.Codel.write.handler=1,queue=0,port=8100]
assignment.AssignmentManager: Ignoring split request from 
ip-172-31-63-83.us-west-2.compute.internal,8120,1619824775800, parent={ENCODED 
=> df7aa0e0af5a2b757ad86f2cf051fcbb, NAME => 
'IntegrationTestLoadCommonCrawl,,1619824793285.df7aa0e0af5a2b757ad86f2cf051fcbb.',
 STARTKEY => '', ENDKEY => ''} because parent is already splitting or not online
{noformat}

but the first submission of the split request is already in progress and 
completes just fine. 

> SPLIT state detritus
> 
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.3
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this 
> happened in 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-04-30 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337654#comment-17337654
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 10:55 PM:


{quote}In RegionStates#getAssignmentsForBalancer, we have a check for 
isTableDisabled or isSplitParent. Both exclude some subset of regions that 
should not be candidates for consideration, but why not get to the point?
{quote}
The change is good on its own merits, but I don't believe the balancer is doing 
something it should not, so won't address the root issue. I will follow up on 
getAssignmentsForBalancer change in a subtask.

Back to the split state handling. Perhaps we are trying to create a 
SplitTransitionProcedure for the split more than once.


was (Author: apurtell):
{quote}In RegionStates#getAssignmentsForBalancer, we have a check for 
isTableDisabled or isSplitParent. Both exclude some subset of regions that 
should not be candidates for consideration, but why not get to the point?
{quote}
This helps, but it treats the symptom. I will follow up on this in a subtask.

Back to the split state handling. Perhaps we are trying to create a 
SplitTransitionProcedure for the split more than once.

> SPLIT state detritus
> 
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.3
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this 
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test 
> writes a log to HDFS of every cell written to HBase and the verify phase uses 
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number of concerning INFO 
> level log messages printed of the form _assignment.RegionStates: Skipping, no 
> server for state=SPLIT, location=null, table=TABLENAME_ 
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8
> {noformat}
> The HBCK chore notices them but does nothing:
> "Loaded *80 regions* from in-memory state of AssignmentManager"
> "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions"
> "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions"
> Yes, there are exactly 7 region state records of SPLIT state with 
> server=null. 
> {noformat}
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0 
> orphan regions
> 2021-04-30 02:02:09,306 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-04-30 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337654#comment-17337654
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 10:50 PM:


{quote}In RegionStates#getAssignmentsForBalancer, we have a check for 
isTableDisabled or isSplitParent. Both exclude some subset of regions that 
should not be candidates for consideration, but why not get to the point?
{quote}
This helps, but it treats the symptom. I will follow up on this in a subtask.

Back to the split state handling. Perhaps we are trying to create a 
SplitTransitionProcedure for the split more than once.


was (Author: apurtell):
bq. In RegionStates#getAssignmentsForBalancer, we have a check for 
isTableDisabled or isSplitParent. Both exclude some subset of regions that 
should not be candidates for consideration, but why not get to the point? 

This helps, but it treats the symptom. I will follow up on this in a subtask. 

Back to the split state handling. Why is this printed three times?
{noformat}
2021-04-30 22:06:15,686 INFO  [PEWorker-6] 
procedure.MasterProcedureScheduler: Took xlock for pid=93, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=2d64bc9357b5f780307611f43052e092, 
daughterA=817c9985e6a2ee7645d3a292758ec729, 
daughterB=8ea34c3e5bcfb4c7d81e4212d700b824
2021-04-30 22:06:15,686 INFO  [PEWorker-6] 
procedure.MasterProcedureScheduler: Took xlock for pid=93, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=2d64bc9357b5f780307611f43052e092, 
daughterA=817c9985e6a2ee7645d3a292758ec729, 
daughterB=8ea34c3e5bcfb4c7d81e4212d700b824
2021-04-30 22:06:15,686 INFO  [PEWorker-6] 
procedure.MasterProcedureScheduler: Took xlock for pid=93, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=2d64bc9357b5f780307611f43052e092, 
daughterA=817c9985e6a2ee7645d3a292758ec729, 
daughterB=8ea34c3e5bcfb4c7d81e4212d700b824
{noformat}

because what follows is 

1. A successful split.

2. A failed SplitTableRegionProcedure with assignment.AssignmentManager: Failed 
transition org.apache.hadoop.hbase.client.DoNotRetryRegionException: 
2d64bc9357b5f780307611f43052e092 is not OPEN; state=CLOSING

3. Another failed  SplitTableRegionProcedure with assignment.AssignmentManager: 
Failed transition org.apache.hadoop.hbase.client.DoNotRetryRegionException: 
2d64bc9357b5f780307611f43052e092 is not OPEN; state=CLOSING

So we ran the split three times? And only one succeeded. 



> SPLIT state detritus
> 
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.3
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this 
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test 
> writes a log to HDFS of every cell written to HBase and the verify phase uses 
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number of concerning INFO 
> level log messages printed of the form _assignment.RegionStates: Skipping, no 
> server for state=SPLIT, location=null, table=TABLENAME_ 
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-04-30 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337628#comment-17337628
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 9:25 PM:
---

I'm going to try out a patch, see if it helps. In theory this both excludes 
regions in SPLITTING or SPLITTING_NEW from balancer attempts to mutate 
assignments and skips a bunch of other uninteresting (as in not actively 
assigned) states, reducing balancer workload and also avoiding unintended 
consequences.

In RegionStates#getAssignmentsForBalancer, we have a check for isTableDisabled 
or isSplitParent. Both exclude some subset of regions that should not be 
candidates for consideration, but why not get to the point? 

{code}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
index d3553f11a3..64abd4e3ac 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
@@ -553,23 +552,25 @@ public class RegionStates {
* wants to iterate this exported list.  We need to synchronize on regions
* since all access to this.servers is under a lock on this.regions.
*
-   * @return A clone of current assignments.
+   * @return A clone of current open or opening assignments.
*/
   public Map>> 
getAssignmentsForBalancer(
   TableStateManager tableStateManager, List onlineServers) {
 final Map>> result = new 
HashMap<>();
 for (RegionStateNode node : regionsMap.values()) {
-  if (isTableDisabled(tableStateManager, node.getTable())) {
-continue;
-  }
-  if (node.getRegionInfo().isSplitParent()) {
+  // When balancing, we are only interested in OPEN or OPENING regions and 
expected
+  // to be online at that server until possibly the next balancer 
iteration or unless
+  // we decide to move it. Other states are not interesting as the region 
will either
+  // be closing, or splitting/merging, or will not be deployed.
+  if (!(node.isInState(State.OPEN)||node.isInState(State.OPENING))) {
 continue;
   }
   Map> tableResult =
   result.computeIfAbsent(node.getTable(), t -> new HashMap<>());
   final ServerName serverName = node.getRegionLocation();
+  // A region in ONLINE or OPENING state should have a location.
   if (serverName == null) {
-LOG.info("Skipping, no server for " + node);
+LOG.warn("Skipping, no server for " + node);
 continue;
   }
   List serverResult =
{code}


was (Author: apurtell):
I'm going to try out a patch, see if it helps. 

In RegionStates#getAssignmentsForBalancer, we have a check for isTableDisabled 
or isSplitParent. Both exclude some subset of regions that should not be 
candidates for consideration, but why not get to the point? 

{code}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
index d3553f11a3..64abd4e3ac 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
@@ -553,23 +552,25 @@ public class RegionStates {
* wants to iterate this exported list.  We need to synchronize on regions
* since all access to this.servers is under a lock on this.regions.
*
-   * @return A clone of current assignments.
+   * @return A clone of current open or opening assignments.
*/
   public Map>> 
getAssignmentsForBalancer(
   TableStateManager tableStateManager, List onlineServers) {
 final Map>> result = new 
HashMap<>();
 for (RegionStateNode node : regionsMap.values()) {
-  if (isTableDisabled(tableStateManager, node.getTable())) {
-continue;
-  }
-  if (node.getRegionInfo().isSplitParent()) {
+  // When balancing, we are only interested in OPEN or OPENING regions and 
expected
+  // to be online at that server until possibly the next balancer 
iteration or unless
+  // we decide to move it. Other states are not interesting as the region 
will either
+  // be closing, or splitting/merging, or will not be deployed.
+  if (!(node.isInState(State.OPEN)||node.isInState(State.OPENING))) {
 continue;
   }
   Map> tableResult =
   result.computeIfAbsent(node.getTable(), t -> new HashMap<>());
   final ServerName serverName = node.getRegionLocation();
+  // A region in ONLINE or OPENING state should have a location.
   if (serverName == null) {
-LOG.info("Skipping, no server for " + node);
+LOG.warn("Skipping, 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-04-30 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337591#comment-17337591
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 8:04 PM:
---

This warning repeats frequently and involves the split parents that accumulate 
as bad state:

{noformat}
2021-04-30 19:35:01,115 INFO  [PEWorker-27] 
procedure.MasterProcedureScheduler: Took xlock for pid=313, ppid=312, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=IntegrationTestLoadCommonCrawl, region=c9076fb746a0690900130cd5d38f1815, 
UNASSIGN
2021-04-30 19:35:01,116 INFO  [PEWorker-27] 
assignment.RegionStateStore: pid=313 updating hbase:meta 
row=c9076fb746a0690900130cd5d38f1815, regionState=CLOSING, 
regionLocation=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832
2021-04-30 19:35:01,116 INFO  [PEWorker-27] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=314, ppid=313, 
state=RUNNABLE; CloseRegionProcedure c9076fb746a0690900130cd5d38f1815, 
server=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832}]
2021-04-30 19:35:05,546 WARN  
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException: 
c9076fb746a0690900130cd5d38f1815 is not OPEN; state=CLOSING
at 
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
at 
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019)
at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
{noformat}

The "Skipping, no server for state=SPLIT" messages accumulate for the affected 
split parents. 

Not every split is affected.

In the master logs I see balancer and split activity co-occurring most of the 
time. One may be interefering with the other.  We seem to get into trouble only 
when balancer activity and split procedure log lines are interlaced in the 
master log.


was (Author: apurtell):
This warning repeats frequently and involves the split parents that accumulate 
as bad state:

{noformat}
2021-04-30 19:35:01,115 INFO  [PEWorker-27] 
procedure.MasterProcedureScheduler: Took xlock for pid=313, ppid=312, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=IntegrationTestLoadCommonCrawl, region=c9076fb746a0690900130cd5d38f1815, 
UNASSIGN
2021-04-30 19:35:01,116 INFO  [PEWorker-27] 
assignment.RegionStateStore: pid=313 updating hbase:meta 
row=c9076fb746a0690900130cd5d38f1815, regionState=CLOSING, 
regionLocation=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832
2021-04-30 19:35:01,116 INFO  [PEWorker-27] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=314, ppid=313, 
state=RUNNABLE; CloseRegionProcedure c9076fb746a0690900130cd5d38f1815, 
server=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832}]
2021-04-30 19:35:05,546 WARN  
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException: 
c9076fb746a0690900130cd5d38f1815 is not OPEN; state=CLOSING
at 
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
at 
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112)
at 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-04-30 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337591#comment-17337591
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 8:00 PM:
---

This warning repeats frequently and involves the split parents that accumulate 
as bad state:

{noformat}
2021-04-30 19:35:01,115 INFO  [PEWorker-27] 
procedure.MasterProcedureScheduler: Took xlock for pid=313, ppid=312, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=IntegrationTestLoadCommonCrawl, region=c9076fb746a0690900130cd5d38f1815, 
UNASSIGN
2021-04-30 19:35:01,116 INFO  [PEWorker-27] 
assignment.RegionStateStore: pid=313 updating hbase:meta 
row=c9076fb746a0690900130cd5d38f1815, regionState=CLOSING, 
regionLocation=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832
2021-04-30 19:35:01,116 INFO  [PEWorker-27] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=314, ppid=313, 
state=RUNNABLE; CloseRegionProcedure c9076fb746a0690900130cd5d38f1815, 
server=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832}]
2021-04-30 19:35:05,546 WARN  
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException: 
c9076fb746a0690900130cd5d38f1815 is not OPEN; state=CLOSING
at 
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
at 
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019)
at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
{noformat}

The "Skipping, no server for state=SPLIT" messages accumulate for the affected 
split parents. 

Not every split is affected.

In the master logs I see balancer and split activity co-occurring most of the 
time. One may be interefering with the other. 


was (Author: apurtell):
This warning happens at every split:

{noformat}
2021-04-30 19:35:01,115 INFO  [PEWorker-27] 
procedure.MasterProcedureScheduler: Took xlock for pid=313, ppid=312, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=IntegrationTestLoadCommonCrawl, region=c9076fb746a0690900130cd5d38f1815, 
UNASSIGN
2021-04-30 19:35:01,116 INFO  [PEWorker-27] 
assignment.RegionStateStore: pid=313 updating hbase:meta 
row=c9076fb746a0690900130cd5d38f1815, regionState=CLOSING, 
regionLocation=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832
2021-04-30 19:35:01,116 INFO  [PEWorker-27] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=314, ppid=313, 
state=RUNNABLE; CloseRegionProcedure c9076fb746a0690900130cd5d38f1815, 
server=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832}]
2021-04-30 19:35:05,546 WARN  
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException: 
c9076fb746a0690900130cd5d38f1815 is not OPEN; state=CLOSING
at 
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
at 
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
at 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-04-30 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337586#comment-17337586
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 7:41 PM:
---

I have a nonstandard balancer configuration. This might be relevant:

hbase.balancer.period=6
hbase.master.balancer.stochastic.minCostNeedBalance=0.025
hbase.master.balancer.stochastic.regionCountCost=100
hbase.master.balancer.stochastic.localityCost=30
hbase.master.balancer.stochastic.moveCost=10
hbase.master.balancer.stochastic.readRequestCost=10
hbase.master.balancer.stochastic.writeRequestCost=15

I think there is a state desynchronization bug related to splitting. The weird 
log lines reported here are for the split parent after a split completes. 

Split procedure begins:
{noformat}
2021-04-30 19:11:39,874 INFO  [PEWorker-65] 
procedure.MasterProcedureScheduler: Took xlock for pid=79, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658
2021-04-30 19:11:39,874 INFO  [PEWorker-65] 
procedure.MasterProcedureScheduler: Took xlock for pid=79, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658
2021-04-30 19:11:39,874 INFO  [PEWorker-65] 
procedure.MasterProcedureScheduler: Took xlock for pid=79, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658
{noformat}

UNASSIGN parent region (22ab65a3e71c3ab7f94c1c47067ad54e), includes WARN for 
failed transition:
{noformat}
2021-04-30 19:11:39,878 INFO  [PEWorker-65] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=80, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e, 
UNASSIGN}]
2021-04-30 19:11:39,878 INFO  [PEWorker-66] 
procedure.MasterProcedureScheduler: Took xlock for pid=80, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e, 
UNASSIGN
2021-04-30 19:11:39,879 INFO  [PEWorker-66] 
assignment.RegionStateStore: pid=80 updating hbase:meta 
row=22ab65a3e71c3ab7f94c1c47067ad54e, regionState=CLOSING, 
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:40,032 INFO  [PEWorker-66] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=81, ppid=80, 
state=RUNNABLE; CloseRegionProcedure 22ab65a3e71c3ab7f94c1c47067ad54e, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:40,303 WARN  
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException: 
22ab65a3e71c3ab7f94c1c47067ad54e is not OPEN; state=CLOSING
at 
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
at 
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019)
at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
2021-04-30 19:11:44,814 INFO  [PEWorker-68] 
assignment.RegionStateStore: pid=80 updating hbase:meta 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-04-30 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337586#comment-17337586
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 7:31 PM:
---

This is what happens.
I have a nonstandard balancer configuration. It is moving regions during a 
split. 

hbase.balancer.period=6
hbase.master.balancer.stochastic.minCostNeedBalance=0.025
hbase.master.balancer.stochastic.regionCountCost=100
hbase.master.balancer.stochastic.localityCost=30
hbase.master.balancer.stochastic.moveCost=10
hbase.master.balancer.stochastic.readRequestCost=10
hbase.master.balancer.stochastic.writeRequestCost=15

I think there is a state desynchronization bug in the master when the balancer 
decides to move regions around when a split is in progress. 

The weird log lines reported here are for the split parent after a split 
completes. 

Split procedure begins:
{noformat}
2021-04-30 19:11:39,874 INFO  [PEWorker-65] 
procedure.MasterProcedureScheduler: Took xlock for pid=79, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658
2021-04-30 19:11:39,874 INFO  [PEWorker-65] 
procedure.MasterProcedureScheduler: Took xlock for pid=79, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658
2021-04-30 19:11:39,874 INFO  [PEWorker-65] 
procedure.MasterProcedureScheduler: Took xlock for pid=79, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658
{noformat}

UNASSIGN parent region (22ab65a3e71c3ab7f94c1c47067ad54e), includes WARN for 
failed transition:
{noformat}
2021-04-30 19:11:39,878 INFO  [PEWorker-65] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=80, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e, 
UNASSIGN}]
2021-04-30 19:11:39,878 INFO  [PEWorker-66] 
procedure.MasterProcedureScheduler: Took xlock for pid=80, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e, 
UNASSIGN
2021-04-30 19:11:39,879 INFO  [PEWorker-66] 
assignment.RegionStateStore: pid=80 updating hbase:meta 
row=22ab65a3e71c3ab7f94c1c47067ad54e, regionState=CLOSING, 
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:40,032 INFO  [PEWorker-66] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=81, ppid=80, 
state=RUNNABLE; CloseRegionProcedure 22ab65a3e71c3ab7f94c1c47067ad54e, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:40,303 WARN  
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException: 
22ab65a3e71c3ab7f94c1c47067ad54e is not OPEN; state=CLOSING
at 
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
at 
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019)
at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
2021-04-30 

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

2021-04-29 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337022#comment-17337022
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 2:15 AM:
---

Also, I didn't see a HBCK2 option that does anything for this case. 

{{./bin/hbase hbck2 -j /path/to/jar fixMeta}} printed a message about kicking 
the chore. The chore didn't handle it. 

{{setRegionState}} doesn't seem right. Maybe a {{removeRegionState}} if we had 
the option. 



was (Author: apurtell):
Also, I didn't see a HBCK2 option that does anything for this case. 

> SPLIT state detritus
> 
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.3
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test load with 'calm' monkey, so this happened in 
> the happy path.
> There were no errors accessing all loaded table data, however whenever the 
> balancer runs there are a number of concerning INFO level log messages 
> printed of the form _assignment.RegionStates: Skipping, no server for 
> state=SPLIT, location=null, table=TABLENAME_ 
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8
> {noformat}
> The HBCK chore notices them but does nothing:
> "Loaded *80 regions* from in-memory state of AssignmentManager"
> "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions"
> "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions"
> Yes, there are exactly 7 region state records of SPLIT state with 
> server=null. 
> {noformat}
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0 
> orphan regions
> 2021-04-30 02:02:09,306 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 3 tables 80 regions from filesystem and found 0 
> orphan regions
> {noformat}
> This repeats indefinitely. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)