[jira] [Created] (HBASE-27414) Search order for locations in HFileLink

2022-10-05 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-27414:


 Summary: Search order for locations in  HFileLink
 Key: HBASE-27414
 URL: https://issues.apache.org/jira/browse/HBASE-27414
 Project: HBase
  Issue Type: Improvement
  Components: Performance
Reporter: Huaxiang Sun


Found that search order for locations is following the order of these locations 
added to HFileLink object. 

 

setLocations(originPath, tempPath, mobPath, archivePath);

archivePath is the last one to be searched. For most cases, hfile exists in 
archivePath, so we can move archivePath to the first parameter to avoid 
unnecessary NN query.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-26421) Use HFileLink file to replace entire file‘s reference when splitting

2022-10-05 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17613084#comment-17613084
 ] 

Huaxiang Sun edited comment on HBASE-26421 at 10/5/22 6:53 PM:
---

Forward comments from review board here for better background of this jira.
h3. *[Apache9|https://github.com/Apache9]* commented [on Nov 8, 
2021|https://github.com/apache/hbase/pull/3825#issuecomment-963696081]
|What is the advantage here to use HFileLink?|

 *[sunhelly|https://github.com/sunhelly]* commented [on Nov 8, 
2021|https://github.com/apache/hbase/pull/3825#issuecomment-963870288]
|Thanks, [@Apache9|https://github.com/Apache9] . The main advantage here is 
that the first compaction after split can be a minor compaction instead of old 
major compaction, because only reference files should be compacted before the 
next split. In HBASE-26422, I described the compaction after splitting using 
HFileLink.
It also makes move hfiles between regions of the same table possible. Then the 
compaction is light weight, since it need not to read and rewrite the referred 
files.|

 *[Apache9|https://github.com/Apache9]* commented [on Nov 9, 
2021|https://github.com/apache/hbase/pull/3825#issuecomment-964161491]
|OK, and we do not need to use HalfStoreFileReader to read the HFile right? 
Sound like a good general improvement. I think this could be done on master and 
other active branches, not only on a feature branch.
Will take a look at the PR soon.|


was (Author: huaxiangsun):
Forward comments from review board here for better background of this jira.

 
h3. *[Apache9|https://github.com/Apache9]* commented [on Nov 8, 
2021|https://github.com/apache/hbase/pull/3825#issuecomment-963696081]
|What is the advantage here to use HFileLink?|
 
[!https://avatars.githubusercontent.com/u/10123703?s=80=6be309b7250884321e34ad1fad94072c3ddea60f=4|width=40,height=40!|https://github.com/sunhelly]
 
ContributorAuthor
h3. *[sunhelly|https://github.com/sunhelly]* commented [on Nov 8, 
2021|https://github.com/apache/hbase/pull/3825#issuecomment-963870288]
|Thanks, [@Apache9|https://github.com/Apache9] . The main advantage here is 
that the first compaction after split can be a minor compaction instead of old 
major compaction, because only reference files should be compacted before the 
next split. In HBASE-26422, I described the compaction after splitting using 
HFileLink.
It also makes move hfiles between regions of the same table possible. Then the 
compaction is light weight, since it need not to read and rewrite the referred 
files.|
 
[!https://avatars.githubusercontent.com/u/4958168?s=80=fc28b222c03c02201d705b025a5293d6c471f7b3=4|width=40,height=40!|https://github.com/Apache9]
 
Contributor
h3. *[Apache9|https://github.com/Apache9]* commented [on Nov 9, 
2021|https://github.com/apache/hbase/pull/3825#issuecomment-964161491]
|OK, and we do not need to use HalfStoreFileReader to read the HFile right? 
Sound like a good general improvement. I think this could be done on master and 
other active branches, not only on a feature branch.
Will take a look at the PR soon.|

> Use HFileLink file to replace entire file‘s reference when splitting
> 
>
> Key: HBASE-26421
> URL: https://issues.apache.org/jira/browse/HBASE-26421
> Project: HBase
>  Issue Type: Improvement
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2
>
>
> ++When splitting store files, if a file should be owned by only one child 
> region, then there will be an entire file's reference in the child region. We 
> can use HFileLink files, just like those in snapshot tables, to replace the 
> reference files that refer to entire files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26421) Use HFileLink file to replace entire file‘s reference when splitting

2022-10-05 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17613084#comment-17613084
 ] 

Huaxiang Sun commented on HBASE-26421:
--

Forward comments from review board here for better background of this jira.

 
h3. *[Apache9|https://github.com/Apache9]* commented [on Nov 8, 
2021|https://github.com/apache/hbase/pull/3825#issuecomment-963696081]
|What is the advantage here to use HFileLink?|
 
[!https://avatars.githubusercontent.com/u/10123703?s=80=6be309b7250884321e34ad1fad94072c3ddea60f=4|width=40,height=40!|https://github.com/sunhelly]
 
ContributorAuthor
h3. *[sunhelly|https://github.com/sunhelly]* commented [on Nov 8, 
2021|https://github.com/apache/hbase/pull/3825#issuecomment-963870288]
|Thanks, [@Apache9|https://github.com/Apache9] . The main advantage here is 
that the first compaction after split can be a minor compaction instead of old 
major compaction, because only reference files should be compacted before the 
next split. In HBASE-26422, I described the compaction after splitting using 
HFileLink.
It also makes move hfiles between regions of the same table possible. Then the 
compaction is light weight, since it need not to read and rewrite the referred 
files.|
 
[!https://avatars.githubusercontent.com/u/4958168?s=80=fc28b222c03c02201d705b025a5293d6c471f7b3=4|width=40,height=40!|https://github.com/Apache9]
 
Contributor
h3. *[Apache9|https://github.com/Apache9]* commented [on Nov 9, 
2021|https://github.com/apache/hbase/pull/3825#issuecomment-964161491]
|OK, and we do not need to use HalfStoreFileReader to read the HFile right? 
Sound like a good general improvement. I think this could be done on master and 
other active branches, not only on a feature branch.
Will take a look at the PR soon.|

> Use HFileLink file to replace entire file‘s reference when splitting
> 
>
> Key: HBASE-26421
> URL: https://issues.apache.org/jira/browse/HBASE-26421
> Project: HBase
>  Issue Type: Improvement
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2
>
>
> ++When splitting store files, if a file should be owned by only one child 
> region, then there will be an entire file's reference in the child region. We 
> can use HFileLink files, just like those in snapshot tables, to replace the 
> reference files that refer to entire files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27366) split or merge removed region under snapshot

2022-09-13 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603641#comment-17603641
 ] 

Huaxiang Sun commented on HBASE-27366:
--

Thanks [~zhangduo]. Will take a more close look based on your input, will try 
to come up with an unitest case to reproduce the issue. also correct the 
version to 2.4.5.

> split or merge removed region under snapshot
> 
>
> Key: HBASE-27366
> URL: https://issues.apache.org/jira/browse/HBASE-27366
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Affects Versions: 2.4.5
>Reporter: Huaxiang Sun
>Priority: Major
>
> We run into snapshot failures for one table with large number of regions. The 
> event sequence is like the following:
>  
>  # Snapshot process lists all regions for one table.
>  # Normalize kicks in to split some regions for the table under snapshot.
>  # split finishes and major compaction finishes. The parent region is moved 
> to archive.
>  # When the Snapshot processes the parent region, it does not exist and 
> snapshot fails.
> Since snapshot process acquires the table lock, but there is no table lock 
> acquired in split or merge process, they crash into each other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27366) split or merge removed region under snapshot

2022-09-13 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27366:
-
Affects Version/s: 2.4.5
   (was: 2.4.10)

> split or merge removed region under snapshot
> 
>
> Key: HBASE-27366
> URL: https://issues.apache.org/jira/browse/HBASE-27366
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Affects Versions: 2.4.5
>Reporter: Huaxiang Sun
>Priority: Major
>
> We run into snapshot failures for one table with large number of regions. The 
> event sequence is like the following:
>  
>  # Snapshot process lists all regions for one table.
>  # Normalize kicks in to split some regions for the table under snapshot.
>  # split finishes and major compaction finishes. The parent region is moved 
> to archive.
>  # When the Snapshot processes the parent region, it does not exist and 
> snapshot fails.
> Since snapshot process acquires the table lock, but there is no table lock 
> acquired in split or merge process, they crash into each other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27366) split or merge removed region under snapshot

2022-09-12 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-27366:


 Summary: split or merge removed region under snapshot
 Key: HBASE-27366
 URL: https://issues.apache.org/jira/browse/HBASE-27366
 Project: HBase
  Issue Type: Bug
  Components: snapshots
Affects Versions: 2.4.10
Reporter: Huaxiang Sun


We run into snapshot failures for one table with large number of regions. The 
event sequence is like the following:

 
 # Snapshot process lists all regions for one table.
 # Normalize kicks in to split some regions for the table under snapshot.
 # split finishes and major compaction finishes. The parent region is moved to 
archive.
 # When the Snapshot processes the parent region, it does not exist and 
snapshot fails.

Since snapshot process acquires the table lock, but there is no table lock 
acquired in split or merge process, they crash into each other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27181) Change HBCK2's setRegionState() to use HBCK's setRegionStateInMeta()

2022-09-06 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601036#comment-17601036
 ] 

Huaxiang Sun commented on HBASE-27181:
--

Instead of current implementation's writing to meta table directly, changed to 
use HBCK's setRegionStateInMeta(). There are couple advantages, one of them  is 
that HBCK's setRegionStateInMeta() will set Master's region state besides 
setting the state in meta table, this will save one active master switchover to 
bring the in-memory state to be consistent with the region state in meta.

> Change HBCK2's setRegionState() to use HBCK's setRegionStateInMeta()
> 
>
> Key: HBASE-27181
> URL: https://issues.apache.org/jira/browse/HBASE-27181
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck2
>Affects Versions: 2.4.13
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Minor
>
> Replica region id is  not recognized by hbck2's setRegionState as it does not 
> show up in meta. We run into cases that it needs to set region state in meta 
> for replica regions in order to fix inconsistency. We ended up writing the 
> state manually into meta table and did a master failover to sync state from 
> meta table. 
>  
> hbck2's setRegionState needs to support replica region id and handles it 
> nicely.
> Currently, setRegionState does not use 
> MasterRpcServices#setRegionStateInMeta. There is an issue in 
> setRegionStateInMeta to support replica regions. After that is fixed, and 
> setRegionState uses setRegionStateInMeta to set the region state, it will 
> support replica Id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27181) Change HBCK2's setRegionState() to use HBCK's setRegionStateInMeta()

2022-09-06 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27181:
-
Summary: Change HBCK2's setRegionState() to use HBCK's 
setRegionStateInMeta()  (was: Replica region support in HBCK2 setRegionState 
option)

> Change HBCK2's setRegionState() to use HBCK's setRegionStateInMeta()
> 
>
> Key: HBASE-27181
> URL: https://issues.apache.org/jira/browse/HBASE-27181
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck2
>Affects Versions: 2.4.13
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Minor
>
> Replica region id is  not recognized by hbck2's setRegionState as it does not 
> show up in meta. We run into cases that it needs to set region state in meta 
> for replica regions in order to fix inconsistency. We ended up writing the 
> state manually into meta table and did a master failover to sync state from 
> meta table. 
>  
> hbck2's setRegionState needs to support replica region id and handles it 
> nicely.
> Currently, setRegionState does not use 
> MasterRpcServices#setRegionStateInMeta. There is an issue in 
> setRegionStateInMeta to support replica regions. After that is fixed, and 
> setRegionState uses setRegionStateInMeta to set the region state, it will 
> support replica Id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27345) Add 2.4.14 to the downloads page

2022-08-29 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun resolved HBASE-27345.
--
Fix Version/s: 3.0.0-alpha-4
 Assignee: Huaxiang Sun
   Resolution: Fixed

> Add 2.4.14 to the downloads page
> 
>
> Key: HBASE-27345
> URL: https://issues.apache.org/jira/browse/HBASE-27345
> Project: HBase
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 2.4.14
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Minor
> Fix For: 3.0.0-alpha-4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27345) Add 2.4.14 to the downloads page

2022-08-29 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-27345:


 Summary: Add 2.4.14 to the downloads page
 Key: HBASE-27345
 URL: https://issues.apache.org/jira/browse/HBASE-27345
 Project: HBase
  Issue Type: Task
  Components: documentation
Affects Versions: 2.4.14
Reporter: Huaxiang Sun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27336) The region visualizer shows 'undefined' region server

2022-08-26 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27336:
-
Attachment: Screen Shot 2022-08-26 at 3.31.27 PM.png

> The region visualizer shows 'undefined' region server
> -
>
> Key: HBASE-27336
> URL: https://issues.apache.org/jira/browse/HBASE-27336
> Project: HBase
>  Issue Type: Bug
>  Components: master, UI
>Reporter: Duo Zhang
>Assignee: LiangJun He
>Priority: Major
> Attachments: Screen Shot 2022-08-26 at 3.31.27 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27246) RSGroupMappingScript#getRSGroup has thread safety problem

2022-08-24 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584328#comment-17584328
 ] 

Huaxiang Sun commented on HBASE-27246:
--

Move to 2.4.15 as 2.4.14RC1 has been built.

> RSGroupMappingScript#getRSGroup has thread safety problem
> -
>
> Key: HBASE-27246
> URL: https://issues.apache.org/jira/browse/HBASE-27246
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: Yutong Xiao
>Assignee: Yutong Xiao
>Priority: Major
> Fix For: 2.6.0, 2.5.1, 3.0.0-alpha-4, 2.4.15
>
> Attachments: Test.java, result.png
>
>
> We are using version 1.4.12 and met a problem in table creation phase some 
> time. The master branch also has this problem. The error message is:
> {code:java}
> 2022-07-26 19:26:20.122 [http-nio-8078-exec-24,d2ad4b13b542b6fb] ERROR 
> HBaseServiceImpl - hbase create table: xxx: failed. 
> (HBaseServiceImpl.java:116)
> java.lang.RuntimeException: 
> org.apache.hadoop.hbase.constraint.ConstraintException: 
> org.apache.hadoop.hbase.constraint.ConstraintException: Default RSGroup 
> (default
> default) for this table's namespace does not exist.
> {code}
> The rsgroup here should be one 'default' but not two consecutive 'default'.  
> The code to get RSGroup from a mapping script is:
> {code:java}
> String getRSGroup(String namespace, String tablename) {
>   if (rsgroupMappingScript == null) {
> return null;
>   }
>   String[] exec = rsgroupMappingScript.getExecString();
>   exec[1] = namespace;
>   exec[2] = tablename;
>   try {
> rsgroupMappingScript.execute();
>   } catch (IOException e) {
> // This exception may happen, like process doesn't have permission to 
> run this script.
> LOG.error("{}, placing {} back to default rsgroup", e.getMessage(),
>   TableName.valueOf(namespace, tablename));
> return RSGroupInfo.DEFAULT_GROUP;
>   }
>   return rsgroupMappingScript.getOutput().trim();
> }
> {code}
> here the rsgourpMappingScript could be executed by multi-threads.
> To test it is a multi-thread issue, I ran a piece of code locally and found 
> that the hadoop ShellCommandExecutor is not thread-safe (I run the code with 
> hadoop 2.10.0 and 3.3.2). So that we should make this method synchronized. 
> Besides, I found that this issue is retained in master branch also.
> The test code is attached and my rsgroup mapping script is very simple:
> {code:java}
> #!/bin/bash
> namespace=$1
> tablename=$2
> echo default
> {code}
> The reproduced screenshot is also attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27246) RSGroupMappingScript#getRSGroup has thread safety problem

2022-08-24 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27246:
-
Fix Version/s: 2.4.15
   (was: 2.4.14)

> RSGroupMappingScript#getRSGroup has thread safety problem
> -
>
> Key: HBASE-27246
> URL: https://issues.apache.org/jira/browse/HBASE-27246
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: Yutong Xiao
>Assignee: Yutong Xiao
>Priority: Major
> Fix For: 2.6.0, 2.5.1, 3.0.0-alpha-4, 2.4.15
>
> Attachments: Test.java, result.png
>
>
> We are using version 1.4.12 and met a problem in table creation phase some 
> time. The master branch also has this problem. The error message is:
> {code:java}
> 2022-07-26 19:26:20.122 [http-nio-8078-exec-24,d2ad4b13b542b6fb] ERROR 
> HBaseServiceImpl - hbase create table: xxx: failed. 
> (HBaseServiceImpl.java:116)
> java.lang.RuntimeException: 
> org.apache.hadoop.hbase.constraint.ConstraintException: 
> org.apache.hadoop.hbase.constraint.ConstraintException: Default RSGroup 
> (default
> default) for this table's namespace does not exist.
> {code}
> The rsgroup here should be one 'default' but not two consecutive 'default'.  
> The code to get RSGroup from a mapping script is:
> {code:java}
> String getRSGroup(String namespace, String tablename) {
>   if (rsgroupMappingScript == null) {
> return null;
>   }
>   String[] exec = rsgroupMappingScript.getExecString();
>   exec[1] = namespace;
>   exec[2] = tablename;
>   try {
> rsgroupMappingScript.execute();
>   } catch (IOException e) {
> // This exception may happen, like process doesn't have permission to 
> run this script.
> LOG.error("{}, placing {} back to default rsgroup", e.getMessage(),
>   TableName.valueOf(namespace, tablename));
> return RSGroupInfo.DEFAULT_GROUP;
>   }
>   return rsgroupMappingScript.getOutput().trim();
> }
> {code}
> here the rsgourpMappingScript could be executed by multi-threads.
> To test it is a multi-thread issue, I ran a piece of code locally and found 
> that the hadoop ShellCommandExecutor is not thread-safe (I run the code with 
> hadoop 2.10.0 and 3.3.2). So that we should make this method synchronized. 
> Besides, I found that this issue is retained in master branch also.
> The test code is attached and my rsgroup mapping script is very simple:
> {code:java}
> #!/bin/bash
> namespace=$1
> tablename=$2
> echo default
> {code}
> The reproduced screenshot is also attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27301) Add Delete addFamilyVersion timestamp verify

2022-08-23 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583730#comment-17583730
 ] 

Huaxiang Sun commented on HBASE-27301:
--

I went through the Jira list again. What happened is that there is HBASE-27125, 
which was fixed in 2.4.13 and the "fix version" was wrongly marked as 2.4.14. 
So it did not show up in 2.4.13's release notes. When I made the change, I 
marked it wrongly in this jira, corrected it and do a respin now.

 

> Add Delete addFamilyVersion  timestamp verify
> -
>
> Key: HBASE-27301
> URL: https://issues.apache.org/jira/browse/HBASE-27301
> Project: HBase
>  Issue Type: Sub-task
>  Components: Client
>Reporter: zhengsicheng
>Assignee: zhengsicheng
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27301) Add Delete addFamilyVersion timestamp verify

2022-08-23 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27301:
-
Fix Version/s: 2.4.14
   (was: 2.4.13)

> Add Delete addFamilyVersion  timestamp verify
> -
>
> Key: HBASE-27301
> URL: https://issues.apache.org/jira/browse/HBASE-27301
> Project: HBase
>  Issue Type: Sub-task
>  Components: Client
>Reporter: zhengsicheng
>Assignee: zhengsicheng
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27125) The batch size of cleaning expired mob files should have an upper bound

2022-08-23 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27125:
-
Fix Version/s: 2.4.13
   (was: 2.4.14)

> The batch size of cleaning expired mob files should have an upper bound
> ---
>
> Key: HBASE-27125
> URL: https://issues.apache.org/jira/browse/HBASE-27125
> Project: HBase
>  Issue Type: Improvement
>  Components: mob
>Affects Versions: 2.4.12
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
> Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.13
>
>
> Currently the cleaning logic for expired mob files is adding all the 
> deletable files in one directory to a list in memory and then archiving all 
> the files in the list. But when there are millions of files need to delete, 
> the list will be huge and make great heap memory pressure to the master.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27125) The batch size of cleaning expired mob files should have an upper bound

2022-08-23 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583723#comment-17583723
 ] 

Huaxiang Sun commented on HBASE-27125:
--

This was released in 2.4.13, due to fix scope error, it was marked as 2.4.14 so 
it did not show up in 2.4.13 release notes. Fix it now.

> The batch size of cleaning expired mob files should have an upper bound
> ---
>
> Key: HBASE-27125
> URL: https://issues.apache.org/jira/browse/HBASE-27125
> Project: HBase
>  Issue Type: Improvement
>  Components: mob
>Affects Versions: 2.4.12
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
> Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.14
>
>
> Currently the cleaning logic for expired mob files is adding all the 
> deletable files in one directory to a list in memory and then archiving all 
> the files in the list. But when there are millions of files need to delete, 
> the list will be huge and make great heap memory pressure to the master.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27129) Add a config that allows us to configure region-level storage policies

2022-08-23 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583722#comment-17583722
 ] 

Huaxiang Sun commented on HBASE-27129:
--

Remove 2.4.13 from the fixed version as it is only in 2.5+.

> Add a config that allows us to configure region-level storage policies
> --
>
> Key: HBASE-27129
> URL: https://issues.apache.org/jira/browse/HBASE-27129
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver
>Reporter: tianhang tang
>Assignee: tianhang tang
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-4
>
>
> 
> hbase.hregion.block.storage.policy
> HOT|ALL_SSD|...
>   
> With this config, we can set region-level storage policies.
>  
> We have this config about CF storage policy:
> {code:java}
>  
> hbase.hstore.block.storage.policy
> ALL_SSD
>   
> {code}
> But in addition to CF, we also have some other path under region path, such 
> like .splits, 
> recovered.edits, .tmp .
> So i want to add a region-level config which could cover the whole path, and 
> if you have any other requirements for the cf directory, you can continue to 
> use the previous hbase.hstore.block.storage.policy to specify them separately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27129) Add a config that allows us to configure region-level storage policies

2022-08-23 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27129:
-
Fix Version/s: (was: 2.4.14)

> Add a config that allows us to configure region-level storage policies
> --
>
> Key: HBASE-27129
> URL: https://issues.apache.org/jira/browse/HBASE-27129
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver
>Reporter: tianhang tang
>Assignee: tianhang tang
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-4
>
>
> 
> hbase.hregion.block.storage.policy
> HOT|ALL_SSD|...
>   
> With this config, we can set region-level storage policies.
>  
> We have this config about CF storage policy:
> {code:java}
>  
> hbase.hstore.block.storage.policy
> ALL_SSD
>   
> {code}
> But in addition to CF, we also have some other path under region path, such 
> like .splits, 
> recovered.edits, .tmp .
> So i want to add a region-level config which could cover the whole path, and 
> if you have any other requirements for the cf directory, you can continue to 
> use the previous hbase.hstore.block.storage.policy to specify them separately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27301) Add Delete addFamilyVersion timestamp verify

2022-08-23 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583704#comment-17583704
 ] 

Huaxiang Sun commented on HBASE-27301:
--

Sorry to miss your message, [~zhangduo] . I may marked the Jira when going 
through the Jira list last week, let me go through the Jira list again and 
correct, thanks. 

> Add Delete addFamilyVersion  timestamp verify
> -
>
> Key: HBASE-27301
> URL: https://issues.apache.org/jira/browse/HBASE-27301
> Project: HBase
>  Issue Type: Sub-task
>  Components: Client
>Reporter: zhengsicheng
>Assignee: zhengsicheng
>Priority: Minor
> Fix For: 2.5.0, 2.4.13, 3.0.0-alpha-4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27243) Fix TestQuotaThrottle after HBASE-27046

2022-08-18 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27243:
-
Fix Version/s: 2.4.15
   (was: 2.4.14)

> Fix TestQuotaThrottle after HBASE-27046
> ---
>
> Key: HBASE-27243
> URL: https://issues.apache.org/jira/browse/HBASE-27243
> Project: HBase
>  Issue Type: Test
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Fix For: 2.6.0, 2.5.1, 3.0.0-alpha-4, 2.4.15
>
>
> TestQuotaThrottle breaks monotonic WAL numbering after HBASE-20746 because of 
> how it manipulates the EnvironmentEdge and was disabled by HBASE-27087. Fix 
> the test.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27243) Fix TestQuotaThrottle after HBASE-27046

2022-08-18 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581505#comment-17581505
 ] 

Huaxiang Sun commented on HBASE-27243:
--

Moved out of 2.4.14.

> Fix TestQuotaThrottle after HBASE-27046
> ---
>
> Key: HBASE-27243
> URL: https://issues.apache.org/jira/browse/HBASE-27243
> Project: HBase
>  Issue Type: Test
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Fix For: 2.6.0, 2.5.1, 3.0.0-alpha-4, 2.4.15
>
>
> TestQuotaThrottle breaks monotonic WAL numbering after HBASE-20746 because of 
> how it manipulates the EnvironmentEdge and was disabled by HBASE-27087. Fix 
> the test.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27301) Add Delete addFamilyVersion timestamp verify

2022-08-18 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581504#comment-17581504
 ] 

Huaxiang Sun commented on HBASE-27301:
--

It is in 2.4.13 instead of 2.4.14.

> Add Delete addFamilyVersion  timestamp verify
> -
>
> Key: HBASE-27301
> URL: https://issues.apache.org/jira/browse/HBASE-27301
> Project: HBase
>  Issue Type: Sub-task
>  Components: Client
>Reporter: zhengsicheng
>Assignee: zhengsicheng
>Priority: Minor
> Fix For: 2.5.0, 2.4.13, 3.0.0-alpha-4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27301) Add Delete addFamilyVersion timestamp verify

2022-08-18 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27301:
-
Fix Version/s: 2.4.13
   (was: 2.4.14)

> Add Delete addFamilyVersion  timestamp verify
> -
>
> Key: HBASE-27301
> URL: https://issues.apache.org/jira/browse/HBASE-27301
> Project: HBase
>  Issue Type: Sub-task
>  Components: Client
>Reporter: zhengsicheng
>Assignee: zhengsicheng
>Priority: Minor
> Fix For: 2.5.0, 2.4.13, 3.0.0-alpha-4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26254) [branch-2.4] Flaky tests

2022-08-18 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581496#comment-17581496
 ] 

Huaxiang Sun commented on HBASE-26254:
--

Moved to 2.4.15.

> [branch-2.4] Flaky tests
> 
>
> Key: HBASE-26254
> URL: https://issues.apache.org/jira/browse/HBASE-26254
> Project: HBase
>  Issue Type: Task
>  Components: test
>Affects Versions: 2.4.6
>Reporter: Andrew Kyle Purtell
>Priority: Minor
> Fix For: 2.4.15
>
>
> Findings listed below. Address on/with subtasks.
> org.apache.hadoop.hbase.namequeues.TestSlowLogAccessor.testHigherSlowLogs 
> [~vjasani]
> {quote}
> {noformat}
> java.lang.AssertionError: Waiting timed out after [7,000] msec
> at org.junit.Assert.fail(Assert.java:89)
> at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203)
> at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137)
> at 
> org.apache.hadoop.hbase.HBaseCommonTestingUtility.waitFor(HBaseCommonTestingUtility.java:253)
> at 
> org.apache.hadoop.hbase.namequeues.TestSlowLogAccessor.testHigherSlowLogs(TestSlowLogAccessor.java:211)
> {noformat}
> {quote}
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testSizeOfLogQueue
>  [~shahrs87]
> {quote}
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<2>
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testSizeOfLogQueue(TestBasicWALEntryStream.java:701)
> {noformat}
> {quote}
> org.apache.hadoop.hbase.util.TestFromClientSide3WoUnsafe.testScanAfterDeletingSpecifiedRow
>  (committer no longer active)
> {quote}
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<0>
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.hbase.client.TestFromClientSide3.testScanAfterDeletingSpecifiedRow(TestFromClientSide3.java:206)
> {noformat}
> {quote}
> org.apache.hadoop.hbase.replication.regionserver.TestReplicationSource.testReplicationSourceInitializingMetric
>  [~sandeep.pal]
> {quote}
> {noformat}
> java.lang.AssertionError: Waiting timed out after [1,000] msec
> at org.junit.Assert.fail(Assert.java:89)
> at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203)
> at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137)
> at 
> org.apache.hadoop.hbase.replication.regionserver.TestReplicationSource.testReplicationSourceInitializingMetric(TestReplicationSource.java:583)
> {noformat}
> {quote}
> org.apache.hadoop.hbase.security.access.TestCellACLWithMultipleVersions.testCellPermissionwithVersions
>  [~apurtell]
> {quote}
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<2>
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.hbase.security.access.SecureTestUtil.verifyAllowed(SecureTestUtil.java:181)
> at 
> org.apache.hadoop.hbase.security.access.TestCellACLWithMultipleVersions.testCellPermissionwithVersions(TestCellACLWithMultipleVersions.java:243)
> {noformat}
> {quote}
> org.apache.hadoop.hbase.client.TestMasterAddressRefresher 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-26254) [branch-2.4] Flaky tests

2022-08-18 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26254:
-
Fix Version/s: 2.4.15
   (was: 2.4.14)

> [branch-2.4] Flaky tests
> 
>
> Key: HBASE-26254
> URL: https://issues.apache.org/jira/browse/HBASE-26254
> Project: HBase
>  Issue Type: Task
>  Components: test
>Affects Versions: 2.4.6
>Reporter: Andrew Kyle Purtell
>Priority: Minor
> Fix For: 2.4.15
>
>
> Findings listed below. Address on/with subtasks.
> org.apache.hadoop.hbase.namequeues.TestSlowLogAccessor.testHigherSlowLogs 
> [~vjasani]
> {quote}
> {noformat}
> java.lang.AssertionError: Waiting timed out after [7,000] msec
> at org.junit.Assert.fail(Assert.java:89)
> at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203)
> at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137)
> at 
> org.apache.hadoop.hbase.HBaseCommonTestingUtility.waitFor(HBaseCommonTestingUtility.java:253)
> at 
> org.apache.hadoop.hbase.namequeues.TestSlowLogAccessor.testHigherSlowLogs(TestSlowLogAccessor.java:211)
> {noformat}
> {quote}
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testSizeOfLogQueue
>  [~shahrs87]
> {quote}
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<2>
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testSizeOfLogQueue(TestBasicWALEntryStream.java:701)
> {noformat}
> {quote}
> org.apache.hadoop.hbase.util.TestFromClientSide3WoUnsafe.testScanAfterDeletingSpecifiedRow
>  (committer no longer active)
> {quote}
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<0>
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.hbase.client.TestFromClientSide3.testScanAfterDeletingSpecifiedRow(TestFromClientSide3.java:206)
> {noformat}
> {quote}
> org.apache.hadoop.hbase.replication.regionserver.TestReplicationSource.testReplicationSourceInitializingMetric
>  [~sandeep.pal]
> {quote}
> {noformat}
> java.lang.AssertionError: Waiting timed out after [1,000] msec
> at org.junit.Assert.fail(Assert.java:89)
> at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203)
> at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137)
> at 
> org.apache.hadoop.hbase.replication.regionserver.TestReplicationSource.testReplicationSourceInitializingMetric(TestReplicationSource.java:583)
> {noformat}
> {quote}
> org.apache.hadoop.hbase.security.access.TestCellACLWithMultipleVersions.testCellPermissionwithVersions
>  [~apurtell]
> {quote}
> {noformat}
> java.lang.AssertionError: expected:<1> but was:<2>
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.hbase.security.access.SecureTestUtil.verifyAllowed(SecureTestUtil.java:181)
> at 
> org.apache.hadoop.hbase.security.access.TestCellACLWithMultipleVersions.testCellPermissionwithVersions(TestCellACLWithMultipleVersions.java:243)
> {noformat}
> {quote}
> org.apache.hadoop.hbase.client.TestMasterAddressRefresher 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27154) Backport missing MOB related changes to branch-2

2022-08-15 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579895#comment-17579895
 ] 

Huaxiang Sun commented on HBASE-27154:
--

#4617 is merged and backported to branch-2.5 as well.

> Backport missing MOB related changes to branch-2
> 
>
> Key: HBASE-27154
> URL: https://issues.apache.org/jira/browse/HBASE-27154
> Project: HBase
>  Issue Type: Bug
>  Components: mob
>Affects Versions: 2.6.0
>Reporter: Szabolcs Bukros
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 2.5.0
>
>
> While trying to backport https://issues.apache.org/jira/browse/HBASE-26969 to 
> branch-2 I have found that multiple major MOB related changes are missing. 
> This change is required for FileBased SFT correctness so the changes it 
> depends on should be backported first. Also any improvement to MOB stability 
> is usually welcomed.
> The missing changes I have found so far:
> https://issues.apache.org/jira/browse/HBASE-22749
> https://issues.apache.org/jira/browse/HBASE-23723
> https://issues.apache.org/jira/browse/HBASE-24163
> There is also a docs change describing the new MOB functionality. But 
> considering that the book is always generated based on master I think it is 
> safe to skip backporting it.
> https://issues.apache.org/jira/browse/HBASE-23198
> I'm planning to backport these changes one by one until we reach a state 
> where HBASE-26969  can be backported too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27296) Some Cell's implementation of toString() such as IndividualBytesFieldCell prints out value and tags which is too verbose

2022-08-15 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun resolved HBASE-27296.
--
Fix Version/s: 2.5.0
   3.0.0-alpha-4
   2.4.14
   Resolution: Fixed

> Some Cell's implementation of toString() such as IndividualBytesFieldCell 
> prints out value and tags which is too verbose
> 
>
> Key: HBASE-27296
> URL: https://issues.apache.org/jira/browse/HBASE-27296
> Project: HBase
>  Issue Type: Improvement
>  Components: logging
>Affects Versions: 2.4.12
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>
> One of users sees cells >10Mb are logged when over limit at their client log. 
> Checked the code, toString() behavior is not consistent, mostly does not 
> include values and tags. Change toString() to exclude tags/value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26969) Eliminate MOB renames when SFT is enabled

2022-08-12 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579144#comment-17579144
 ] 

Huaxiang Sun commented on HBASE-26969:
--

We are planing to spin 2.5.0 RC next Monday, [~bszabolcs] , do you think it can 
be merged by then? Otherwise, it will push out o 2.5.1.

> Eliminate MOB renames when SFT is enabled
> -
>
> Key: HBASE-26969
> URL: https://issues.apache.org/jira/browse/HBASE-26969
> Project: HBase
>  Issue Type: Sub-task
>  Components: mob
>Affects Versions: 2.5.0, 3.0.0-alpha-3
>Reporter: Szabolcs Bukros
>Assignee: Szabolcs Bukros
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-4
>
>
> MOB file compaction and flush still relies on renames even when SFT is 
> enabled.
> My proposed changes are:
>  * when requireWritingToTmpDirFirst is false during mob flush/compact instead 
> of using the temp writer we should create a different writer using a 
> {color:#00}StoreFileWriterCreationTracker that writes directly to the mob 
> store folder{color}
>  * {color:#00}these StoreFileWriterCreationTracker should be stored in 
> the MobStore. This would requires us to extend MobStore with a createWriter 
> and a finalizeWriter method to handle this{color}
>  * {color:#00}refactor {color}MobFileCleanerChore to run on the RS 
> instead on Master to allow access to the 
> {color:#00}StoreFileWriterCreationTracker{color}s to make sure the 
> currently written files are not cleaned up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27296) Some Cell's implementation of toString() such as IndividualBytesFieldCell prints out value and tags which is too verbose

2022-08-11 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27296:
-
Description: 
One of users sees cells >10Mb are logged when over limit at their client log. 

Checked the code, toString() behavior is not consistent, mostly does not 
include values and tags. Change toString() to exclude tags/value.

  was:
One of users sees cells >10Mb are logged when over limit. 

Checked the code, toString() behavior is not consistent, mostly does not 
include values and tags. Change toString() to exclude tags/value.


> Some Cell's implementation of toString() such as IndividualBytesFieldCell 
> prints out value and tags which is too verbose
> 
>
> Key: HBASE-27296
> URL: https://issues.apache.org/jira/browse/HBASE-27296
> Project: HBase
>  Issue Type: Improvement
>  Components: logging
>Affects Versions: 2.4.12
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Minor
>
> One of users sees cells >10Mb are logged when over limit at their client log. 
> Checked the code, toString() behavior is not consistent, mostly does not 
> include values and tags. Change toString() to exclude tags/value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27296) Some Cell's implementation of toString() such as IndividualBytesFieldCell prints out value and tags which is too verbose

2022-08-11 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27296:
-
Summary: Some Cell's implementation of toString() such as 
IndividualBytesFieldCell prints out value and tags which is too verbose  (was: 
IndividualBytesFieldCell#toString() prints out value and tags which is too 
verbose.)

> Some Cell's implementation of toString() such as IndividualBytesFieldCell 
> prints out value and tags which is too verbose
> 
>
> Key: HBASE-27296
> URL: https://issues.apache.org/jira/browse/HBASE-27296
> Project: HBase
>  Issue Type: Improvement
>  Components: logging
>Affects Versions: 2.4.12
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Minor
>
> One of users sees cells >10Mb are logged when over limit. 
> Checked the code, toString() behavior is not consistent, mostly does not 
> include values and tags. Change toString() to exclude tags/value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27296) IndividualBytesFieldCell#toString() prints out value and tags which is too verbose.

2022-08-11 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-27296:


 Summary: IndividualBytesFieldCell#toString() prints out value and 
tags which is too verbose.
 Key: HBASE-27296
 URL: https://issues.apache.org/jira/browse/HBASE-27296
 Project: HBase
  Issue Type: Improvement
  Components: logging
Affects Versions: 2.4.12
Reporter: Huaxiang Sun
Assignee: Huaxiang Sun


One of users sees cells >10Mb are logged when over limit. 

Checked the code, toString() behavior is not consistent, mostly does not 
include values and tags. Change toString() to exclude tags/value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27251) Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: /hbase/MasterData/data/master/store/.initialized/.regioninfo`

2022-08-01 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27251:
-
Release Note: 
Before upgrading HBase cluster to 2.5.0, it is strongly recommended to upgrade 
the cluster to HBase-2.4.14 or later first. This is to guarantee a smooth 
rollback in case. Otherwise, it may run into HBASE-27251.

  Resolution: Fixed
  Status: Resolved  (was: Patch Available)

> Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo`
> ---
>
> Key: HBASE-27251
> URL: https://issues.apache.org/jira/browse/HBASE-27251
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.5.0
>Reporter: Nick Dimiduk
>Assignee: Huaxiang Sun
>Priority: Critical
> Fix For: 2.4.14
>
>
> I was doing some perf testing with builds of 2.5.0. I rolled back to 2.4.13 
> and the master won't start. Stack trace ends in,
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo  
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)   
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156)
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2089)
>   
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:762)
>   
>
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
> 
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
>
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)   
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) 
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) 
> at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   
>  
> at java.base/javax.security.auth.Subject.doAs(Subject.java:439)   
>   
>   
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>   
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)
> {noformat}
> When I examine the on-disk file system, I see,
> {noformat}
> nonroot@namenode-0:~$ hdfs dfs -ls /hbase/MasterData/data/master/store/
> Found 3 items
> drwxr-xr-x   - nonroot supergroup  0 2022-07-19 17:37 
> /hbase/MasterData/data/master/store/.initialized
> drwxr-xr-x   - nonroot supergroup  0 2022-07-19 

[jira] [Commented] (HBASE-27251) Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: /hbase/MasterData/data/master/store/.initialized/.regioninfo`

2022-08-01 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573967#comment-17573967
 ] 

Huaxiang Sun commented on HBASE-27251:
--

Addendum is merged back to banch-2.4, resolving it.

> Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo`
> ---
>
> Key: HBASE-27251
> URL: https://issues.apache.org/jira/browse/HBASE-27251
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.5.0
>Reporter: Nick Dimiduk
>Assignee: Huaxiang Sun
>Priority: Critical
> Fix For: 2.4.14
>
>
> I was doing some perf testing with builds of 2.5.0. I rolled back to 2.4.13 
> and the master won't start. Stack trace ends in,
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo  
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)   
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156)
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2089)
>   
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:762)
>   
>
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
> 
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
>
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)   
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) 
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) 
> at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   
>  
> at java.base/javax.security.auth.Subject.doAs(Subject.java:439)   
>   
>   
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>   
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)
> {noformat}
> When I examine the on-disk file system, I see,
> {noformat}
> nonroot@namenode-0:~$ hdfs dfs -ls /hbase/MasterData/data/master/store/
> Found 3 items
> drwxr-xr-x   - nonroot supergroup  0 2022-07-19 17:37 
> /hbase/MasterData/data/master/store/.initialized
> drwxr-xr-x   - nonroot supergroup  0 2022-07-19 17:37 
> /hbase/MasterData/data/master/store/.tabledesc
> drwxr-xr-x   - nonroot supergroup  0 2022-07-27 16:25 
> /hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682
> nonroot@namenode-0:~$ 

[jira] [Commented] (HBASE-27251) Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: /hbase/MasterData/data/master/store/.initialized/.regioninfo`

2022-07-28 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572659#comment-17572659
 ] 

Huaxiang Sun commented on HBASE-27251:
--

Put up an addendum to add a warning log if there are non-region directories 
under master store.

> Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo`
> ---
>
> Key: HBASE-27251
> URL: https://issues.apache.org/jira/browse/HBASE-27251
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.5.0
>Reporter: Nick Dimiduk
>Priority: Critical
> Fix For: 2.4.14
>
>
> I was doing some perf testing with builds of 2.5.0. I rolled back to 2.4.13 
> and the master won't start. Stack trace ends in,
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo  
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)   
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156)
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2089)
>   
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:762)
>   
>
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
> 
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
>
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)   
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) 
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) 
> at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   
>  
> at java.base/javax.security.auth.Subject.doAs(Subject.java:439)   
>   
>   
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>   
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)
> {noformat}
> When I examine the on-disk file system, I see,
> {noformat}
> nonroot@namenode-0:~$ hdfs dfs -ls /hbase/MasterData/data/master/store/
> Found 3 items
> drwxr-xr-x   - nonroot supergroup  0 2022-07-19 17:37 
> /hbase/MasterData/data/master/store/.initialized
> drwxr-xr-x   - nonroot supergroup  0 2022-07-19 17:37 
> /hbase/MasterData/data/master/store/.tabledesc
> drwxr-xr-x   - nonroot supergroup  0 2022-07-27 16:25 
> /hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682
> 

[jira] [Commented] (HBASE-27251) Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: /hbase/MasterData/data/master/store/.initialized/.regioninfo`

2022-07-28 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572589#comment-17572589
 ] 

Huaxiang Sun commented on HBASE-27251:
--

Sounds good to me, [~apurtell]. Let me put an addendum for logging.

> Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo`
> ---
>
> Key: HBASE-27251
> URL: https://issues.apache.org/jira/browse/HBASE-27251
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.5.0
>Reporter: Nick Dimiduk
>Priority: Critical
> Fix For: 2.4.14
>
>
> I was doing some perf testing with builds of 2.5.0. I rolled back to 2.4.13 
> and the master won't start. Stack trace ends in,
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo  
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)   
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156)
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2089)
>   
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:762)
>   
>
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
> 
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
>
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)   
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) 
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) 
> at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   
>  
> at java.base/javax.security.auth.Subject.doAs(Subject.java:439)   
>   
>   
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>   
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)
> {noformat}
> When I examine the on-disk file system, I see,
> {noformat}
> nonroot@namenode-0:~$ hdfs dfs -ls /hbase/MasterData/data/master/store/
> Found 3 items
> drwxr-xr-x   - nonroot supergroup  0 2022-07-19 17:37 
> /hbase/MasterData/data/master/store/.initialized
> drwxr-xr-x   - nonroot supergroup  0 2022-07-19 17:37 
> /hbase/MasterData/data/master/store/.tabledesc
> drwxr-xr-x   - nonroot supergroup  0 2022-07-27 16:25 
> /hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682
> nonroot@namenode-0:~$ hdfs dfs -ls 
> 

[jira] [Comment Edited] (HBASE-27251) Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: /hbase/MasterData/data/master/store/.initialized/.regioninfo`

2022-07-27 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572163#comment-17572163
 ] 

Huaxiang Sun edited comment on HBASE-27251 at 7/27/22 11:44 PM:


[~apurtell], let me create one backport Jira. Not sure if skipping these 
.tabledesc and .initialized directories are enough. I left one comment in 
HBASE-26640 to get Duo's input. Let me fix the "fix version" for HBASE-26640 as 
well as it is marked as 2.6.0.


was (Author: huaxiangsun):
[~apurtell], let me create one backport Jira. Not sure if ignore these 
.tabledesc and .initialized directories are enough. I left one comment in 
HBASE-26640 to get Duo's input. Let me fix the "fix version" for HBASE-26640 as 
well as it is marked as 2.6.0.

> Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo`
> ---
>
> Key: HBASE-27251
> URL: https://issues.apache.org/jira/browse/HBASE-27251
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.5.0
>Reporter: Nick Dimiduk
>Priority: Critical
> Fix For: 2.4.14
>
>
> I was doing some perf testing with builds of 2.5.0. I rolled back to 2.4.13 
> and the master won't start. Stack trace ends in,
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo  
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)   
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156)
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2089)
>   
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:762)
>   
>
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
> 
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
>
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)   
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) 
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) 
> at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   
>  
> at java.base/javax.security.auth.Subject.doAs(Subject.java:439)   
>   
>   
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>   
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)
> {noformat}
> When I examine the on-disk 

[jira] [Updated] (HBASE-26640) Reimplement master local region initialization to better work with SFT

2022-07-27 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26640:
-
Fix Version/s: 2.5.0
   (was: 2.6.0)

> Reimplement master local region initialization to better work with SFT
> --
>
> Key: HBASE-26640
> URL: https://issues.apache.org/jira/browse/HBASE-26640
> Project: HBase
>  Issue Type: Sub-task
>  Components: master, RegionProcedureStore
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3
>
>
> It is not like a normal region where we have a TableDescriptor so it can 
> store the SFT implementation of its own. In the current implementation, if we 
> change the global SFT configuration, the SFT implementation of the master 
> local reigon will be changed and cause data loss.
> First I think we could hard coded it to use DefaultSFT. The region is small 
> and will not cause too much performance impact. Then we could find a way to 
> manage the SFT implementation of it.
> == Update ==
> The initialization of master local region depends on renaming, which can not 
> work well on OSS. So we should also change it. The basic idea is to touch a 
> '.initialized' file to indicate it is initialized. Need to consider how to 
> migrate from the existing master local region where it does not have this 
> file.
> And we could also store the TableDescriptor on file system, so we can 
> determine whether this is a SFT change. If so, we should do the migration 
> before actually opening the master local region.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27251) Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: /hbase/MasterData/data/master/store/.initialized/.regioninfo`

2022-07-27 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572163#comment-17572163
 ] 

Huaxiang Sun commented on HBASE-27251:
--

[~apurtell], let me create one backport Jira. Not sure if ignore these 
.tabledesc and .initialized directories are enough. I left one comment in 
HBASE-26640 to get Duo's input. Let me fix the "fix version" for HBASE-26640 as 
well as it is marked as 2.6.0.

> Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo`
> ---
>
> Key: HBASE-27251
> URL: https://issues.apache.org/jira/browse/HBASE-27251
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.5.0
>Reporter: Nick Dimiduk
>Priority: Critical
> Fix For: 2.4.14
>
>
> I was doing some perf testing with builds of 2.5.0. I rolled back to 2.4.13 
> and the master won't start. Stack trace ends in,
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo  
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)   
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156)
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2089)
>   
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:762)
>   
>
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
> 
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
>
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)   
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) 
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) 
> at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   
>  
> at java.base/javax.security.auth.Subject.doAs(Subject.java:439)   
>   
>   
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>   
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)
> {noformat}
> When I examine the on-disk file system, I see,
> {noformat}
> nonroot@namenode-0:~$ hdfs dfs -ls /hbase/MasterData/data/master/store/
> Found 3 items
> drwxr-xr-x   - nonroot supergroup  0 2022-07-19 17:37 
> /hbase/MasterData/data/master/store/.initialized
> drwxr-xr-x   - nonroot supergroup  0 2022-07-19 17:37 
> 

[jira] [Commented] (HBASE-27251) Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: /hbase/MasterData/data/master/store/.initialized/.regioninfo`

2022-07-27 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572123#comment-17572123
 ] 

Huaxiang Sun commented on HBASE-27251:
--

We need a small patch to filter out these non region directories in 2.4.*-. 
Otherwise, needs to manually remove these directories before downgrading.
{code:java}
Path regionDir = fs.listStatus(tableDir, p -> !p.getName().startsWith(".")
&& RegionInfo.isEncodedRegionName(Bytes.toBytes(p.getName([0].getPath();
return HRegionFileSystem.loadRegionInfoFileContent(fs, regionDir);{code}

> Rolling back from 2.5.0-SNAPSHOT to 2.4.13 fails due to `File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo`
> ---
>
> Key: HBASE-27251
> URL: https://issues.apache.org/jira/browse/HBASE-27251
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.5.0
>Reporter: Nick Dimiduk
>Priority: Major
>
> I was doing some perf testing with builds of 2.5.0. I rolled back to 2.4.13 
> and the master won't start. Stack trace ends in,
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hbase/MasterData/data/master/store/.initialized/.regioninfo  
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)   
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156)
>   
>at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2089)
>   
> 
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:762)
>   
>
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
> 
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
>   
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
>
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)   
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) 
>   
>   
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) 
> at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   
>  
> at java.base/javax.security.auth.Subject.doAs(Subject.java:439)   
>   
>   
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>   
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)
> {noformat}
> When I examine the on-disk file system, I see,
> {noformat}
> nonroot@namenode-0:~$ hdfs dfs -ls /hbase/MasterData/data/master/store/
> Found 3 items
> drwxr-xr-x   - nonroot supergroup  0 2022-07-19 17:37 
> /hbase/MasterData/data/master/store/.initialized
> 

[jira] [Commented] (HBASE-26640) Reimplement master local region initialization to better work with SFT

2022-07-27 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572122#comment-17572122
 ] 

Huaxiang Sun commented on HBASE-26640:
--

The change now is in 2.5.0-snapshot, do we need to change the release version 
to 2.5.0? [~zhangduo] 

Also if it downgrades from 2.5.0 to 2.4.*, do you see any imcompatible change 
besides HBASE-27251?

Thanks.

 

> Reimplement master local region initialization to better work with SFT
> --
>
> Key: HBASE-26640
> URL: https://issues.apache.org/jira/browse/HBASE-26640
> Project: HBase
>  Issue Type: Sub-task
>  Components: master, RegionProcedureStore
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-3
>
>
> It is not like a normal region where we have a TableDescriptor so it can 
> store the SFT implementation of its own. In the current implementation, if we 
> change the global SFT configuration, the SFT implementation of the master 
> local reigon will be changed and cause data loss.
> First I think we could hard coded it to use DefaultSFT. The region is small 
> and will not cause too much performance impact. Then we could find a way to 
> manage the SFT implementation of it.
> == Update ==
> The initialization of master local region depends on renaming, which can not 
> work well on OSS. So we should also change it. The basic idea is to touch a 
> '.initialized' file to indicate it is initialized. Need to consider how to 
> migrate from the existing master local region where it does not have this 
> file.
> And we could also store the TableDescriptor on file system, so we can 
> determine whether this is a SFT change. If so, we should do the migration 
> before actually opening the master local region.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27181) Replica region support in HBCK2 setRegionState option

2022-07-27 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27181:
-
Description: 
Replica region id is  not recognized by hbck2's setRegionState as it does not 
show up in meta. We run into cases that it needs to set region state in meta 
for replica regions in order to fix inconsistency. We ended up writing the 
state manually into meta table and did a master failover to sync state from 
meta table. 

 

hbck2's setRegionState needs to support replica region id and handles it nicely.

Currently, setRegionState does not use MasterRpcServices#setRegionStateInMeta. 
There is an issue in setRegionStateInMeta to support replica regions. After 
that is fixed, and setRegionState uses setRegionStateInMeta to set the region 
state, it will support replica Id.

  was:
Replica region id is  not recognized by hbck2's setRegionState as it does not 
show up in meta. We run into cases that it needs to set region state in meta 
for replica regions in order to fix inconsistency. We ended up writing the 
state manually into meta table and did a master failover to sync state from 
meta table. 

 

hbck2's setRegionState needs to support replica region id and handles it nicely.


> Replica region support in HBCK2 setRegionState option
> -
>
> Key: HBASE-27181
> URL: https://issues.apache.org/jira/browse/HBASE-27181
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck2
>Affects Versions: 2.4.13
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Minor
>
> Replica region id is  not recognized by hbck2's setRegionState as it does not 
> show up in meta. We run into cases that it needs to set region state in meta 
> for replica regions in order to fix inconsistency. We ended up writing the 
> state manually into meta table and did a master failover to sync state from 
> meta table. 
>  
> hbck2's setRegionState needs to support replica region id and handles it 
> nicely.
> Currently, setRegionState does not use 
> MasterRpcServices#setRegionStateInMeta. There is an issue in 
> setRegionStateInMeta to support replica regions. After that is fixed, and 
> setRegionState uses setRegionStateInMeta to set the region state, it will 
> support replica Id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27250) MasterRpcService#setRegionStateInMeta does not support replica region encodedNames or region names

2022-07-27 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-27250:


 Summary: MasterRpcService#setRegionStateInMeta does not support 
replica region encodedNames or region names
 Key: HBASE-27250
 URL: https://issues.apache.org/jira/browse/HBASE-27250
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.4.13
Reporter: Huaxiang Sun
Assignee: Huaxiang Sun


MasterRpcServices#setRegionStateInMeta does not support replica region names, 
it assumes the primary region only. This makes HBCK2's setRegionState for 
replica region fails. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27181) Replica region support in HBCK2 setRegionState option

2022-07-26 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27181:
-
Priority: Minor  (was: Major)

> Replica region support in HBCK2 setRegionState option
> -
>
> Key: HBASE-27181
> URL: https://issues.apache.org/jira/browse/HBASE-27181
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck2
>Affects Versions: 2.4.13
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Minor
>
> Replica region id is  not recognized by hbck2's setRegionState as it does not 
> show up in meta. We run into cases that it needs to set region state in meta 
> for replica regions in order to fix inconsistency. We ended up writing the 
> state manually into meta table and did a master failover to sync state from 
> meta table. 
>  
> hbck2's setRegionState needs to support replica region id and handles it 
> nicely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27181) Replica region support in HBCK2 setRegionState option

2022-07-05 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-27181:


 Summary: Replica region support in HBCK2 setRegionState option
 Key: HBASE-27181
 URL: https://issues.apache.org/jira/browse/HBASE-27181
 Project: HBase
  Issue Type: Improvement
  Components: hbck2
Affects Versions: 2.4.13
Reporter: Huaxiang Sun
Assignee: Huaxiang Sun


Replica region id is  not recognized by hbck2's setRegionState as it does not 
show up in meta. We run into cases that it needs to set region state in meta 
for replica regions in order to fix inconsistency. We ended up writing the 
state manually into meta table and did a master failover to sync state from 
meta table. 

 

hbck2's setRegionState needs to support replica region id and handles it nicely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27025) Change Hbase book's description for "74.7.3. Load Balancing META table load"

2022-06-24 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun resolved HBASE-27025.
--
Fix Version/s: 3.0.0-alpha-4
   Resolution: Fixed

Merged into the master branch.

> Change Hbase book's description for "74.7.3. Load Balancing META table load"
> 
>
> Key: HBASE-27025
> URL: https://issues.apache.org/jira/browse/HBASE-27025
> Project: HBase
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.4.12
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Minor
> Fix For: 3.0.0-alpha-4
>
>
> HBASE-26618 involves primary meta region in meta scan. The description in 
> hbase book is inaccurate. Update it accordingly.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HBASE-26981) The CPU usage of the regionserver node where the meta table is located is too high

2022-06-23 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26981:
-
Component/s: regionserver

> The CPU usage of the regionserver node where the meta table is located is too 
> high
> --
>
> Key: HBASE-26981
> URL: https://issues.apache.org/jira/browse/HBASE-26981
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 2.3.4
>Reporter: zhengsicheng
>Priority: Major
> Attachments: image-2022-04-27-20-24-33-252.png, 
> image-2022-04-27-20-45-09-227.png, image-2022-04-28-09-59-40-567.png, jstack
>
>
> When the read and write pressure is high, the CPU usage of the meta table 
> node is too high
> !image-2022-04-28-09-59-40-567.png!
>  
> !image-2022-04-27-20-24-33-252.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HBASE-26981) The CPU usage of the regionserver node where the meta table is located is too high

2022-06-23 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26981:
-
Component/s: (was: hbase-connectors)

> The CPU usage of the regionserver node where the meta table is located is too 
> high
> --
>
> Key: HBASE-26981
> URL: https://issues.apache.org/jira/browse/HBASE-26981
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.4
>Reporter: zhengsicheng
>Priority: Major
> Attachments: image-2022-04-27-20-24-33-252.png, 
> image-2022-04-27-20-45-09-227.png, image-2022-04-28-09-59-40-567.png, jstack
>
>
> When the read and write pressure is high, the CPU usage of the meta table 
> node is too high
> !image-2022-04-28-09-59-40-567.png!
>  
> !image-2022-04-27-20-24-33-252.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-26981) The CPU usage of the regionserver node where the meta table is located is too high

2022-06-23 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558206#comment-17558206
 ] 

Huaxiang Sun commented on HBASE-26981:
--

[~zhengsicheng] We enabled the LoadBalance mode on some clusters' hbase 
clients, we observed meta scan requests are well balanced across meta replica 
regions. More data will be available in the coming couple months.

> The CPU usage of the regionserver node where the meta table is located is too 
> high
> --
>
> Key: HBASE-26981
> URL: https://issues.apache.org/jira/browse/HBASE-26981
> Project: HBase
>  Issue Type: Bug
>  Components: hbase-connectors
>Affects Versions: 2.3.4
>Reporter: zhengsicheng
>Priority: Major
> Attachments: image-2022-04-27-20-24-33-252.png, 
> image-2022-04-27-20-45-09-227.png, image-2022-04-28-09-59-40-567.png, jstack
>
>
> When the read and write pressure is high, the CPU usage of the meta table 
> node is too high
> !image-2022-04-28-09-59-40-567.png!
>  
> !image-2022-04-27-20-24-33-252.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-26981) The CPU usage of the regionserver node where the meta table is located is too high

2022-06-21 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556942#comment-17556942
 ] 

Huaxiang Sun commented on HBASE-26981:
--

Meta scan is cpu heavy operation. We identified this issue in 2020. We 
developed meta replica LoadBalance mode to load balance meta scan across 
replica regions.

[https://hbase.apache.org/book.html#async.wal.replication.meta?] 

> The CPU usage of the regionserver node where the meta table is located is too 
> high
> --
>
> Key: HBASE-26981
> URL: https://issues.apache.org/jira/browse/HBASE-26981
> Project: HBase
>  Issue Type: Bug
>  Components: hbase-connectors
>Affects Versions: 2.3.4
>Reporter: zhengsicheng
>Priority: Major
> Attachments: image-2022-04-27-20-24-33-252.png, 
> image-2022-04-27-20-45-09-227.png, image-2022-04-28-09-59-40-567.png, jstack
>
>
> When the read and write pressure is high, the CPU usage of the meta table 
> node is too high
> !image-2022-04-28-09-59-40-567.png!
>  
> !image-2022-04-27-20-24-33-252.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-26649) Support meta replica LoadBalance mode for RegionLocator#getAllRegionLocations()

2022-06-03 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun resolved HBASE-26649.
--
Fix Version/s: 2.5.0
   3.0.0-alpha-3
   2.4.13
 Release Note: When setting 'hbase.locator.meta.replicas.mode' to 
"LoadBalance" at HBase client, RegionLocator#getAllRegionLocations() now load 
balances across all Meta Replica Regions. Please note,  results from 
non-primary meta replica regions may contain stale data. 
   Resolution: Fixed

> Support meta replica LoadBalance mode for 
> RegionLocator#getAllRegionLocations()
> ---
>
> Key: HBASE-26649
> URL: https://issues.apache.org/jira/browse/HBASE-26649
> Project: HBase
>  Issue Type: Improvement
>  Components: meta replicas
>Affects Versions: 2.4.9
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.13
>
>
> When HBase application restarts, its meta cache is empty. Normally, it will 
> fill the meta cache one region at a time by scanning the meta region. This 
> will cause huge pressure to the region server hosting meta during application 
> restart. 
> It can prefetching all region locations by calling 
> RegionLocator#getAllRegionLocations().Meta replica LoadBalance mode is 
> support in 2.4, it will be nice to load balance 
> RegionLocator#getAllRegionLocations() to all meta replica regions so batch 
> scan can spread across all meta replica regions.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-26649) Support meta replica LoadBalance mode for RegionLocator#getAllRegionLocations()

2022-06-03 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17547358#comment-17547358
 ] 

Huaxiang Sun commented on HBASE-26649:
--

A subtask of HBASE-18070

> Support meta replica LoadBalance mode for 
> RegionLocator#getAllRegionLocations()
> ---
>
> Key: HBASE-26649
> URL: https://issues.apache.org/jira/browse/HBASE-26649
> Project: HBase
>  Issue Type: Improvement
>  Components: meta replicas
>Affects Versions: 2.4.9
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> When HBase application restarts, its meta cache is empty. Normally, it will 
> fill the meta cache one region at a time by scanning the meta region. This 
> will cause huge pressure to the region server hosting meta during application 
> restart. 
> It can prefetching all region locations by calling 
> RegionLocator#getAllRegionLocations().Meta replica LoadBalance mode is 
> support in 2.4, it will be nice to load balance 
> RegionLocator#getAllRegionLocations() to all meta replica regions so batch 
> scan can spread across all meta replica regions.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HBASE-27087) TestQuotaThrottle times out in branch-2.4/2.5

2022-06-03 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-27087:
-
Summary: TestQuotaThrottle times out in branch-2.4/2.5  (was: 
TestQuotaThrottle times out in branch-2.5.)

> TestQuotaThrottle times out in branch-2.4/2.5
> -
>
> Key: HBASE-27087
> URL: https://issues.apache.org/jira/browse/HBASE-27087
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.5.0
>Reporter: Huaxiang Sun
>Priority: Major
>
> With branch-2.5, TestQuotaThrottle times out. Need to investigate.
>  
> h3. Error Message
> Failed after attempts=7, exceptions: 2022-06-03T11:26:33.418Z, 
> RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
> maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
> java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /hbase/master 2022-06-03T11:26:33.418Z, 
> RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
> maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
> java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /hbase/master 2022-06-03T11:26:33.418Z, 
> RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
> maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
> java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /hbase/master 2022-06-03T11:26:33.418Z, 
> RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
> maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
> java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /hbase/master 2022-06-03T11:26:33.418Z, 
> RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
> maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
> java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /hbase/master 2022-06-03T11:26:33.418Z, 
> RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
> maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
> java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /hbase/master 2022-06-03T11:26:33.418Z, 
> RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
> maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
> java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /hbase/master



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HBASE-27087) TestQuotaThrottle times out in branch-2.5.

2022-06-03 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-27087:


 Summary: TestQuotaThrottle times out in branch-2.5.
 Key: HBASE-27087
 URL: https://issues.apache.org/jira/browse/HBASE-27087
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 2.5.0
Reporter: Huaxiang Sun


With branch-2.5, TestQuotaThrottle times out. Need to investigate.

 
h3. Error Message

Failed after attempts=7, exceptions: 2022-06-03T11:26:33.418Z, 
RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /hbase/master 2022-06-03T11:26:33.418Z, 
RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /hbase/master 2022-06-03T11:26:33.418Z, 
RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /hbase/master 2022-06-03T11:26:33.418Z, 
RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /hbase/master 2022-06-03T11:26:33.418Z, 
RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /hbase/master 2022-06-03T11:26:33.418Z, 
RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /hbase/master 2022-06-03T11:26:33.418Z, 
RpcRetryingCaller\{globalStartTime=2022-06-03T11:26:33.418Z, pause=250, 
maxAttempts=7}, org.apache.hadoop.hbase.MasterNotRunningException: 
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /hbase/master



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (HBASE-26962) Add mob info in web UI

2022-06-02 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545652#comment-17545652
 ] 

Huaxiang Sun edited comment on HBASE-26962 at 6/2/22 9:10 PM:
--

The commit caused branch-2 build failure. Can you fix the build error and 
resubmit a patch? [~liangxs]  Thanks.


was (Author: huaxiangsun):
The commit caused branch-2 build failure. Can you fix the build error and 
resubmit a patch? Thanks.

> Add mob info in web UI
> --
>
> Key: HBASE-26962
> URL: https://issues.apache.org/jira/browse/HBASE-26962
> Project: HBase
>  Issue Type: Improvement
>  Components: UI
>Reporter: Xuesen Liang
>Assignee: Xuesen Liang
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-3
>
>
> Add mob store info in web UI.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Reopened] (HBASE-26962) Add mob info in web UI

2022-06-02 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun reopened HBASE-26962:
--

The commit caused branch-2 build failure. Can you fix the build error and 
resubmit a patch? Thanks.

> Add mob info in web UI
> --
>
> Key: HBASE-26962
> URL: https://issues.apache.org/jira/browse/HBASE-26962
> Project: HBase
>  Issue Type: Improvement
>  Components: UI
>Reporter: Xuesen Liang
>Assignee: Xuesen Liang
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-3
>
>
> Add mob store info in web UI.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-26962) Add mob info in web UI

2022-06-02 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545651#comment-17545651
 ] 

Huaxiang Sun commented on HBASE-26962:
--

It failed the checks with the following errors at branch-2, I am reverting the 
commit.
{code:java}
ERROR] 
/Users/hsun/work/hbase-hs/hbase-1/hbase-server/target/generated-sources/java/org/apache/hadoop/hbase/generated/regionserver/region_jsp.java:[111,73]
 cannot find symbol
[ERROR]   symbol:   variable MOB_FILE_REFS
[ERROR]   location: class org.apache.hadoop.hbase.regionserver.HStoreFile
[ERROR] 
/Users/hsun/work/hbase-hs/hbase-1/hbase-server/target/generated-sources/java/org/apache/hadoop/hbase/generated/regionserver/region_jsp.java:[116,53]
 cannot find symbol
[ERROR]   symbol:   method deserializeMobFileRefs(byte[])
[ERROR]   location: class org.apache.hadoop.hbase.mob.MobUtils
[ERROR] -> [Help 1] {code}

> Add mob info in web UI
> --
>
> Key: HBASE-26962
> URL: https://issues.apache.org/jira/browse/HBASE-26962
> Project: HBase
>  Issue Type: Improvement
>  Components: UI
>Reporter: Xuesen Liang
>Assignee: Xuesen Liang
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-3
>
>
> Add mob store info in web UI.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-26649) Support meta replica LoadBalance mode for RegionLocator#getAllRegionLocations()

2022-06-02 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545649#comment-17545649
 ] 

Huaxiang Sun commented on HBASE-26649:
--

{code:java}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) 
on project hbase-server: Compilation failure: Compilation failure: [ERROR] 
/home/jenkins/jenkins-home/workspace/HBase_Nightly_branch-2/component/hbase-server/target/generated-sources/java/org/apache/hadoop/hbase/generated/regionserver/region_jsp.java:[111,73]
 cannot find symbol [ERROR] symbol: variable MOB_FILE_REFS [ERROR] location: 
class org.apache.hadoop.hbase.regionserver.HStoreFile [ERROR] 
/home/jenkins/jenkins-home/workspace/HBase_Nightly_branch-2/component/hbase-server/target/generated-sources/java/org/apache/hadoop/hbase/generated/regionserver/region_jsp.java:[116,53]
 cannot find symbol [ERROR] symbol: method deserializeMobFileRefs(byte[]) 
[ERROR] location: class org.apache.hadoop.hbase.mob.MobUtils{code}

> Support meta replica LoadBalance mode for 
> RegionLocator#getAllRegionLocations()
> ---
>
> Key: HBASE-26649
> URL: https://issues.apache.org/jira/browse/HBASE-26649
> Project: HBase
>  Issue Type: Improvement
>  Components: meta replicas
>Affects Versions: 2.4.9
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> When HBase application restarts, its meta cache is empty. Normally, it will 
> fill the meta cache one region at a time by scanning the meta region. This 
> will cause huge pressure to the region server hosting meta during application 
> restart. 
> It can prefetching all region locations by calling 
> RegionLocator#getAllRegionLocations().Meta replica LoadBalance mode is 
> support in 2.4, it will be nice to load balance 
> RegionLocator#getAllRegionLocations() to all meta replica regions so batch 
> scan can spread across all meta replica regions.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-26649) Support meta replica LoadBalance mode for RegionLocator#getAllRegionLocations()

2022-06-02 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545648#comment-17545648
 ] 

Huaxiang Sun commented on HBASE-26649:
--

Seems something is wrong, checking with the latest code.

> Support meta replica LoadBalance mode for 
> RegionLocator#getAllRegionLocations()
> ---
>
> Key: HBASE-26649
> URL: https://issues.apache.org/jira/browse/HBASE-26649
> Project: HBase
>  Issue Type: Improvement
>  Components: meta replicas
>Affects Versions: 2.4.9
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> When HBase application restarts, its meta cache is empty. Normally, it will 
> fill the meta cache one region at a time by scanning the meta region. This 
> will cause huge pressure to the region server hosting meta during application 
> restart. 
> It can prefetching all region locations by calling 
> RegionLocator#getAllRegionLocations().Meta replica LoadBalance mode is 
> support in 2.4, it will be nice to load balance 
> RegionLocator#getAllRegionLocations() to all meta replica regions so batch 
> scan can spread across all meta replica regions.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-26233) The region replication framework should not be built upon the general replication framework

2022-05-12 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536279#comment-17536279
 ] 

Huaxiang Sun commented on HBASE-26233:
--

Sorry, [~zhangduo], catching up the topic so late! I am starting to review the 
feature now.

 
{quote}And when discussing around HBASE-18070, I recall that we talked about 
only replicate the 'info' family. So have we already done this, i.e, only 
replicate 'info' family but not other families? Reading the section in ref guide

[http://hbase.apache.org/book.html#async.wal.replication.meta]

I haven't seen any related topics. So my question is do we still need to 
implement this feature?
{quote}
Yeah, since it only needs the info family for region locations, it really does 
not need to replicate other families. I can create Jira and implement it during 
the reviewing process if it has not been in the place, thanks.

 

 

> The region replication framework should not be built upon the general 
> replication framework
> ---
>
> Key: HBASE-26233
> URL: https://issues.apache.org/jira/browse/HBASE-26233
> Project: HBase
>  Issue Type: Umbrella
>  Components: read replicas
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0-alpha-3
>
>
> At least, at the source path, where we track the edits, we should not make 
> region replication rely on general replication framework.
> The difficulty here for switching to a table based storage is that, the WAL 
> system and replication system highly depend on each other. There will be 
> cyclic dependency if we want to store replication peer and queue data in a 
> hbase table.
> And after HBASE-18070, even meta wal provider will be integrated together 
> with replication system, which makes things more difficult.
> But in general, for region replication, it is not a big deal to lose some 
> edits, a flush can fix everything, which means we do not so heavy tracking 
> system in the general replication system.
> We should find a more light-weighted way to do region replication.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HBASE-27025) Change Hbase book's description for "74.7.3. Load Balancing META table load"

2022-05-11 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-27025:


 Summary: Change Hbase book's description for "74.7.3. Load 
Balancing META table load"
 Key: HBASE-27025
 URL: https://issues.apache.org/jira/browse/HBASE-27025
 Project: HBase
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.4.12
Reporter: Huaxiang Sun
Assignee: Huaxiang Sun


HBASE-26618 involves primary meta region in meta scan. The description in hbase 
book is inaccurate. Update it accordingly.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-26960) Another case for unnecessary replication suspending in RegionReplicationSink

2022-05-10 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534488#comment-17534488
 ] 

Huaxiang Sun commented on HBASE-26960:
--

Does this issue exist in branch-2? If so, can we backport to branch-2/2.5/2.4? 
Thanks.

> Another case for unnecessary replication suspending in RegionReplicationSink
> 
>
> Key: HBASE-26960
> URL: https://issues.apache.org/jira/browse/HBASE-26960
> Project: HBase
>  Issue Type: Bug
>  Components: read replicas
>Affects Versions: 3.0.0-alpha-2
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
> Fix For: 3.0.0-alpha-3
>
>
> Besides HBASE-26768, there is another case replication  in 
> {{RegionReplicationSink}} would be suspend:
> For {{RegionReplicationSink}}, when there is a replication error , 
> {{RegionReplicationSink}} invokes {{MemStoreFlusher#requestFlush}} to request 
> a flush, and after receiving the {{FlushAction#START_FLUSH}} or 
> {{FlushAction#CANNOT_FLUSH}} flush marker, it would resume the replication. 
> But when {{MemStoreFlusher}}  flushing, it invokes following method 
> {{HRegion.flushcache}} with the {{writeFlushRequestWalMarker}} set to false:
> {code:java}
>   public FlushResultImpl flushcache(List families,
>   boolean writeFlushRequestWalMarker, FlushLifeCycleTracker tracker) 
> throws IOException {
>  }
> {code}
> When  {{writeFlushRequestWalMarker}} is set to false, {{HRegion.flushcache}} 
> does not write the {{FlushAction#CANNOT_FLUSH}} flush marker to {{WAL}} when 
> the memstore is empty, just as following 
> {{HRegion.writeFlushRequestMarkerToWAL}} illustrated:
> {code:java}
> private boolean writeFlushRequestMarkerToWAL(WAL wal, boolean 
> writeFlushWalMarker) {
> if (writeFlushWalMarker && wal != null && !writestate.readOnly) {
>   FlushDescriptor desc = 
> ProtobufUtil.toFlushDescriptor(FlushAction.CANNOT_FLUSH,
> getRegionInfo(), -1, new TreeMap<>(Bytes.BYTES_COMPARATOR));
>   try {
> WALUtil.writeFlushMarker(wal, this.getReplicationScope(), 
> getRegionInfo(), desc, true, mvcc,
>   regionReplicationSink.orElse(null));
> return true;
>   } catch (IOException e) {
> LOG.warn(getRegionInfo().getEncodedName() + " : " +
>   "Received exception while trying to write the flush request to 
> wal", e);
>   }
> }
> return false;
>   }
> {code}
> so when there is a replication error when the memstore is empty(eg. 
> replicating the {{FlushAction#START_FLUSH}}  or {{FlushAction#COMMIT_FLUSH}} 
> ), the replication may suspend until next memstore flush,even though later 
> there are user writes and it could replicate normally.
> I simulate this problem in the PR , and for {{writeFlushRequestWalMarker}} 
> paramter, it is introduced by HBASE-11580 and just only determines whether or 
> not writing the {{FlushAction#CANNOT_FLUSH}} flush marker to WAL when the 
> memstore is empty, so I think for simplicity, we could set it to true always 
> for {{MemStoreFlusher}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-26984) Chaos Monkey thread dies in ITBLL Chaos GracefulRollingRestartRsAction

2022-05-05 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun resolved HBASE-26984.
--
Fix Version/s: 2.5.0
   3.0.0-alpha-3
   Resolution: Fixed

> Chaos Monkey thread dies in ITBLL Chaos GracefulRollingRestartRsAction 
> ---
>
> Key: HBASE-26984
> URL: https://issues.apache.org/jira/browse/HBASE-26984
> Project: HBase
>  Issue Type: Bug
>  Components: integration tests
>Affects Versions: 2.4.11
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3
>
>
> Run itbll chaos monkey in k8s cluster, found chaos monkey thread died in 
> GracefulRollingRestartRsAction. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HBASE-26984) Chaos Monkey thread dies in ITBLL Chaos GracefulRollingRestartRsAction

2022-04-27 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-26984:


 Summary: Chaos Monkey thread dies in ITBLL Chaos 
GracefulRollingRestartRsAction 
 Key: HBASE-26984
 URL: https://issues.apache.org/jira/browse/HBASE-26984
 Project: HBase
  Issue Type: Bug
  Components: integration tests
Affects Versions: 2.4.11
Reporter: Huaxiang Sun
Assignee: Huaxiang Sun


Run itbll chaos monkey in k8s cluster, found chaos monkey thread died in 
GracefulRollingRestartRsAction. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-26618) Involving primary meta region in meta scan with CatalogReplicaLoadBalanceSimpleSelector

2022-04-11 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun resolved HBASE-26618.
--
Fix Version/s: 2.5.0
   3.0.0-alpha-3
   2.4.12
 Release Note: When META replica LoadBalance mode is enabled at 
client-side, clients will try to read from one META region first. If META 
location is from any non-primary META regions, in case of errors, it will fall 
back to the primary META region.
   Resolution: Fixed

> Involving primary meta region in meta scan with 
> CatalogReplicaLoadBalanceSimpleSelector
> ---
>
> Key: HBASE-26618
> URL: https://issues.apache.org/jira/browse/HBASE-26618
> Project: HBase
>  Issue Type: Improvement
>  Components: meta replicas
>Affects Versions: 2.4.9
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.12
>
>
> In the current release with Meta replica LoadBalance mode, the primary meta 
> region is not serving the meta scan (only meta replica region serves the 
> read). When the result from meta replica region is stale, it will go to 
> primary meta region for up-to-date location. 
> From our experience, the primary meta region serves very less read traffic, 
> so it will be better to load balance read traffic across the primary meta 
> region as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26618) Involving primary meta region in meta scan with CatalogReplicaLoadBalanceSimpleSelector

2022-04-04 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26618:
-
Summary: Involving primary meta region in meta scan with 
CatalogReplicaLoadBalanceSimpleSelector  (was: Involving primary meta region in 
meta scan with Meta Replica Mode)

> Involving primary meta region in meta scan with 
> CatalogReplicaLoadBalanceSimpleSelector
> ---
>
> Key: HBASE-26618
> URL: https://issues.apache.org/jira/browse/HBASE-26618
> Project: HBase
>  Issue Type: Improvement
>  Components: meta replicas
>Affects Versions: 2.4.9
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Minor
>
> In the current release with Meta replica LoadBalance mode, the primary meta 
> region is not serving the meta scan (only meta replica region serves the 
> read). When the result from meta replica region is stale, it will go to 
> primary meta region for up-to-date location. 
> From our experience, the primary meta region serves very less read traffic, 
> so it will be better to load balance read traffic across the primary meta 
> region as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26908) Remove warnings from meta replicas feature references in the HBase book

2022-03-30 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17514828#comment-17514828
 ] 

Huaxiang Sun commented on HBASE-26908:
--

We use the client side MetaReplica LoadBalance mode in our production cluster. 
Since regions does not move frequently in our cluster, so we do not turn on 
"async wal replication for meta". Just want to be clear. Unless there are other 
users who has this feature turned on at their cluster.

 

The plan is to run itbll with this config turn on against 2.5.0 candidate and 
2.4.11, if that passes, we can remove the warning from the hbase book.

> Remove warnings from meta replicas feature references in the HBase book
> ---
>
> Key: HBASE-26908
> URL: https://issues.apache.org/jira/browse/HBASE-26908
> Project: HBase
>  Issue Type: Task
>  Components: documentation
>Reporter: Andor Molnar
>Assignee: Andor Molnar
>Priority: Major
>
> Meta replicas is a new feature in HBase 2.4 and mentioned in "Use with 
> caution" in the docs. Given that the feature and the related "async wal 
> replication for meta" is actively used in production already, I'd like to 
> remove these warnings from the docs.
> With this change, users will have more confidence in the feature.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26864) SplitTableRegionProcedure calls openParentRegions() at a wrong state during rollback.

2022-03-26 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512831#comment-17512831
 ] 

Huaxiang Sun commented on HBASE-26864:
--

Thanks [~apurtell].

> SplitTableRegionProcedure calls openParentRegions() at a wrong state during 
> rollback.
> -
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
> Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.12
>
>
> Changed the issue title and description for the scope of the work. 
> there is a bug in handling Rollback in SplitTableRegionProcedure.
> [https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L304]
> [https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L385]
> {code:java}
> In the state machine:
>         case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION:
>           addChildProcedure(createUnassignProcedures(env));
>   // Comments from HX:
>           // createUnassignProcedures() can throw out IOException. If this 
> happens,
>           // it wont reach state SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGION and 
> no parent regions
>           // is closed as all created UnassignProcedures are rolled back. If 
> it rolls back with
>           // state SPLIT_TABLE_REGION_CLOSE_PARENT_REGION, no need to call 
> openParentRegion(),
>           // otherwise, it will result in OpenRegionProcedure for an already 
> open region.
>           
> setNextState(SplitTableRegionState.SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS);
>           break;
> In the rollback,
>         case SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS:
>           // Doing nothing, in SPLIT_TABLE_REGION_CLOSE_PARENT_REGION,
>           // we will bring parent region online
>           break;
>         case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION:
>   // Comments from HX: 
>   // OpenParentRegion() should not be called here as explained above.
>           openParentRegion(env);
>           break; {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26864) SplitTableRegionProcedure calls openParentRegions() at a wrong state during rollback.

2022-03-22 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26864:
-
Summary: SplitTableRegionProcedure calls openParentRegions() at a wrong 
state during rollback.  (was: SplitTableRegionProcedure, it calls 
openParentRegions() at a wrong state during rollback.)

> SplitTableRegionProcedure calls openParentRegions() at a wrong state during 
> rollback.
> -
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> Changed the issue title and description for the scope of the work. 
> there is a bug in handling Rollback in SplitTableRegionProcedure.
> [https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L304]
> [https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L385]
> {code:java}
> In the state machine:
>         case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION:
>           addChildProcedure(createUnassignProcedures(env));
>   // Comments from HX:
>           // createUnassignProcedures() can throw out IOException. If this 
> happens,
>           // it wont reach state SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGION and 
> no parent regions
>           // is closed as all created UnassignProcedures are rolled back. If 
> it rolls back with
>           // state SPLIT_TABLE_REGION_CLOSE_PARENT_REGION, no need to call 
> openParentRegion(),
>           // otherwise, it will result in OpenRegionProcedure for an already 
> open region.
>           
> setNextState(SplitTableRegionState.SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS);
>           break;
> In the rollback,
>         case SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS:
>           // Doing nothing, in SPLIT_TABLE_REGION_CLOSE_PARENT_REGION,
>           // we will bring parent region online
>           break;
>         case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION:
>   // Comments from HX: 
>   // OpenParentRegion() should not be called here as explained above.
>           openParentRegion(env);
>           break; {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26864) SplitTableRegionProcedure, it calls openParentRegions() at a wrong state during rollback.

2022-03-22 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26864:
-
Description: 
Changed the issue title and description for the scope of the work. 

there is a bug in handling Rollback in SplitTableRegionProcedure.

[https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L304]

[https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L385]
{code:java}
In the state machine:


        case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION:
          addChildProcedure(createUnassignProcedures(env));
  // Comments from HX:
          // createUnassignProcedures() can throw out IOException. If this 
happens,
          // it wont reach state SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGION and no 
parent regions
          // is closed as all created UnassignProcedures are rolled back. If it 
rolls back with
          // state SPLIT_TABLE_REGION_CLOSE_PARENT_REGION, no need to call 
openParentRegion(),
          // otherwise, it will result in OpenRegionProcedure for an already 
open region.
          
setNextState(SplitTableRegionState.SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS);
          break;


In the rollback,


        case SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS:
          // Doing nothing, in SPLIT_TABLE_REGION_CLOSE_PARENT_REGION,
          // we will bring parent region online
          break;
        case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION:
  // Comments from HX: 
  // OpenParentRegion() should not be called here as explained above.
          openParentRegion(env);
          break; {code}

  was:
Changed the issue title and description for the scope of the work. 

The reason 


> SplitTableRegionProcedure, it calls openParentRegions() at a wrong state 
> during rollback.
> -
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> Changed the issue title and description for the scope of the work. 
> there is a bug in handling Rollback in SplitTableRegionProcedure.
> [https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L304]
> [https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L385]
> {code:java}
> In the state machine:
>         case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION:
>           addChildProcedure(createUnassignProcedures(env));
>   // Comments from HX:
>           // createUnassignProcedures() can throw out IOException. If this 
> happens,
>           // it wont reach state SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGION and 
> no parent regions
>           // is closed as all created UnassignProcedures are rolled back. If 
> it rolls back with
>           // state SPLIT_TABLE_REGION_CLOSE_PARENT_REGION, no need to call 
> openParentRegion(),
>           // otherwise, it will result in OpenRegionProcedure for an already 
> open region.
>           
> setNextState(SplitTableRegionState.SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS);
>           break;
> In the rollback,
>         case SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS:
>           // Doing nothing, in SPLIT_TABLE_REGION_CLOSE_PARENT_REGION,
>           // we will bring parent region online
>           break;
>         case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION:
>   // Comments from HX: 
>   // OpenParentRegion() should not be called here as explained above.
>           openParentRegion(env);
>           break; {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26864) SplitTableRegionProcedure, it calls openParentRegions() at a wrong state during rollback.

2022-03-22 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26864:
-
Description: 
Changed the issue title and description for the scope of the work. 

The reason 

  was:
For some upgrading cases, we found that master issues RegionOpen for an already 
open region and Region Sever simply logs 
{code:java}
2022-03-17 22:16:55,595 WARN 
org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received OPEN 
for 
foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4.
 which is already online {code}
and it does not ack or nack master. This OpenRegionProceduce is stuck forever.

In this specific case, it needs to ack master that region is open. 

 

For the cause of why it sent an OpenRegion request for an already open region, 
it will be followed by another issue.


> SplitTableRegionProcedure, it calls openParentRegions() at a wrong state 
> during rollback.
> -
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> Changed the issue title and description for the scope of the work. 
> The reason 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26864) SplitTableRegionProcedure, it calls openParentRegions() at a wrong state during rollback.

2022-03-22 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26864:
-
Summary: SplitTableRegionProcedure, it calls openParentRegions() at a wrong 
state during rollback.  (was: Region Server does not send Ack back to master 
after receiving an OpenRegionReq for already opened regions, causing 
OpenRegionProcedure stay forever.)

> SplitTableRegionProcedure, it calls openParentRegions() at a wrong state 
> during rollback.
> -
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> For some upgrading cases, we found that master issues RegionOpen for an 
> already open region and Region Sever simply logs 
> {code:java}
> 2022-03-17 22:16:55,595 WARN 
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received 
> OPEN for 
> foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4.
>  which is already online {code}
> and it does not ack or nack master. This OpenRegionProceduce is stuck forever.
> In this specific case, it needs to ack master that region is open. 
>  
> For the cause of why it sent an OpenRegion request for an already open 
> region, it will be followed by another issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for already opened regions, causing OpenRegionProcedure stay forever.

2022-03-21 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510172#comment-17510172
 ] 

Huaxiang Sun commented on HBASE-26864:
--

Thanks for explain. I assumed that report is associated with procId, and master 
would discard report when there is no outstanding procedure.

For this specific case, there is a bug in handling Rollback in 
SplitTableRegionProcedure, preparing a patch.

[https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L304]

[https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L385]
{code:java}
In the state machine:


        case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION:
          addChildProcedure(createUnassignProcedures(env));
  // Comments from HX:
          // createUnassignProcedures() can throw out IOException. If this 
happens,
          // it wont reach state SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGION and no 
parent regions
          // is closed as all created UnassignProcedures are rolled back. If it 
rolls back with
          // state SPLIT_TABLE_REGION_CLOSE_PARENT_REGION, no need to call 
openParentRegion(),
          // otherwise, it will result in OpenRegionProcedure for an already 
open region.
          
setNextState(SplitTableRegionState.SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS);
          break;


In the rollback,


        case SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS:
          // Doing nothing, in SPLIT_TABLE_REGION_CLOSE_PARENT_REGION,
          // we will bring parent region online
          break;
        case SPLIT_TABLE_REGION_CLOSE_PARENT_REGION:
  // Comments from HX: 
  // OpenParentRegion() should not be called here as explained above.
          openParentRegion(env);
          break; {code}

> Region Server does not send Ack back to master after receiving an 
> OpenRegionReq for already opened regions, causing OpenRegionProcedure stay 
> forever.
> -
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> For some upgrading cases, we found that master issues RegionOpen for an 
> already open region and Region Sever simply logs 
> {code:java}
> 2022-03-17 22:16:55,595 WARN 
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received 
> OPEN for 
> foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4.
>  which is already online {code}
> and it does not ack or nack master. This OpenRegionProceduce is stuck forever.
> In this specific case, it needs to ack master that region is open. 
>  
> For the cause of why it sent an OpenRegion request for an already open 
> region, it will be followed by another issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for already opened regions, causing OpenRegionProcedure stay forever.

2022-03-21 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510022#comment-17510022
 ] 

Huaxiang Sun commented on HBASE-26864:
--

[~zhangduo], Can you elaborate more about the double assign issue? I will 
provide more details about the root cause later today. So far as I read from 
the log,  it is not caused by the cases you described. The sequence I found is 
that region is opened at RS A, A acks back to master that the region is opened. 
During postOpenDeployTasks, RS A finds that it needs to split the region, so it 
sends a split request to master. Master starts the RegionSplitProcedure and 
later it finds that a replica parent is still being opened. It rolls back the 
RegionSplitProduce and in the process, it sends OpenRegion request to RS A.

 

Even if it does not ack in this case, it still needs to clean up some state, 
proc id is in submittedRegionProcedures, it needs to be cleaned.

> Region Server does not send Ack back to master after receiving an 
> OpenRegionReq for already opened regions, causing OpenRegionProcedure stay 
> forever.
> -
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> For some upgrading cases, we found that master issues RegionOpen for an 
> already open region and Region Sever simply logs 
> {code:java}
> 2022-03-17 22:16:55,595 WARN 
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received 
> OPEN for 
> foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4.
>  which is already online {code}
> and it does not ack or nack master. This OpenRegionProceduce is stuck forever.
> In this specific case, it needs to ack master that region is open. 
>  
> For the cause of why it sent an OpenRegion request for an already open 
> region, it will be followed by another issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for already opened regions, causing OpenRegionProcedure stay forever.

2022-03-18 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26864:
-
Summary: Region Server does not send Ack back to master after receiving an 
OpenRegionReq for already opened regions, causing OpenRegionProcedure stay 
forever.  (was: Region Server does not send Ack back to master after receiving 
an OpenRegionReq for already opened regions, causing OpenRegionProcedure stuck 
forever.)

> Region Server does not send Ack back to master after receiving an 
> OpenRegionReq for already opened regions, causing OpenRegionProcedure stay 
> forever.
> -
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> For some upgrading cases, we found that master issues RegionOpen for an 
> already open region and Region Sever simply logs 
> {code:java}
> 2022-03-17 22:16:55,595 WARN 
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received 
> OPEN for 
> foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4.
>  which is already online {code}
> and it does not ack or nack master. This OpenRegionProceduce is stuck forever.
> In this specific case, it needs to ack master that region is open. 
>  
> For the cause of why it sent an OpenRegion request for an already open 
> region, it will be followed by another issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for already opened regions, causing OpenRegionProcedure stuck forever.

2022-03-18 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26864:
-
Summary: Region Server does not send Ack back to master after receiving an 
OpenRegionReq for already opened regions, causing OpenRegionProcedure stuck 
forever.  (was: Region Server does not send Ack back to master after receiving 
an OpenRegionReq for open regions, causing OpenRegionProcedure stuck forever.)

> Region Server does not send Ack back to master after receiving an 
> OpenRegionReq for already opened regions, causing OpenRegionProcedure stuck 
> forever.
> --
>
> Key: HBASE-26864
> URL: https://issues.apache.org/jira/browse/HBASE-26864
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.4.10
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> For some upgrading cases, we found that master issues RegionOpen for an 
> already open region and Region Sever simply logs 
> {code:java}
> 2022-03-17 22:16:55,595 WARN 
> org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received 
> OPEN for 
> foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4.
>  which is already online {code}
> and it does not ack or nack master. This OpenRegionProceduce is stuck forever.
> In this specific case, it needs to ack master that region is open. 
>  
> For the cause of why it sent an OpenRegion request for an already open 
> region, it will be followed by another issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26864) Region Server does not send Ack back to master after receiving an OpenRegionReq for open regions, causing OpenRegionProcedure stuck forever.

2022-03-18 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-26864:


 Summary: Region Server does not send Ack back to master after 
receiving an OpenRegionReq for open regions, causing OpenRegionProcedure stuck 
forever.
 Key: HBASE-26864
 URL: https://issues.apache.org/jira/browse/HBASE-26864
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 2.4.10
Reporter: Huaxiang Sun
Assignee: Huaxiang Sun


For some upgrading cases, we found that master issues RegionOpen for an already 
open region and Region Sever simply logs 
{code:java}
2022-03-17 22:16:55,595 WARN 
org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler: Received OPEN 
for 
foo,b2875fcb-7bc0-4fa9-a980-e902faf7f151,1631771037620.def199cc7208615b783b285f582ddfa4.
 which is already online {code}
and it does not ack or nack master. This OpenRegionProceduce is stuck forever.

In this specific case, it needs to ack master that region is open. 

 

For the cause of why it sent an OpenRegion request for an already open region, 
it will be followed by another issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2022-02-07 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488302#comment-17488302
 ] 

Huaxiang Sun commented on HBASE-26590:
--

Sorry, I noticed that I committed it to the branch. In the fix version part, I 
did not put any 2.3 releases.

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 2.4.0, 2.5.0, 2.3.7, 2.6.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
> Fix For: 2.5.0, 2.4.10
>
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-27 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483364#comment-17483364
 ] 

Huaxiang Sun edited comment on HBASE-26679 at 1/27/22, 6:51 PM:


Very nice finding and analysis! We run into this issue before and Stack has the 
jira, HBASE-26042, link here.


was (Author: huaxiangsun):
Very nice finding and analysis! We run into this issue before and Stack has the 
jira, HBASE-26041, link here.

> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.10
>
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So wait on the 
> future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-27 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483364#comment-17483364
 ] 

Huaxiang Sun commented on HBASE-26679:
--

Very nice finding and analysis! We run into this issue before and Stack has the 
jira, HBASE-26041, link here.

> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.10
>
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So wait on the 
> future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26047) [JDK17] Track JDK17 unit test failures

2022-01-27 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483359#comment-17483359
 ] 

Huaxiang Sun commented on HBASE-26047:
--

[~xytss123] , sorry to get back to you late. I have linked HBASE-26410 and 
HBASE-26477.

> [JDK17] Track JDK17 unit test failures
> --
>
> Key: HBASE-26047
> URL: https://issues.apache.org/jira/browse/HBASE-26047
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Wei-Chiu Chuang
>Assignee: Yutong Xiao
>Priority: Major
>
> As of now, there are still two failed unit tests after exporting JDK internal 
> modules and the modifier access hack.
> {noformat}
> [ERROR] Tests run: 7, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 0.217 
> s <<< FAILURE! - in org.apache.hadoop.hbase.io.TestHeapSize
> [ERROR] org.apache.hadoop.hbase.io.TestHeapSize.testSizes  Time elapsed: 
> 0.041 s  <<< FAILURE!
> java.lang.AssertionError: expected:<160> but was:<152>
> at 
> org.apache.hadoop.hbase.io.TestHeapSize.testSizes(TestHeapSize.java:335)
> [ERROR] org.apache.hadoop.hbase.io.TestHeapSize.testNativeSizes  Time 
> elapsed: 0.01 s  <<< FAILURE!
> java.lang.AssertionError: expected:<72> but was:<64>
> at 
> org.apache.hadoop.hbase.io.TestHeapSize.testNativeSizes(TestHeapSize.java:134)
> [INFO] Running org.apache.hadoop.hbase.io.Tes
> [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.697 
> s <<< FAILURE! - in org.apache.hadoop.hbase.ipc.TestBufferChain
> [ERROR] org.apache.hadoop.hbase.ipc.TestBufferChain.testWithSpy  Time 
> elapsed: 0.537 s  <<< ERROR!
> java.lang.NullPointerException: Cannot enter synchronized block because 
> "this.closeLock" is null
> at 
> org.apache.hadoop.hbase.ipc.TestBufferChain.testWithSpy(TestBufferChain.java:119)
> {noformat}
> It appears that JDK17 makes the heap size estimate different than before. Not 
> sure why.
> TestBufferChain.testWithSpy  failure might be because of yet another 
> unexported module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26649) Support meta replica LoadBalance mode for RegionLocator#getAllRegionLocations()

2022-01-06 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-26649:


 Summary: Support meta replica LoadBalance mode for 
RegionLocator#getAllRegionLocations()
 Key: HBASE-26649
 URL: https://issues.apache.org/jira/browse/HBASE-26649
 Project: HBase
  Issue Type: Improvement
  Components: meta replicas
Affects Versions: 2.4.9
Reporter: Huaxiang Sun
Assignee: Huaxiang Sun


When HBase application restarts, its meta cache is empty. Normally, it will 
fill the meta cache one region at a time by scanning the meta region. This will 
cause huge pressure to the region server hosting meta during application 
restart. 

It can prefetching all region locations by calling 
RegionLocator#getAllRegionLocations().Meta replica LoadBalance mode is support 
in 2.4, it will be nice to load balance RegionLocator#getAllRegionLocations() 
to all meta replica regions so batch scan can spread across all meta replica 
regions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2022-01-06 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun resolved HBASE-26590.
--
Fix Version/s: 2.5.0
   2.4.10
   Resolution: Fixed

Resolved it for now, will reopen if there is new finding.

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 2.4.0, 2.5.0, 2.3.7, 2.6.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
> Fix For: 2.5.0, 2.4.10
>
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2022-01-05 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469506#comment-17469506
 ] 

Huaxiang Sun commented on HBASE-26590:
--

I modified my testing case, excluding connection setup/teardown from the time 
counted. Here is the result for 1m random meta lookup. I added option to use 
BlockingRpcClient for meta lookup against the default NettyRpcClient.
||h5. ~Version~ ||h5. ~Meta Replica Load Balance Enabled~||h5. 
~BlockingRpcClient~               ||h5. ~Time(ms)~||
||h5. ~2.4.5-with-fixed~||h5. ~No~||h5. ~No~||h5. ~370814~||
||h5. ~2.4.5-with-fixed~||h5. ~No~||h5. ~Yes~||h5. ~358931~||
||h5. ~2.4.5-with-fixed~||h5. ~Yes~||h5. ~Yes~||h5. ~349485~ ||
||h5. ~2.4.5~||h5. ~No~||h5. ~No~||h5. ~516640~ ||
||h5. ~2.4.5~||h5. ~Yes~||h5. ~Yes~||h5. ~497509~||
||h5.       ~cdh-5.16.2~||h5. ~No~||h5. ~No~||h5. ~371540~||

 

When I did the Table.get() test. It is hard to draw a solid conclusion due to 
key distribution, most of the keys randomly created fall into the the last 
region and it is cached. BlockingRpcClient/NettyRpcClient difference is about 
3% (Not as initially reported as 5 ~ 10%), so not a very big concern here.

This difference here is not big as what we observed at the production cluster. 
I am going to put up the patch and will work with the team to see if it helps.

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 2.4.0, 2.5.0, 2.3.7, 2.6.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2021-12-23 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464814#comment-17464814
 ] 

Huaxiang Sun commented on HBASE-26590:
--

I am modifying my test code to exclude the connection setup/teardown from the 
reported time (should not be there at the first place). Will report back when I 
have more testing results.

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 2.4.0, 2.5.0, 2.3.7, 2.6.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2021-12-22 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464271#comment-17464271
 ] 

Huaxiang Sun commented on HBASE-26590:
--

By the way, the hbase-1 client app and hbase-2 client app are working against 
the same hbase-2.4.5 cluster, so the only difference is the the client module.

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 2.4.0, 2.5.0, 2.3.7, 2.6.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26618) Involving primary meta region in meta scan with Meta Replica Mode

2021-12-22 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-26618:


 Summary: Involving primary meta region in meta scan with Meta 
Replica Mode
 Key: HBASE-26618
 URL: https://issues.apache.org/jira/browse/HBASE-26618
 Project: HBase
  Issue Type: Improvement
  Components: meta replicas
Affects Versions: 2.4.9
Reporter: Huaxiang Sun
Assignee: Huaxiang Sun


In the current release with Meta replica LoadBalance mode, the primary meta 
region is not serving the meta scan (only meta replica region serves the read). 
When the result from meta replica region is stale, it will go to primary meta 
region for up-to-date location. 

>From our experience, the primary meta region serves very less read traffic, so 
>it will be better to load balance read traffic across the primary meta region 
>as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2021-12-22 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464196#comment-17464196
 ] 

Huaxiang Sun commented on HBASE-26590:
--

Thanks [~zhangduo].

For master, I think 10 is fine as all results are cached to the meta cache, so 
they are not wasted.

For hbase-2, the extra 4 results are not cached so a bit concern. The issue 
happened during the job restart, when ~700 hbase client starts at the same time 
with an empty meta cache, so there is a meta scan storm, there are ~300k 
regions in the meta table. I am not sure at this moment that this is the main 
factor as my testing result shows way less impact as the one observed by the 
production job. 

Some background info:

The cluster is stable without region move. 

There is meta replica Load Balance mode enabled at the 2.4.5 client side. Meta 
Replica Region Server is fully synced with the primary region as the cluster is 
stable. During my test, meta scan going through meta replica region does not 
cause performance regression. 

At my testing cluster, I can reproduce a bit regression with a RandomGet test 
with 2.4.5 NettyRpcClient. After changing to BlockingRpcClient, this regression 
is gone (5 ~ 10%). 

I will submit this minor improvement patch and will work with the production 
team again to see if there is any improvement with the patch and the new 
BlockingRpcClient config. 

If the meta replica region is out of sync with the primary region, there will 
be lots of stale region locations, results in NotServingRegionException and 
client will do retry with the primary meta region. This will cause the serious 
latency issue, but this is not the case here. Anyway, I will keep an eye on it 
when we are going retry with the new 2.4.5 client.

 

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 2.4.0, 2.5.0, 2.3.7, 2.6.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2021-12-16 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461122#comment-17461122
 ] 

Huaxiang Sun commented on HBASE-26590:
--

Update, put the fix into a test which does real Table#get(), and still shows 
there is performance regression, so this is not the only cause. Debugging.

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 2.4.0, 2.5.0, 2.3.7, 2.6.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2021-12-16 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461066#comment-17461066
 ] 

Huaxiang Sun edited comment on HBASE-26590 at 12/16/21, 10:12 PM:
--

It is hard to compare with the master branch, as it saves the fetched extra 
locations into client's meta cache.


was (Author: huaxiangsun):
It is hard to compare with the master branch, as it saves the fetched locations 
into client's meta cache.

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 2.4.0, 2.5.0, 2.3.7, 2.6.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2021-12-16 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461066#comment-17461066
 ] 

Huaxiang Sun commented on HBASE-26590:
--

It is hard to compare with the master branch, as it saves the fetched locations 
into client's meta cache.

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 2.4.0, 2.5.0, 2.3.7, 2.6.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2021-12-16 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26590:
-
Affects Version/s: (was: 3.0.0-alpha-1)

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 2.4.0, 2.5.0, 2.3.7, 2.6.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2021-12-16 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461031#comment-17461031
 ] 

Huaxiang Sun commented on HBASE-26590:
--

I debugged the code, found that this regression is caused by the following 
line. It now asks region server to return 5 rows, which will take more time for 
region server to process. This change was introduced in HBASE-20182, which in 
most normal cases, the extra 4 rows returned are discarded. The proposed fix is 
to revert back to the hbase-1 behavior, i.e, ask for 1 row in meta scan. For 
the corner case fixed by HBASE-20182, it will go back to meta region server 
couple more times for the correct location.

 
{code:java}
diff --git 
a/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java
 b/hbase-
client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java
index 9145c55c0a..6039387b6e 100644
--- 
a/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java
+++ 
b/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java
@@ -888,7 +888,7 @@ class ConnectionImplementation implements 
ClusterConnection, Closeable {
     byte[] metaStopKey =
       RegionInfo.createRegionName(tableName, HConstants.EMPTY_START_ROW, "", 
false);
     Scan s = new Scan().withStartRow(metaStartKey).withStopRow(metaStopKey, 
true)
-      .addFamily(HConstants.CATALOG_FAMILY).setReversed(true).setCaching(5)
+      .addFamily(HConstants.CATALOG_FAMILY).setReversed(true).setCaching(1)
       .setReadType(ReadType.PREAD);
 
     switch (this.metaReplicaMode) { {code}
 

 

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 3.0.0-alpha-1, 2.3.7
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2021-12-16 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461032#comment-17461032
 ] 

Huaxiang Sun commented on HBASE-26590:
--

2.4.5-with-fixed is the release with the proposed fix. With that, meta lookup 
time is similar to cdh-5.16.2's.

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 3.0.0-alpha-1, 2.3.7
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2021-12-16 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26590:
-
Description: 
One of our users complained higher latency after application upgrades from 
hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
Balance mode during app restart. I reproduced the regression by a test for meta 
lookup. 

At my test cluster, there are 160k regions for the test table, so there are 
160k entries in meta region. Used one thread to do 1 million meta lookup 
against the meta region server.

 
||Version ||Meta Replica Load Balance Enabled||Time               ||
||2.4.5-with-fixed||Yes||336458ms||
||2.4.5-with-fixed||No||333253ms||
||2.4.5||Yes||469980ms||
||2.4.5||No||470515ms||
|      *cdh-5.16.2*|                                *No* |  *323412ms*|

 

  was:
One of our users complained higher latency after application upgrades from 
hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
Balance mode during app restart. I reproduced the regression by a test for meta 
lookup. 

At my test cluster, there are 160k regions for the test table, so there are 
160k entries in meta region. Used one thread to do 1 million meta lookup 
against the meta region server.

 


> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 3.0.0-alpha-1, 2.3.7
> Environment: ||Version ||Meta Replica Load Balance Enabled||Time      
>          ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2021-12-16 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun updated HBASE-26590:
-
Environment: (was: ||Version ||Meta Replica Load Balance Enabled||Time  
             ||
||2.4.5-with-fixed||Yes||336458ms||
||2.4.5-with-fixed||No||333253ms||
||2.4.5||Yes||469980ms||
||2.4.5||No||470515ms||
|      *cdh-5.16.2*|                                *No* |  *323412ms*|)

> Hbase-client Meta lookup performance regression between hbase-1 and hbase-2
> ---
>
> Key: HBASE-26590
> URL: https://issues.apache.org/jira/browse/HBASE-26590
> Project: HBase
>  Issue Type: Improvement
>  Components: meta
>Affects Versions: 3.0.0-alpha-1, 2.3.7
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
>
> One of our users complained higher latency after application upgrades from 
> hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
> Balance mode during app restart. I reproduced the regression by a test for 
> meta lookup. 
> At my test cluster, there are 160k regions for the test table, so there are 
> 160k entries in meta region. Used one thread to do 1 million meta lookup 
> against the meta region server.
>  
> ||Version ||Meta Replica Load Balance Enabled||Time               ||
> ||2.4.5-with-fixed||Yes||336458ms||
> ||2.4.5-with-fixed||No||333253ms||
> ||2.4.5||Yes||469980ms||
> ||2.4.5||No||470515ms||
> |      *cdh-5.16.2*|                                *No* |  *323412ms*|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26590) Hbase-client Meta lookup performance regression between hbase-1 and hbase-2

2021-12-16 Thread Huaxiang Sun (Jira)
Huaxiang Sun created HBASE-26590:


 Summary: Hbase-client Meta lookup performance regression between 
hbase-1 and hbase-2
 Key: HBASE-26590
 URL: https://issues.apache.org/jira/browse/HBASE-26590
 Project: HBase
  Issue Type: Improvement
  Components: meta
Affects Versions: 2.3.7, 3.0.0-alpha-1
 Environment: ||Version ||Meta Replica Load Balance Enabled||Time       
        ||
||2.4.5-with-fixed||Yes||336458ms||
||2.4.5-with-fixed||No||333253ms||
||2.4.5||Yes||469980ms||
||2.4.5||No||470515ms||
|      *cdh-5.16.2*|                                *No* |  *323412ms*|
Reporter: Huaxiang Sun
Assignee: Huaxiang Sun


One of our users complained higher latency after application upgrades from 
hbase-1.2 client (CDH-5.16.2) to hbase-2.4.5 client with meta replica Load 
Balance mode during app restart. I reproduced the regression by a test for meta 
lookup. 

At my test cluster, there are 160k regions for the test table, so there are 
160k entries in meta region. Used one thread to do 1 million meta lookup 
against the meta region server.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26304) Reflect out-of-band locality improvements in served requests

2021-11-08 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440687#comment-17440687
 ] 

Huaxiang Sun commented on HBASE-26304:
--

Thanks [~bbeaudreault] . Will try to look at the patch in the following days 
and try out at my testing clusters as well.

> Reflect out-of-band locality improvements in served requests
> 
>
> Key: HBASE-26304
> URL: https://issues.apache.org/jira/browse/HBASE-26304
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>
> Once the LocalityHealer has improved locality of a StoreFile (by moving 
> blocks onto the correct host), the Reader's DFSInputStream and Region's 
> localityIndex metric must be refreshed. Without refreshing the 
> DFSInputStream, the improved locality will not improve latencies. In fact, 
> the DFSInputStream may try to fetch blocks that have moved, resulting in a 
> ReplicaNotFoundException. This is automatically retried, but the retry will 
> temporarily increase long tail latencies relative to configured backoff 
> strategy.
> In the original LocalityHealer design, I created a new 
> RefreshHDFSBlockDistribution RPC on the RegionServer. This RPC accepts a list 
> of region names and, for each region store, re-opens the underlying StoreFile 
> if the locality has changed. This implementation was complicated both in 
> integrating callbacks into the HDFS Dispatcher and in terms of safely 
> re-opening StoreFiles without impacting reads or caches. 
> In working to port the LocalityHealer to the Apache projects, I'm taking a 
> different approach:
>  * The part of the LocalityHealer that moves blocks will be an HDFS project 
> contribution
>  * As such, the DFSClient should be able to more gracefully recover from 
> block moves.
>  * Additionally, HBase has some caches of block locations for locality 
> reporting and the balancer. Those need to be kept up-to-date.
> The DFSClient improvements are covered in 
> https://issues.apache.org/jira/browse/HDFS-16261. As such, this issue becomes 
> about updating HBase's block location caches.
> I considered a few different approaches, but the most elegant one I could 
> come up with was to tie the HDFSBlockDistribution metrics directly to the 
> underlying DFSInputStream of each StoreFile's initialReader. That way, our 
> locality metrics are identically representing the block allocations that our 
> reads are going through. This also means that our locality metrics will 
> naturally adjust as the DFSInputStream adjusts to block moves.
> Once we have accurate locality metrics on the regionserver, the Balancer's 
> cache can easily be invalidated via our usual heartbeat methods. 
> RegionServers report to the HMaster periodically, which keeps a 
> ClusterMetrics method up to date. Right before each balancer invocation, the 
> balancer is updated with the latest ClusterMetrics. At this time, we compare 
> the old ClusterMetrics to the new, and invalidate the caches for any regions 
> whose locality has changed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26047) [JDK17] Track JDK17 unit test failures

2021-10-25 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434035#comment-17434035
 ] 

Huaxiang Sun commented on HBASE-26047:
--

[~xytss123], I linked HBASE-26392. 

> [JDK17] Track JDK17 unit test failures
> --
>
> Key: HBASE-26047
> URL: https://issues.apache.org/jira/browse/HBASE-26047
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Wei-Chiu Chuang
>Priority: Major
>
> As of now, there are still two failed unit tests after exporting JDK internal 
> modules and the modifier access hack.
> {noformat}
> [ERROR] Tests run: 7, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 0.217 
> s <<< FAILURE! - in org.apache.hadoop.hbase.io.TestHeapSize
> [ERROR] org.apache.hadoop.hbase.io.TestHeapSize.testSizes  Time elapsed: 
> 0.041 s  <<< FAILURE!
> java.lang.AssertionError: expected:<160> but was:<152>
> at 
> org.apache.hadoop.hbase.io.TestHeapSize.testSizes(TestHeapSize.java:335)
> [ERROR] org.apache.hadoop.hbase.io.TestHeapSize.testNativeSizes  Time 
> elapsed: 0.01 s  <<< FAILURE!
> java.lang.AssertionError: expected:<72> but was:<64>
> at 
> org.apache.hadoop.hbase.io.TestHeapSize.testNativeSizes(TestHeapSize.java:134)
> [INFO] Running org.apache.hadoop.hbase.io.Tes
> [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.697 
> s <<< FAILURE! - in org.apache.hadoop.hbase.ipc.TestBufferChain
> [ERROR] org.apache.hadoop.hbase.ipc.TestBufferChain.testWithSpy  Time 
> elapsed: 0.537 s  <<< ERROR!
> java.lang.NullPointerException: Cannot enter synchronized block because 
> "this.closeLock" is null
> at 
> org.apache.hadoop.hbase.ipc.TestBufferChain.testWithSpy(TestBufferChain.java:119)
> {noformat}
> It appears that JDK17 makes the heap size estimate different than before. Not 
> sure why.
> TestBufferChain.testWithSpy  failure might be because of yet another 
> unexported module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-26338) hbck2 setRegionState cannot set replica region state

2021-10-11 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun resolved HBASE-26338.
--
Fix Version/s: hbase-operator-tools-1.2.0
 Release Note: 
To set the replica region's state, it needs the primary region's 
encoded regionname and replica id, the command will be "setRegionState 
, ".
   Resolution: Fixed

> hbck2 setRegionState cannot set replica region state
> 
>
> Key: HBASE-26338
> URL: https://issues.apache.org/jira/browse/HBASE-26338
> Project: HBase
>  Issue Type: Bug
>  Components: hbck2
>Affects Versions: hbase-operator-tools-1.1.0
>Reporter: Huaxiang Sun
>Assignee: Huaxiang Sun
>Priority: Major
> Fix For: hbase-operator-tools-1.2.0
>
>
> Currently, there is no way to use hbck2 setRegionState to set a replica 
> region's state, which makes hard to fix inconsistency related with replica 
> regions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >