[jira] [Commented] (HBASE-27947) RegionServer OOM under load when TLS is enabled

2023-06-29 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738849#comment-17738849
 ] 

Bryan Beaudreault commented on HBASE-27947:
---

Sorry for all the updates, but I’ve had some success today with the 
non-blocking idea in my last comment. I did the simple thing and had handlers 
drop calls off the channel is not writable when they pull from the queue. I set 
the netty high watermark to 5mb and low watermark to 512kb. This still resolved 
the OOMs but handler usage is better. I’m going to do more testing tomorrow 
before I package it up for a PR

> RegionServer OOM under load when TLS is enabled
> ---
>
> Key: HBASE-27947
> URL: https://issues.apache.org/jira/browse/HBASE-27947
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.6.0
>Reporter: Bryan Beaudreault
>Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters. 
> This has mostly gone fine, except on 1 cluster. Most clusters, including this 
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This 
> cluster tends to get bursts of traffic, in which case it would typically jump 
> to 400-500mb. Again this is sampled, so it could have been higher than that. 
> When we enabled SSL on this cluster, we started seeing bursts up to at least 
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's 
> and general chaos on the cluster.
>  
> We've gotten it under control a little bit by setting 
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and 
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}. 
> We've set netty's maxDirectMemory to be approx equal to 
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we 
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain 
> for clients but at least insulates the other components of the regionserver.
>  
> We're still digging into exactly why this is happening. The cluster clearly 
> has a bad access pattern, but it doesn't seem like SSL should increase the 
> memory footprint by 5-10x like we're seeing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27947) RegionServer OOM under load when TLS is enabled

2023-06-29 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738847#comment-17738847
 ] 

Lijin Bin commented on HBASE-27947:
---

We encounter a similar problem, there is too much response accumulated in the 
responseQueue and cause the regionserver OOm, we fix tit by just close this 
channel and release the heap reference by response.

> RegionServer OOM under load when TLS is enabled
> ---
>
> Key: HBASE-27947
> URL: https://issues.apache.org/jira/browse/HBASE-27947
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.6.0
>Reporter: Bryan Beaudreault
>Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters. 
> This has mostly gone fine, except on 1 cluster. Most clusters, including this 
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This 
> cluster tends to get bursts of traffic, in which case it would typically jump 
> to 400-500mb. Again this is sampled, so it could have been higher than that. 
> When we enabled SSL on this cluster, we started seeing bursts up to at least 
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's 
> and general chaos on the cluster.
>  
> We've gotten it under control a little bit by setting 
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and 
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}. 
> We've set netty's maxDirectMemory to be approx equal to 
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we 
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain 
> for clients but at least insulates the other components of the regionserver.
>  
> We're still digging into exactly why this is happening. The cluster clearly 
> has a bad access pattern, but it doesn't seem like SSL should increase the 
> memory footprint by 5-10x like we're seeing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27948) Report memstore on-heap and off-heap size as jmx metrics in sub=Memory bean

2023-06-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738835#comment-17738835
 ] 

Hudson commented on HBASE-27948:


Results for branch branch-3
[build #11 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/11/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/11/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/11/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/11/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Report memstore on-heap and off-heap size as jmx metrics in sub=Memory bean
> ---
>
> Key: HBASE-27948
> URL: https://issues.apache.org/jira/browse/HBASE-27948
> Project: HBase
>  Issue Type: Improvement
>Reporter: Jing Yu
>Assignee: Jing Yu
>Priority: Major
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>
> Currently we only report "memStoreSize" jmx metric in sub=Memory bean. There 
> are "Memstore On-Heap Size" and "Memsotre Off-Heap Size" in the RS UI. It 
> would be useful to report them in JMX.
> In addition, "memStoreSize" metric under sub=Memory is 0 for some reason 
> (while that under sub=Server is not). Need to do some digging to see if it is 
> a bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27951) Use ADMIN_QOS in MasterRpcServices for regionserver operational dependencies

2023-06-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738834#comment-17738834
 ] 

Hudson commented on HBASE-27951:


Results for branch branch-3
[build #11 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/11/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/11/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/11/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/11/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Use ADMIN_QOS in MasterRpcServices for regionserver operational dependencies
> 
>
> Key: HBASE-27951
> URL: https://issues.apache.org/jira/browse/HBASE-27951
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.10
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1, 4.0.0-alpha-1
>
>
> Analysis of a recent production incident is not yet complete but an item of 
> note is an apparent deadlock. Imagine you are gracefully draining a 
> regionserver by way of a flurry of moveRegion requests. The handler for 
> moveRegion submits a TRSP and then waits on its future without timeout. 
> Imagine that there are sufficient number of moveRegion requests to tie up the 
> normal priority master RPC pool. Now imagine that all of those requests are 
> waiting on TRSPs pending on a regionserver that is concurrently bounced or 
> maybe it fails. The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE 
> because the target regionserver terminated before responding to the close 
> requests, blocking the moveRegion requests, blocking the RPC handlers. The 
> regionserver restarts and tries to check in, but cannot report to the master 
> because there are no free normal priority handlers to handle it. It seems not 
> correct to have the regionserver operational dependencies 
> (regionServerStartup, regionServerReport, and reportFatalRSError) contending 
> with normal priority requests.
> They should be made ADMIN_QOS priority to avoid this case. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-27955) RefreshPeerProcedure should be resilient to replication endpoint failures

2023-06-29 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738226#comment-17738226
 ] 

Viraj Jasani edited comment on HBASE-27955 at 6/29/23 9:31 PM:
---

That is correct, NPE is code bug in the custom replication endpoint, however 
the point i am trying to make is: as soon as this NPE gets reported, 
RefreshPeerProcedure gets completed but not rolled back (rollback is not 
supported). And the next step in the parent procedure i.e. 
POST_PEER_MODIFICATION would stay stuck and it doesn't even get executed. The 
only clue i have is that the previous step of the procedure had above NPE 
reported and it got completed (succ flag is modified to false)
{code:java}
@Override
protected void complete(MasterProcedureEnv env, Throwable error) {
  if (error != null) {
LOG.warn("Refresh peer {} for {} on {} failed", peerId, type, targetServer, 
error);
this.succ = false;
  } else {
LOG.info("Refresh peer {} for {} on {} suceeded", peerId, type, 
targetServer);
this.succ = true;
  }
} {code}
Thread dumps had nothing reported that could indicate why 
POST_PEER_MODIFICATION was stuck. No INFO logs from POST_PEER_MODIFICATION step 
execution either.

 

Hence, if we could introduce rollback in RefreshPeerProcedure, that would help 
at least complete the procedure with rollback rather than letting it stay stuck 
at next step (POST_PEER_MODIFICATION).


was (Author: vjasani):
That is correct, NPE is code bug in the custom replication endpoint, however 
the point i am trying to make is: as soon as this NPE gets reported, 
RefreshPeerProcedure gets completed but not rolled back (rollback is not 
supported). And the next step in the parent procedure i.e. 
POST_PEER_MODIFICATION would stay stuck and it doesn't even get executed. The 
only clue i have is that the previous step of the procedure had above NPE 
reported and it got completed (succ flag is modified to false)

 
{code:java}
@Override
protected void complete(MasterProcedureEnv env, Throwable error) {
  if (error != null) {
LOG.warn("Refresh peer {} for {} on {} failed", peerId, type, targetServer, 
error);
this.succ = false;
  } else {
LOG.info("Refresh peer {} for {} on {} suceeded", peerId, type, 
targetServer);
this.succ = true;
  }
} {code}
 

 

Thread dumps had nothing reported that could indicate why 
POST_PEER_MODIFICATION was stuck.

 

If we could introduce rollback in RefreshPeerProcedure, that could help at 
least complete the procedure with rollback rather than letting it stay stuck at 
next step (POST_PEER_MODIFICATION).

> RefreshPeerProcedure should be resilient to replication endpoint failures
> -
>
> Key: HBASE-27955
> URL: https://issues.apache.org/jira/browse/HBASE-27955
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.14
>Reporter: Viraj Jasani
>Priority: Major
>
> UpdatePeerConfigProcedure gets stuck when we see some failures in 
> RefreshPeerProcedure. The only way to move forward is either by restarting 
> active master or bypassing the stuck procedure.
>  
> For instance,
> {code:java}
> 2023-06-26 17:22:08,375 WARN  [,queue=24,port=61000] 
> replication.RefreshPeerProcedure - Refresh peer peer0 for UPDATE_CONFIG on 
> {host},{port},1687053857180 failed
> java.lang.NullPointerException via 
> {host},{port},1687053857180:java.lang.NullPointerException: 
>     at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
>     at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
> Caused by: java.lang.NullPointerException: 
>     at xyz(Abc.java:89) <= replication endpoint failure example
>     at xyz(Abc.java:79)     <= replication endpoint failure example
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> 

[jira] [Commented] (HBASE-27951) Use ADMIN_QOS in MasterRpcServices for regionserver operational dependencies

2023-06-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738765#comment-17738765
 ] 

Hudson commented on HBASE-27951:


Results for branch branch-2.5
[build #375 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/375/]:
 (/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/375/General_20Nightly_20Build_20Report/]


(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/375/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/375/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/375/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Use ADMIN_QOS in MasterRpcServices for regionserver operational dependencies
> 
>
> Key: HBASE-27951
> URL: https://issues.apache.org/jira/browse/HBASE-27951
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.10
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1, 4.0.0-alpha-1
>
>
> Analysis of a recent production incident is not yet complete but an item of 
> note is an apparent deadlock. Imagine you are gracefully draining a 
> regionserver by way of a flurry of moveRegion requests. The handler for 
> moveRegion submits a TRSP and then waits on its future without timeout. 
> Imagine that there are sufficient number of moveRegion requests to tie up the 
> normal priority master RPC pool. Now imagine that all of those requests are 
> waiting on TRSPs pending on a regionserver that is concurrently bounced or 
> maybe it fails. The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE 
> because the target regionserver terminated before responding to the close 
> requests, blocking the moveRegion requests, blocking the RPC handlers. The 
> regionserver restarts and tries to check in, but cannot report to the master 
> because there are no free normal priority handlers to handle it. It seems not 
> correct to have the regionserver operational dependencies 
> (regionServerStartup, regionServerReport, and reportFatalRSError) contending 
> with normal priority requests.
> They should be made ADMIN_QOS priority to avoid this case. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27948) Report memstore on-heap and off-heap size as jmx metrics in sub=Memory bean

2023-06-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738766#comment-17738766
 ] 

Hudson commented on HBASE-27948:


Results for branch branch-2.5
[build #375 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/375/]:
 (/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/375/General_20Nightly_20Build_20Report/]


(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/375/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/375/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/375/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Report memstore on-heap and off-heap size as jmx metrics in sub=Memory bean
> ---
>
> Key: HBASE-27948
> URL: https://issues.apache.org/jira/browse/HBASE-27948
> Project: HBase
>  Issue Type: Improvement
>Reporter: Jing Yu
>Assignee: Jing Yu
>Priority: Major
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>
> Currently we only report "memStoreSize" jmx metric in sub=Memory bean. There 
> are "Memstore On-Heap Size" and "Memsotre Off-Heap Size" in the RS UI. It 
> would be useful to report them in JMX.
> In addition, "memStoreSize" metric under sub=Memory is 0 for some reason 
> (while that under sub=Server is not). Need to do some digging to see if it is 
> a bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27948) Report memstore on-heap and off-heap size as jmx metrics in sub=Memory bean

2023-06-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738749#comment-17738749
 ] 

Hudson commented on HBASE-27948:


Results for branch master
[build #865 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/865/]: 
(/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/865/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/865/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/865/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Report memstore on-heap and off-heap size as jmx metrics in sub=Memory bean
> ---
>
> Key: HBASE-27948
> URL: https://issues.apache.org/jira/browse/HBASE-27948
> Project: HBase
>  Issue Type: Improvement
>Reporter: Jing Yu
>Assignee: Jing Yu
>Priority: Major
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>
> Currently we only report "memStoreSize" jmx metric in sub=Memory bean. There 
> are "Memstore On-Heap Size" and "Memsotre Off-Heap Size" in the RS UI. It 
> would be useful to report them in JMX.
> In addition, "memStoreSize" metric under sub=Memory is 0 for some reason 
> (while that under sub=Server is not). Need to do some digging to see if it is 
> a bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27951) Use ADMIN_QOS in MasterRpcServices for regionserver operational dependencies

2023-06-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738748#comment-17738748
 ] 

Hudson commented on HBASE-27951:


Results for branch master
[build #865 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/865/]: 
(/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/865/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/865/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/865/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Use ADMIN_QOS in MasterRpcServices for regionserver operational dependencies
> 
>
> Key: HBASE-27951
> URL: https://issues.apache.org/jira/browse/HBASE-27951
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.10
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1, 4.0.0-alpha-1
>
>
> Analysis of a recent production incident is not yet complete but an item of 
> note is an apparent deadlock. Imagine you are gracefully draining a 
> regionserver by way of a flurry of moveRegion requests. The handler for 
> moveRegion submits a TRSP and then waits on its future without timeout. 
> Imagine that there are sufficient number of moveRegion requests to tie up the 
> normal priority master RPC pool. Now imagine that all of those requests are 
> waiting on TRSPs pending on a regionserver that is concurrently bounced or 
> maybe it fails. The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE 
> because the target regionserver terminated before responding to the close 
> requests, blocking the moveRegion requests, blocking the RPC handlers. The 
> regionserver restarts and tries to check in, but cannot report to the master 
> because there are no free normal priority handlers to handle it. It seems not 
> correct to have the regionserver operational dependencies 
> (regionServerStartup, regionServerReport, and reportFatalRSError) contending 
> with normal priority requests.
> They should be made ADMIN_QOS priority to avoid this case. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27956) Support wall clock profiling in ProfilerServlet

2023-06-29 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-27956:
-

 Summary: Support wall clock profiling in ProfilerServlet
 Key: HBASE-27956
 URL: https://issues.apache.org/jira/browse/HBASE-27956
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault


The async-profiler supports profiling wall clock time, but our ProfilerServlet 
does not recognize that type. When an unrecognized type is passed, it defaults 
to cpu. We should add support for wall clock, which would be triggered via the 
string "wall"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27947) RegionServer OOM under load when TLS is enabled

2023-06-29 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738710#comment-17738710
 ] 

Bryan Beaudreault commented on HBASE-27947:
---

I have been trying to think about how to solve this without blocking RPC 
handlers.

The problem with just relying on setAutoRead(false) is it only pauses 
acceptance of new requests into the call queue. There will already be requests 
in progress by RPC handlers, and there could be even more requests queued in 
our call queue. Allowing them to publish to the channel can still result in OOM.

In terms of solving this without blocking RPC handlers, we might need to either 
clear or temporarily invalidate calls in the call queue originating from that 
channel. We could possibly achieve this by having the ServerCall retain a 
reference to the originating ServerRpcConnection. When a handler pulls a call 
from the queue, it checks if that call's connection.channel is writable. If not 
it could re-enqueue it, drop it, or maybe close the connection? Not sure yet. 
I'm open to any thoughts. Then the other question is what to do with calls 
which have already been in progress when the channel is made unwritable. Do we 
need a size-limited per-channel responseQueue?

> RegionServer OOM under load when TLS is enabled
> ---
>
> Key: HBASE-27947
> URL: https://issues.apache.org/jira/browse/HBASE-27947
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.6.0
>Reporter: Bryan Beaudreault
>Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters. 
> This has mostly gone fine, except on 1 cluster. Most clusters, including this 
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This 
> cluster tends to get bursts of traffic, in which case it would typically jump 
> to 400-500mb. Again this is sampled, so it could have been higher than that. 
> When we enabled SSL on this cluster, we started seeing bursts up to at least 
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's 
> and general chaos on the cluster.
>  
> We've gotten it under control a little bit by setting 
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and 
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}. 
> We've set netty's maxDirectMemory to be approx equal to 
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we 
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain 
> for clients but at least insulates the other components of the regionserver.
>  
> We're still digging into exactly why this is happening. The cluster clearly 
> has a bad access pattern, but it doesn't seem like SSL should increase the 
> memory footprint by 5-10x like we're seeing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27948) Report memstore on-heap and off-heap size as jmx metrics in sub=Memory bean

2023-06-29 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27948.
--
Fix Version/s: 2.6.0
   2.5.6
   3.0.0-beta-1
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Report memstore on-heap and off-heap size as jmx metrics in sub=Memory bean
> ---
>
> Key: HBASE-27948
> URL: https://issues.apache.org/jira/browse/HBASE-27948
> Project: HBase
>  Issue Type: Improvement
>Reporter: Jing Yu
>Assignee: Jing Yu
>Priority: Major
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>
> Currently we only report "memStoreSize" jmx metric in sub=Memory bean. There 
> are "Memstore On-Heap Size" and "Memsotre Off-Heap Size" in the RS UI. It 
> would be useful to report them in JMX.
> In addition, "memStoreSize" metric under sub=Memory is 0 for some reason 
> (while that under sub=Server is not). Need to do some digging to see if it is 
> a bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hbase] Apache-HBase commented on pull request #5312: HBASE-27954 Eliminate duplicate code for getNonRootIndexedKey in HFil…

2023-06-29 Thread via GitHub


Apache-HBase commented on PR #5312:
URL: https://github.com/apache/hbase/pull/5312#issuecomment-1612491857

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | +0 :ok: |  reexec  |   0m 28s |  Docker mode activated.  |
   | -0 :warning: |  yetus  |   0m  3s |  Unprocessed flag(s): 
--brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list 
--whitespace-tabs-ignore-list --quick-hadoopcheck  |
   ||| _ Prechecks _ |
   ||| _ master Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   2m 49s |  master passed  |
   | +1 :green_heart: |  compile  |   0m 46s |  master passed  |
   | +1 :green_heart: |  shadedjars  |   4m 40s |  branch has no errors when 
building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 27s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   2m 34s |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 48s |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 48s |  the patch passed  |
   | +1 :green_heart: |  shadedjars  |   4m 36s |  patch has no errors when 
building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  the patch passed  |
   ||| _ Other Tests _ |
   | +1 :green_heart: |  unit  | 219m 23s |  hbase-server in the patch passed.  
|
   |  |   | 241m 22s |   |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5312/2/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hbase/pull/5312 |
   | Optional Tests | javac javadoc unit shadedjars compile |
   | uname | Linux e23aeef39417 5.4.0-148-generic #165-Ubuntu SMP Tue Apr 18 
08:53:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / 9e8e43864c |
   | Default Java | Eclipse Adoptium-11.0.17+8 |
   |  Test Results | 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5312/2/testReport/
 |
   | Max. process+thread count | 4218 (vs. ulimit of 3) |
   | modules | C: hbase-server U: hbase-server |
   | Console output | 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5312/2/console 
|
   | versions | git=2.34.1 maven=3.8.6 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org