subject:"\[jira\] \[Commented\] \(HBASE\-22978\) Online slow response log"

[jira] [Commented] (HBASE-22978) Online slow response log

2020-10-16 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215500#comment-17215500
 ] 

Hudson commented on HBASE-22978:


Results for branch branch-1
[build #44 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/44/]:
 (x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/44//General_Nightly_Build_Report/]


(x) {color:red}-1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/44//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/44//JDK8_Nightly_Build_Report_(Hadoop2)/]




(x) {color:red}-1 source release artifact{color}
-- See build output for details.


> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0-alpha-1, 2.3.0
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.3.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
>

[jira] [Commented] (HBASE-22978) Online slow response log

2020-03-05 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052326#comment-17052326
 ] 

Viraj Jasani commented on HBASE-22978:
--

{quote}Pushed on master and branch-2. Did not do branch-1 because this patch is 
not in branch-1 yet?
{quote}
Thanks [~stack] for the addendum. This is yet to be backported to branch-1, 
will do it with HBASE-23935 as I want to backport a couple of more subtask 
commits together in branch-1.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided.

[jira] [Commented] (HBASE-22978) Online slow response log

2020-03-04 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051156#comment-17051156
 ] 

Hudson commented on HBASE-22978:


Results for branch master
[build #1651 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1651/]: (x) 
*{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1651//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1651//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1651//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry befor

[jira] [Commented] (HBASE-22978) Online slow response log

2020-03-03 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050650#comment-17050650
 ] 

Hudson commented on HBASE-22978:


Results for branch branch-2
[build #2528 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2528/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2528//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2528//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2528//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an

[jira] [Commented] (HBASE-22978) Online slow response log

2020-03-03 Thread Michael Stack (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050406#comment-17050406
 ] 

Michael Stack commented on HBASE-22978:
---

FYI, I addendumed this change to fix errorprone ERROR seen in nightlies

 

{{commit f411e395d36acd8dee4d2545b1483cd95816f710 (HEAD -> m, origin/master, 
origin/HEAD)}}
{{Author: stack }}
{{Date: Tue Mar 3 09:15:18 2020 -0800}}{{HBASE-22978 : Online slow response log 
(#1228)}}
{{ Addendum to fix errorprone ERROR}}{{diff --git 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/slowlog/TestSlowLogRecorder.java
 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/slowlog/TestSlowLogRecorder.java}}
{{index 240230eede..6a56c099eb 100644}}
{{--- 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/slowlog/TestSlowLogRecorder.java}}
{{+++ 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/slowlog/TestSlowLogRecorder.java}}
{{@@ -556,7 +556,8 @@ public class TestSlowLogRecorder {}}
{{ .build();}}
{{ break;}}
{{ }}}
{{-}}
{{+ default:}}
{{+ throw new RuntimeException("Not supposed to get here?");}}
{{ }}}{{return message;}}

 

Pushed on master and branch-2. Did not do branch-1 because this patch is not in 
branch-1 yet? [~vjasani]

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> O

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-29 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048417#comment-17048417
 ] 

Hudson commented on HBASE-22978:


Results for branch branch-2
[build #2520 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2520/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2520//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2520//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2520//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an ent

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-29 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048281#comment-17048281
 ] 

Hudson commented on HBASE-22978:


Results for branch master
[build #1647 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1647/]: (x) 
*{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1647//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1647//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1647//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-28 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047659#comment-17047659
 ] 

Viraj Jasani commented on HBASE-22978:
--

Got it, Thanks Sean

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded regi

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-27 Thread Sean Busbey (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046602#comment-17046602
 ] 

Sean Busbey commented on HBASE-22978:
-

No worries. I can definitely empathize with needing to chase after new 
reviewers on a long running issue. I think you did a good job of getting plenty 
of eyes on things; the key bit for me was just making sure someone got to the 
end and gave sign off.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-27 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046458#comment-17046458
 ] 

Viraj Jasani commented on HBASE-22978:
--

Sorry for missing this. Will take care of it going forward [~busbey]

And yes, many reviewers were involved for different duration hence, many GitHub 
responses, multiple repetitive conversations etc, and perhaps I requested 
reviews from too many reviewers, sorry for spamming you all with lots of 
emails, will be careful going forward.

Thanks to all reviewers over Jira and GitHub PR. Let me keep PR open one more 
day.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-26 Thread Sean Busbey (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046138#comment-17046138
 ] 

Sean Busbey commented on HBASE-22978:
-

{quote}Planning to commit this after ~2 days unless anyone has any 
objections.{quote}

for future reference, please try to reference having a review done when saying 
this kind of stuff. I see now that [~bharathv] gave a +1 on the PR, but I've 
spent the time since your comment fighting GitHub 500 responses to verify that 
you were following our RTC norm.

(Also thanks for pushing on this; I'm excited to see it land.)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the to

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-25 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044779#comment-17044779
 ] 

Viraj Jasani commented on HBASE-22978:
--

Planning to commit this after ~2 days unless anyone has any objections.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filte

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-22 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042638#comment-17042638
 ] 

Viraj Jasani commented on HBASE-22978:
--

Update on the progress:

I have addressed all concerns in latest reviews, almost ready for commit to 
master branch and then will prepare branch-2 backport. Will create all follow 
up tasks soon.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-13 Thread Andrew Kyle Purtell (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036470#comment-17036470
 ] 

Andrew Kyle Purtell commented on HBASE-22978:
-

We are releasing 1.6.0 now because of HBASE-23825, moving to 1.7.0.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGIO

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-12 Thread Nick Dimiduk (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035657#comment-17035657
 ] 

Nick Dimiduk commented on HBASE-22978:
--

For those not following the PR, here's a summary as I see them of the remaining 
todo items:
 * log large responses (follow-on task)?
 ** don’t assume “tooLarge” and “tooSlow” are mutually exclusive.
 * replicate log entries into a file on HDFS or a new system table (follow-on 
task)?
 * expose slow/large responses as a metric (follow-on task)?
 * do we need to ship with thrift api parity?
 * push filter and limit logic into the server, so all clients have the same 
experience
 ** I think this requires adding a `TooSlowRequest` protobuf message with 
filter and limit parameters
 ** Maybe then `SlowLogPayload` is renamed to `SlowLogRecord` or something like 
that
 * need separation of concerns between SlowLog.toString and formatting of 
output in the shell
 * avoid adding to the surface area of HConstants
 * proper interrupt handling within SlowLogEventHandler
 * update configuration adoc to reference the feature adoc
 * test cleanup
 ** avoid sleeping for fixed time units, instead use waitFor
 ** maybe combine some methods containing repeated logic, introduce a helper 
method or two

[https://github.com/apache/hbase/pull/754#issuecomment-585382957]

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row count

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-12 Thread Nick Dimiduk (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035545#comment-17035545
 ] 

Nick Dimiduk commented on HBASE-22978:
--

Thanks for filling me in on the background here [~vjasani]. I admit to focusing 
more on the PR than all this background discussion here in Jira.

bq. Filtering or sorting on other attributes would be done locally in the 
client. I think it best to let the client index the list of ResponseDetail 
however it likes.
bq. The shell commands are one client of the admin APIs. This seems a good 
place to put additional convenience filtering.

Regardless of who made the comment, I would like to push back against this 
design decision. Leaving this kind of logic in the shell means it's not 
available to users of the Admin API. Having this kind of unequal interface 
across subcomponents makes for a bad user experience, and forces users to 
re-implement what the shell is doing. Just implement it on the server, make the 
filter parameters a part of the RPC interface, and everyone enjoys the feature.

bq. Although I agree with scalability concern but I believe filter done by 
client(shell) might not be too much of an overhead right?

So picture a cluster of 1000 workers, hosting an application that has gotten 
out of hand, has pathologically slow or large response access pattern. 256 log 
entries * 1000 servers. These entries are pretty fat with up to 14 fields and 
some potentially big ones like {{call_details}} and {{params}}... just save 
everyone the hassles and run the filter on the server side where the data 
volume is relatively small.

Come to think of it, this should have a built-in default limit of something 
relatively small... maybe 10? From sitting in the shell, the operator has no 
idea which RS's might have slow logs, or how many of them, so a naive 
invocation of the command could potentially surprise and overwhelm them with 
data. So in addition to a limit param with a default value, maybe we also want 
the response message from each server to include the number of results that 
server has. That way the client and response with something like "showing 
10/256,000".

WDYT?

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write reques

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-10 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034139#comment-17034139
 ] 

Viraj Jasani commented on HBASE-22978:
--

Thanks for the review [~ndimiduk]

I would like to bring one of the client side scalability concern raised here 
since as per the earlier discussions on this Jira, we decided to provide filter 
at client side, which is shell for now and in future, operator can also consume 
APIs.

 

Copying Andrew's earlier comment:

"Filtering or sorting on other attributes would be done locally in the client. 
I think it best to let the client index the list of ResponseDetail however it 
likes.

The shell commands are one client of the admin APIs. This seems a good place to 
put additional convenience filtering."

 

Although I agree with scalability concern but I believe filter done by 
client(shell) might not be too much of an overhead right? 

 

cc: [~apurtell]

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Pr

[jira] [Commented] (HBASE-22978) Online slow response log

2020-02-01 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028114#comment-17028114
 ] 

Viraj Jasani commented on HBASE-22978:
--

[~stack] [~ndimiduk] Addressed latest review comments. Please review when you 
get some time.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that tabl

[jira] [Commented] (HBASE-22978) Online slow response log

2020-01-27 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024140#comment-17024140
 ] 

Viraj Jasani commented on HBASE-22978:
--

Moved this back to 2.3.0 and if there is any concern or this causes any 
unexpected delay, we will move it out of 2.3.0 again. Hopefully, this should be 
in branch-2 soon :)

FYI [~stack] [~ndimiduk]

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE fi

[jira] [Commented] (HBASE-22978) Online slow response log

2020-01-26 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024137#comment-17024137
 ] 

Viraj Jasani commented on HBASE-22978:
--

[~stack] agree with you on providing feature turn on/off config, added a commit 
for the same. Please review.

Also, I am very much looking forward to pushing this feature to 2.3.0 and 1.6.0 
releases. Since 2.3.0 is going to happen sooner, soon as I get approval for 
master patch, I will start on backport work for branch-2 and then for branch-1 
after committing to branch-2.

If you have some bandwidth to review this new commit, please review and I will 
be on backport work.

[~apurtell] If you also have some bandwidth, please review.

 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell comm

[jira] [Commented] (HBASE-22978) Online slow response log

2020-01-23 Thread Michael Stack (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022369#comment-17022369
 ] 

Michael Stack commented on HBASE-22978:
---

For now, since this is a new feature and its marked minor, I'm moving it out of 
2.3.0. We can pull in the backport if done in time. Thanks [~vjasani]. More 
comments up on the PR.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a st

[jira] [Commented] (HBASE-22978) Online slow response log

2020-01-21 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020704#comment-17020704
 ] 

Viraj Jasani commented on HBASE-22978:
--

Thanks [~stack]

Sure let me try for 2.3.0 soon after we have enough reviews and confidence. 
Will address comments.

Also, 2.3.0 backport might not take more time after trunk commit compared to 
1.6.0 backport, will create backport subtask for 1.6.0.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters f

[jira] [Commented] (HBASE-22978) Online slow response log

2020-01-21 Thread Michael Stack (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020671#comment-17020671
 ] 

Michael Stack commented on HBASE-22978:
---

This supposed to got into hbase-2.3.0? Left some comments on PR

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, whi

[jira] [Commented] (HBASE-22978) Online slow response log

2019-12-16 Thread Xu Cang (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997702#comment-16997702
 ] 

Xu Cang commented on HBASE-22978:
-

Left some comments in PR. thanks [~vjasani]

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing a

[jira] [Commented] (HBASE-22978) Online slow response log

2019-12-13 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995759#comment-16995759
 ] 

Viraj Jasani commented on HBASE-22978:
--

Please review the patch, it's been a while :)

Thanks

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a

[jira] [Commented] (HBASE-22978) Online slow response log

2019-11-08 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970182#comment-16970182
 ] 

Viraj Jasani commented on HBASE-22978:
--

Please review the pull request

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an e

[jira] [Commented] (HBASE-22978) Online slow response log

2019-10-24 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16958654#comment-16958654
 ] 

Viraj Jasani commented on HBASE-22978:
--

Created PR for review: [https://github.com/apache/hbase/pull/754]

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Attachments: HBASE-22978.master.000.patch, 
> HBASE-22978.master.001.patch, Screen Shot 2019-10-19 at 2.31.59 AM.png, 
> Screen Shot 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 
> AM.png, Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string conta

[jira] [Commented] (HBASE-22978) Online slow response log

2019-10-21 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956178#comment-16956178
 ] 

Viraj Jasani commented on HBASE-22978:
--

With the attached patch, tried to cover ringbuffer, shell scripts to get and 
clear buffers including filters(except USER filter). 

If it is fine, will create subtask for optionally dumping logs to HDFS. Also, 
just realized, I need to apply USER filter with get_slow_responses(). Will 
update the patch later.

Some screenshots of sample outputs:

!Screen Shot 2019-10-19 at 2.31.59 AM.png!

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Attachments: HBASE-22978.master.000.patch, 
> HBASE-22978.master.001.patch, Screen Shot 2019-10-19 at 2.31.59 AM.png, 
> Screen Shot 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 
> AM.png, Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the give

[jira] [Commented] (HBASE-22978) Online slow response log

2019-10-19 Thread HBase QA (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955358#comment-16955358
 ] 

HBase QA commented on HBASE-22978:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
40s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
1s{color} | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} prototool {color} | {color:blue}  0m  
0s{color} | {color:blue} prototool was not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
29s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
 5s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
55s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
42s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  5m 
30s{color} | {color:blue} branch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
39s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  4m 
46s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 13m 
18s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 22m 
44s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  2m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
40s{color} | {color:green} root: The patch generated 0 new + 488 unchanged - 1 
fixed = 488 total (was 489) {color} |
| {color:red}-1{color} | {color:red} rubocop {color} | {color:red}  0m 
18s{color} | {color:red} The patch generated 26 new + 490 unchanged - 0 fixed = 
516 total (was 490) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  5m 
36s{color} | {color:blue} patch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
34s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
15m 26s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.8.5 2.9.2 or 3.1.2. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green}  
8m  0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  4m 
54s{color}

[jira] [Commented] (HBASE-22978) Online slow response log

2019-10-19 Thread HBase QA (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955254#comment-16955254
 ] 

HBase QA commented on HBASE-22978:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
43s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
1s{color} | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} prototool {color} | {color:blue}  0m  
0s{color} | {color:blue} prototool was not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
32s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
45s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
11s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
57s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  6m 
16s{color} | {color:blue} branch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  5m 
 0s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  4m 
45s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 14m 
46s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 24m 
52s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  3m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  3m 
 6s{color} | {color:green} root: The patch generated 0 new + 488 unchanged - 1 
fixed = 488 total (was 489) {color} |
| {color:red}-1{color} | {color:red} rubocop {color} | {color:red}  0m 
18s{color} | {color:red} The patch generated 39 new + 490 unchanged - 0 fixed = 
529 total (was 490) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 3 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  6m 
13s{color} | {color:blue} patch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
59s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
17m  8s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.8.5 2.9.2 or 3.1.2. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green}  
8m 39s{color} | {color:green} the patch passed {color} |
| {colo

[jira] [Commented] (HBASE-22978) Online slow response log

2019-10-10 Thread Andrew Kyle Purtell (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948769#comment-16948769
 ] 

Andrew Kyle Purtell commented on HBASE-22978:
-

[~vjasani] asked me offline if he could work on this. Please, and thank you!

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> pertaining to that client. The USER filter, whi

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-18 Thread Andrew Purtell (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932179#comment-16932179
 ] 

Andrew Purtell commented on HBASE-22978:


If need be but seems premature because the work hasn’t even started yet. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> pertaining to that client. The USER filter, which expects a str

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-18 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932156#comment-16932156
 ] 

Viraj Jasani commented on HBASE-22978:
--

In this case, is it fine to have "optionally dumping logs to HDFS" as a 
sub-task to this?

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> pertaining to that client. The USER filter, which expects a string containing 
> a user

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-16 Thread Andrew Purtell (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930926#comment-16930926
 ] 

Andrew Purtell commented on HBASE-22978:


Just to be clear if the HDFS file option is turned on then all history will be 
there even though it may be expensive to keep all the events - large file sizes 
even with compression, depending. The shell cannot be expected to process large 
data efficiently. Some kind of tool that can optionally use MR will be 
developed. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
>

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-16 Thread Andrew Purtell (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930921#comment-16930921
 ] 

Andrew Purtell commented on HBASE-22978:


Shell is ring buffers. This is an online slow request log. The goal isn’t to 
have complete history it is to keep a list of interesting events for as long as 
we can without consuming too many resources either RAM or files. 

Having an option also to dump to HDFS is important so we added that as a side 
goal. The tool that dumps the files will have the same capabilities as the 
shell because I expect to share common utility classes. However this is not the 
main goal. The main goal is shell access to ring buffers containing *latest* 
events of interest, for use in online and active debugging. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally pro

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-16 Thread Sean Busbey (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930655#comment-16930655
 ] 

Sean Busbey commented on HBASE-22978:
-

the shell command should be ring buffers by default. It can include a pointer 
to the HDFS location for older entries.

an option on the shell command to read from the HDFS location would be fine, 
but it'll definitely need some filtering options given the size.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter,

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-15 Thread Viraj Jasani (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929945#comment-16929945
 ] 

Viraj Jasani commented on HBASE-22978:
--

{quote}{{get_slow_responses [  ... ,  ] [ , \{ 
 } ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers in the cluster 
if no list is provided. Optionally provide a map of parameters for filtering as 
additional argument.
{quote}
 
{quote}I like the idea of persisting the complete slow log with best effort, 
though. Reminds me of the [MySQL slow log dump 
tool|https://dev.mysql.com/doc/refman/8.0/en/mysqldumpslow.html]. Initially I 
was thinking this could be part of general performance surveillance, where 
sampling is good enough, but maybe there could be a tough to debug case that's 
rare so would also be hard to catch in the ring buffers. For that we'd 
configure a slow log directory in site configuration, presumably in HDFS, into 
which regionservers would each append to a file, rolling at some configured 
bound.
{quote}
shell triggered get_slow_responses should retrieve records maintained by ring 
buffers only? or it should read records from HDFS?

If shell reads only the buffer only, we might miss on details of hard to catch 
rare cases (already written to HDFS)?

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-09 Thread Andrew Purtell (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925929#comment-16925929
 ] 

Andrew Purtell commented on HBASE-22978:


Going out on vacation for much of September. Unassigning, in case someone else 
is interested in picking it up in the meantime. I'll assign back to myself and 
do it upon return otherwise. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include o

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-05 Thread Sean Busbey (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923853#comment-16923853
 ] 

Sean Busbey commented on HBASE-22978:
-

bq. You can invoke the shell with a command and pipe the output to a file, like 
{{./bin/hbase shell 'command' > output.txt 2>&1}} . Or the output can be piped 
to something else. Nothing special need be done there.

Sure, but each RS writing to HDFS and then using tools to pull all or some 
binary blobs down to a local machine (or inspecting them on HDFS itself) is 
going to be less error prone then making things run through a jruby shell.

Agree that a general purpose "dump everything over here at best effort" is 
going to be better for the kind of problems I'm thinking about though.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923852#comment-16923852
 ] 

Andrew Purtell commented on HBASE-22978:


You can invoke the shell with a command and pipe the output to a file, like 
{{./bin/hbase shell 'command' > output.txt 2>&1}} . Or the output can be piped 
to something else. Nothing special need be done there. 

I like the idea of persisting the complete slow log with best effort, though. 
Reminds me of the [MySQL slow 
log|https://dev.mysql.com/doc/refman/8.0/en/slow-query-log.html]. Initially I 
was thinking this could be part of general performance surveillance, where 
sampling is good enough, but maybe there could be a tough to debug case that's 
rare so would also be hard to catch in the ring buffers. For that we'd 
configure a slow log directory in site configuration, presumably in HDFS, into 
which regionservers would each append to a file, rolling at some configured 
bound. A tool that decodes and prints to stdout, like HFilePrettyPrinter and 
such, can mostly share common code with what we put in the regionserver to do 
the same out to RPC for the shell. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printi

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-05 Thread Sean Busbey (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923836#comment-16923836
 ] 

Sean Busbey commented on HBASE-22978:
-

bq. I'll update the description to include write-behind of the ring buffer to a 
directory in HDFS, but it shouldn't block, so if we stall during writing some 
ring buffer entries may have been lost. If we can detect that we can log that 
it happened in the file.

oh that's really nice. I just meant like "send the ones that match to this HDFS 
directory instead of to my hbase shell here" on the assumption that I will 
eventually run into a case where the buffer is eventually big enough that I 
will want to look at the results with $EDITOR or other shell tools instead of 
our shell and I'd rather not have to write a java tool to do it then.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or al

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923772#comment-16923772
 ] 

Andrew Purtell commented on HBASE-22978:


Description updated. Admin API simplified. The {{get_slow_responses}} shell 
command is reconsidered as a general query tool. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> pertaining to that client. The USER filter, which expects a string containing 
> a user name, will include only entries pertaining to that user. Filters are 
> additive, for example if both CLIENT_IP and USER filters are given, entries 
> matching either or both con

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923760#comment-16923760
 ] 

Andrew Purtell commented on HBASE-22978:


I'll update the description to include write-behind of the ring buffer to a 
directory in HDFS, but it shouldn't block, so if we stall during writing some 
ring buffer entries may have been lost. If we can detect that we can log that 
it happened in the file. 

bq. Also saving the user and/or client IP of the request and being able to ask 
for requests by those would be extra nice

Ok, will include user and client IP in the request details set aside. 

As for query APIs, the admin API is sugar over fan out requests to 
regionservers for whatever is currently sitting in the ring buffers. Where we 
want to narrow the search by region or table we can get region locations and 
prune the regionserver set. Filtering or sorting on other attributes would be 
done locally in the client. I think it best to let the client index the list of 
ResponseDetail however it likes. 

The shell commands are one client of the admin APIs. This seems a good place to 
put additional convenience filtering. Will update the description for this too. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> —
> New shell commands:
> {{get_slow_responses  [ , \{ SERVERS=> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer. Provide a table name as first argument to find all regions and 
> retrieve too slow response entries for the given table from all servers 
> currently hosting it. Provide a region name as first argument to retrieve all 
> too slow response ent

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-05 Thread Sean Busbey (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923737#comment-16923737
 ] 

Sean Busbey commented on HBASE-22978:
-

Can we include the ability to have entries from the buffer(s) dumped to a 
FileSystem directory? I'd like to have something flexible available for when I 
need to use this with an outlier case.


Also saving the user and/or client IP of the request and being able to ask for 
requests by those would be extra nice

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> —
> New shell commands:
> {{get_slow_responses  [ , \{ SERVERS=> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer. Provide a table name as first argument to find all regions and 
> retrieve too slow response entries for the given table from all servers 
> currently hosting it. Provide a region name as first argument to retrieve all 
> too slow response entries for the given region. Optionally provide a map of 
> parameters as second argument. The SERVERS parameter, which expects a list of 
> server names, will constrain the search to the given set of servers. A server 
> name is its host, port, and start code, e.g. 
> "host187.example.com,60020,1289493121758".
> {{get_slow_responses [  ... ,  ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers on the cluster 
> if no argument is provided. A server name is its host, port, and start code, 
> e.g. "host187.example.com,60020,1289493121758".
> {{clear_slow_responses [  ... ,  ]}}
> Clear the too slow response ring buffer maintained

44 matches

Mail list logo