[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215500#comment-17215500 ] Hudson commented on HBASE-22978: Results for branch branch-1 [build #44 on builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/44/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/44//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/44//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/44//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 source release artifact{color} -- See build output for details. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0-alpha-1, 2.3.0 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0-alpha-1, 2.3.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that >
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052326#comment-17052326 ] Viraj Jasani commented on HBASE-22978: -- {quote}Pushed on master and branch-2. Did not do branch-1 because this patch is not in branch-1 yet? {quote} Thanks [~stack] for the addendum. This is yet to be backported to branch-1, will do it with HBASE-23935 as I want to backport a couple of more subtask commits together in branch-1. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided.
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051156#comment-17051156 ] Hudson commented on HBASE-22978: Results for branch master [build #1651 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1651/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/master/1651//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1651//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1651//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry befor
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050650#comment-17050650 ] Hudson commented on HBASE-22978: Results for branch branch-2 [build #2528 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2528/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2528//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2528//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2528//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050406#comment-17050406 ] Michael Stack commented on HBASE-22978: --- FYI, I addendumed this change to fix errorprone ERROR seen in nightlies {{commit f411e395d36acd8dee4d2545b1483cd95816f710 (HEAD -> m, origin/master, origin/HEAD)}} {{Author: stack }} {{Date: Tue Mar 3 09:15:18 2020 -0800}}{{HBASE-22978 : Online slow response log (#1228)}} {{ Addendum to fix errorprone ERROR}}{{diff --git a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/slowlog/TestSlowLogRecorder.java b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/slowlog/TestSlowLogRecorder.java}} {{index 240230eede..6a56c099eb 100644}} {{--- a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/slowlog/TestSlowLogRecorder.java}} {{+++ b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/slowlog/TestSlowLogRecorder.java}} {{@@ -556,7 +556,8 @@ public class TestSlowLogRecorder {}} {{ .build();}} {{ break;}} {{ }}} {{-}} {{+ default:}} {{+ throw new RuntimeException("Not supposed to get here?");}} {{ }}}{{return message;}} Pushed on master and branch-2. Did not do branch-1 because this patch is not in branch-1 yet? [~vjasani] > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > O
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048417#comment-17048417 ] Hudson commented on HBASE-22978: Results for branch branch-2 [build #2520 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2520/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2520//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2520//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2520//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an ent
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048281#comment-17048281 ] Hudson commented on HBASE-22978: Results for branch master [build #1647 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1647/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/master/1647//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1647//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1647//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047659#comment-17047659 ] Viraj Jasani commented on HBASE-22978: -- Got it, Thanks Sean > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a string containing an encoded regi
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046602#comment-17046602 ] Sean Busbey commented on HBASE-22978: - No worries. I can definitely empathize with needing to chase after new reviewers on a long running issue. I think you did a good job of getting plenty of eyes on things; the key bit for me was just making sure someone got to the end and gave sign off. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046458#comment-17046458 ] Viraj Jasani commented on HBASE-22978: -- Sorry for missing this. Will take care of it going forward [~busbey] And yes, many reviewers were involved for different duration hence, many GitHub responses, multiple repetitive conversations etc, and perhaps I requested reviews from too many reviewers, sorry for spamming you all with lots of emails, will be careful going forward. Thanks to all reviewers over Jira and GitHub PR. Let me keep PR open one more day. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046138#comment-17046138 ] Sean Busbey commented on HBASE-22978: - {quote}Planning to commit this after ~2 days unless anyone has any objections.{quote} for future reference, please try to reference having a review done when saying this kind of stuff. I see now that [~bharathv] gave a +1 on the PR, but I've spent the time since your comment fighting GitHub 500 responses to verify that you were following our RTC norm. (Also thanks for pushing on this; I'm excited to see it land.) > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the to
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044779#comment-17044779 ] Viraj Jasani commented on HBASE-22978: -- Planning to commit this after ~2 days unless anyone has any objections. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filte
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042638#comment-17042638 ] Viraj Jasani commented on HBASE-22978: -- Update on the progress: I have addressed all concerns in latest reviews, almost ready for commit to master branch and then will prepare branch-2 backport. Will create all follow up tasks soon. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036470#comment-17036470 ] Andrew Kyle Purtell commented on HBASE-22978: - We are releasing 1.6.0 now because of HBASE-23825, moving to 1.7.0. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGIO
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035657#comment-17035657 ] Nick Dimiduk commented on HBASE-22978: -- For those not following the PR, here's a summary as I see them of the remaining todo items: * log large responses (follow-on task)? ** don’t assume “tooLarge” and “tooSlow” are mutually exclusive. * replicate log entries into a file on HDFS or a new system table (follow-on task)? * expose slow/large responses as a metric (follow-on task)? * do we need to ship with thrift api parity? * push filter and limit logic into the server, so all clients have the same experience ** I think this requires adding a `TooSlowRequest` protobuf message with filter and limit parameters ** Maybe then `SlowLogPayload` is renamed to `SlowLogRecord` or something like that * need separation of concerns between SlowLog.toString and formatting of output in the shell * avoid adding to the surface area of HConstants * proper interrupt handling within SlowLogEventHandler * update configuration adoc to reference the feature adoc * test cleanup ** avoid sleeping for fixed time units, instead use waitFor ** maybe combine some methods containing repeated logic, introduce a helper method or two [https://github.com/apache/hbase/pull/754#issuecomment-585382957] > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.6.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row count
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035545#comment-17035545 ] Nick Dimiduk commented on HBASE-22978: -- Thanks for filling me in on the background here [~vjasani]. I admit to focusing more on the PR than all this background discussion here in Jira. bq. Filtering or sorting on other attributes would be done locally in the client. I think it best to let the client index the list of ResponseDetail however it likes. bq. The shell commands are one client of the admin APIs. This seems a good place to put additional convenience filtering. Regardless of who made the comment, I would like to push back against this design decision. Leaving this kind of logic in the shell means it's not available to users of the Admin API. Having this kind of unequal interface across subcomponents makes for a bad user experience, and forces users to re-implement what the shell is doing. Just implement it on the server, make the filter parameters a part of the RPC interface, and everyone enjoys the feature. bq. Although I agree with scalability concern but I believe filter done by client(shell) might not be too much of an overhead right? So picture a cluster of 1000 workers, hosting an application that has gotten out of hand, has pathologically slow or large response access pattern. 256 log entries * 1000 servers. These entries are pretty fat with up to 14 fields and some potentially big ones like {{call_details}} and {{params}}... just save everyone the hassles and run the filter on the server side where the data volume is relatively small. Come to think of it, this should have a built-in default limit of something relatively small... maybe 10? From sitting in the shell, the operator has no idea which RS's might have slow logs, or how many of them, so a naive invocation of the command could potentially surprise and overwhelm them with data. So in addition to a limit param with a default value, maybe we also want the response message from each server to include the number of results that server has. That way the client and response with something like "showing 10/256,000". WDYT? > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.6.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write reques
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034139#comment-17034139 ] Viraj Jasani commented on HBASE-22978: -- Thanks for the review [~ndimiduk] I would like to bring one of the client side scalability concern raised here since as per the earlier discussions on this Jira, we decided to provide filter at client side, which is shell for now and in future, operator can also consume APIs. Copying Andrew's earlier comment: "Filtering or sorting on other attributes would be done locally in the client. I think it best to let the client index the list of ResponseDetail however it likes. The shell commands are one client of the admin APIs. This seems a good place to put additional convenience filtering." Although I agree with scalability concern but I believe filter done by client(shell) might not be too much of an overhead right? cc: [~apurtell] > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.6.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Pr
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028114#comment-17028114 ] Viraj Jasani commented on HBASE-22978: -- [~stack] [~ndimiduk] Addressed latest review comments. Please review when you get some time. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.6.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that tabl
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024140#comment-17024140 ] Viraj Jasani commented on HBASE-22978: -- Moved this back to 2.3.0 and if there is any concern or this causes any unexpected delay, we will move it out of 2.3.0 again. Hopefully, this should be in branch-2 soon :) FYI [~stack] [~ndimiduk] > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.6.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE fi
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024137#comment-17024137 ] Viraj Jasani commented on HBASE-22978: -- [~stack] agree with you on providing feature turn on/off config, added a commit for the same. Please review. Also, I am very much looking forward to pushing this feature to 2.3.0 and 1.6.0 releases. Since 2.3.0 is going to happen sooner, soon as I get approval for master patch, I will start on backport work for branch-2 and then for branch-1 after committing to branch-2. If you have some bandwidth to review this new commit, please review and I will be on backport work. [~apurtell] If you also have some bandwidth, please review. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 1.6.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell comm
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022369#comment-17022369 ] Michael Stack commented on HBASE-22978: --- For now, since this is a new feature and its marked minor, I'm moving it out of 2.3.0. We can pull in the backport if done in time. Thanks [~vjasani]. More comments up on the PR. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 1.6.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a st
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020704#comment-17020704 ] Viraj Jasani commented on HBASE-22978: -- Thanks [~stack] Sure let me try for 2.3.0 soon after we have enough reviews and confidence. Will address comments. Also, 2.3.0 backport might not take more time after trunk commit compared to 1.6.0 backport, will create backport subtask for 1.6.0. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.6.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters f
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020671#comment-17020671 ] Michael Stack commented on HBASE-22978: --- This supposed to got into hbase-2.3.0? Left some comments on PR > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.6.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, whi
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997702#comment-16997702 ] Xu Cang commented on HBASE-22978: - Left some comments in PR. thanks [~vjasani] > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.6.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a string containing a
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995759#comment-16995759 ] Viraj Jasani commented on HBASE-22978: -- Please review the patch, it's been a while :) Thanks > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.6.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970182#comment-16970182 ] Viraj Jasani commented on HBASE-22978: -- Please review the pull request > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Affects Versions: 3.0.0, 2.3.0, 1.5.1 >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.6.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a string containing an e
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16958654#comment-16958654 ] Viraj Jasani commented on HBASE-22978: -- Created PR for review: [https://github.com/apache/hbase/pull/754] > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Attachments: HBASE-22978.master.000.patch, > HBASE-22978.master.001.patch, Screen Shot 2019-10-19 at 2.31.59 AM.png, > Screen Shot 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 > AM.png, Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a string conta
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956178#comment-16956178 ] Viraj Jasani commented on HBASE-22978: -- With the attached patch, tried to cover ringbuffer, shell scripts to get and clear buffers including filters(except USER filter). If it is fine, will create subtask for optionally dumping logs to HDFS. Also, just realized, I need to apply USER filter with get_slow_responses(). Will update the patch later. Some screenshots of sample outputs: !Screen Shot 2019-10-19 at 2.31.59 AM.png! > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > Attachments: HBASE-22978.master.000.patch, > HBASE-22978.master.001.patch, Screen Shot 2019-10-19 at 2.31.59 AM.png, > Screen Shot 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 > AM.png, Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the give
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955358#comment-16955358 ] HBase QA commented on HBASE-22978: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 40s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} prototool {color} | {color:blue} 0m 0s{color} | {color:blue} prototool was not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 4 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 29s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 5s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 55s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 42s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} refguide {color} | {color:blue} 5m 30s{color} | {color:blue} branch has no errors when building the reference guide. See footer for rendered docs, which you should manually inspect. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 39s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 46s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 13m 18s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 22m 44s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 40s{color} | {color:green} root: The patch generated 0 new + 488 unchanged - 1 fixed = 488 total (was 489) {color} | | {color:red}-1{color} | {color:red} rubocop {color} | {color:red} 0m 18s{color} | {color:red} The patch generated 26 new + 490 unchanged - 0 fixed = 516 total (was 490) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:blue}0{color} | {color:blue} refguide {color} | {color:blue} 5m 36s{color} | {color:blue} patch has no errors when building the reference guide. See footer for rendered docs, which you should manually inspect. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 34s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 15m 26s{color} | {color:green} Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2. {color} | | {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 8m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 54s{color}
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955254#comment-16955254 ] HBase QA commented on HBASE-22978: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 43s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} prototool {color} | {color:blue} 0m 0s{color} | {color:blue} prototool was not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 4 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 32s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 45s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 11s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 57s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} refguide {color} | {color:blue} 6m 16s{color} | {color:blue} branch has no errors when building the reference guide. See footer for rendered docs, which you should manually inspect. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 5m 0s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 45s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 14m 46s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 24m 52s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 3m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 6s{color} | {color:green} root: The patch generated 0 new + 488 unchanged - 1 fixed = 488 total (was 489) {color} | | {color:red}-1{color} | {color:red} rubocop {color} | {color:red} 0m 18s{color} | {color:red} The patch generated 39 new + 490 unchanged - 0 fixed = 529 total (was 490) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:blue}0{color} | {color:blue} refguide {color} | {color:blue} 6m 13s{color} | {color:blue} patch has no errors when building the reference guide. See footer for rendered docs, which you should manually inspect. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 59s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 17m 8s{color} | {color:green} Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2. {color} | | {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 8m 39s{color} | {color:green} the patch passed {color} | | {colo
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948769#comment-16948769 ] Andrew Kyle Purtell commented on HBASE-22978: - [~vjasani] asked me offline if he could work on this. Please, and thank you! > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Reporter: Andrew Kyle Purtell >Assignee: Viraj Jasani >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a string containing an encoded region name, will > include only entries pertaining to that region. The CLIENT_IP filter, which > expects a string containing an IP address, will include only entries > pertaining to that client. The USER filter, whi
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932179#comment-16932179 ] Andrew Purtell commented on HBASE-22978: If need be but seems premature because the work hasn’t even started yet. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Reporter: Andrew Purtell >Assignee: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a string containing an encoded region name, will > include only entries pertaining to that region. The CLIENT_IP filter, which > expects a string containing an IP address, will include only entries > pertaining to that client. The USER filter, which expects a str
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932156#comment-16932156 ] Viraj Jasani commented on HBASE-22978: -- In this case, is it fine to have "optionally dumping logs to HDFS" as a sub-task to this? > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Reporter: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a string containing an encoded region name, will > include only entries pertaining to that region. The CLIENT_IP filter, which > expects a string containing an IP address, will include only entries > pertaining to that client. The USER filter, which expects a string containing > a user
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930926#comment-16930926 ] Andrew Purtell commented on HBASE-22978: Just to be clear if the HDFS file option is turned on then all history will be there even though it may be expensive to keep all the events - large file sizes even with compression, depending. The shell cannot be expected to process large data efficiently. Some kind of tool that can optionally use MR will be developed. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Reporter: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a string containing an encoded region name, will >
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930921#comment-16930921 ] Andrew Purtell commented on HBASE-22978: Shell is ring buffers. This is an online slow request log. The goal isn’t to have complete history it is to keep a list of interesting events for as long as we can without consuming too many resources either RAM or files. Having an option also to dump to HDFS is important so we added that as a side goal. The tool that dumps the files will have the same capabilities as the shell because I expect to share common utility classes. However this is not the main goal. The main goal is shell access to ring buffers containing *latest* events of interest, for use in online and active debugging. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Reporter: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally pro
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930655#comment-16930655 ] Sean Busbey commented on HBASE-22978: - the shell command should be ring buffers by default. It can include a pointer to the HDFS location for older entries. an option on the shell command to read from the HDFS location would be fine, but it'll definitely need some filtering options given the size. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Reporter: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a string containing an encoded region name, will > include only entries pertaining to that region. The CLIENT_IP filter,
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929945#comment-16929945 ] Viraj Jasani commented on HBASE-22978: -- {quote}{{get_slow_responses [ ... , ] [ , \{ } ]}} Retrieve, decode, and pretty print the contents of the too slow response ring buffer maintained by the given list of servers; or all servers in the cluster if no list is provided. Optionally provide a map of parameters for filtering as additional argument. {quote} {quote}I like the idea of persisting the complete slow log with best effort, though. Reminds me of the [MySQL slow log dump tool|https://dev.mysql.com/doc/refman/8.0/en/mysqldumpslow.html]. Initially I was thinking this could be part of general performance surveillance, where sampling is good enough, but maybe there could be a tough to debug case that's rare so would also be hard to catch in the ring buffers. For that we'd configure a slow log directory in site configuration, presumably in HDFS, into which regionservers would each append to a file, rolling at some configured bound. {quote} shell triggered get_slow_responses should retrieve records maintained by ring buffers only? or it should read records from HDFS? If shell reads only the buffer only, we might miss on details of hard to catch rare cases (already written to HDFS)? > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Reporter: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925929#comment-16925929 ] Andrew Purtell commented on HBASE-22978: Going out on vacation for much of September. Unassigning, in case someone else is interested in picking it up in the meantime. I'll assign back to myself and do it upon return otherwise. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Reporter: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a string containing an encoded region name, will > include only entries pertaining to that region. The CLIENT_IP filter, which > expects a string containing an IP address, will include o
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923853#comment-16923853 ] Sean Busbey commented on HBASE-22978: - bq. You can invoke the shell with a command and pipe the output to a file, like {{./bin/hbase shell 'command' > output.txt 2>&1}} . Or the output can be piped to something else. Nothing special need be done there. Sure, but each RS writing to HDFS and then using tools to pull all or some binary blobs down to a local machine (or inspecting them on HDFS itself) is going to be less error prone then making things run through a jruby shell. Agree that a general purpose "dump everything over here at best effort" is going to be better for the kind of problems I'm thinking about though. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Reporter: Andrew Purtell >Assignee: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923852#comment-16923852 ] Andrew Purtell commented on HBASE-22978: You can invoke the shell with a command and pipe the output to a file, like {{./bin/hbase shell 'command' > output.txt 2>&1}} . Or the output can be piped to something else. Nothing special need be done there. I like the idea of persisting the complete slow log with best effort, though. Reminds me of the [MySQL slow log|https://dev.mysql.com/doc/refman/8.0/en/slow-query-log.html]. Initially I was thinking this could be part of general performance surveillance, where sampling is good enough, but maybe there could be a tough to debug case that's rare so would also be hard to catch in the ring buffers. For that we'd configure a slow log directory in site configuration, presumably in HDFS, into which regionservers would each append to a file, rolling at some configured bound. A tool that decodes and prints to stdout, like HFilePrettyPrinter and such, can mostly share common code with what we put in the regionserver to do the same out to RPC for the shell. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell >Reporter: Andrew Purtell >Assignee: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printi
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923836#comment-16923836 ] Sean Busbey commented on HBASE-22978: - bq. I'll update the description to include write-behind of the ring buffer to a directory in HDFS, but it shouldn't block, so if we stall during writing some ring buffer entries may have been lost. If we can detect that we can log that it happened in the file. oh that's really nice. I just meant like "send the ones that match to this HDFS directory instead of to my hbase shell here" on the assumption that I will eventually run into a case where the buffer is eventually big enough that I will want to look at the results with $EDITOR or other shell tools instead of our shell and I'd rather not have to write a java tool to do it then. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, regionserver, shell >Reporter: Andrew Purtell >Assignee: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or al
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923772#comment-16923772 ] Andrew Purtell commented on HBASE-22978: Description updated. Admin API simplified. The {{get_slow_responses}} shell command is reconsidered as a general query tool. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, regionserver, shell >Reporter: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. > — > New shell commands: > {{get_slow_responses [ ... , ] [ , \{ > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a string containing an encoded region name, will > include only entries pertaining to that region. The CLIENT_IP filter, which > expects a string containing an IP address, will include only entries > pertaining to that client. The USER filter, which expects a string containing > a user name, will include only entries pertaining to that user. Filters are > additive, for example if both CLIENT_IP and USER filters are given, entries > matching either or both con
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923760#comment-16923760 ] Andrew Purtell commented on HBASE-22978: I'll update the description to include write-behind of the ring buffer to a directory in HDFS, but it shouldn't block, so if we stall during writing some ring buffer entries may have been lost. If we can detect that we can log that it happened in the file. bq. Also saving the user and/or client IP of the request and being able to ask for requests by those would be extra nice Ok, will include user and client IP in the request details set aside. As for query APIs, the admin API is sugar over fan out requests to regionservers for whatever is currently sitting in the ring buffers. Where we want to narrow the search by region or table we can get region locations and prune the regionserver set. Filtering or sorting on other attributes would be done locally in the client. I think it best to let the client index the list of ResponseDetail however it likes. The shell commands are one client of the admin APIs. This seems a good place to put additional convenience filtering. Will update the description for this too. > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, regionserver, shell >Reporter: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > — > New shell commands: > {{get_slow_responses [ , \{ SERVERS=> } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer. Provide a table name as first argument to find all regions and > retrieve too slow response entries for the given table from all servers > currently hosting it. Provide a region name as first argument to retrieve all > too slow response ent
[jira] [Commented] (HBASE-22978) Online slow response log
[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923737#comment-16923737 ] Sean Busbey commented on HBASE-22978: - Can we include the ability to have entries from the buffer(s) dumped to a FileSystem directory? I'd like to have something flexible available for when I need to use this with an outlier case. Also saving the user and/or client IP of the request and being able to ask for requests by those would be extra nice > Online slow response log > > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, regionserver, shell >Reporter: Andrew Purtell >Priority: Minor > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > — > New shell commands: > {{get_slow_responses [ , \{ SERVERS=> } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer. Provide a table name as first argument to find all regions and > retrieve too slow response entries for the given table from all servers > currently hosting it. Provide a region name as first argument to retrieve all > too slow response entries for the given region. Optionally provide a map of > parameters as second argument. The SERVERS parameter, which expects a list of > server names, will constrain the search to the given set of servers. A server > name is its host, port, and start code, e.g. > "host187.example.com,60020,1289493121758". > {{get_slow_responses [ ... , ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers on the cluster > if no argument is provided. A server name is its host, port, and start code, > e.g. "host187.example.com,60020,1289493121758". > {{clear_slow_responses [ ... , ]}} > Clear the too slow response ring buffer maintained