[jira] [Comment Edited] (HBASE-22978) Online slow response log

2020-02-23 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042638#comment-17042638
 ] 

Viraj Jasani edited comment on HBASE-22978 at 2/24/20 6:55 AM:
---

Update on the progress:

I have addressed all concerns in latest reviews, almost ready for commit to 
master branch and then will prepare branch-2 backport. Will create all follow 
up tasks soon.

[~ndimiduk] if you could take one pass, that would be great.


was (Author: vjasani):
Update on the progress:

I have addressed all concerns in latest reviews, almost ready for commit to 
master branch and then will prepare branch-2 backport. Will create all follow 
up tasks soon.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2020-02-10 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034139#comment-17034139
 ] 

Viraj Jasani edited comment on HBASE-22978 at 2/11/20 7:30 AM:
---

Thanks for the review [~ndimiduk]

I would like to bring one of the client side scalability concern raised here 
since as per the earlier discussions on this Jira, we decided to provide filter 
at client side, which is shell for now and in future, operator can also consume 
APIs and filter with multiple params.

 

Copying Andrew's earlier comment:

"Filtering or sorting on other attributes would be done locally in the client. 
I think it best to let the client index the list of ResponseDetail however it 
likes.

The shell commands are one client of the admin APIs. This seems a good place to 
put additional convenience filtering."

 

Although I agree with scalability concern but I believe filter done by 
client(shell) might not be too much of an overhead right? 

 

cc: [~apurtell]


was (Author: vjasani):
Thanks for the review [~ndimiduk]

I would like to bring one of the client side scalability concern raised here 
since as per the earlier discussions on this Jira, we decided to provide filter 
at client side, which is shell for now and in future, operator can also consume 
APIs.

 

Copying Andrew's earlier comment:

"Filtering or sorting on other attributes would be done locally in the client. 
I think it best to let the client index the list of ResponseDetail however it 
likes.

The shell commands are one client of the admin APIs. This seems a good place to 
put additional convenience filtering."

 

Although I agree with scalability concern but I believe filter done by 
client(shell) might not be too much of an overhead right? 

 

cc: [~apurtell]

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2020-02-05 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028114#comment-17028114
 ] 

Viraj Jasani edited comment on HBASE-22978 at 2/5/20 8:41 AM:
--

[~stack] [~ndimiduk] Addressed latest review comments.

Please review when you get some time.


was (Author: vjasani):
[~stack] [~ndimiduk] Addressed latest review comments. Please review when you 
get some time.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2020-01-27 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024137#comment-17024137
 ] 

Viraj Jasani edited comment on HBASE-22978 at 1/27/20 9:00 AM:
---

[~stack] agree with you on providing feature turn on/off config, added a commit 
for the same. Please review.

Also, I am very much looking forward to pushing this feature to 2.3.0 and 1.6.0 
releases. Since 2.3.0 is going to happen sooner, soon as I get approval for 
master patch, I will start on backport work for branch-2 and then for branch-1 
after committing to branch-2.

If you have some bandwidth to review this new commit, please review and I will 
start backport work.

[~apurtell] If you also have some bandwidth, please review.

 


was (Author: vjasani):
[~stack] agree with you on providing feature turn on/off config, added a commit 
for the same. Please review.

Also, I am very much looking forward to pushing this feature to 2.3.0 and 1.6.0 
releases. Since 2.3.0 is going to happen sooner, soon as I get approval for 
master patch, I will start on backport work for branch-2 and then for branch-1 
after committing to branch-2.

If you have some bandwidth to review this new commit, please review and I will 
be on backport work.

[~apurtell] If you also have some bandwidth, please review.

 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2020-01-21 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020704#comment-17020704
 ] 

Viraj Jasani edited comment on HBASE-22978 at 1/22/20 6:29 AM:
---

Thanks [~stack]

Sure let me try for 2.3.0 soon after this lands to master. Will address 
comments.

Also, 2.3.0 backport might not take more time compared to 1.6.0 backport, will 
create backport subtask for 1.6.0.

Looking forward to all backports :)


was (Author: vjasani):
Thanks [~stack]

Sure let me try for 2.3.0 soon after we have enough reviews and confidence. 
Will address comments.

Also, 2.3.0 backport might not take more time after trunk commit compared to 
1.6.0 backport, will create backport subtask for 1.6.0.

Looking forward to all backports :)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2020-01-21 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020704#comment-17020704
 ] 

Viraj Jasani edited comment on HBASE-22978 at 1/22/20 2:21 AM:
---

Thanks [~stack]

Sure let me try for 2.3.0 soon after we have enough reviews and confidence. 
Will address comments.

Also, 2.3.0 backport might not take more time after trunk commit compared to 
1.6.0 backport, will create backport subtask for 1.6.0.

Looking forward to all backports :)


was (Author: vjasani):
Thanks [~stack]

Sure let me try for 2.3.0 soon after we have enough reviews and confidence. 
Will address comments.

Also, 2.3.0 backport might not take more time after trunk commit compared to 
1.6.0 backport, will create backport subtask for 1.6.0.

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-11-19 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970182#comment-16970182
 ] 

Viraj Jasani edited comment on HBASE-22978 at 11/19/19 5:34 PM:


Please review the pull request: [https://github.com/apache/hbase/pull/754]


was (Author: vjasani):
Please review the pull request: [https://github.com/apache/hbase/pull/754]

 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-11-18 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970182#comment-16970182
 ] 

Viraj Jasani edited comment on HBASE-22978 at 11/18/19 2:33 PM:


Please review the pull request: [https://github.com/apache/hbase/pull/754]

 


was (Author: vjasani):
Please review the pull request: [https://github.com/apache/hbase/pull/754]

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-11-17 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970182#comment-16970182
 ] 

Viraj Jasani edited comment on HBASE-22978 at 11/18/19 6:19 AM:


Please review the pull request: [https://github.com/apache/hbase/pull/754]


was (Author: vjasani):
Please review the pull request

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-11-08 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956178#comment-16956178
 ] 

Viraj Jasani edited comment on HBASE-22978 at 11/8/19 3:46 PM:
---

With the attached patch, tried to cover ringbuffer, shell scripts to get and 
clear buffers including filters. 

If it is fine, will create subtask for optionally dumping logs to HDFS.

Some screenshots of sample outputs:

[^Screen Shot 2019-10-19 at 2.31.59 AM.png]

[^Screen Shot 2019-10-19 at 2.32.54 AM.png]

[^Screen Shot 2019-10-19 at 2.34.11 AM.png]

[^Screen Shot 2019-10-19 at 2.36.14 AM.png]

 


was (Author: vjasani):
With the attached patch, tried to cover ringbuffer, shell scripts to get and 
clear buffers including filters(except USER filter). 

If it is fine, will create subtask for optionally dumping logs to HDFS. Also, 
just realized, I need to apply USER filter with get_slow_responses(). Will 
update the patch with USER filter.

Some screenshots of sample outputs:

[^Screen Shot 2019-10-19 at 2.31.59 AM.png]

[^Screen Shot 2019-10-19 at 2.32.54 AM.png]

[^Screen Shot 2019-10-19 at 2.34.11 AM.png]

[^Screen Shot 2019-10-19 at 2.36.14 AM.png]

 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-10-24 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958654#comment-16958654
 ] 

Viraj Jasani edited comment on HBASE-22978 at 10/24/19 3:07 PM:


Created PR for review: [https://github.com/apache/hbase/pull/754]


was (Author: vjasani):
Created PR for review: [https://github.com/apache/hbase/pull/754]

Attached patches are outdated

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-10-24 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958654#comment-16958654
 ] 

Viraj Jasani edited comment on HBASE-22978 at 10/24/19 8:36 AM:


Created PR for review: [https://github.com/apache/hbase/pull/754]

Attached patches are outdated


was (Author: vjasani):
Created PR for review: [https://github.com/apache/hbase/pull/754]

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Attachments: HBASE-22978.master.000.patch, 
> HBASE-22978.master.001.patch, Screen Shot 2019-10-19 at 2.31.59 AM.png, 
> Screen Shot 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 
> AM.png, Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-10-21 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956178#comment-16956178
 ] 

Viraj Jasani edited comment on HBASE-22978 at 10/21/19 3:05 PM:


With the attached patch, tried to cover ringbuffer, shell scripts to get and 
clear buffers including filters(except USER filter). 

If it is fine, will create subtask for optionally dumping logs to HDFS. Also, 
just realized, I need to apply USER filter with get_slow_responses(). Will 
update the patch with USER filter.

Some screenshots of sample outputs:

[^Screen Shot 2019-10-19 at 2.31.59 AM.png]

[^Screen Shot 2019-10-19 at 2.32.54 AM.png]

[^Screen Shot 2019-10-19 at 2.34.11 AM.png]

[^Screen Shot 2019-10-19 at 2.36.14 AM.png]

 


was (Author: vjasani):
With the attached patch, tried to cover ringbuffer, shell scripts to get and 
clear buffers including filters(except USER filter). 

If it is fine, will create subtask for optionally dumping logs to HDFS. Also, 
just realized, I need to apply USER filter with get_slow_responses(). Will 
update the patch later.

Some screenshots of sample outputs:

[^Screen Shot 2019-10-19 at 2.31.59 AM.png]

[^Screen Shot 2019-10-19 at 2.32.54 AM.png]

[^Screen Shot 2019-10-19 at 2.34.11 AM.png]

[^Screen Shot 2019-10-19 at 2.36.14 AM.png]

 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Attachments: HBASE-22978.master.000.patch, 
> HBASE-22978.master.001.patch, Screen Shot 2019-10-19 at 2.31.59 AM.png, 
> Screen Shot 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 
> AM.png, Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-10-21 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956178#comment-16956178
 ] 

Viraj Jasani edited comment on HBASE-22978 at 10/21/19 3:02 PM:


With the attached patch, tried to cover ringbuffer, shell scripts to get and 
clear buffers including filters(except USER filter). 

If it is fine, will create subtask for optionally dumping logs to HDFS. Also, 
just realized, I need to apply USER filter with get_slow_responses(). Will 
update the patch later.

Some screenshots of sample outputs:

[^Screen Shot 2019-10-19 at 2.31.59 AM.png]

[^Screen Shot 2019-10-19 at 2.32.54 AM.png]

[^Screen Shot 2019-10-19 at 2.34.11 AM.png]

[^Screen Shot 2019-10-19 at 2.36.14 AM.png]

 


was (Author: vjasani):
With the attached patch, tried to cover ringbuffer, shell scripts to get and 
clear buffers including filters(except USER filter). 

If it is fine, will create subtask for optionally dumping logs to HDFS. Also, 
just realized, I need to apply USER filter with get_slow_responses(). Will 
update the patch later.

Some screenshots of sample outputs:

!Screen Shot 2019-10-19 at 2.31.59 AM.png!

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Attachments: HBASE-22978.master.000.patch, 
> HBASE-22978.master.001.patch, Screen Shot 2019-10-19 at 2.31.59 AM.png, 
> Screen Shot 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 
> AM.png, Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923852#comment-16923852
 ] 

Andrew Purtell edited comment on HBASE-22978 at 9/6/19 1:19 AM:


You can invoke the shell with a command and pipe the output to a file, like 
{{./bin/hbase shell 'command' > output.txt 2>&1}} . Or the output can be piped 
to something else. Nothing special need be done there. 

I like the idea of persisting the complete slow log with best effort, though. 
Reminds me of the [MySQL slow log dump 
tool|https://dev.mysql.com/doc/refman/8.0/en/mysqldumpslow.html]. Initially I 
was thinking this could be part of general performance surveillance, where 
sampling is good enough, but maybe there could be a tough to debug case that's 
rare so would also be hard to catch in the ring buffers. For that we'd 
configure a slow log directory in site configuration, presumably in HDFS, into 
which regionservers would each append to a file, rolling at some configured 
bound. A tool that decodes and prints to stdout, like HFilePrettyPrinter and 
such, can mostly share common code with what we put in the regionserver to do 
the same out to RPC for the shell. 


was (Author: apurtell):
You can invoke the shell with a command and pipe the output to a file, like 
{{./bin/hbase shell 'command' > output.txt 2>&1}} . Or the output can be piped 
to something else. Nothing special need be done there. 

I like the idea of persisting the complete slow log with best effort, though. 
Reminds me of the [MySQL slow 
log|https://dev.mysql.com/doc/refman/8.0/en/slow-query-log.html]. Initially I 
was thinking this could be part of general performance surveillance, where 
sampling is good enough, but maybe there could be a tough to debug case that's 
rare so would also be hard to catch in the ring buffers. For that we'd 
configure a slow log directory in site configuration, presumably in HDFS, into 
which regionservers would each append to a file, rolling at some configured 
bound. A tool that decodes and prints to stdout, like HFilePrettyPrinter and 
such, can mostly share common code with what we put in the regionserver to do 
the same out to RPC for the shell. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923772#comment-16923772
 ] 

Andrew Purtell edited comment on HBASE-22978 at 9/5/19 10:37 PM:
-

Description updated. Admin API simplified. The {{get_slow_responses}} shell 
command is reconsidered as a general query tool. 

Pseudocode of this command would be like:
{noformat}
servers = nil
filters = ...
if filters['REGION'] 
servers = locateRegion( filters['REGION'] )
elsif filters['TABLE']
servers = locateRegions( filters['TABLE'] )
results = @admin.getSlowResponses(servers).map { 
// if a returned slow response entry matches any filter, return the 
value, else nil
}.compact // remove any nils
// pretty print 'results'
{noformat}


was (Author: apurtell):
Description updated. Admin API simplified. The {{get_slow_responses}} shell 
command is reconsidered as a general query tool. 

Pseudocode of this command would be like:
{noformat}
servers = []
filters = ...
if filters['REGION'] 
servers = locateRegion( filters['REGION'] )
elsif filters['TABLE']
servers = locateRegions( filters['TABLE'] )
results = @admin.getSlowResponses(servers.empty? ? nil : servers).map { 
// if a returned slow response entry matches any filter, return the 
value, else nil
}.compact // remove any nils
// pretty print 'results'
{noformat}

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923772#comment-16923772
 ] 

Andrew Purtell edited comment on HBASE-22978 at 9/5/19 10:25 PM:
-

Description updated. Admin API simplified. The {{get_slow_responses}} shell 
command is reconsidered as a general query tool. 

Pseudocode of this command would be like:
{noformat}
servers = []
filters = ...
if filters['REGION'] 
servers = locateRegion( filters['REGION'] )
elsif filters['TABLE']
servers = locateRegions( filters['TABLE'] )
results = @admin.getSlowResponses(servers.empty? ? nil : servers).map { 
// if a returned slow response entry matches any filter, return the 
value, else nil
}.compact // remove any nils
// pretty print 'results'
{noformat}


was (Author: apurtell):
Description updated. Admin API simplified. The {{get_slow_responses}} shell 
command is reconsidered as a general query tool. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as