[jira] (HBASE-28595) Losing exception from scan RPC can lead to partial results

2024-05-16 Thread Michael Smith (Jira)


[ https://issues.apache.org/jira/browse/HBASE-28595 ]


Michael Smith deleted comment on HBASE-28595:
---

was (Author: JIRAUSER288956):
I saw the OutOfOrderScannerException while working on various iterations of 
reproduction. I think it would cover the case of a successful scan then lost 
response; the client would retry, get OutOfOrderScannerException, and recognize 
it needs to reset the scanner.

> Losing exception from scan RPC can lead to partial results
> --
>
> Key: HBASE-28595
> URL: https://issues.apache.org/jira/browse/HBASE-28595
> Project: HBase
>  Issue Type: Bug
>  Components: Client, regionserver, Scanners
>Reporter: Csaba Ringhofer
>Assignee: Csaba Ringhofer
>Priority: Critical
>  Labels: pull-request-available
>
> This was discovered in Apache Impala using HBase 2.2 based branch hbase 
> client and server. It is not clear yet whether other branches are also 
> affected.
> The issue happens if the server side of the scan throws an exception and 
> closes the scanner, but at the same time, the client gets an rpc connection 
> closed error and doesn't process the exception sent by the server. Client 
> then thinks it got a network error, which leads to retrying the RPC instead 
> of opening a new scanner. But then when the client retry reaches the server, 
> the server returns an empty ScanResponse instead of an error, leading to 
> closing the scanner on client side without returning any error.
> A few pointers to critical parts:
> region server:
> 1st call throws exception leading to closing (but not deleting) scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3539]
> 2nd call (retry of 1st) returns empty results:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3403]
> client:
> some exceptions are handled as non-retriable at RPC level and are only 
> handled through opening a new scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java#L214]
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java#L367]
> This mechanism in the client only works if it gets the exception from the 
> server. If there are connection issues during the RPC then the client won't 
> really know the state of the server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28595) Losing exception from scan RPC can lead to partial results

2024-05-16 Thread Michael Smith (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847021#comment-17847021
 ] 

Michael Smith commented on HBASE-28595:
---

I saw the OutOfOrderScannerException while working on various iterations of 
reproduction. I think it would cover the case of a successful scan then lost 
response; the client would retry, get OutOfOrderScannerException, and recognize 
it needs to reset the scanner.

> Losing exception from scan RPC can lead to partial results
> --
>
> Key: HBASE-28595
> URL: https://issues.apache.org/jira/browse/HBASE-28595
> Project: HBase
>  Issue Type: Bug
>  Components: Client, regionserver, Scanners
>Reporter: Csaba Ringhofer
>Assignee: Csaba Ringhofer
>Priority: Critical
>  Labels: pull-request-available
>
> This was discovered in Apache Impala using HBase 2.2 based branch hbase 
> client and server. It is not clear yet whether other branches are also 
> affected.
> The issue happens if the server side of the scan throws an exception and 
> closes the scanner, but at the same time, the client gets an rpc connection 
> closed error and doesn't process the exception sent by the server. Client 
> then thinks it got a network error, which leads to retrying the RPC instead 
> of opening a new scanner. But then when the client retry reaches the server, 
> the server returns an empty ScanResponse instead of an error, leading to 
> closing the scanner on client side without returning any error.
> A few pointers to critical parts:
> region server:
> 1st call throws exception leading to closing (but not deleting) scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3539]
> 2nd call (retry of 1st) returns empty results:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3403]
> client:
> some exceptions are handled as non-retriable at RPC level and are only 
> handled through opening a new scanner:
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java#L214]
> [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java#L367]
> This mechanism in the client only works if it gets the exception from the 
> server. If there are connection issues during the RPC then the client won't 
> really know the state of the server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-28595) Losing exception from scan RPC can lead to partial results

2024-05-15 Thread Michael Smith (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846755#comment-17846755
 ] 

Michael Smith edited comment on HBASE-28595 at 5/15/24 9:04 PM:


I have a setup that reproduces this issue against release 2.5.8: 
https://github.com/MikaelSmith/hbase/tree/hbase-28595

It differs slightly from the description above in that my demo code triggers 
connection close from the client side, which is what I think we actually saw 
happening (usually triggered by Netty). Still not clear why the connection was 
closed; server side doesn't log anything when it encounters an exception during 
scan, as it expects to be able to send it to the client.
{code}
I0508 19:30:24.107174 64862 ScannerCallable.java:181] Got exception making 
request scanner_id: 13987119624345627690 number_of_rows: 1024 close_scanner: 
false next_call_seq: 0 client_handles_partials: true client_handles_heartbeats: 
true track_scan_metrics: false renew: false to region=...
Java exception follows:
org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to 
host.example.net/169.169.169.169:16020 failed on local exception: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:333)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:91)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:576)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:42810)
at 
org.apache.hadoop.hbase.client.ScannerCallable.next(ScannerCallable.java:175)
at 
org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:244)
at 
org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:58)
at 
org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127)
at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:396)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:370)
at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
at 
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:79)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call 
to host.example.net/169.169.169.169:16020 failed on local exception: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:206)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:383)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:91)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:414)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:116)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:131)
at 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.cleanupCalls(NettyRpcDuplexHandler.java:203)
at 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelInactive(NettyRpcDuplexHandler.java:211)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
at 
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)
at 
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)
at 

[jira] [Comment Edited] (HBASE-28595) Losing exception from scan RPC can lead to partial results

2024-05-15 Thread Michael Smith (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846755#comment-17846755
 ] 

Michael Smith edited comment on HBASE-28595 at 5/15/24 9:00 PM:


I have a setup that reproduces this issue against release 2.5.8: 
https://github.com/MikaelSmith/hbase/tree/hbase-28595

It differs slightly from the description above in that my demo code triggers 
connection close from the client side, which is what I think we actually saw 
happening (usually triggered by Netty)
{code}
I0508 19:30:24.107174 64862 ScannerCallable.java:181] Got exception making 
request scanner_id: 13987119624345627690 number_of_rows: 1024 close_scanner: 
false next_call_seq: 0 client_handles_partials: true client_handles_heartbeats: 
true track_scan_metrics: false renew: false to region=...
Java exception follows:
org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to 
host.example.net/169.169.169.169:16020 failed on local exception: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:333)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:91)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:576)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:42810)
at 
org.apache.hadoop.hbase.client.ScannerCallable.next(ScannerCallable.java:175)
at 
org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:244)
at 
org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:58)
at 
org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127)
at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:396)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:370)
at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
at 
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:79)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call 
to host.example.net/169.169.169.169:16020 failed on local exception: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:206)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:383)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:91)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:414)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:116)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:131)
at 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.cleanupCalls(NettyRpcDuplexHandler.java:203)
at 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelInactive(NettyRpcDuplexHandler.java:211)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
at 
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)
at 
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
   

[jira] [Commented] (HBASE-28595) Losing exception from scan RPC can lead to partial results

2024-05-15 Thread Michael Smith (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846755#comment-17846755
 ] 

Michael Smith commented on HBASE-28595:
---

I have a setup that reproduces this issue: 
https://github.com/MikaelSmith/hbase/tree/hbase-28595

It differs slightly from the description above in that my demo code triggers 
connection close from the client side, which is what I think we actually saw 
happening (usually triggered by Netty)
{code}
I0508 19:30:24.107174 64862 ScannerCallable.java:181] Got exception making 
request scanner_id: 13987119624345627690 number_of_rows: 1024 close_scanner: 
false next_call_seq: 0 client_handles_partials: true client_handles_heartbeats: 
true track_scan_metrics: false renew: false to region=...
Java exception follows:
org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to 
host.example.net/169.169.169.169:16020 failed on local exception: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:333)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:91)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:576)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:42810)
at 
org.apache.hadoop.hbase.client.ScannerCallable.next(ScannerCallable.java:175)
at 
org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:244)
at 
org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:58)
at 
org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127)
at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:396)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:370)
at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
at 
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:79)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call 
to host.example.net/169.169.169.169:16020 failed on local exception: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:206)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:383)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:91)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:414)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:116)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:131)
at 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.cleanupCalls(NettyRpcDuplexHandler.java:203)
at 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelInactive(NettyRpcDuplexHandler.java:211)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
at 
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)
at 
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
at 

[jira] [Created] (HBASE-12097) Blocked threads on hbase, slowly increasing, appears to be updating metrics

2014-09-25 Thread Michael Smith (JIRA)
Michael Smith created HBASE-12097:
-

 Summary: Blocked threads on hbase, slowly increasing, appears to 
be updating metrics
 Key: HBASE-12097
 URL: https://issues.apache.org/jira/browse/HBASE-12097
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.6
 Environment: RHEL 6.2, CDH 4.3.0.0
Reporter: Michael Smith


Hbase shows an increasing number of IPC Threads in BLOCKED state

Hundreds of these,more and more appearing over hours, performance degrading, 
requiring regionserver restart to restore performance.

Thread:

Thread 421 (IPC Server handler 368 on 60201):
  State: BLOCKED
  Blocked count: 19314
  Waited count: 322565
  Blocked on org.apache.hadoop.metrics.util.MetricsIntValue@1ec5ca55
  Blocked by 236 (IPC Server handler 183 on 60201)
  Stack:
org.apache.hadoop.metrics.util.MetricsIntValue.set(MetricsIntValue.java:73)
org.apache.hadoop.hbase.ipc.HBaseServer.updateCallQueueLenMetrics(HBaseServer.java:1360)
   org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1399)

i dont actually know how to troubleshoot this much further... Happy to take 
suggestions...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12097) Blocked threads on hbase, slowly increasing, appears to be updating metrics

2014-09-25 Thread Michael Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Smith updated HBASE-12097:
--
Attachment: total_blocks_versus_threads-week.png

The attached graph shows the increasing number of blocked threads. The massive 
drops, are when we do rolling restarts of all the regionservers. You then see 
the number of blocked threads slowly starting to grow.

We determine the number of blocked threads, by hitting the web interface, ie 
host:60030/dump, and then for each thread, counting the number of 'Status: 
BLOCKED' threads.

Analysis of the Blocked threads, has revealed its blocked on updating Metrics

 Blocked threads on hbase, slowly increasing, appears to be updating metrics
 ---

 Key: HBASE-12097
 URL: https://issues.apache.org/jira/browse/HBASE-12097
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.6
 Environment: RHEL 6.2, CDH 4.3.0.0
Reporter: Michael Smith
 Attachments: total_blocks_versus_threads-week.png


 Hbase shows an increasing number of IPC Threads in BLOCKED state
 Hundreds of these,more and more appearing over hours, performance degrading, 
 requiring regionserver restart to restore performance.
 Thread:
 Thread 421 (IPC Server handler 368 on 60201):
   State: BLOCKED
   Blocked count: 19314
   Waited count: 322565
   Blocked on org.apache.hadoop.metrics.util.MetricsIntValue@1ec5ca55
   Blocked by 236 (IPC Server handler 183 on 60201)
   Stack:
 
 org.apache.hadoop.metrics.util.MetricsIntValue.set(MetricsIntValue.java:73)
 org.apache.hadoop.hbase.ipc.HBaseServer.updateCallQueueLenMetrics(HBaseServer.java:1360)
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1399)
 i dont actually know how to troubleshoot this much further... Happy to take 
 suggestions...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)