[jira] (HBASE-28595) Losing exception from scan RPC can lead to partial results
[ https://issues.apache.org/jira/browse/HBASE-28595 ] Michael Smith deleted comment on HBASE-28595: --- was (Author: JIRAUSER288956): I saw the OutOfOrderScannerException while working on various iterations of reproduction. I think it would cover the case of a successful scan then lost response; the client would retry, get OutOfOrderScannerException, and recognize it needs to reset the scanner. > Losing exception from scan RPC can lead to partial results > -- > > Key: HBASE-28595 > URL: https://issues.apache.org/jira/browse/HBASE-28595 > Project: HBase > Issue Type: Bug > Components: Client, regionserver, Scanners >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Critical > Labels: pull-request-available > > This was discovered in Apache Impala using HBase 2.2 based branch hbase > client and server. It is not clear yet whether other branches are also > affected. > The issue happens if the server side of the scan throws an exception and > closes the scanner, but at the same time, the client gets an rpc connection > closed error and doesn't process the exception sent by the server. Client > then thinks it got a network error, which leads to retrying the RPC instead > of opening a new scanner. But then when the client retry reaches the server, > the server returns an empty ScanResponse instead of an error, leading to > closing the scanner on client side without returning any error. > A few pointers to critical parts: > region server: > 1st call throws exception leading to closing (but not deleting) scanner: > [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3539] > 2nd call (retry of 1st) returns empty results: > [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3403] > client: > some exceptions are handled as non-retriable at RPC level and are only > handled through opening a new scanner: > [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java#L214] > [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java#L367] > This mechanism in the client only works if it gets the exception from the > server. If there are connection issues during the RPC then the client won't > really know the state of the server. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-28595) Losing exception from scan RPC can lead to partial results
[ https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847021#comment-17847021 ] Michael Smith commented on HBASE-28595: --- I saw the OutOfOrderScannerException while working on various iterations of reproduction. I think it would cover the case of a successful scan then lost response; the client would retry, get OutOfOrderScannerException, and recognize it needs to reset the scanner. > Losing exception from scan RPC can lead to partial results > -- > > Key: HBASE-28595 > URL: https://issues.apache.org/jira/browse/HBASE-28595 > Project: HBase > Issue Type: Bug > Components: Client, regionserver, Scanners >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Critical > Labels: pull-request-available > > This was discovered in Apache Impala using HBase 2.2 based branch hbase > client and server. It is not clear yet whether other branches are also > affected. > The issue happens if the server side of the scan throws an exception and > closes the scanner, but at the same time, the client gets an rpc connection > closed error and doesn't process the exception sent by the server. Client > then thinks it got a network error, which leads to retrying the RPC instead > of opening a new scanner. But then when the client retry reaches the server, > the server returns an empty ScanResponse instead of an error, leading to > closing the scanner on client side without returning any error. > A few pointers to critical parts: > region server: > 1st call throws exception leading to closing (but not deleting) scanner: > [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3539] > 2nd call (retry of 1st) returns empty results: > [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3403] > client: > some exceptions are handled as non-retriable at RPC level and are only > handled through opening a new scanner: > [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java#L214] > [https://github.com/apache/hbase/blob/0c8607a35008b7dca15e9daaec41ec362d159d67/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java#L367] > This mechanism in the client only works if it gets the exception from the > server. If there are connection issues during the RPC then the client won't > really know the state of the server. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HBASE-28595) Losing exception from scan RPC can lead to partial results
[ https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846755#comment-17846755 ] Michael Smith edited comment on HBASE-28595 at 5/15/24 9:04 PM: I have a setup that reproduces this issue against release 2.5.8: https://github.com/MikaelSmith/hbase/tree/hbase-28595 It differs slightly from the description above in that my demo code triggers connection close from the client side, which is what I think we actually saw happening (usually triggered by Netty). Still not clear why the connection was closed; server side doesn't log anything when it encounters an exception during scan, as it expects to be able to send it to the client. {code} I0508 19:30:24.107174 64862 ScannerCallable.java:181] Got exception making request scanner_id: 13987119624345627690 number_of_rows: 1024 close_scanner: false next_call_seq: 0 client_handles_partials: true client_handles_heartbeats: true track_scan_metrics: false renew: false to region=... Java exception follows: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to host.example.net/169.169.169.169:16020 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:333) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:91) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:576) at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:42810) at org.apache.hadoop.hbase.client.ScannerCallable.next(ScannerCallable.java:175) at org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:244) at org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:58) at org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:396) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:370) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:79) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to host.example.net/169.169.169.169:16020 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:206) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:383) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:91) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:414) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410) at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:116) at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:131) at org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.cleanupCalls(NettyRpcDuplexHandler.java:203) at org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelInactive(NettyRpcDuplexHandler.java:211) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274) at org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411) at org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376) at
[jira] [Comment Edited] (HBASE-28595) Losing exception from scan RPC can lead to partial results
[ https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846755#comment-17846755 ] Michael Smith edited comment on HBASE-28595 at 5/15/24 9:00 PM: I have a setup that reproduces this issue against release 2.5.8: https://github.com/MikaelSmith/hbase/tree/hbase-28595 It differs slightly from the description above in that my demo code triggers connection close from the client side, which is what I think we actually saw happening (usually triggered by Netty) {code} I0508 19:30:24.107174 64862 ScannerCallable.java:181] Got exception making request scanner_id: 13987119624345627690 number_of_rows: 1024 close_scanner: false next_call_seq: 0 client_handles_partials: true client_handles_heartbeats: true track_scan_metrics: false renew: false to region=... Java exception follows: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to host.example.net/169.169.169.169:16020 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:333) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:91) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:576) at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:42810) at org.apache.hadoop.hbase.client.ScannerCallable.next(ScannerCallable.java:175) at org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:244) at org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:58) at org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:396) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:370) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:79) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to host.example.net/169.169.169.169:16020 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:206) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:383) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:91) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:414) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410) at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:116) at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:131) at org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.cleanupCalls(NettyRpcDuplexHandler.java:203) at org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelInactive(NettyRpcDuplexHandler.java:211) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274) at org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411) at org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
[jira] [Commented] (HBASE-28595) Losing exception from scan RPC can lead to partial results
[ https://issues.apache.org/jira/browse/HBASE-28595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846755#comment-17846755 ] Michael Smith commented on HBASE-28595: --- I have a setup that reproduces this issue: https://github.com/MikaelSmith/hbase/tree/hbase-28595 It differs slightly from the description above in that my demo code triggers connection close from the client side, which is what I think we actually saw happening (usually triggered by Netty) {code} I0508 19:30:24.107174 64862 ScannerCallable.java:181] Got exception making request scanner_id: 13987119624345627690 number_of_rows: 1024 close_scanner: false next_call_seq: 0 client_handles_partials: true client_handles_heartbeats: true track_scan_metrics: false renew: false to region=... Java exception follows: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to host.example.net/169.169.169.169:16020 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:333) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:91) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:576) at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:42810) at org.apache.hadoop.hbase.client.ScannerCallable.next(ScannerCallable.java:175) at org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:244) at org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:58) at org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:396) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:370) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:79) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to host.example.net/169.169.169.169:16020 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:206) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:383) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:91) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:414) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410) at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:116) at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:131) at org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.cleanupCalls(NettyRpcDuplexHandler.java:203) at org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelInactive(NettyRpcDuplexHandler.java:211) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274) at org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411) at org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) at
[jira] [Created] (HBASE-12097) Blocked threads on hbase, slowly increasing, appears to be updating metrics
Michael Smith created HBASE-12097: - Summary: Blocked threads on hbase, slowly increasing, appears to be updating metrics Key: HBASE-12097 URL: https://issues.apache.org/jira/browse/HBASE-12097 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.6 Environment: RHEL 6.2, CDH 4.3.0.0 Reporter: Michael Smith Hbase shows an increasing number of IPC Threads in BLOCKED state Hundreds of these,more and more appearing over hours, performance degrading, requiring regionserver restart to restore performance. Thread: Thread 421 (IPC Server handler 368 on 60201): State: BLOCKED Blocked count: 19314 Waited count: 322565 Blocked on org.apache.hadoop.metrics.util.MetricsIntValue@1ec5ca55 Blocked by 236 (IPC Server handler 183 on 60201) Stack: org.apache.hadoop.metrics.util.MetricsIntValue.set(MetricsIntValue.java:73) org.apache.hadoop.hbase.ipc.HBaseServer.updateCallQueueLenMetrics(HBaseServer.java:1360) org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1399) i dont actually know how to troubleshoot this much further... Happy to take suggestions... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12097) Blocked threads on hbase, slowly increasing, appears to be updating metrics
[ https://issues.apache.org/jira/browse/HBASE-12097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Smith updated HBASE-12097: -- Attachment: total_blocks_versus_threads-week.png The attached graph shows the increasing number of blocked threads. The massive drops, are when we do rolling restarts of all the regionservers. You then see the number of blocked threads slowly starting to grow. We determine the number of blocked threads, by hitting the web interface, ie host:60030/dump, and then for each thread, counting the number of 'Status: BLOCKED' threads. Analysis of the Blocked threads, has revealed its blocked on updating Metrics Blocked threads on hbase, slowly increasing, appears to be updating metrics --- Key: HBASE-12097 URL: https://issues.apache.org/jira/browse/HBASE-12097 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.6 Environment: RHEL 6.2, CDH 4.3.0.0 Reporter: Michael Smith Attachments: total_blocks_versus_threads-week.png Hbase shows an increasing number of IPC Threads in BLOCKED state Hundreds of these,more and more appearing over hours, performance degrading, requiring regionserver restart to restore performance. Thread: Thread 421 (IPC Server handler 368 on 60201): State: BLOCKED Blocked count: 19314 Waited count: 322565 Blocked on org.apache.hadoop.metrics.util.MetricsIntValue@1ec5ca55 Blocked by 236 (IPC Server handler 183 on 60201) Stack: org.apache.hadoop.metrics.util.MetricsIntValue.set(MetricsIntValue.java:73) org.apache.hadoop.hbase.ipc.HBaseServer.updateCallQueueLenMetrics(HBaseServer.java:1360) org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1399) i dont actually know how to troubleshoot this much further... Happy to take suggestions... -- This message was sent by Atlassian JIRA (v6.3.4#6332)