[jira] [Commented] (HBASE-7989) Client with a cache info on a dead server will wait for 20s before trying another one.
[ https://issues.apache.org/jira/browse/HBASE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13596393#comment-13596393 ] Jean-Daniel Cryans commented on HBASE-7989: --- This is something we saw yesterday I think. First we saw tons of those a minute after the server died: {noformat} 2013-03-07 01:27:57,065 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Failed all from region=someregion, hostname=sv4r20s13, port=10304 java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: Call to sv4r20s13/10.4.20.13:10304 failed on socket timeout exception: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.4.17.37:46591 remote=sv4r20s13/10.4.20.13:10304] at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1525) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1377) at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:702) at org.apache.hadoop.hbase.thrift.ThriftServerRunner$HBaseHandler.parallelGet(ThriftServerRunner.java:1410) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.thrift.HbaseHandlerMetricsProxy.invoke(HbaseHandlerMetricsProxy.java:65) at $Proxy5.parallelGet(Unknown Source) at org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$parallelGet.getResult(Hbase.java:4930) at org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$parallelGet.getResult(Hbase.java:4918) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34) at org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:287) at org.apache.hadoop.hbase.thrift.CallQueue$Call.run(CallQueue.java:62) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.net.SocketTimeoutException: Call to sv4r20s13/10.4.20.13:10304 failed on socket timeout exception: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.4.17.37:46591 remote=sv4r20s13/10.4.20.13:10304] at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:1052) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1025) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150) at $Proxy6.multi(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1354) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1352) at org.apache.hadoop.hbase.client.ServerCallable.withoutRetries(ServerCallable.java:210) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1361) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1349) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) ... 3 more Caused by: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.4.17.37:46591 remote=sv4r20s13/10.4.20.13:10304] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:399) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
[jira] [Commented] (HBASE-7989) Client with a cache info on a dead server will wait for 20s before trying another one.
[ https://issues.apache.org/jira/browse/HBASE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13596263#comment-13596263 ] nkeywal commented on HBASE-7989: Yes. There is a 20s timout for connect by default. And here there are two issues: - we should be able to have a much lower timeout for connect as it doesn't depend on GC stuff and it's a clear error (we are sure that the action is not done on the server, contrary to a read or write timeout) - we should not even go to the server in some cases (we know it's dead). > Client with a cache info on a dead server will wait for 20s before trying > another one. > -- > > Key: HBASE-7989 > URL: https://issues.apache.org/jira/browse/HBASE-7989 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.95.0, 0.98.0 >Reporter: nkeywal > > Scenario is: > - fetch the cache in the client > - a server dies > - try to use a region that is on the dead server > This will lead to a 20 second connect timeout. We don't have this in unit > test because we have this only is the remote box does not answer. In the unit > tests we have immediately a connection refused from the OS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7989) Client with a cache info on a dead server will wait for 20s before trying another one.
[ https://issues.apache.org/jira/browse/HBASE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13596156#comment-13596156 ] Sergey Shelukhin commented on HBASE-7989: - Hmm, nevermind, is this about TCP timeout? > Client with a cache info on a dead server will wait for 20s before trying > another one. > -- > > Key: HBASE-7989 > URL: https://issues.apache.org/jira/browse/HBASE-7989 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.95.0, 0.98.0 >Reporter: nkeywal > > Scenario is: > - fetch the cache in the client > - a server dies > - try to use a region that is on the dead server > This will lead to a 20 second connect timeout. We don't have this in unit > test because we have this only is the remote box does not answer. In the unit > tests we have immediately a connection refused from the OS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7989) Client with a cache info on a dead server will wait for 20s before trying another one.
[ https://issues.apache.org/jira/browse/HBASE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13596149#comment-13596149 ] Sergey Shelukhin commented on HBASE-7989: - Dup of HBASE-7649? > Client with a cache info on a dead server will wait for 20s before trying > another one. > -- > > Key: HBASE-7989 > URL: https://issues.apache.org/jira/browse/HBASE-7989 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.95.0, 0.98.0 >Reporter: nkeywal > > Scenario is: > - fetch the cache in the client > - a server dies > - try to use a region that is on the dead server > This will lead to a 20 second connect timeout. We don't have this in unit > test because we have this only is the remote box does not answer. In the unit > tests we have immediately a connection refused from the OS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira