[jira] [Commented] (HBASE-7989) Client with a cache info on a dead server will wait for 20s before trying another one.

2013-03-07 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13596393#comment-13596393
 ] 

Jean-Daniel Cryans commented on HBASE-7989:
---

This is something we saw yesterday I think.

First we saw tons of those a minute after the server died:

{noformat}
2013-03-07 01:27:57,065 WARN 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
Failed all from region=someregion, hostname=sv4r20s13, port=10304
java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: Call 
to sv4r20s13/10.4.20.13:10304 failed on socket timeout exception: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.4.17.37:46591 remote=sv4r20s13/10.4.20.13:10304]
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1525)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1377)
at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:702)
at 
org.apache.hadoop.hbase.thrift.ThriftServerRunner$HBaseHandler.parallelGet(ThriftServerRunner.java:1410)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.hbase.thrift.HbaseHandlerMetricsProxy.invoke(HbaseHandlerMetricsProxy.java:65)
at $Proxy5.parallelGet(Unknown Source)
at 
org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$parallelGet.getResult(Hbase.java:4930)
at 
org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$parallelGet.getResult(Hbase.java:4918)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at 
org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:287)
at org.apache.hadoop.hbase.thrift.CallQueue$Call.run(CallQueue.java:62)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.SocketTimeoutException: Call to sv4r20s13/10.4.20.13:10304 
failed on socket timeout exception: java.net.SocketTimeoutException: 6 
millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.4.17.37:46591 
remote=sv4r20s13/10.4.20.13:10304]
at 
org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:1052)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1025)
at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
at $Proxy6.multi(Unknown Source)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1354)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1352)
at 
org.apache.hadoop.hbase.client.ServerCallable.withoutRetries(ServerCallable.java:210)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1361)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1349)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
... 3 more
Caused by: java.net.SocketTimeoutException: 6 millis timeout while waiting 
for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.4.17.37:46591 
remote=sv4r20s13/10.4.20.13:10304]
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:399)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)

[jira] [Commented] (HBASE-7989) Client with a cache info on a dead server will wait for 20s before trying another one.

2013-03-07 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13596263#comment-13596263
 ] 

nkeywal commented on HBASE-7989:


Yes. There is a 20s timout for connect by default. And here there are two 
issues:
- we should be able to have a much lower timeout for connect as it doesn't 
depend on GC stuff and it's a clear error (we are sure that the action is not 
done on the server, contrary to a read or write timeout) 
- we should not even go to the server in some cases (we know it's dead).

> Client with a cache info on a dead server will wait for 20s before trying 
> another one.
> --
>
> Key: HBASE-7989
> URL: https://issues.apache.org/jira/browse/HBASE-7989
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 0.95.0, 0.98.0
>Reporter: nkeywal
>
> Scenario is:
> - fetch the cache in the client
> - a server dies
> - try to use a region that is on the dead server
> This will lead to a 20 second connect timeout. We don't have this in unit 
> test because we have this only is the remote box does not answer. In the unit 
> tests we have immediately a connection refused from the OS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7989) Client with a cache info on a dead server will wait for 20s before trying another one.

2013-03-07 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13596156#comment-13596156
 ] 

Sergey Shelukhin commented on HBASE-7989:
-

Hmm, nevermind, is this about TCP timeout?

> Client with a cache info on a dead server will wait for 20s before trying 
> another one.
> --
>
> Key: HBASE-7989
> URL: https://issues.apache.org/jira/browse/HBASE-7989
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 0.95.0, 0.98.0
>Reporter: nkeywal
>
> Scenario is:
> - fetch the cache in the client
> - a server dies
> - try to use a region that is on the dead server
> This will lead to a 20 second connect timeout. We don't have this in unit 
> test because we have this only is the remote box does not answer. In the unit 
> tests we have immediately a connection refused from the OS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7989) Client with a cache info on a dead server will wait for 20s before trying another one.

2013-03-07 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13596149#comment-13596149
 ] 

Sergey Shelukhin commented on HBASE-7989:
-

Dup of HBASE-7649?

> Client with a cache info on a dead server will wait for 20s before trying 
> another one.
> --
>
> Key: HBASE-7989
> URL: https://issues.apache.org/jira/browse/HBASE-7989
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 0.95.0, 0.98.0
>Reporter: nkeywal
>
> Scenario is:
> - fetch the cache in the client
> - a server dies
> - try to use a region that is on the dead server
> This will lead to a 20 second connect timeout. We don't have this in unit 
> test because we have this only is the remote box does not answer. In the unit 
> tests we have immediately a connection refused from the OS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira