[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13913964#comment-13913964 ] Hudson commented on HBASE-10566: FAILURE: Integrated in HBase-TRUNK-on-Hadoop-1.1 #100 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-1.1/100/]) HBASE-10566 cleanup rpcTimeout in the client - addendum (nkeywal: rev 1572033) * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClient.java > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13913012#comment-13913012 ] Hudson commented on HBASE-10566: FAILURE: Integrated in HBase-TRUNK #4957 (See [https://builds.apache.org/job/HBase-TRUNK/4957/]) HBASE-10566 cleanup rpcTimeout in the client - addendum (nkeywal: rev 1572033) * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClient.java > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13912826#comment-13912826 ] Nicolas Liochon commented on HBASE-10566: - bq. Will do as an addendum. Done. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13912401#comment-13912401 ] Hudson commented on HBASE-10566: FAILURE: Integrated in HBase-TRUNK-on-Hadoop-1.1 #99 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-1.1/99/]) HBASE-10566 cleanup rpcTimeout in the client - missing TimeLimitedRpcController (nkeywal: rev 1571730) * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/TimeLimitedRpcController.java HBASE-10566 cleanup rpcTimeout in the client (nkeywal: rev 1571727) * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientSmallScanner.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionManager.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/DelegatingRetryingCallable.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HTable.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/MultiServerCallable.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/Put.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RegionServerCallable.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RetryingCallable.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCaller.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/PayloadCarryingRpcController.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RegionCoprocessorRpcChannel.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClient.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcServer.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/ReplicationProtbufUtil.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALEditsReplaySink.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestHCM.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestEndToEndSplitTransaction.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionServerBulkLoad.java > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > Th
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911944#comment-13911944 ] Nicolas Liochon commented on HBASE-10566: - bq. Belated +1 Thanks ;-) bq. Yes. Will do as an addendum. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911926#comment-13911926 ] Hudson commented on HBASE-10566: FAILURE: Integrated in HBase-TRUNK #4953 (See [https://builds.apache.org/job/HBase-TRUNK/4953/]) HBASE-10566 cleanup rpcTimeout in the client - missing TimeLimitedRpcController (nkeywal: rev 1571730) * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/TimeLimitedRpcController.java HBASE-10566 cleanup rpcTimeout in the client (nkeywal: rev 1571727) * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientSmallScanner.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionManager.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/DelegatingRetryingCallable.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HTable.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/MultiServerCallable.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/Put.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RegionServerCallable.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RetryingCallable.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCaller.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/PayloadCarryingRpcController.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RegionCoprocessorRpcChannel.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClient.java * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/RpcServer.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/ReplicationProtbufUtil.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALEditsReplaySink.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestHCM.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestEndToEndSplitTransaction.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionServerBulkLoad.java > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911856#comment-13911856 ] Andrew Purtell commented on HBASE-10566: Belated +1 bq. btw, I'm seeing this only now, but I extended the original naming 'ipc.socket.timeout', and that was not prefixed by 'hbase.' I should change this, no? Yes. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911703#comment-13911703 ] Nicolas Liochon commented on HBASE-10566: - btw, I'm seeing this only now, but I extended the original naming 'ipc.socket.timeout', and that was not prefixed by 'hbase.' I should change this, no? > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911682#comment-13911682 ] stack commented on HBASE-10566: --- +1 on fix in separate patch > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911557#comment-13911557 ] Nicolas Liochon commented on HBASE-10566: - I plan to commit this shortly. The ??callWithRetries(callable, HConstants.DEFAULT_HBASE_CLIENT_OPERATION_TIMEOUT);?? looks terrible, but may it's 'feature by accident', so it's better to fix this in a separate patch. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911548#comment-13911548 ] Hadoop QA commented on HBASE-10566: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12630929/10566.v3.patch against trunk revision . ATTACHMENT ID: 12630929 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified tests. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop1.1{color}. The patch compiles against the hadoop 1.1 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8797//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8797//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8797//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8797//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8797//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8797//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8797//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8797//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8797//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8797//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8797//console This message is automatically generated. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example).
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911488#comment-13911488 ] Nicolas Liochon commented on HBASE-10566: - Added to that, the existing code does this: {code} public synchronized T callWithRetries(RetryingCallable callable) throws IOException, RuntimeException { return callWithRetries(callable, HConstants.DEFAULT_HBASE_CLIENT_OPERATION_TIMEOUT); } {code} In other words, the setting is not taken into account; we use the hardcoded default value in many cases The patch does not change this. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911487#comment-13911487 ] Nicolas Liochon commented on HBASE-10566: - v3: - fix the javadoc warnings - add a test - test showed that some HTable paths were not setting the timeout. Fixed. ProtobufUtils contains code that belong to the client code. I haven't fixed them all, the patch would become too big. I think we're as good as before. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch, > 10566.v3.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910954#comment-13910954 ] stack commented on HBASE-10566: --- Fix javadoc warning on commit? This is a a great comment: "We're spending a lot of time wrapping the exceptions, and then unwrapping them to discover what really happened." File an issue for this one when you get a chance. Patch looks great to me. Commit. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910746#comment-13910746 ] Hadoop QA commented on HBASE-10566: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12630733/10566.v2.patch against trunk revision . ATTACHMENT ID: 12630733 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop1.1{color}. The patch compiles against the hadoop 1.1 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 3 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8787//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8787//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8787//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8787//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8787//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8787//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8787//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8787//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8787//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8787//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8787//console This message is automatically generated. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will mak
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910602#comment-13910602 ] Nicolas Liochon commented on HBASE-10566: - v2 fixes the test error. I'm not sure that we should not get rid of 'wrapException' however. We're spending a lot of time wrapping the exceptions, and then unwrapping them to discover what really happened. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch, 10566.v2.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910592#comment-13910592 ] Hadoop QA commented on HBASE-10566: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12630699/10566.v1.patch against trunk revision . ATTACHMENT ID: 12630699 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop1.1{color}. The patch compiles against the hadoop 1.1 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 3 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestClientOperationInterrupt Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8785//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8785//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8785//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8785//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8785//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8785//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8785//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8785//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8785//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8785//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8785//console This message is automatically generated. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > qu
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910452#comment-13910452 ] Nicolas Liochon commented on HBASE-10566: - v1 is a first attempt. I haven't run all the tests locally, but I had no error after a 30 minutes run. 3 different socket timeouts - connect - read - write For all of them, we should be able to set them to low value, something like 2 / 5 / 5, without any impact. Likely I will need to write a test for this. The existing timeout of 60s is a global timeout for the operation. I need to double check how we were using the existing operationTimout, my feeling is that it was buggy, and that it was overriding the individual timeout. If it's the case, it's still buggy. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch, 10566.v1.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910145#comment-13910145 ] Nicolas Liochon commented on HBASE-10566: - bq. I suppose it is ok. Maybe rename the class so it is not confused with Callable. Actually, we never use the fact that it's a java Callable. But changing the name can impact a lot of code. I will try (intelliJ will do the change for me, but it can make the patch much bigger, I don"t know). bq. Is TimeLimitedRpcController left as an exercise to the reader I forgot it (usual stuff: not added to git, so not included in git diff). But the patch globally compiles but does not set the timeout all the time. bq. Doesn't "callTimeout" make more sense for this parameter name? Often timeout indicates a duration, while here I used something like a cutoff time. That's what I wanted to express. There is an implication however: the client and the server time must be in sync. Even if it's a common requirement, I'm not sure I'm not going to change my mind. Thanks a lot for the feedback, I'm going to try to write the full patch. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908830#comment-13908830 ] stack commented on HBASE-10566: --- I like Nick's namings > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908726#comment-13908726 ] Nick Dimiduk commented on HBASE-10566: -- bq. A confusion between the socket timeout and the call timeout Yet you chose the name "finishBefore". Doesn't "callTimeout" make more sense for this parameter name? Is {{TimeLimitedRpcController}} left as an exercise to the reader? Perhaps this is what you mean by "sample". Again with the name choice (for my own clarity mostly), I understand this to be an implementation of RpcController that honors an RpcCallTimeout setting. bq. Nick pointed me to this, it's quite interesting My intention was to show how that implementation manages Rpc timeout. I've studied their implementation a bit, but don't know how to map those concepts back to ours. In their implementation, the timeout management is handled as an ExecutorService implementation. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908707#comment-13908707 ] stack commented on HBASE-10566: --- TimeLimitedRpcController.java is missing from the patch (but I think I can imagine what it looks like -- smile). Hmm this is radical difference: -public interface RetryingCallable extends Callable { +public interface RetryingCallable { I suppose it is ok. Maybe rename the class so it is not confused with Callable. The cleanup in RpcRetryingCaller is great. Why does this have to be a data member in RpcClient? +final long finishBefore; ... oh I see ... there is a Call per method invocation... makes sense. Fix spelling: finisheBefore Nice cleanup in RpcClient.java removing the TL Patch lgtm. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908668#comment-13908668 ] stack commented on HBASE-10566: --- bq. I'm not that ambitious. That is because you are more sensible than I (smile). Sounds like we'd have to remove the TLS anyway, whether new transport or not. Let me review the patch. Can do new transport in another issue (smile). > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908647#comment-13908647 ] Nicolas Liochon commented on HBASE-10566: - bq. You think we should just do a new transport altogether? You are fixing ugly legacy. I'm not that ambitious :-). I'm not saying it does not make sense. I'm trying to remove the non standard behaviors: here, the TLS is both buggy and impact the threading model. By removing it I can have a different one, likely more compatible with any other rpc layer, but, anyway and short term, less buggy. If I remove the TLS, I can have a thread pool for all the servers, and this will allow me to have a single code path after 10525. In any case, we need to have standard features. Client side, the issues were the ping and the rpc timeout. I think that if we remove them, we're clean imho. Server side it's more complex: delayed call, priorities, configurable schedulers, ... Thanks a lot for reviewing: the patch is not complete, but I need feedback on the direction, especially because it's going to touch a lot of parts (trivial changes, but all over the place). > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908619#comment-13908619 ] stack commented on HBASE-10566: --- bq. Nick pointed me to this, it's quite interesting: https://code.google.com/p/protobuf-rpc-pro/wiki/RpcTimeout Yes. It would be fun to try and drop in a new transport, one that had fancyness like that of pb-rpc-pro with bidirectional messaging and cancel, etc. Or this dead one: https://code.google.com/p/netty-protobuf-rpc/ There are others too.. This is a thorny issue N. Thanks for digging in. bq. if we can not write to the socket in 10 second, we have an issue. Yes. We've inherited a load of our timeouts from our batch-orientated parent and have yet to change them in many cases. bq. Today, we have a single value, it does not allow us to have low socket read timeouts. Yes. Excellent. This sloppyness has been allowed prevail down through the years (sorry about that). bq. May be protobuf makes this complicated, to be confirmed), but as well it does not really work, because we can have multithreading issues This we inherited from the hadoop rpc. You think we should just do a new transport altogether? You are fixing ugly legacy. bq. we could send the call timeout to the server as well: Yes. Server might reject a call, even before it starts working on it, because it is already past its timeout (for whatever reason) bq. getStub().multi(null, request); << pcrc not used Good. bq. we would need to instantiate one rpcController per call (we more or less do that already) We do this already -- at least IIRC, this is what the model puts up on us. rpcController we use at the moment for carrying our cellblock across the pb rpc interface (it doesn't allow for extra args as is). Let me look at the patch > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908563#comment-13908563 ] Nicolas Liochon commented on HBASE-10566: - Nick pointed me to this, it's quite interesting: https://code.google.com/p/protobuf-rpc-pro/wiki/RpcTimeout > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908493#comment-13908493 ] Nicolas Liochon commented on HBASE-10566: - The difficult point is to correlate this with the cancel. The RpcController supports the cancellation, but we don't really show this class to the client code. However, doing a cancel is a client decision. So we can be protobuf compliant with the timeout, as it's something we manage as a conf parameter, so we don't have to show the RpcController to the client, but we can't easily be protobuf compliant with the cancel, as it would require to expose this RpcController. the path for this jira is: clean rpc timeout -> shared thread pool for the connection instead of 2 threads per region server -> cancellation server side likely, this would help to put Netty or byteBuffer in the loop as well. And this would allow to have shorter socket timeout: today, if we want to support an operation that can last a few minutes (a scan with a lot of filters under heavy load for example), we need to have a socket timeout of a few minutes as well. So I'm interested in any feedback. [~saint@gmail.com], [~andrew.purt...@gmail.com]; [~enis]; [~devaraj] ? > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908332#comment-13908332 ] Nicolas Liochon commented on HBASE-10566: - Interestingly: "A specific RpcController implementation may very well provide a SetTimeout() method — Google’s internal implementation does exactly this." - http://steve.vinoski.net/blog/2008/07/13/protocol-buffers-leaky-rpc/ So likely, it's likely the right approach from a protobuf PoV, even if I'm not sure that we're totally rpc protobuf friendly... > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908255#comment-13908255 ] Nicolas Liochon commented on HBASE-10566: - As well to include in this patch: same policy for ServerRpcController: one throws UnsupportedException, one does nothing. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907013#comment-13907013 ] Nicolas Liochon commented on HBASE-10566: - Any feedback? Is the approach for the rpcController ok? > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905868#comment-13905868 ] Nicolas Liochon commented on HBASE-10566: - Here a sample, using the rpcController to give the info: - call() is replaced by call(finishBefore): it indicates that the call should finish before this time - call(0) means default timeout - we could forward the info to the server (somewhere in protobuf) - we would need to instantiate one rpcController per call (we more or less do that already) Any opinion on the approach? It seems that's how protobuf should be used, but I may be wrong. > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > Attachments: 10566.sample.patch > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10566) cleanup rpcTimeout in the client
[ https://issues.apache.org/jira/browse/HBASE-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905738#comment-13905738 ] Nicolas Liochon commented on HBASE-10566: - To be fixed in this patch, there is a bug in HTable#mutateRow {code} public void mutateRow(final RowMutations rm) throws IOException { RegionServerCallable callable = new RegionServerCallable(connection, getName(), rm.getRow()) { public Void call() throws IOException { try { RegionAction.Builder regionMutationBuilder = RequestConverter.buildRegionAction( getLocation().getRegionInfo().getRegionName(), rm); regionMutationBuilder.setAtomic(true); MultiRequest request = MultiRequest.newBuilder().addRegionAction(regionMutationBuilder.build()).build(); PayloadCarryingRpcController pcrc = new PayloadCarryingRpcController(); pcrc.setPriority(tableName); getStub().multi(null, request); << pcrc not used } catch (ServiceException se) { throw ProtobufUtil.getRemoteException(se); } return null; } }; rpcCallerFactory. newCaller().callWithRetries(callable, this.operationTimeout); } {code} > cleanup rpcTimeout in the client > > > Key: HBASE-10566 > URL: https://issues.apache.org/jira/browse/HBASE-10566 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.99.0 >Reporter: Nicolas Liochon >Assignee: Nicolas Liochon > Fix For: 0.99.0 > > > There are two issues: > 1) A confusion between the socket timeout and the call timeout > Socket timeouts should be minimal: a default like 20 seconds, that could be > lowered to single digits timeouts for some apps: if we can not write to the > socket in 10 second, we have an issue. This is different from the total > duration (send query + do query + receive query), that can be longer, as it > can include remotes calls on the server and so on. Today, we have a single > value, it does not allow us to have low socket read timeouts. > 2) The timeout can be different between the calls. Typically, if the total > time, retries included is 60 seconds but failed after 2 seconds, then the > remaining is 58s. HBase does this today, but by hacking with a thread local > storage variable. It's a hack (it should have been a parameter of the > methods, the TLS allowed to bypass all the layers. May be protobuf makes this > complicated, to be confirmed), but as well it does not really work, because > we can have multithreading issues (we use the updated rpc timeout of someone > else, or we create a new BlockingRpcChannelImplementation with a random > default timeout). > Ideally, we could send the call timeout to the server as well: it will be > able to dismiss alone the calls that it received but git stick in the request > queue or in the internal retries (on hdfs for example). > This will make the system more reactive to failure. > I think we can solve this now, especially after 10525. The main issue is to > something that fits well with protobuf... > Then it should be easy to have a pool of thread for writers and readers, w/o > a single thread per region server as today. -- This message was sent by Atlassian JIRA (v6.1.5#6160)