[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2013-05-30 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13670176#comment-13670176
 ] 

stack commented on HBASE-6364:
--

@nkeywal nevermind.  The issue I am seeing is not particular to this issue.  
Please ignore.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: Nicolas Liochon
  Labels: client
 Fix For: 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364.94.v2.nolargetest.security-addendum.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2013-05-29 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13670062#comment-13670062
 ] 

stack commented on HBASE-6364:
--

[~nkeywal] What is supposed to happen when the FailedServerException is thrown?

The below is hard to read -- it is out of our loadtesttool used alot in 
hbase-it.

A loadtesttool thread is failing with a FailedServerException...  This server 
is in the failed servers list'.  Reading the above, I thought we were supposed 
to pause and retry but if you notice, we are doing connection setup using 
withoutRetries... because we are inside a Process.

Any input would help.  Sorry for being short.  Have to run (I'm trying to 
figure failing hbase-it tests up on ec2 and on internal jenkins).  Thanks.

{code}
2013-05-27 17:08:53,193 ERROR [HBaseWriterThread_3] 
util.MultiThreadedWriter(191): Failed to insert: 48349; region information: 
cached: 
region=IntegrationTestDataIngestWithChaosMonkey,8cc4,1369699538187.6b2be2c004e633f8eed7de8aff6f7cfd.,
 hostname=a1007.halxg.cloudera.com,53752,1369699532487, seqNum=214684; cache is 
up to date; errors: Error from [a1007.halxg.cloudera.com:53752] for 
[979402d0d20fb0f8ded281a8b8687ab9-48349]java.io.IOException: Call to 
a1007.halxg.cloudera.com/10.20.184.107:53752 failed on local exception: 
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in 
the failed servers list: a1007.halxg.cloudera.com/10.20.184.107:53752
at 
org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1368)er$HConnectionImplementation$Process$1.call(HConnectionManager.java:2463)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This 
server is in the failed servers list: 
a1007.halxg.cloudera.com/10.20.184.107:53752
at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:798)
at 
org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1422)
at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1314)
... 15 more

2013-05-27 17:08:53,193 ERROR [HBaseWriterThread_0] 
util.MultiThreadedWriter(191): Failed to insert: 48366; region information: 
cached: 
region=IntegrationTestDataIngestWithChaosMonkey,8cc4,1369699538187.6b2be2c004e633f8eed7de8aff6f7cfd.,
 hostname=a1007.halxg.cloudera.com,53752,1369699532487, seqNum=214684; cache is 
up to date; errors: Error from [a1007.halxg.cloudera.com:53752] for 
[92873a55c54f98db38508ba065852cc5-48366]java.io.IOException: Call to 
a1007.halxg.cloudera.com/10.20.184.107:53752 failed on local exception: 
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in 
the failed servers list: a1007.halxg.cloudera.com/10.20.184.107:53752
at 
org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1368)
at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1340)
at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1540)
at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1597)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.multi(ClientProtos.java:21403)
at 
org.apache.hadoop.hbase.client.MultiServerCallable.call(MultiServerCallable.java:102)
at 
org.apache.hadoop.hbase.client.MultiServerCallable.call(MultiServerCallable.java:43)
at 
org.apache.hadoop.hbase.client.ServerCallable.withoutRetries(ServerCallable.java:250)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$7.call(HConnectionManager.java:1993)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$7.call(HConnectionManager.java:1988)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$Process$1.call(HConnectionManager.java:2473)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$Process$1.call(HConnectionManager.java:2463)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This 
server is in the failed 

[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13448311#comment-13448311
 ] 

Hudson commented on HBASE-6364:
---

Integrated in HBase-0.94-security-on-Hadoop-23 #7 (See 
[https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/7/])
HBASE-6364 Powering down the server host holding the .META. table causes 
HBase Client to take excessively long to recover and connect to reassigned 
.META. table - security addendum (Revision 1376136)
HBASE-6364 Powering down the server host holding the .META. table causes HBase 
Client to take excessively long to recover and connect to reassigned .META. 
table (Revision 1376013)

 Result = FAILURE
nkeywal : 
Files : 
* 
/hbase/branches/0.94/security/src/main/java/org/apache/hadoop/hbase/ipc/SecureClient.java

nkeywal : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/ipc/HBaseClient.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/util/EnvironmentEdgeManager.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/util/ManualEnvironmentEdge.java


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364.94.v2.nolargetest.security-addendum.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13440213#comment-13440213
 ] 

Hudson commented on HBASE-6364:
---

Integrated in HBase-0.94-security #49 (See 
[https://builds.apache.org/job/HBase-0.94-security/49/])
HBASE-6364 Powering down the server host holding the .META. table causes 
HBase Client to take excessively long to recover and connect to reassigned 
.META. table - security addendum (Revision 1376136)

 Result = SUCCESS
nkeywal : 
Files : 
* 
/hbase/branches/0.94/security/src/main/java/org/apache/hadoop/hbase/ipc/SecureClient.java


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364.94.v2.nolargetest.security-addendum.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439525#comment-13439525
 ] 

Hudson commented on HBASE-6364:
---

Integrated in HBase-0.94-security #48 (See 
[https://builds.apache.org/job/HBase-0.94-security/48/])
HBASE-6364 Powering down the server host holding the .META. table causes 
HBase Client to take excessively long to recover and connect to reassigned 
.META. table (Revision 1376013)

 Result = FAILURE
nkeywal : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/ipc/HBaseClient.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/util/EnvironmentEdgeManager.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/util/ManualEnvironmentEdge.java


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439545#comment-13439545
 ] 

Hudson commented on HBASE-6364:
---

Integrated in HBase-0.94 #415 (See 
[https://builds.apache.org/job/HBase-0.94/415/])
HBASE-6364 Powering down the server host holding the .META. table causes 
HBase Client to take excessively long to recover and connect to reassigned 
.META. table (Revision 1376013)

 Result = SUCCESS
nkeywal : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/ipc/HBaseClient.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/util/EnvironmentEdgeManager.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/util/ManualEnvironmentEdge.java


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-22 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439655#comment-13439655
 ] 

Zhihong Ted Yu commented on HBASE-6364:
---

Addendum looks good to me.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364.94.v2.nolargetest.security-addendum.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439659#comment-13439659
 ] 

Hadoop QA commented on HBASE-6364:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12541992/6364.94.v2.nolargetest.security-addendum.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2648//console

This message is automatically generated.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364.94.v2.nolargetest.security-addendum.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-22 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439672#comment-13439672
 ] 

nkeywal commented on HBASE-6364:


No failure on the unit tests small  medium with the security profile.
Committed revision 1376136.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364.94.v2.nolargetest.security-addendum.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439718#comment-13439718
 ] 

Hudson commented on HBASE-6364:
---

Integrated in HBase-0.94 #416 (See 
[https://builds.apache.org/job/HBase-0.94/416/])
HBASE-6364 Powering down the server host holding the .META. table causes 
HBase Client to take excessively long to recover and connect to reassigned 
.META. table - security addendum (Revision 1376136)

 Result = FAILURE
nkeywal : 
Files : 
* 
/hbase/branches/0.94/security/src/main/java/org/apache/hadoop/hbase/ipc/SecureClient.java


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364.94.v2.nolargetest.security-addendum.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-22 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439726#comment-13439726
 ] 

nkeywal commented on HBASE-6364:


This time it's the usual guilty so I think we're ok.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364.94.v2.nolargetest.security-addendum.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-22 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439725#comment-13439725
 ] 

stack commented on HBASE-6364:
--

+1 on addendum

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364.94.v2.nolargetest.security-addendum.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-22 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13440025#comment-13440025
 ] 

Lars Hofhansl commented on HBASE-6364:
--

@Nicholas: Thank you very much for backporting the patch!

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364.94.v2.nolargetest.security-addendum.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438612#comment-13438612
 ] 

nkeywal commented on HBASE-6364:


Committed revision 1375473.

After another fishy and not reproduced error on 
org.apache.hadoop.hbase.TestMultiVersions

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 
 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438622#comment-13438622
 ] 

Hudson commented on HBASE-6364:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #140 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/140/])
HBASE-6364 Powering down the server host holding the .META. table causes 
HBase Client to take excessively long to recover and connect to reassigned 
.META. table (Revision 1375473)

 Result = FAILURE
nkeywal : 
Files : 
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/ClientCache.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/HBaseClient.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/EnvironmentEdgeManager.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/ManualEnvironmentEdge.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestHBaseClient.java


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 
 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438684#comment-13438684
 ] 

Hudson commented on HBASE-6364:
---

Integrated in HBase-TRUNK #3248 (See 
[https://builds.apache.org/job/HBase-TRUNK/3248/])
HBASE-6364 Powering down the server host holding the .META. table causes 
HBase Client to take excessively long to recover and connect to reassigned 
.META. table (Revision 1375473)

 Result = FAILURE
nkeywal : 
Files : 
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/ClientCache.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/HBaseClient.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/EnvironmentEdgeManager.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/ManualEnvironmentEdge.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestHBaseClient.java


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 
 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438855#comment-13438855
 ] 

Lars Hofhansl commented on HBASE-6364:
--

@Nicholas: What's your feeling here... Should this go into 0.94? The patch 
seems contained to me.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 
 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438874#comment-13438874
 ] 

nkeywal commented on HBASE-6364:


@lars
I think it's reasonably safe (well I wrote it :-) ). On the other hand to have 
the issue it solves you need to be a little bit unlucky as well. As the 
consequences are quite bad when you're unlucky I think it's better to do it. 
And if there is an issue you can deactivate it (by setting 
hbase.ipc.client.failed.servers.expiry to zero), or lower the expiry time to 
something as 100ms. If it breaks something I will fix it for both versions.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 
 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438931#comment-13438931
 ] 

stack commented on HBASE-6364:
--

+1 on backport so we get benefit of the MTTR work sooner.

@Nkeywal Should we lower the default  ipc.socket.timeout too?  I don't see that 
in your patch?

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 
 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438968#comment-13438968
 ] 

nkeywal commented on HBASE-6364:


bq. +1 on backport so we get benefit of the MTTR work sooner.
I can do it if you like.

bq. @Nkeywal Should we lower the default ipc.socket.timeout too? I don't see 
that in your patch?
Doable, there will be no side effect, it's used only in HBaseClient. How do you 
want it:
- safe: 10 seconds
- reasonably aggressive: 5 seconds
- instructive: 1 second

20 seconds is a common timeout value (used in the OS if I remember well). Not 
subject to GC effects. I don't know for large clusters under large failure 
conditions (with switches). Especially, in this case, it will be all the 
clients connecting to a single server (the one with meta), so likely going 
through a single network link at the end. I would vote 10s; but 5s is doable. 
1s could work, who knows?


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 
 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439080#comment-13439080
 ] 

stack commented on HBASE-6364:
--

20 is the current default?  If so, we can leave it.  Should we add a note to 
perf section of refguide on lowering connection timeout?

That'd be grand if you'd commit to 0.94.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 
 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439082#comment-13439082
 ] 

nkeywal commented on HBASE-6364:


For connect timeout, yes it's 20s. We could add a note as well, I will create a 
jira for this. And commit in 0.94.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 
 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-18 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13437294#comment-13437294
 ] 

nkeywal commented on HBASE-6364:


@stack. Sorry I missed your message. Yes, the last one is v9, and it's the one 
in RB as well.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-18 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13437420#comment-13437420
 ] 

stack commented on HBASE-6364:
--

I added some comment up on RB but this is close to committable IMO

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-17 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13437137#comment-13437137
 ] 

stack commented on HBASE-6364:
--

@nkeywal Where is v10?  I see v9 attached here.  Is it what is up in RB?

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-15 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13434952#comment-13434952
 ] 

nkeywal commented on HBASE-6364:


v10 with all comments taken into account. Unfortunately, TestReplication failed 
3 times in a row. I need to look at this. There is no clear reason, this last 
patch is very similar to the previous one. Will look at that later...

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-15 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13435095#comment-13435095
 ] 

nkeywal commented on HBASE-6364:


Ok, tried 3 more times, they're all successful. So I think it's ok. If there is 
nothing scary in the hadoop-qa-build, I will commit v10 within 2 or 3 days.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13435158#comment-13435158
 ] 

Hadoop QA commented on HBASE-6364:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12541031/6364.v9.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 2 new or modified tests.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 9 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2587//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2587//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2587//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2587//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2587//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2587//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2587//console

This message is automatically generated.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a 

[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-10 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432710#comment-13432710
 ] 

nkeywal commented on HBASE-6364:


@Suraj
Ok, we will see how it takes to finish the review process as well. But I've 
reproduced your scenario I think.

@All
v8 with all comments taken into account
TestHBaseClient to be in the final commit
Test_HBASE_6364 / HBaseRecoveryTestingUtility not for the final commit

I'm going to create a rb for this as well.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-10 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432713#comment-13432713
 ] 

nkeywal commented on HBASE-6364:


https://reviews.apache.org/r/6521/

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-09 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431774#comment-13431774
 ] 

nkeywal commented on HBASE-6364:


For the tests, I tried 2 options:
- Implementing a specific SocketFactory that would return configurable sockets. 
Too fragile  too complicated
- Adding a hook to get a specific HBaseClient implementation, that would add 
the sleep when necessary. That's in the v6.

On the long term, I think the best place to add test hooks is NetUtils, but 
this class is made of static methods and is in the hadoop.net package, not 
HBase.


I'm not a big fan of adding these new tests to the main build, but it can now 
be done.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431800#comment-13431800
 ] 

Hadoop QA commented on HBASE-6364:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12540017/6364.v6.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 9 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.TestLocalHBaseCluster
  org.apache.hadoop.hbase.master.TestSplitLogManager

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2538//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2538//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2538//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2538//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2538//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2538//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2538//console

This message is automatically generated.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 

[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431866#comment-13431866
 ] 

stack commented on HBASE-6364:
--

We already have a class named DeadServers here: 
./hbase-server/src/main/java/org/apache/hadoop/hbase/master/DeadServer.java  
Maybe name this something else?

Do you think the deadservers will hold reference to the original backing Map?  
If so, if we keep adding, are we adding to backing Map or to the tailMap?  I'm 
talking about here:

+deadServers = deadServers.tailMap(now);

I'm afraid we will be retaining reference to backing Map and it will grow 
without bound?  Is that possible?

So, if default timeout for items in list is two seconds and socket timeout is 
20 seconds, will items timeout in the dead list before we ever make use of it?

You should fix the formatting adding spaces around '+' etc., so it has 
formatting like the rest of the code in this class... or is that how its done 
elsewhere in this class?   If so, fine.

Your addition where we can override HBaseClient looks generally useful.

Use EnvironmentEdge instead of System.currentMillis...

+final long start = System.currentTimeMillis();

Test looks good.

Is the HBaseRecoveryTestingUtility generally useful do you think?  Maybe we 
don't add test but add this? (This class needs class comment explaining what 
goodies it has).

Good stuff.


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your 

[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-09 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431893#comment-13431893
 ] 

nkeywal commented on HBASE-6364:


bq. We already have a class named DeadServers here: 
./hbase-server/src/main/java/org/apache/hadoop/hbase/master/DeadServer.java 
Maybe name this something else?
You're right. Ok.

bq. I'm afraid we will be retaining reference to backing Map and it will grow 
without bound? Is that possible?
It's even probable :-). I need to fix this.

bq. So, if default timeout for items in list is two seconds and socket timeout 
is 20 seconds, will items timeout in the dead list before we ever make use of 
it?

No, because it's added after the timeout. So:
t0: connect starts, will wait 20s
t 19: all threads get in the queue because of the synchronized
t20: timeout; added to the dead list, synchronized lock freed
t21: all threads gets in and gets out because of the dead server list
t22: if a new thread comes in with the wrong server it will wait again

bq. Use EnvironmentEdge instead of System.currentMillis...
Even in an unit test? I didn't know. Ok.

bq. You should fix the formatting adding spaces around '+' etc., so it has 
formatting like the rest of the code in this class... or is that how its done 
elsewhere in this class? If so, fine.
I will recheck. There are some '+' like this in HBaseClient already :-). But I 
will add spaces to mines.


bq. Is the HBaseRecoveryTestingUtility generally useful do you think? Maybe we 
don't add test but add this? (This class needs class comment explaining what 
goodies it has).
Yes, I think so. It helps to write more readable failure tests. It still has 
rough edges, and may be bugs. Also, some functions should be in 
HBaseTestingUtility. It will be ready soon, but it's better to wait a little 
before committing...

Thanks for the feedback.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 

[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-09 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432059#comment-13432059
 ] 

Zhihong Ted Yu commented on HBASE-6364:
---

{code}
+  deadServers.put(expiry, address.toString());
{code}
Do we need a MultiMap as the backing store for deadServers (possibly two 
servers having the same expiration time) ?

The hbase.ipc.client.recheckServersTimeout is for dead servers. Release note is 
needed so that people know what to look for.


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-09 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432071#comment-13432071
 ] 

nkeywal commented on HBASE-6364:


bq. Do we need a MultiMap as the backing store for deadServers (possibly two 
servers having the same expiration time) ?
You're right. I will need to fix this as well.

bq. The hbase.ipc.client.recheckServersTimeout is for dead servers. Release 
note is needed so that people know what to look for.
Will do. But the default should be good for most people imho. I will put a 
warning in the release notes (there could be an issue if a node disappears and 
comes back if the setting is too high: the client will have to wait for the end 
of the recheck before seeing the node coming back).

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-09 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432080#comment-13432080
 ] 

Zhihong Ted Yu commented on HBASE-6364:
---

@N:
Thanks for the quick response. Appreciate it.
nit: please add javadoc for remoteId:
{code}
+   * Creates a connection. Can be overridden by a subclass for testing.
+   */
+  protected Connection createConnection(ConnectionId remoteId) throws 
IOException {
{code}
{code}
+IOException e = new DeadServerIOException(
+This server is is the dead server list: +server);
{code}
'is is' - 'is in'
DeadServerIOException extends IOException, I think 'IO' doesn't have to appear 
in the name of exception.

Would be nice if the next patch is put up on review board.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-09 Thread Suraj Varma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432344#comment-13432344
 ] 

Suraj Varma commented on HBASE-6364:


Thanks nkeywal - this looks great. I will try to rebase it back to my version 
and see if I can arrange a re-test. However, please don't wait for me for your 
commit as it may take a few days before I can arrange to run this in my 
environment (perhaps end of next week ... at best). 



 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-08 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431174#comment-13431174
 ] 

nkeywal commented on HBASE-6364:


New version, with the results of the test implementation on the mini cluster:
- I haven't been able to simulate a connect timeout. So I need to instrument 
the code a way or another to add a sleep if we want the push the tests in the 
test suite. I can do that by subclassing HBaseClient for example. It won't be 
very clean. Is there another solution?
- When I implemented Suraj scenario, I reproduced exactly his issue (while 
previously, in my tests the time lost was only a fraction of the number of 
threads). And my fix was not working on this specific scenario.
- So I finally added a dead servers list. The expiry time is configurable 
(hbase.ipc.client.recheckServersTimeout). It's not totally without impact: if 
someone was actually relying on the time spent on retries (for example if a 
server comes down  up again), he will be disappointed... I set a small default 
value for this reason (2 seconds, ie. less than most retries wait time). 
Increasing it should be done only if there are good reasons for this.


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429012#comment-13429012
 ] 

nkeywal commented on HBASE-6364:


Reanalyzing the fix, there is an issue with the v1: we could have a call added 
to a dying connection, and this call won't get cleaned up. This is was not 
possible previously. Will write a v2.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429052#comment-13429052
 ] 

nkeywal commented on HBASE-6364:


v2, fixes the problem mentioned above, works locally.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429072#comment-13429072
 ] 

Hadoop QA commented on HBASE-6364:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12539258/6364.v2.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 10 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2516//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2516//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2516//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2516//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2516//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2516//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2516//console

This message is automatically generated.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. 

[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429078#comment-13429078
 ] 

nkeywal commented on HBASE-6364:


v3 just changes a comment, not the code itself.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429098#comment-13429098
 ] 

Hadoop QA commented on HBASE-6364:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12539272/6364.v3.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 10 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.client.TestFromClientSide
  org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks
  org.apache.hadoop.hbase.master.TestAssignmentManager

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2517//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2517//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2517//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2517//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2517//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2517//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2517//console

This message is automatically generated.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default 

[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429139#comment-13429139
 ] 

nkeywal commented on HBASE-6364:


errors are unrelated imho. Retrying to see if the third executions says 
something different.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429233#comment-13429233
 ] 

stack commented on HBASE-6364:
--

Good one lads.

Fix formatting before commit N.  Make it same as surrounding code... Add 
spacings around brackets -- the 'else' -- and the '+' in String concatenations.

I think I understand the notifying that is going on on the end of the addCall 
method.  They line up w/ waits on Call and waits on the calls data member?

Would it be hard making a test of this bit of code?

What speed up around recovery are you seeing N?  Should we change the default 
timeout too as Suraj does above?

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429238#comment-13429238
 ] 

Zhihong Ted Yu commented on HBASE-6364:
---

In addCall():
{code}
+calls.put(call.id, call);
+notify();
{code}
Do we need a 'synchronized (call)' block for the above notification ?

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429295#comment-13429295
 ] 

nkeywal commented on HBASE-6364:


bq. Fix formatting before commit N.
Ok.

Before committing, I would be interested by a feedback from Suraj. There are 
just a few lines of code, so rebasing won't be complicated if he needs some 
time to test it.

bq. I think I understand the notifying that is going on on the end of the 
addCall method. They line up w/ waits on Call and waits on the calls data 
member?
It's mainly playing with the synchronized: Connection#addCall  
Connection#setupIOstreams are both synchronized, so, on an exception during 
setupIOstreams, either:
- you were waiting just before setupIOstreams, and in this case you've been 
clean up during setupIOstreams exception management
- you were waiting before the addCall, and in this case you won't be added to 
the calls list.
- in both case when you enter yourself in setupIOstreams you are filtered by 
the test on shouldCloseConnection 

bq. Would it be hard making a test of this bit of code?
It's a difficult question, because there are both the behavior of this jira and 
both the generic behavior to be tested.
1) Just for this jira, when I tested it I added a sleep to simulate a 
connection timeout. I will provide soon a (small) set of utility functions to 
better simulated this, with real timeouts. This type of test (more in the 
category of regression tests than unit tests) could be added to the integration 
tests may be. I had various issues during the tests, it was more difficult than 
expected.
2) Testing the HBaseClient itself would be useful, but the interesting path is 
the multithreaded one.

bq. What speed up around recovery are you seeing N? Should we change the 
default timeout too as Suraj does above?

For the speed up, it's arbitrary, as it depends on the number of RS. on my 
tests, 20% of the calls were serialized. I.e. with an operation on 20 rs, the 
fix makes it 3 times faster. But it seems that Suraj had a much worse 
serialization, on a bigger cluster, so for him we could expect much better 
results, likely 20 times faster or better.
Another point is that in this fix we don't keep a list of dead rs, so we cut 
the connection attempts only of they are happening while another is taking 
place.  So if he could try it would be great.

For the default timeout, I think we can cut down the connect timeout. But I 
think it's safer to make it to 5 seconds, so this fix remains important. I will 
work on this on another jira.


bq. Do we need a 'synchronized (call)' block for the above notification ?
It's a notification on the connection itself, and addCall is synchronized, so 
it's ok. Then 'calls' is a 'ConcurrentSkipListMap' so we can access it 
concurrently.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method 

[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429394#comment-13429394
 ] 

stack commented on HBASE-6364:
--

bq. Before committing, I would be interested by a feedback from Suraj. There 
are just a few lines of code, so rebasing won't be complicated if he needs some 
time to test it.

Sounds good.

bq. For the default timeout, I think we can cut down the connect timeout. But I 
think it's safer to make it to 5 seconds, so this fix remains important. I will 
work on this on another jira.

Sounds good too.

On test, even if its utility to simulate so you can prove your fix, that'd be 
great.

I like the numbers you are quoting above.



 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428879#comment-13428879
 ] 

nkeywal commented on HBASE-6364:


v1. There are some other ways of doing this, like adding a list of dead servers 
and a timeout, but it does not pass the unit tests, with numerous failures. I 
haven't sorted out if there is a common root cause...

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428886#comment-13428886
 ] 

Hadoop QA commented on HBASE-6364:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12539211/6364.v1.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 9 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster
  org.apache.hadoop.hbase.master.TestAssignmentManager
  org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//console

This message is automatically generated.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the 

[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428890#comment-13428890
 ] 

nkeywal commented on HBASE-6364:


likely to be unrelated, worked twice locally. Let's retry.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428920#comment-13428920
 ] 

Zhihong Ted Yu commented on HBASE-6364:
---

https://builds.apache.org/job/PreCommit-HBASE-Build/2514/console got aborted.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428975#comment-13428975
 ] 

Lars Hofhansl commented on HBASE-6364:
--

Is this alleviated (at least somewhat) by HBASE-6326?

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428976#comment-13428976
 ] 

Lars Hofhansl commented on HBASE-6364:
--

Looking at the issue, no it isn't. NM me. :)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-07-28 Thread Suraj Varma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424289#comment-13424289
 ] 

Suraj Varma commented on HBASE-6364:


nkeywal, my test cluster had 5 regionservers with an average of 35 regions per 
server. The client is doing Get calls from all threads - each thread creates a 
separate HTable but with the same HBaseConfiguration (so, sharing the same 
underlying HConnection.)

Case 2 that you mention above is what I think I saw. Several threads blocked on 
#addCall with a lock owned by the setupIOStreams call (on NetUtils.connect()) 
which was timing out after 20s. 

I finally managed to locate the stack traces from the test ... I'm attaching a 
snippet from that with some BLOCKED threads and the setupIOStreams call holding 
the lock. 

I believe the same behaviour would occur on the hadoop client as well ... but 
perhaps because of multiple datanodes, the problem is not as acute? In the 
HBaseClient case, all threads repeatedly reaches out for META to rediscover the 
relocated regions. 
(Attached the thread dump snippet)


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: stacktrace.txt, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-07-28 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424308#comment-13424308
 ] 

nkeywal commented on HBASE-6364:


Thanks Suraj. So it seems that we have the root cause. You're right, on 
ipc.Client it's less likely to occur. Only the NN could have the issue I think, 
and as the hot failover is not yet deployed in production, it has not been an 
issue so far. I will ping Todd on this to see what he thinks.

The only strange thing is that when tested on the 0.96 + minicluster, I have 
the issue but many threads seem to be able to 'escape', i.e. manage to do the 
'addCall' between two blocking calls to setupIOstreams. But it could be a pure 
multithreading hazard. And I haven't been able to find a better cause than this.

I will propose a patch, likely end of next week. Suraj, If you have some time 
available to validate it on your env, it would be great.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-07-19 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418397#comment-13418397
 ] 

nkeywal commented on HBASE-6364:


That's a possible cause:

in HBaseClient#getConnection()
{noformat}
do {
  synchronized (connections) {
connection = connections.get(remoteId);
if (connection == null) {
  connection = new Connection(remoteId);
  connections.put(remoteId, connection);
}
  }
} while (!connection.addCall(call));

connection.setupIOstreams();
return connection;
  }
{noformat}
Connection#addCall and Connection#setupIOstreams are synchronized. if 
#setupIOstreams fails, it marks the connection as dead, remove it from the 
connections list, and throws an exception. In #addCall it returns false if the 
connection is marked as dead. So
case 1 - Sometimes, we may add the call to a connection that will be marked as 
dead:
 Thread 1: create the connection, add it to the connections list, call addCall
 Thread 2: get the connection, add it to the calls list 
 Thread 1: get into setupIOstreams, fails, and mark the connection as dead, 
throws an exception, done
 Thread 2: get into setupIOstreams, see that the connection is dead, done. The 
call has been added to a dead connection

case 2 - If we have a lot of threads on a dying connection, we will have:
 Thread 1: goes until setupIOstreams
 All other threads: get the connection from the list, wait on the synchronized 
addCall 
 Thread 1: exit from setupIOstreams with an exception after 20 seconds (socket 
timeout)
 All other threads: call addCall, as the connection is dead, reloop
 One of these threads will create a connection
 One of these thread will win the race on addCall
 One of them will win the race on setupIOstreams
 Most of them should be waiting on addCall so reloop
 So back to the case 1 or 2.


We would have the same behavior on pure hadoop client (ipc.Client), as the 
implementation is similar, at least on 1.0.3.


Suraj, does this match your analysis? How many region servers and regions did 
you have during your test? What's the client doing?

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client

 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only 

[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-07-10 Thread Suraj Varma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410555#comment-13410555
 ] 

Suraj Varma commented on HBASE-6364:


Also wanted to capture NKeywal's comments on this from the mailing list: 

NKeywal said: 
What you're describing -the 35 minutes recovery time- seems to match the code. 
And it's a bug (still there on trunk). Could you please create a jira for it? 
If you have the logs it even better.

Lowering the ipc.socket.timeout seems to be an acceptable partial workaround. 
Setting it to 10s seems ok to me. Lower than this... I don't know.


 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client

 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira