[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2013-04-06 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6364:
-

Fix Version/s: (was: 0.95.0)
   0.94.2

Fix up after bulk move overwrote some 0.94.2 fix versions w/ 0.95.0 (Noticed by 
Lars Hofhansl)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: Nicolas Liochon
  Labels: client
 Fix For: 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364.94.v2.nolargetest.security-addendum.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-22 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.94.v2.nolargetest.patch

6364.94.v2.nolargetest.patch contains the patch for 0.94. My own test depends 
on a class that does not exist in 0.94; so I didn't test it on 0.95 

Unit tests ok, except   
testClientPoolRoundRobin(org.apache.hadoop.hbase.client.TestFromClientSide): 
The number of versions of '[B@4c9cde9a:[B@4eda77c1 did not match 4 expected:4 
but was:3

failed once, second try ok. Committed.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-22 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.94.v2.nolargetest.security-addendum.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364.94.v2.nolargetest.patch, 
 6364.94.v2.nolargetest.security-addendum.patch, 
 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v11.nolargetest.patch

Version that will be committed if the local tests (in progress) are ok.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 
 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-6364:
-

Fix Version/s: 0.94.2

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 
 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-15 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v9.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-15 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Patch Available  (was: Open)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-15 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Release Note: The client (ipc.HBaseClient) now keeps a list of the failed 
connection attempts. It does not retry to connect before 2 seconds after a 
failure. This can be configured by setting 
hbase.ipc.client.failed.servers.expiry: number of milliseconds before 
retrying the same server. Note that some clients retry multiple times to allow 
transient errors. If this parameter is set to a large value, these clients will 
fail without the server being actually retried.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-10 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Fix Version/s: (was: 0.94.2)
   Status: Open  (was: Patch Available)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-10 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v7.withtests.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-10 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v8.withtests.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 6364.v7.withtests.patch, 6364.v8.withtests.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-09 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Open  (was: Patch Available)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-09 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v6.withtests.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.withtests.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-09 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v6.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-09 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Patch Available  (was: Open)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-08 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v5.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-08 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Open  (was: Patch Available)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-08 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Patch Available  (was: Open)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-08 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v5.withtests.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
 6364.v5.withtests.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Open  (was: Patch Available)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v2.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Fix Version/s: 0.96.0
   Status: Patch Available  (was: Open)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v3.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Patch Available  (was: Open)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Open  (was: Patch Available)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Open  (was: Patch Available)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Patch Available  (was: Open)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v3.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-06 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-6364:
-

Fix Version/s: 0.94.2

Not sure I wrapped my head around the issue completely. But from the discussion 
here and looking at the patch it looks right.
This should be in 0.94 as well.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Fix For: 0.96.0, 0.94.2

 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v1.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Patch Available  (was: Open)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Open  (was: Patch Available)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v1.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Patch Available  (was: Open)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread Zhihong Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Ted Yu updated HBASE-6364:
--

Attachment: 6364-host-serving-META.v1.patch

Patch from N.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-07-28 Thread Suraj Varma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Varma updated HBASE-6364:
---

Attachment: stacktrace.txt

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: stacktrace.txt, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-07-28 Thread Suraj Varma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Varma updated HBASE-6364:
---

Attachment: stacktrace.txt

stack trace snippet from test.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: stacktrace.txt, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-07-28 Thread Suraj Varma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Varma updated HBASE-6364:
---

Attachment: (was: stacktrace.txt)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira