[jira] [Commented] (HBASE-6512) Incorret OfflineMetaRepair log class name

2012-08-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428814#comment-13428814
 ] 

Hadoop QA commented on HBASE-6512:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12539179/HBASE-6512.diff
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 9 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2510//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2510//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2510//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2510//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2510//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2510//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2510//console

This message is automatically generated.

 Incorret OfflineMetaRepair log class name
 -

 Key: HBASE-6512
 URL: https://issues.apache.org/jira/browse/HBASE-6512
 Project: HBase
  Issue Type: Bug
  Components: hbck
Affects Versions: 0.94.0, 0.96.0, 0.94.1, 0.94.2
Reporter: liang xie
 Attachments: HBASE-6512.diff


 At the beginning of OfflineMetaRepair.java, we can observe:
 private static final Log LOG = LogFactory.getLog(HBaseFsck.class.getName());
 It would be better change to :
 private static final Log LOG = 
 LogFactory.getLog(OfflineMetaRepair.class.getName());

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6372) Add scanner batching to Export job

2012-08-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428816#comment-13428816
 ] 

Hadoop QA commented on HBASE-6372:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12539180/HBASE-6372.4.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 9 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.TestFullLogReconstruction

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2511//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2511//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2511//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2511//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2511//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2511//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2511//console

This message is automatically generated.

 Add scanner batching to Export job
 --

 Key: HBASE-6372
 URL: https://issues.apache.org/jira/browse/HBASE-6372
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Affects Versions: 0.96.0, 0.94.2
Reporter: Lars George
Assignee: Shengsheng Huang
Priority: Minor
  Labels: newbie
 Attachments: HBASE-6372.2.patch, HBASE-6372.3.patch, 
 HBASE-6372.4.patch, HBASE-6372.patch


 When a single row is too large for the RS heap then an OOME can take out the 
 entire RS. Setting scanner batching in custom scans helps avoiding this 
 scenario, but for the supplied Export job this is not set.
 Similar to HBASE-3421 we can set the batching to a low number - or if needed 
 make it a command line option.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6372) Add scanner batching to Export job

2012-08-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428820#comment-13428820
 ] 

Hadoop QA commented on HBASE-6372:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12539193/HBASE-6372.4.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 9 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
 

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2512//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2512//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2512//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2512//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2512//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2512//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2512//console

This message is automatically generated.

 Add scanner batching to Export job
 --

 Key: HBASE-6372
 URL: https://issues.apache.org/jira/browse/HBASE-6372
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Affects Versions: 0.96.0, 0.94.2
Reporter: Lars George
Assignee: Shengsheng Huang
Priority: Minor
  Labels: newbie
 Attachments: HBASE-6372.2.patch, HBASE-6372.3.patch, 
 HBASE-6372.4.patch, HBASE-6372.patch


 When a single row is too large for the RS heap then an OOME can take out the 
 entire RS. Setting scanner batching in custom scans helps avoiding this 
 scenario, but for the supplied Export job this is not set.
 Similar to HBASE-3421 we can set the batching to a low number - or if needed 
 make it a command line option.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6466) Enable multi-thread for memstore flush

2012-08-05 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428821#comment-13428821
 ] 

ramkrishna.s.vasudevan commented on HBASE-6466:
---

@Ted
I will also try to check it out.  We are not able to test this as some other 
things are going on.

 Enable multi-thread for memstore flush
 --

 Key: HBASE-6466
 URL: https://issues.apache.org/jira/browse/HBASE-6466
 Project: HBase
  Issue Type: Improvement
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: HBASE-6466.patch, HBASE-6466v2.patch, HBASE-6466v3.patch


 If the KV is large or Hlog is closed with high-pressure putting, we found 
 memstore is often above the high water mark and block the putting.
 So should we enable multi-thread for Memstore Flush?
 Some performance test data for reference,
 1.test environment : 
 random writting;upper memstore limit 5.6GB;lower memstore limit 4.8GB;400 
 regions per regionserver;row len=50 bytes, value len=1024 bytes;5 
 regionserver, 300 ipc handler per regionserver;5 client, 50 thread handler 
 per client for writing
 2.test results:
 one cacheFlush handler, tps: 7.8k/s per regionserver, Flush:10.1MB/s per 
 regionserver, appears many aboveGlobalMemstoreLimit blocking
 two cacheFlush handlers, tps: 10.7k/s per regionserver, Flush:12.46MB/s per 
 regionserver,
 200 thread handler per client  two cacheFlush handlers, tps:16.1k/s per 
 regionserver, Flush:18.6MB/s per regionserver

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6505) Allow shared RegionObserver state

2012-08-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428835#comment-13428835
 ] 

Hudson commented on HBASE-6505:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #122 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/122/])
HBASE-6505 Allow shared RegionObserver state (Revision 1369516)

 Result = FAILURE
larsh : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/CoprocessorHost.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/RegionCoprocessorEnvironment.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RegionCoprocessorHost.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestCoprocessorInterface.java


 Allow shared RegionObserver state
 -

 Key: HBASE-6505
 URL: https://issues.apache.org/jira/browse/HBASE-6505
 Project: HBase
  Issue Type: Sub-task
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.2

 Attachments: 6505-0.94.txt, 6505-trunk.txt, 6505-v2.txt, 6505-v3.txt, 
 6505-v4.txt, 6505.txt




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6317) Master clean start up and Partially enabled tables make region assignment inconsistent.

2012-08-05 Thread rajeshbabu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428846#comment-13428846
 ] 

rajeshbabu commented on HBASE-6317:
---

@Jimmy,
bq.Your patch may still not cover all scenarios, for example, if the 
deadServers is not empty
You are correct in one scenario. If all the tables are in ENABLING 
state(partially enabled) and one or more region servers holding regions of 
those tables went down and master restarted. 

@Jimmy/@Stack
Updated the patch addressing above scenario and posted on 
RB:https://reviews.apache.org/r/6011/

Please review and provide your comments/suggestions.


 Master clean start up and Partially enabled tables make region assignment 
 inconsistent.
 ---

 Key: HBASE-6317
 URL: https://issues.apache.org/jira/browse/HBASE-6317
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Assignee: rajeshbabu
 Fix For: 0.92.2, 0.96.0, 0.94.2

 Attachments: HBASE-6317_94.patch, HBASE-6317_94_3.patch


 If we have a  table in partially enabled state (ENABLING) then on HMaster 
 restart we treat it as a clean cluster start up and do a bulk assign.  
 Currently in 0.94 bulk assign will not handle ALREADY_OPENED scenarios and it 
 leads to region assignment problems.  Analysing more on this we found that we 
 have better way to handle these scenarios.
 {code}
 if (false == checkIfRegionBelongsToDisabled(regionInfo)
  false == checkIfRegionsBelongsToEnabling(regionInfo)) {
   synchronized (this.regions) {
 regions.put(regionInfo, regionLocation);
 addToServers(regionLocation, regionInfo);
   }
 {code}
 We dont add to regions map so that enable table handler can handle it.  But 
 as nothing is added to regions map we think it as a clean cluster start up.
 Will come up with a patch tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6506) Setting CACHE_BLOCKS to false in an hbase shell scan doesn't work

2012-08-05 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428866#comment-13428866
 ] 

stack commented on HBASE-6506:
--

Can you make a patch Josh?  Thanks.

You probably don't want to turn off caching completely?  You should cache at 
least the blocks that hold the hfile indices.  Otherwise, if caching fully off, 
any read will have to bring in the indices each time first.

 Setting CACHE_BLOCKS to false in an hbase shell scan doesn't work
 -

 Key: HBASE-6506
 URL: https://issues.apache.org/jira/browse/HBASE-6506
 Project: HBase
  Issue Type: Bug
  Components: shell
Affects Versions: 0.94.0
Reporter: Josh Wymer
Priority: Minor
  Labels: cache, ruby, scan, shell
   Original Estimate: 1m
  Remaining Estimate: 1m

 I was attempting to prevent blocks from being cached by setting CACHE_BLOCKS 
 = false in the hbase shell when doing a scan but I kept seeing tons of 
 evictions when I ran it. After inspecting table.rb I found this line:
 cache = args[CACHE_BLOCKS] || true
 The problem then is that if CACHE_BLOCKS is false then this expression will 
 always return true. Therefore, it's impossible to turn off block caching. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6476) Replace all occurrances of System.currentTimeMillis() with EnvironmentEdge equivalent

2012-08-05 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428869#comment-13428869
 ] 

stack commented on HBASE-6476:
--

+1 on commit.  Open new issue to add in check?

 Replace all occurrances of System.currentTimeMillis() with EnvironmentEdge 
 equivalent
 -

 Key: HBASE-6476
 URL: https://issues.apache.org/jira/browse/HBASE-6476
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
Priority: Minor
 Fix For: 0.96.0

 Attachments: 6476-v2.txt, 6476-v2.txt, 6476.txt


 There are still some areas where System.currentTimeMillis() is used in HBase. 
 In order to make all parts of the code base testable and (potentially) to be 
 able to configure HBase's notion of time, this should be generally be 
 replaced with EnvironmentEdgeManager.currentTimeMillis().
 How hard would it be to add a maven task that checks for that, so we do not 
 introduce System.currentTimeMillis back in the future?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v1.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Patch Available  (was: Open)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428879#comment-13428879
 ] 

nkeywal commented on HBASE-6364:


v1. There are some other ways of doing this, like adding a list of dead servers 
and a timeout, but it does not pass the unit tests, with numerous failures. I 
haven't sorted out if there is a common root cause...

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428886#comment-13428886
 ] 

Hadoop QA commented on HBASE-6364:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12539211/6364.v1.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 9 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster
  org.apache.hadoop.hbase.master.TestAssignmentManager
  org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2513//console

This message is automatically generated.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the 

[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428890#comment-13428890
 ] 

nkeywal commented on HBASE-6364:


likely to be unrelated, worked twice locally. Let's retry.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Open  (was: Patch Available)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v1.patch

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Status: Patch Available  (was: Open)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.94.0, 0.92.1, 0.90.6
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364.v1.patch, 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428920#comment-13428920
 ] 

Zhihong Ted Yu commented on HBASE-6364:
---

https://builds.apache.org/job/PreCommit-HBASE-Build/2514/console got aborted.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread Zhihong Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Ted Yu updated HBASE-6364:
--

Attachment: 6364-host-serving-META.v1.patch

Patch from N.

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6414) Remove the WritableRpcEngine associated Invocation classes

2012-08-05 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428957#comment-13428957
 ] 

Zhihong Ted Yu commented on HBASE-6414:
---

{code}
+  if (builder.mergeDelimitedFrom(in)) {
+value = builder.build();
+  }
{code}
Should there be an else block for the above if ?
{code}
+  } catch (Exception e) {
+// TODO Auto-generated catch block
+e.printStackTrace();
{code}
Should something similar to closeException be introduced to save the caught 
exception ?

There're a few white spaces, visible if you put the patch on review board.

 Remove the WritableRpcEngine  associated Invocation classes
 

 Key: HBASE-6414
 URL: https://issues.apache.org/jira/browse/HBASE-6414
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.96.0
Reporter: Devaraj Das
Assignee: Devaraj Das
 Fix For: 0.96.0

 Attachments: 6414-initial.patch.txt, 6414-initial.patch.txt


 Remove the WritableRpcEngine  Invocation classes once HBASE-5705 gets 
 committed and all the protocols are rebased to use PB.
 Raising this jira in advance..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428975#comment-13428975
 ] 

Lars Hofhansl commented on HBASE-6364:
--

Is this alleviated (at least somewhat) by HBASE-6326?

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-05 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428976#comment-13428976
 ] 

Lars Hofhansl commented on HBASE-6364:
--

Looking at the issue, no it isn't. NM me. :)

 Powering down the server host holding the .META. table causes HBase Client to 
 take excessively long to recover and connect to reassigned .META. table
 -

 Key: HBASE-6364
 URL: https://issues.apache.org/jira/browse/HBASE-6364
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.6, 0.92.1, 0.94.0
Reporter: Suraj Varma
Assignee: nkeywal
  Labels: client
 Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 
 6364.v1.patch, stacktrace.txt


 When a server host with a Region Server holding the .META. table is powered 
 down on a live cluster, while the HBase cluster itself detects and reassigns 
 the .META. table, connected HBase Client's take an excessively long time to 
 detect this and re-discover the reassigned .META. 
 Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
 value (default is 20s leading to 35 minute recovery time; we were able to get 
 acceptable results with 100ms getting a 3 minute recovery) 
 This was found during some hardware failure testing scenarios. 
 Test Case:
 1) Apply load via client app on HBase cluster for several minutes
 2) Power down the region server holding the .META. server (i.e. power off ... 
 and keep it off)
 3) Measure how long it takes for cluster to reassign META table and for 
 client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
 and DN on that host).
 Observation:
 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
 recover (i.e. for the thread count to go back to normal) - no client calls 
 are serviced - they just back up on a synchronized method (see #2 below)
 2) All the client app threads queue up behind the 
 oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
 After taking several thread dumps we found that the thread within this 
 synchronized method was blocked on  NetUtils.connect(this.socket, 
 remoteId.getAddress(), getSocketTimeout(conf));
 The client thread that gets the synchronized lock would try to connect to the 
 dead RS (till socket times out after 20s), retries, and then the next thread 
 gets in and so forth in a serial manner.
 Workaround:
 ---
 Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
 (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
 the client threads recovered in a couple of minutes by failing fast and 
 re-discovering the .META. table on a reassigned RS.
 Assumption: This ipc.socket.timeout is only ever used during the initial 
 HConnection setup via the NetUtils.connect and should only ever be used 
 when connectivity to a region server is lost and needs to be re-established. 
 i.e it does not affect the normal RPC actiivity as this is just the connect 
 timeout.
 During RS GC periods, any _new_ clients trying to connect will fail and will 
 require .META. table re-lookups.
 This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6414) Remove the WritableRpcEngine associated Invocation classes

2012-08-05 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428980#comment-13428980
 ] 

Zhihong Ted Yu commented on HBASE-6414:
---

{code}
+public AuthenticationTokenSecretManager createSecretManager(){
{code}
I only find one reference to the above method. I guess it doesn't have to be 
public.
{code}
+  Class? extends Message rpcArgClassname = null;
{code}
The above variable represents Class, not Class name.

 Remove the WritableRpcEngine  associated Invocation classes
 

 Key: HBASE-6414
 URL: https://issues.apache.org/jira/browse/HBASE-6414
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.96.0
Reporter: Devaraj Das
Assignee: Devaraj Das
 Fix For: 0.96.0

 Attachments: 6414-initial.patch.txt, 6414-initial.patch.txt


 Remove the WritableRpcEngine  Invocation classes once HBASE-5705 gets 
 committed and all the protocols are rebased to use PB.
 Raising this jira in advance..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6513) Test errors when building on MacOS

2012-08-05 Thread Archimedes Trajano (JIRA)
Archimedes Trajano created HBASE-6513:
-

 Summary: Test errors when building on MacOS
 Key: HBASE-6513
 URL: https://issues.apache.org/jira/browse/HBASE-6513
 Project: HBase
  Issue Type: Bug
  Components: build
 Environment: MacOSX 10.8 
Oracle JDK 1.7
Reporter: Archimedes Trajano



Results :

Failed tests:   
testBackgroundEvictionThread[0](org.apache.hadoop.hbase.io.hfile.TestLruBlockCache):
 expected:2 but was:1
  
testBackgroundEvictionThread[1](org.apache.hadoop.hbase.io.hfile.TestLruBlockCache):
 expected:2 but was:1
  
testSplitCalculatorEq(org.apache.hadoop.hbase.util.TestRegionSplitCalculator): 
expected:2 but was:1


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6514) unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram

2012-08-05 Thread Archimedes Trajano (JIRA)
Archimedes Trajano created HBASE-6514:
-

 Summary: unknown metrics type: 
org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
 Key: HBASE-6514
 URL: https://issues.apache.org/jira/browse/HBASE-6514
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: MacOS 10.8
Oracle JDK 1.7
Reporter: Archimedes Trajano


When trying to run a unit test that just starts up and shutdown the server the 
following errors occur in System.out

01:10:59,874 ERROR MetricsUtil:116 - unknown metrics type: 
org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
01:10:59,874 ERROR MetricsUtil:116 - unknown metrics type: 
org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
01:10:59,875 ERROR MetricsUtil:116 - unknown metrics type: 
org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
01:10:59,875 ERROR MetricsUtil:116 - unknown metrics type: 
org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6514) unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram

2012-08-05 Thread Archimedes Trajano (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428983#comment-13428983
 ] 

Archimedes Trajano commented on HBASE-6514:
---

The test case does pass though.

 unknown metrics type: 
 org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
 

 Key: HBASE-6514
 URL: https://issues.apache.org/jira/browse/HBASE-6514
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: MacOS 10.8
 Oracle JDK 1.7
Reporter: Archimedes Trajano

 When trying to run a unit test that just starts up and shutdown the server 
 the following errors occur in System.out
 01:10:59,874 ERROR MetricsUtil:116 - unknown metrics type: 
 org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
 01:10:59,874 ERROR MetricsUtil:116 - unknown metrics type: 
 org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
 01:10:59,875 ERROR MetricsUtil:116 - unknown metrics type: 
 org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
 01:10:59,875 ERROR MetricsUtil:116 - unknown metrics type: 
 org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6514) unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram

2012-08-05 Thread Archimedes Trajano (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Archimedes Trajano updated HBASE-6514:
--

Attachment: FrameworkTest.java

Sample JUnit test I had used.

 unknown metrics type: 
 org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
 

 Key: HBASE-6514
 URL: https://issues.apache.org/jira/browse/HBASE-6514
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: MacOS 10.8
 Oracle JDK 1.7
Reporter: Archimedes Trajano
 Attachments: FrameworkTest.java


 When trying to run a unit test that just starts up and shutdown the server 
 the following errors occur in System.out
 01:10:59,874 ERROR MetricsUtil:116 - unknown metrics type: 
 org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
 01:10:59,874 ERROR MetricsUtil:116 - unknown metrics type: 
 org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
 01:10:59,875 ERROR MetricsUtil:116 - unknown metrics type: 
 org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
 01:10:59,875 ERROR MetricsUtil:116 - unknown metrics type: 
 org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira