[jira] [Commented] (HBASE-6512) Incorret OfflineMetaRepair log class name
[ https://issues.apache.org/jira/browse/HBASE-6512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428814#comment-13428814 ] Hadoop QA commented on HBASE-6512: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12539179/HBASE-6512.diff against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2510//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2510//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2510//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2510//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2510//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2510//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2510//console This message is automatically generated. Incorret OfflineMetaRepair log class name - Key: HBASE-6512 URL: https://issues.apache.org/jira/browse/HBASE-6512 Project: HBase Issue Type: Bug Components: hbck Affects Versions: 0.94.0, 0.96.0, 0.94.1, 0.94.2 Reporter: liang xie Attachments: HBASE-6512.diff At the beginning of OfflineMetaRepair.java, we can observe: private static final Log LOG = LogFactory.getLog(HBaseFsck.class.getName()); It would be better change to : private static final Log LOG = LogFactory.getLog(OfflineMetaRepair.class.getName()); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6372) Add scanner batching to Export job
[ https://issues.apache.org/jira/browse/HBASE-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428816#comment-13428816 ] Hadoop QA commented on HBASE-6372: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12539180/HBASE-6372.4.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.TestFullLogReconstruction Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2511//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2511//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2511//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2511//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2511//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2511//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2511//console This message is automatically generated. Add scanner batching to Export job -- Key: HBASE-6372 URL: https://issues.apache.org/jira/browse/HBASE-6372 Project: HBase Issue Type: Improvement Components: mapreduce Affects Versions: 0.96.0, 0.94.2 Reporter: Lars George Assignee: Shengsheng Huang Priority: Minor Labels: newbie Attachments: HBASE-6372.2.patch, HBASE-6372.3.patch, HBASE-6372.4.patch, HBASE-6372.patch When a single row is too large for the RS heap then an OOME can take out the entire RS. Setting scanner batching in custom scans helps avoiding this scenario, but for the supplied Export job this is not set. Similar to HBASE-3421 we can set the batching to a low number - or if needed make it a command line option. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6372) Add scanner batching to Export job
[ https://issues.apache.org/jira/browse/HBASE-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428820#comment-13428820 ] Hadoop QA commented on HBASE-6372: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12539193/HBASE-6372.4.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2512//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2512//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2512//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2512//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2512//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2512//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2512//console This message is automatically generated. Add scanner batching to Export job -- Key: HBASE-6372 URL: https://issues.apache.org/jira/browse/HBASE-6372 Project: HBase Issue Type: Improvement Components: mapreduce Affects Versions: 0.96.0, 0.94.2 Reporter: Lars George Assignee: Shengsheng Huang Priority: Minor Labels: newbie Attachments: HBASE-6372.2.patch, HBASE-6372.3.patch, HBASE-6372.4.patch, HBASE-6372.patch When a single row is too large for the RS heap then an OOME can take out the entire RS. Setting scanner batching in custom scans helps avoiding this scenario, but for the supplied Export job this is not set. Similar to HBASE-3421 we can set the batching to a low number - or if needed make it a command line option. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6466) Enable multi-thread for memstore flush
[ https://issues.apache.org/jira/browse/HBASE-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428821#comment-13428821 ] ramkrishna.s.vasudevan commented on HBASE-6466: --- @Ted I will also try to check it out. We are not able to test this as some other things are going on. Enable multi-thread for memstore flush -- Key: HBASE-6466 URL: https://issues.apache.org/jira/browse/HBASE-6466 Project: HBase Issue Type: Improvement Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-6466.patch, HBASE-6466v2.patch, HBASE-6466v3.patch If the KV is large or Hlog is closed with high-pressure putting, we found memstore is often above the high water mark and block the putting. So should we enable multi-thread for Memstore Flush? Some performance test data for reference, 1.test environment : random writting;upper memstore limit 5.6GB;lower memstore limit 4.8GB;400 regions per regionserver;row len=50 bytes, value len=1024 bytes;5 regionserver, 300 ipc handler per regionserver;5 client, 50 thread handler per client for writing 2.test results: one cacheFlush handler, tps: 7.8k/s per regionserver, Flush:10.1MB/s per regionserver, appears many aboveGlobalMemstoreLimit blocking two cacheFlush handlers, tps: 10.7k/s per regionserver, Flush:12.46MB/s per regionserver, 200 thread handler per client two cacheFlush handlers, tps:16.1k/s per regionserver, Flush:18.6MB/s per regionserver -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6505) Allow shared RegionObserver state
[ https://issues.apache.org/jira/browse/HBASE-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428835#comment-13428835 ] Hudson commented on HBASE-6505: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #122 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/122/]) HBASE-6505 Allow shared RegionObserver state (Revision 1369516) Result = FAILURE larsh : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/CoprocessorHost.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/RegionCoprocessorEnvironment.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RegionCoprocessorHost.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestCoprocessorInterface.java Allow shared RegionObserver state - Key: HBASE-6505 URL: https://issues.apache.org/jira/browse/HBASE-6505 Project: HBase Issue Type: Sub-task Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.2 Attachments: 6505-0.94.txt, 6505-trunk.txt, 6505-v2.txt, 6505-v3.txt, 6505-v4.txt, 6505.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6317) Master clean start up and Partially enabled tables make region assignment inconsistent.
[ https://issues.apache.org/jira/browse/HBASE-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428846#comment-13428846 ] rajeshbabu commented on HBASE-6317: --- @Jimmy, bq.Your patch may still not cover all scenarios, for example, if the deadServers is not empty You are correct in one scenario. If all the tables are in ENABLING state(partially enabled) and one or more region servers holding regions of those tables went down and master restarted. @Jimmy/@Stack Updated the patch addressing above scenario and posted on RB:https://reviews.apache.org/r/6011/ Please review and provide your comments/suggestions. Master clean start up and Partially enabled tables make region assignment inconsistent. --- Key: HBASE-6317 URL: https://issues.apache.org/jira/browse/HBASE-6317 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Assignee: rajeshbabu Fix For: 0.92.2, 0.96.0, 0.94.2 Attachments: HBASE-6317_94.patch, HBASE-6317_94_3.patch If we have a table in partially enabled state (ENABLING) then on HMaster restart we treat it as a clean cluster start up and do a bulk assign. Currently in 0.94 bulk assign will not handle ALREADY_OPENED scenarios and it leads to region assignment problems. Analysing more on this we found that we have better way to handle these scenarios. {code} if (false == checkIfRegionBelongsToDisabled(regionInfo) false == checkIfRegionsBelongsToEnabling(regionInfo)) { synchronized (this.regions) { regions.put(regionInfo, regionLocation); addToServers(regionLocation, regionInfo); } {code} We dont add to regions map so that enable table handler can handle it. But as nothing is added to regions map we think it as a clean cluster start up. Will come up with a patch tomorrow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6506) Setting CACHE_BLOCKS to false in an hbase shell scan doesn't work
[ https://issues.apache.org/jira/browse/HBASE-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428866#comment-13428866 ] stack commented on HBASE-6506: -- Can you make a patch Josh? Thanks. You probably don't want to turn off caching completely? You should cache at least the blocks that hold the hfile indices. Otherwise, if caching fully off, any read will have to bring in the indices each time first. Setting CACHE_BLOCKS to false in an hbase shell scan doesn't work - Key: HBASE-6506 URL: https://issues.apache.org/jira/browse/HBASE-6506 Project: HBase Issue Type: Bug Components: shell Affects Versions: 0.94.0 Reporter: Josh Wymer Priority: Minor Labels: cache, ruby, scan, shell Original Estimate: 1m Remaining Estimate: 1m I was attempting to prevent blocks from being cached by setting CACHE_BLOCKS = false in the hbase shell when doing a scan but I kept seeing tons of evictions when I ran it. After inspecting table.rb I found this line: cache = args[CACHE_BLOCKS] || true The problem then is that if CACHE_BLOCKS is false then this expression will always return true. Therefore, it's impossible to turn off block caching. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6476) Replace all occurrances of System.currentTimeMillis() with EnvironmentEdge equivalent
[ https://issues.apache.org/jira/browse/HBASE-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428869#comment-13428869 ] stack commented on HBASE-6476: -- +1 on commit. Open new issue to add in check? Replace all occurrances of System.currentTimeMillis() with EnvironmentEdge equivalent - Key: HBASE-6476 URL: https://issues.apache.org/jira/browse/HBASE-6476 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Minor Fix For: 0.96.0 Attachments: 6476-v2.txt, 6476-v2.txt, 6476.txt There are still some areas where System.currentTimeMillis() is used in HBase. In order to make all parts of the code base testable and (potentially) to be able to configure HBase's notion of time, this should be generally be replaced with EnvironmentEdgeManager.currentTimeMillis(). How hard would it be to add a maven task that checks for that, so we do not introduce System.currentTimeMillis back in the future? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-6364: --- Attachment: 6364.v1.patch Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table - Key: HBASE-6364 URL: https://issues.apache.org/jira/browse/HBASE-6364 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Suraj Varma Assignee: nkeywal Labels: client Attachments: 6364.v1.patch, stacktrace.txt When a server host with a Region Server holding the .META. table is powered down on a live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected HBase Client's take an excessively long time to detect this and re-discover the reassigned .META. Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low value (default is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms getting a 3 minute recovery) This was found during some hardware failure testing scenarios. Test Case: 1) Apply load via client app on HBase cluster for several minutes 2) Power down the region server holding the .META. server (i.e. power off ... and keep it off) 3) Measure how long it takes for cluster to reassign META table and for client threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host). Observation: 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e. for the thread count to go back to normal) - no client calls are serviced - they just back up on a synchronized method (see #2 below) 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj After taking several thread dumps we found that the thread within this synchronized method was blocked on NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf)); The client thread that gets the synchronized lock would try to connect to the dead RS (till socket times out after 20s), retries, and then the next thread gets in and so forth in a serial manner. Workaround: --- Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, the client threads recovered in a couple of minutes by failing fast and re-discovering the .META. table on a reassigned RS. Assumption: This ipc.socket.timeout is only ever used during the initial HConnection setup via the NetUtils.connect and should only ever be used when connectivity to a region server is lost and needs to be re-established. i.e it does not affect the normal RPC actiivity as this is just the connect timeout. During RS GC periods, any _new_ clients trying to connect will fail and will require .META. table re-lookups. This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-6364: --- Status: Patch Available (was: Open) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table - Key: HBASE-6364 URL: https://issues.apache.org/jira/browse/HBASE-6364 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.94.0, 0.92.1, 0.90.6 Reporter: Suraj Varma Assignee: nkeywal Labels: client Attachments: 6364.v1.patch, stacktrace.txt When a server host with a Region Server holding the .META. table is powered down on a live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected HBase Client's take an excessively long time to detect this and re-discover the reassigned .META. Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low value (default is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms getting a 3 minute recovery) This was found during some hardware failure testing scenarios. Test Case: 1) Apply load via client app on HBase cluster for several minutes 2) Power down the region server holding the .META. server (i.e. power off ... and keep it off) 3) Measure how long it takes for cluster to reassign META table and for client threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host). Observation: 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e. for the thread count to go back to normal) - no client calls are serviced - they just back up on a synchronized method (see #2 below) 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj After taking several thread dumps we found that the thread within this synchronized method was blocked on NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf)); The client thread that gets the synchronized lock would try to connect to the dead RS (till socket times out after 20s), retries, and then the next thread gets in and so forth in a serial manner. Workaround: --- Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, the client threads recovered in a couple of minutes by failing fast and re-discovering the .META. table on a reassigned RS. Assumption: This ipc.socket.timeout is only ever used during the initial HConnection setup via the NetUtils.connect and should only ever be used when connectivity to a region server is lost and needs to be re-established. i.e it does not affect the normal RPC actiivity as this is just the connect timeout. During RS GC periods, any _new_ clients trying to connect will fail and will require .META. table re-lookups. This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428879#comment-13428879 ] nkeywal commented on HBASE-6364: v1. There are some other ways of doing this, like adding a list of dead servers and a timeout, but it does not pass the unit tests, with numerous failures. I haven't sorted out if there is a common root cause... Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table - Key: HBASE-6364 URL: https://issues.apache.org/jira/browse/HBASE-6364 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Suraj Varma Assignee: nkeywal Labels: client Attachments: 6364.v1.patch, stacktrace.txt When a server host with a Region Server holding the .META. table is powered down on a live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected HBase Client's take an excessively long time to detect this and re-discover the reassigned .META. Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low value (default is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms getting a 3 minute recovery) This was found during some hardware failure testing scenarios. Test Case: 1) Apply load via client app on HBase cluster for several minutes 2) Power down the region server holding the .META. server (i.e. power off ... and keep it off) 3) Measure how long it takes for cluster to reassign META table and for client threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host). Observation: 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e. for the thread count to go back to normal) - no client calls are serviced - they just back up on a synchronized method (see #2 below) 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj After taking several thread dumps we found that the thread within this synchronized method was blocked on NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf)); The client thread that gets the synchronized lock would try to connect to the dead RS (till socket times out after 20s), retries, and then the next thread gets in and so forth in a serial manner. Workaround: --- Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, the client threads recovered in a couple of minutes by failing fast and re-discovering the .META. table on a reassigned RS. Assumption: This ipc.socket.timeout is only ever used during the initial HConnection setup via the NetUtils.connect and should only ever be used when connectivity to a region server is lost and needs to be re-established. i.e it does not affect the normal RPC actiivity as this is just the connect timeout. During RS GC periods, any _new_ clients trying to connect will fail and will require .META. table re-lookups. This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428886#comment-13428886 ] Hadoop QA commented on HBASE-6364: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12539211/6364.v1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster org.apache.hadoop.hbase.master.TestAssignmentManager org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2513//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2513//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2513//console This message is automatically generated. Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table - Key: HBASE-6364 URL: https://issues.apache.org/jira/browse/HBASE-6364 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Suraj Varma Assignee: nkeywal Labels: client Attachments: 6364.v1.patch, stacktrace.txt When a server host with a Region Server holding the .META. table is powered down on a live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected HBase Client's take an excessively long time to detect this and re-discover the reassigned .META. Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low value (default is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms getting a 3 minute recovery) This was found during some hardware failure testing scenarios. Test Case: 1) Apply load via client app on HBase cluster for several minutes 2) Power down the region server holding the .META. server (i.e. power off ... and keep it off) 3) Measure how long it takes for cluster to reassign META table and for client threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host). Observation: 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e. for the thread count to go back to normal) - no client calls are serviced - they just back up on a synchronized method (see #2 below) 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj After taking several thread dumps we found that the thread within this synchronized method was blocked on NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf)); The client thread that gets the synchronized lock would try to connect to the dead RS (till socket times out after 20s), retries, and then the next thread gets in and so forth in a serial manner. Workaround: --- Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 100 ms, etc) on the
[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428890#comment-13428890 ] nkeywal commented on HBASE-6364: likely to be unrelated, worked twice locally. Let's retry. Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table - Key: HBASE-6364 URL: https://issues.apache.org/jira/browse/HBASE-6364 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Suraj Varma Assignee: nkeywal Labels: client Attachments: 6364.v1.patch, 6364.v1.patch, stacktrace.txt When a server host with a Region Server holding the .META. table is powered down on a live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected HBase Client's take an excessively long time to detect this and re-discover the reassigned .META. Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low value (default is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms getting a 3 minute recovery) This was found during some hardware failure testing scenarios. Test Case: 1) Apply load via client app on HBase cluster for several minutes 2) Power down the region server holding the .META. server (i.e. power off ... and keep it off) 3) Measure how long it takes for cluster to reassign META table and for client threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host). Observation: 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e. for the thread count to go back to normal) - no client calls are serviced - they just back up on a synchronized method (see #2 below) 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj After taking several thread dumps we found that the thread within this synchronized method was blocked on NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf)); The client thread that gets the synchronized lock would try to connect to the dead RS (till socket times out after 20s), retries, and then the next thread gets in and so forth in a serial manner. Workaround: --- Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, the client threads recovered in a couple of minutes by failing fast and re-discovering the .META. table on a reassigned RS. Assumption: This ipc.socket.timeout is only ever used during the initial HConnection setup via the NetUtils.connect and should only ever be used when connectivity to a region server is lost and needs to be re-established. i.e it does not affect the normal RPC actiivity as this is just the connect timeout. During RS GC periods, any _new_ clients trying to connect will fail and will require .META. table re-lookups. This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-6364: --- Status: Open (was: Patch Available) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table - Key: HBASE-6364 URL: https://issues.apache.org/jira/browse/HBASE-6364 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.94.0, 0.92.1, 0.90.6 Reporter: Suraj Varma Assignee: nkeywal Labels: client Attachments: 6364.v1.patch, 6364.v1.patch, stacktrace.txt When a server host with a Region Server holding the .META. table is powered down on a live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected HBase Client's take an excessively long time to detect this and re-discover the reassigned .META. Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low value (default is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms getting a 3 minute recovery) This was found during some hardware failure testing scenarios. Test Case: 1) Apply load via client app on HBase cluster for several minutes 2) Power down the region server holding the .META. server (i.e. power off ... and keep it off) 3) Measure how long it takes for cluster to reassign META table and for client threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host). Observation: 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e. for the thread count to go back to normal) - no client calls are serviced - they just back up on a synchronized method (see #2 below) 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj After taking several thread dumps we found that the thread within this synchronized method was blocked on NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf)); The client thread that gets the synchronized lock would try to connect to the dead RS (till socket times out after 20s), retries, and then the next thread gets in and so forth in a serial manner. Workaround: --- Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, the client threads recovered in a couple of minutes by failing fast and re-discovering the .META. table on a reassigned RS. Assumption: This ipc.socket.timeout is only ever used during the initial HConnection setup via the NetUtils.connect and should only ever be used when connectivity to a region server is lost and needs to be re-established. i.e it does not affect the normal RPC actiivity as this is just the connect timeout. During RS GC periods, any _new_ clients trying to connect will fail and will require .META. table re-lookups. This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-6364: --- Attachment: 6364.v1.patch Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table - Key: HBASE-6364 URL: https://issues.apache.org/jira/browse/HBASE-6364 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Suraj Varma Assignee: nkeywal Labels: client Attachments: 6364.v1.patch, 6364.v1.patch, stacktrace.txt When a server host with a Region Server holding the .META. table is powered down on a live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected HBase Client's take an excessively long time to detect this and re-discover the reassigned .META. Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low value (default is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms getting a 3 minute recovery) This was found during some hardware failure testing scenarios. Test Case: 1) Apply load via client app on HBase cluster for several minutes 2) Power down the region server holding the .META. server (i.e. power off ... and keep it off) 3) Measure how long it takes for cluster to reassign META table and for client threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host). Observation: 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e. for the thread count to go back to normal) - no client calls are serviced - they just back up on a synchronized method (see #2 below) 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj After taking several thread dumps we found that the thread within this synchronized method was blocked on NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf)); The client thread that gets the synchronized lock would try to connect to the dead RS (till socket times out after 20s), retries, and then the next thread gets in and so forth in a serial manner. Workaround: --- Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, the client threads recovered in a couple of minutes by failing fast and re-discovering the .META. table on a reassigned RS. Assumption: This ipc.socket.timeout is only ever used during the initial HConnection setup via the NetUtils.connect and should only ever be used when connectivity to a region server is lost and needs to be re-established. i.e it does not affect the normal RPC actiivity as this is just the connect timeout. During RS GC periods, any _new_ clients trying to connect will fail and will require .META. table re-lookups. This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-6364: --- Status: Patch Available (was: Open) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table - Key: HBASE-6364 URL: https://issues.apache.org/jira/browse/HBASE-6364 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.94.0, 0.92.1, 0.90.6 Reporter: Suraj Varma Assignee: nkeywal Labels: client Attachments: 6364.v1.patch, 6364.v1.patch, stacktrace.txt When a server host with a Region Server holding the .META. table is powered down on a live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected HBase Client's take an excessively long time to detect this and re-discover the reassigned .META. Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low value (default is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms getting a 3 minute recovery) This was found during some hardware failure testing scenarios. Test Case: 1) Apply load via client app on HBase cluster for several minutes 2) Power down the region server holding the .META. server (i.e. power off ... and keep it off) 3) Measure how long it takes for cluster to reassign META table and for client threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host). Observation: 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e. for the thread count to go back to normal) - no client calls are serviced - they just back up on a synchronized method (see #2 below) 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj After taking several thread dumps we found that the thread within this synchronized method was blocked on NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf)); The client thread that gets the synchronized lock would try to connect to the dead RS (till socket times out after 20s), retries, and then the next thread gets in and so forth in a serial manner. Workaround: --- Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, the client threads recovered in a couple of minutes by failing fast and re-discovering the .META. table on a reassigned RS. Assumption: This ipc.socket.timeout is only ever used during the initial HConnection setup via the NetUtils.connect and should only ever be used when connectivity to a region server is lost and needs to be re-established. i.e it does not affect the normal RPC actiivity as this is just the connect timeout. During RS GC periods, any _new_ clients trying to connect will fail and will require .META. table re-lookups. This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428920#comment-13428920 ] Zhihong Ted Yu commented on HBASE-6364: --- https://builds.apache.org/job/PreCommit-HBASE-Build/2514/console got aborted. Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table - Key: HBASE-6364 URL: https://issues.apache.org/jira/browse/HBASE-6364 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Suraj Varma Assignee: nkeywal Labels: client Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 6364.v1.patch, stacktrace.txt When a server host with a Region Server holding the .META. table is powered down on a live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected HBase Client's take an excessively long time to detect this and re-discover the reassigned .META. Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low value (default is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms getting a 3 minute recovery) This was found during some hardware failure testing scenarios. Test Case: 1) Apply load via client app on HBase cluster for several minutes 2) Power down the region server holding the .META. server (i.e. power off ... and keep it off) 3) Measure how long it takes for cluster to reassign META table and for client threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host). Observation: 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e. for the thread count to go back to normal) - no client calls are serviced - they just back up on a synchronized method (see #2 below) 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj After taking several thread dumps we found that the thread within this synchronized method was blocked on NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf)); The client thread that gets the synchronized lock would try to connect to the dead RS (till socket times out after 20s), retries, and then the next thread gets in and so forth in a serial manner. Workaround: --- Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, the client threads recovered in a couple of minutes by failing fast and re-discovering the .META. table on a reassigned RS. Assumption: This ipc.socket.timeout is only ever used during the initial HConnection setup via the NetUtils.connect and should only ever be used when connectivity to a region server is lost and needs to be re-established. i.e it does not affect the normal RPC actiivity as this is just the connect timeout. During RS GC periods, any _new_ clients trying to connect will fail and will require .META. table re-lookups. This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6364: -- Attachment: 6364-host-serving-META.v1.patch Patch from N. Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table - Key: HBASE-6364 URL: https://issues.apache.org/jira/browse/HBASE-6364 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Suraj Varma Assignee: nkeywal Labels: client Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 6364.v1.patch, stacktrace.txt When a server host with a Region Server holding the .META. table is powered down on a live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected HBase Client's take an excessively long time to detect this and re-discover the reassigned .META. Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low value (default is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms getting a 3 minute recovery) This was found during some hardware failure testing scenarios. Test Case: 1) Apply load via client app on HBase cluster for several minutes 2) Power down the region server holding the .META. server (i.e. power off ... and keep it off) 3) Measure how long it takes for cluster to reassign META table and for client threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host). Observation: 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e. for the thread count to go back to normal) - no client calls are serviced - they just back up on a synchronized method (see #2 below) 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj After taking several thread dumps we found that the thread within this synchronized method was blocked on NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf)); The client thread that gets the synchronized lock would try to connect to the dead RS (till socket times out after 20s), retries, and then the next thread gets in and so forth in a serial manner. Workaround: --- Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, the client threads recovered in a couple of minutes by failing fast and re-discovering the .META. table on a reassigned RS. Assumption: This ipc.socket.timeout is only ever used during the initial HConnection setup via the NetUtils.connect and should only ever be used when connectivity to a region server is lost and needs to be re-established. i.e it does not affect the normal RPC actiivity as this is just the connect timeout. During RS GC periods, any _new_ clients trying to connect will fail and will require .META. table re-lookups. This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6414) Remove the WritableRpcEngine associated Invocation classes
[ https://issues.apache.org/jira/browse/HBASE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428957#comment-13428957 ] Zhihong Ted Yu commented on HBASE-6414: --- {code} + if (builder.mergeDelimitedFrom(in)) { +value = builder.build(); + } {code} Should there be an else block for the above if ? {code} + } catch (Exception e) { +// TODO Auto-generated catch block +e.printStackTrace(); {code} Should something similar to closeException be introduced to save the caught exception ? There're a few white spaces, visible if you put the patch on review board. Remove the WritableRpcEngine associated Invocation classes Key: HBASE-6414 URL: https://issues.apache.org/jira/browse/HBASE-6414 Project: HBase Issue Type: Improvement Affects Versions: 0.96.0 Reporter: Devaraj Das Assignee: Devaraj Das Fix For: 0.96.0 Attachments: 6414-initial.patch.txt, 6414-initial.patch.txt Remove the WritableRpcEngine Invocation classes once HBASE-5705 gets committed and all the protocols are rebased to use PB. Raising this jira in advance.. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428975#comment-13428975 ] Lars Hofhansl commented on HBASE-6364: -- Is this alleviated (at least somewhat) by HBASE-6326? Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table - Key: HBASE-6364 URL: https://issues.apache.org/jira/browse/HBASE-6364 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Suraj Varma Assignee: nkeywal Labels: client Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 6364.v1.patch, stacktrace.txt When a server host with a Region Server holding the .META. table is powered down on a live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected HBase Client's take an excessively long time to detect this and re-discover the reassigned .META. Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low value (default is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms getting a 3 minute recovery) This was found during some hardware failure testing scenarios. Test Case: 1) Apply load via client app on HBase cluster for several minutes 2) Power down the region server holding the .META. server (i.e. power off ... and keep it off) 3) Measure how long it takes for cluster to reassign META table and for client threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host). Observation: 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e. for the thread count to go back to normal) - no client calls are serviced - they just back up on a synchronized method (see #2 below) 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj After taking several thread dumps we found that the thread within this synchronized method was blocked on NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf)); The client thread that gets the synchronized lock would try to connect to the dead RS (till socket times out after 20s), retries, and then the next thread gets in and so forth in a serial manner. Workaround: --- Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, the client threads recovered in a couple of minutes by failing fast and re-discovering the .META. table on a reassigned RS. Assumption: This ipc.socket.timeout is only ever used during the initial HConnection setup via the NetUtils.connect and should only ever be used when connectivity to a region server is lost and needs to be re-established. i.e it does not affect the normal RPC actiivity as this is just the connect timeout. During RS GC periods, any _new_ clients trying to connect will fail and will require .META. table re-lookups. This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428976#comment-13428976 ] Lars Hofhansl commented on HBASE-6364: -- Looking at the issue, no it isn't. NM me. :) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table - Key: HBASE-6364 URL: https://issues.apache.org/jira/browse/HBASE-6364 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Suraj Varma Assignee: nkeywal Labels: client Attachments: 6364-host-serving-META.v1.patch, 6364.v1.patch, 6364.v1.patch, stacktrace.txt When a server host with a Region Server holding the .META. table is powered down on a live cluster, while the HBase cluster itself detects and reassigns the .META. table, connected HBase Client's take an excessively long time to detect this and re-discover the reassigned .META. Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low value (default is 20s leading to 35 minute recovery time; we were able to get acceptable results with 100ms getting a 3 minute recovery) This was found during some hardware failure testing scenarios. Test Case: 1) Apply load via client app on HBase cluster for several minutes 2) Power down the region server holding the .META. server (i.e. power off ... and keep it off) 3) Measure how long it takes for cluster to reassign META table and for client threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN on that host). Observation: 1) Client threads spike up to maxThreads size ... and take over 35 mins to recover (i.e. for the thread count to go back to normal) - no client calls are serviced - they just back up on a synchronized method (see #2 below) 2) All the client app threads queue up behind the oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj After taking several thread dumps we found that the thread within this synchronized method was blocked on NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf)); The client thread that gets the synchronized lock would try to connect to the dead RS (till socket times out after 20s), retries, and then the next thread gets in and so forth in a serial manner. Workaround: --- Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, the client threads recovered in a couple of minutes by failing fast and re-discovering the .META. table on a reassigned RS. Assumption: This ipc.socket.timeout is only ever used during the initial HConnection setup via the NetUtils.connect and should only ever be used when connectivity to a region server is lost and needs to be re-established. i.e it does not affect the normal RPC actiivity as this is just the connect timeout. During RS GC periods, any _new_ clients trying to connect will fail and will require .META. table re-lookups. This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6414) Remove the WritableRpcEngine associated Invocation classes
[ https://issues.apache.org/jira/browse/HBASE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428980#comment-13428980 ] Zhihong Ted Yu commented on HBASE-6414: --- {code} +public AuthenticationTokenSecretManager createSecretManager(){ {code} I only find one reference to the above method. I guess it doesn't have to be public. {code} + Class? extends Message rpcArgClassname = null; {code} The above variable represents Class, not Class name. Remove the WritableRpcEngine associated Invocation classes Key: HBASE-6414 URL: https://issues.apache.org/jira/browse/HBASE-6414 Project: HBase Issue Type: Improvement Affects Versions: 0.96.0 Reporter: Devaraj Das Assignee: Devaraj Das Fix For: 0.96.0 Attachments: 6414-initial.patch.txt, 6414-initial.patch.txt Remove the WritableRpcEngine Invocation classes once HBASE-5705 gets committed and all the protocols are rebased to use PB. Raising this jira in advance.. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6513) Test errors when building on MacOS
Archimedes Trajano created HBASE-6513: - Summary: Test errors when building on MacOS Key: HBASE-6513 URL: https://issues.apache.org/jira/browse/HBASE-6513 Project: HBase Issue Type: Bug Components: build Environment: MacOSX 10.8 Oracle JDK 1.7 Reporter: Archimedes Trajano Results : Failed tests: testBackgroundEvictionThread[0](org.apache.hadoop.hbase.io.hfile.TestLruBlockCache): expected:2 but was:1 testBackgroundEvictionThread[1](org.apache.hadoop.hbase.io.hfile.TestLruBlockCache): expected:2 but was:1 testSplitCalculatorEq(org.apache.hadoop.hbase.util.TestRegionSplitCalculator): expected:2 but was:1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6514) unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
Archimedes Trajano created HBASE-6514: - Summary: unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram Key: HBASE-6514 URL: https://issues.apache.org/jira/browse/HBASE-6514 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: MacOS 10.8 Oracle JDK 1.7 Reporter: Archimedes Trajano When trying to run a unit test that just starts up and shutdown the server the following errors occur in System.out 01:10:59,874 ERROR MetricsUtil:116 - unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram 01:10:59,874 ERROR MetricsUtil:116 - unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram 01:10:59,875 ERROR MetricsUtil:116 - unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram 01:10:59,875 ERROR MetricsUtil:116 - unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6514) unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
[ https://issues.apache.org/jira/browse/HBASE-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428983#comment-13428983 ] Archimedes Trajano commented on HBASE-6514: --- The test case does pass though. unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram Key: HBASE-6514 URL: https://issues.apache.org/jira/browse/HBASE-6514 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: MacOS 10.8 Oracle JDK 1.7 Reporter: Archimedes Trajano When trying to run a unit test that just starts up and shutdown the server the following errors occur in System.out 01:10:59,874 ERROR MetricsUtil:116 - unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram 01:10:59,874 ERROR MetricsUtil:116 - unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram 01:10:59,875 ERROR MetricsUtil:116 - unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram 01:10:59,875 ERROR MetricsUtil:116 - unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6514) unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram
[ https://issues.apache.org/jira/browse/HBASE-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Archimedes Trajano updated HBASE-6514: -- Attachment: FrameworkTest.java Sample JUnit test I had used. unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram Key: HBASE-6514 URL: https://issues.apache.org/jira/browse/HBASE-6514 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: MacOS 10.8 Oracle JDK 1.7 Reporter: Archimedes Trajano Attachments: FrameworkTest.java When trying to run a unit test that just starts up and shutdown the server the following errors occur in System.out 01:10:59,874 ERROR MetricsUtil:116 - unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram 01:10:59,874 ERROR MetricsUtil:116 - unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram 01:10:59,875 ERROR MetricsUtil:116 - unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram 01:10:59,875 ERROR MetricsUtil:116 - unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira