[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228233#comment-13228233 ] chunhui shen commented on HBASE-5270: - I have submited patch here --HBASE-5270v11.patch I only add some notes for joinCluster(), change some comments explaining what deadNotExpired servers are in ServerManager, and fixing spaces around '='. Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.2 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, HBASE-5270v11.patch, hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5569) TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228236#comment-13228236 ] stack commented on HBASE-5569: -- Ugh. Indexing JIRA lost my comment. Looking at builds, we don't have much of a history on trunk builds but TestAtomicOperation started failing today when HBASE-5399 Cut the link between the client and the zookeeper ensemble went in (among others). I see over in hadoopqa builds that it doesn't fail if I go back twenty odd builds. It did break here, https://builds.apache.org/view/G-L/view/HBase/job/PreCommit-HBASE-Build/1168/, and on a later build. Should I try reverting it? TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally --- Key: HBASE-5569 URL: https://issues.apache.org/jira/browse/HBASE-5569 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Priority: Minor What I pieced together so far is that it is the *scanning* side that has problems sometimes. Every time I see a assertion failure in the log I see this before: {quote} 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0 {quote} The order of if the Put and Delete is sometimes reversed. The test threads should always see exactly one KV, if the before was the Put the thread see 0 KVs, if the before was the Delete the threads see 2 KVs. This debug message comes from StoreScanner to checkReseek. It seems we still some consistency issue with scanning sometimes :( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5059) Tests for: Support deleted rows in CopyTable
[ https://issues.apache.org/jira/browse/HBASE-5059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228241#comment-13228241 ] stack commented on HBASE-5059: -- @Evan If you poke around, you can find the logs of your test run up on the apache build server. If you run the test local, does it pass? To retry your patch against hadoopqa, hit 'cancel patch' above and then reattach and then hit 'submit patch' again. See if you patch fails second or third time; if it does, add debug to your patch to help figure whats going on? Thanks. Tests for: Support deleted rows in CopyTable Key: HBASE-5059 URL: https://issues.apache.org/jira/browse/HBASE-5059 Project: HBase Issue Type: Sub-task Components: mapreduce Affects Versions: 0.94.0 Reporter: Lars Hofhansl Assignee: Evan Beard Priority: Minor Fix For: 0.94.0 Attachments: TestCopyTable_HBASE_5059.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5059) Tests for: Support deleted rows in CopyTable
[ https://issues.apache.org/jira/browse/HBASE-5059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228242#comment-13228242 ] stack commented on HBASE-5059: -- @Evan If you poke around, you can find the logs of your test run up on the apache build server. If you run the test local, does it pass? To retry your patch against hadoopqa, hit 'cancel patch' above and then reattach and then hit 'submit patch' again. See if you patch fails second or third time; if it does, add debug to your patch to help figure whats going on? Thanks. Tests for: Support deleted rows in CopyTable Key: HBASE-5059 URL: https://issues.apache.org/jira/browse/HBASE-5059 Project: HBase Issue Type: Sub-task Components: mapreduce Affects Versions: 0.94.0 Reporter: Lars Hofhansl Assignee: Evan Beard Priority: Minor Fix For: 0.94.0 Attachments: TestCopyTable_HBASE_5059.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5434) [REST] Include more metrics in cluster status request
[ https://issues.apache.org/jira/browse/HBASE-5434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5434: - Attachment: HBASE-5434.trunk.v2.patch [REST] Include more metrics in cluster status request - Key: HBASE-5434 URL: https://issues.apache.org/jira/browse/HBASE-5434 Project: HBase Issue Type: Improvement Components: metrics, rest Affects Versions: 0.94.0 Reporter: Mubarak Seyed Assignee: Mubarak Seyed Priority: Minor Labels: noob Fix For: 0.94.0 Attachments: HBASE-5434.trunk.v1.patch, HBASE-5434.trunk.v2.patch, HBASE-5434.trunk.v2.patch /status/cluster shows only {code} stores=2 storefiless=0 storefileSizeMB=0 memstoreSizeMB=0 storefileIndexSizeMB=0 {code} for a region but master web-ui shows {code} stores=1, storefiles=0, storefileUncompressedSizeMB=0 storefileSizeMB=0 memstoreSizeMB=0 storefileIndexSizeMB=0 readRequestsCount=0 writeRequestsCount=0 rootIndexSizeKB=0 totalStaticIndexSizeKB=0 totalStaticBloomSizeKB=0 totalCompactingKVs=0 currentCompactedKVs=0 compactionProgressPct=NaN {code} In a write-heavy REST gateway based production environment, ops team needs to verify whether write counters are getting incremented per region (they do run /status/cluster on each REST server), we can get the same values from *rpc.metrics.put_num_ops* and *hbase.regionserver.writeRequestsCount* but some home-grown tools needs to parse the output of /status/cluster and updates the dashboard. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5270: - Status: Open (was: Patch Available) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.2 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, HBASE-5270v11.patch, hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-5314) Gracefully rolling restart region servers in rolling-restart.sh
[ https://issues.apache.org/jira/browse/HBASE-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-5314. -- Resolution: Fixed Fix Version/s: 0.96.0 Hadoop Flags: Reviewed Committed to trunk. Thanks for the patch Yifeng. Gracefully rolling restart region servers in rolling-restart.sh --- Key: HBASE-5314 URL: https://issues.apache.org/jira/browse/HBASE-5314 Project: HBase Issue Type: Improvement Components: scripts Reporter: Yifeng Jiang Priority: Minor Fix For: 0.96.0 Attachments: HBASE-5314.patch, HBASE-5314.patch.2 The rolling-restart.sh has a --rs-only option which simply restarts all region servers in the cluster. Consider improve it to gracefully restart region servers to avoid the offline time of the regions deployed on that server, and keep the region distributions same as what it was before the restarting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228251#comment-13228251 ] Hadoop QA commented on HBASE-5270: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518153/HBASE-5270v11.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 161 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestImportTsv Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1172//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1172//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1172//console This message is automatically generated. Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.2 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, HBASE-5270v11.patch, hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5571) Table will be disabling forever
[ https://issues.apache.org/jira/browse/HBASE-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5571: Attachment: HBASE-5571.patch Table will be disabling forever --- Key: HBASE-5571 URL: https://issues.apache.org/jira/browse/HBASE-5571 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-5571.patch If we restart master when it is disabling one table, the table will be disabling forever. In current logic, Region CLOSE RPC will always returned NotServingRegionException because RS has already closed the region before we restart master. So table will be disabling forever because the region will in RIT all along. In another case, when AssignmentManager#rebuildUserRegions(), it will put parent regions to AssignmentManager.regions, so we can't close these parent regions until it is purged by CatalogJanitor if we execute disabling the table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228297#comment-13228297 ] Laxman commented on HBASE-5564: --- Scope of this issue. 1) Avoid the behavioral inconsistency with timestamp parameter. {noformat} Currently in code, a) If timstamp parameter is configured, duplicate records will be overwritten. b) If not configured, some duplicate records are maintained as different version. {noformat} This fix should be inline with the expectation Todd has mentioned. bq. The whole point is that, in a bulk-load-only workflow, you can identify each bulk load exactly, and correlate it to the MR job that inserted it. 2) Provide an option to look up timestamp column value from input data. (Like ROWKEY column) Example : importtsv.columns='HBASE_ROW_KEY, HBASE_TS_KEY, emp:name,emp:sal,dept:code' I will submit the patch with the above mentioned approach. Any other addons? Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228300#comment-13228300 ] nkeywal commented on HBASE-5399: @stack Yes, this test is flaky... I reproduce the error on the trunk as of March 10th as well. I've seen it failing previously, I think it's flaky for at the very least a month (and may be much more) git log: {noformat} commit 0f3e025a62f89763fffbf8298d565a6c4e5b7d06 Date: Sat Mar 10 02:27:05 2012 + {noformat} With the same stack as in trunk #2676: {noformat} Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.444 sec FAILURE! testMultiRowMutationMultiThreads(org.apache.hadoop.hbase.regionserver.TestAtomicOperation) Time elapsed: 7.083 sec FAILURE! junit.framework.AssertionFailedError: expected:0 but was:1 at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.failNotEquals(Assert.java:287) at junit.framework.Assert.assertEquals(Assert.java:67) at junit.framework.Assert.assertEquals(Assert.java:199) at junit.framework.Assert.assertEquals(Assert.java:205) at org.apache.hadoop.hbase.regionserver.TestAtomicOperation.testMultiRowMutationMultiThreads(TestAtomicOperation.java:416) {noformat} Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators:
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228406#comment-13228406 ] Laxman commented on HBASE-5564: --- While testing the patch in local, I'm getting the following error in trunk. Any hints on this please? {noformat} java.lang.RuntimeException: java.io.IOException: Call to localhost/127.0.0.1:0 failed on local exception: java.net.BindException: Cannot assign requested address: no further information at org.apache.hadoop.mapred.MiniMRCluster.waitUntilIdle(MiniMRCluster.java:323) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:524) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:462) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:454) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:446) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:436) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:426) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:417) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniMapReduceCluster(HBaseTestingUtility.java:1269) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniMapReduceCluster(HBaseTestingUtility.java:1255) at org.apache.hadoop.hbase.mapreduce.TestImportTsv.doMROnTableTest(TestImportTsv.java:189) at org.apache.hadoop.hbase.mapreduce.TestImportTsv.testMROnTable(TestImportTsv.java:162) {noformat} Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5572: --- Attachment: 5572.v1.patch KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5572.v1.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5572: --- Status: Patch Available (was: Open) KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5572.v1.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5206) Port HBASE-5155 to 0.92 and TRUNK
[ https://issues.apache.org/jira/browse/HBASE-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228421#comment-13228421 ] ramkrishna.s.vasudevan commented on HBASE-5206: --- For the 0.92 version the test case passed even with your updated patch Ted. The change that you made is needed. Port HBASE-5155 to 0.92 and TRUNK - Key: HBASE-5206 URL: https://issues.apache.org/jira/browse/HBASE-5206 Project: HBase Issue Type: Bug Affects Versions: 0.92.2, 0.96.0 Reporter: Zhihong Yu Attachments: 5206_92_1.patch, 5206_92_latest_1.patch, 5206_trunk-v2.patch, 5206_trunk_1.patch, 5206_trunk_latest_1.patch This JIRA ports HBASE-5155 (ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted) to 0.92 and TRUNK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228453#comment-13228453 ] stack commented on HBASE-5270: -- I committed to trunk. Can you make a version for 0.92 please Chunhui? The trunk patch does not seem to apply. Thank you. Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.2 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, HBASE-5270v11.patch, hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228460#comment-13228460 ] stack commented on HBASE-5572: -- @N Thanks for digging in. So, it looks like your patch retains the behavior where if the current master has same host and port, we'll expire it, and then try and register ourselves (because we go around to the top of your new while loop)? Is that so? I believe we have a test to ensure this behavior IIRC. Patch looks good. +1. KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5572.v1.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228464#comment-13228464 ] stack commented on HBASE-5399: -- bq. Yes, this test is flaky... I reproduce the error on the trunk as of March 10th as well. How do you do this? You run the test multiple times? bq. I think it's flaky for at the very least a month (and may be much more) In your estimation, we broke this a while back. Any clue what did it? Thanks. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228467#comment-13228467 ] stack commented on HBASE-5564: -- Googling it, its either something is already listening on the port of your 127.0.0.1 has been removed? See http://www-01.ibm.com/support/docview.wss?uid=swg21233733 Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228476#comment-13228476 ] nkeywal commented on HBASE-5572: Yes. I've done 3 modifications in the code, two like for like (hopefully!) and one with a different behavior. I: - removed the variable named cleanSetOfActiveMaster, replaced by return true or return false. - replaced the recursive call by a while(true) loop. - implicitly (it's hidden because there is no recursive call anymore) changed the function behavior: we now return the final result. For this reason the function behaves differently (we return true instead of false), but it's more on line with the method contract. This change breaks the testMasterZKSessionRecoveryFailure, because it does not fail anymore. TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure was testing explicitly the behavior with both SessionExpired AND master with same host port. I removed it, but I can move it to TestZooKeeper (to save a cluster start/stop) and reverse the assertion in the test (now it does not fail). KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5572.v1.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please
[jira] [Commented] (HBASE-5571) Table will be disabling forever
[ https://issues.apache.org/jira/browse/HBASE-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228484#comment-13228484 ] stack commented on HBASE-5571: -- Oh, any chance of a test? Table will be disabling forever --- Key: HBASE-5571 URL: https://issues.apache.org/jira/browse/HBASE-5571 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-5571.patch If we restart master when it is disabling one table, the table will be disabling forever. In current logic, Region CLOSE RPC will always returned NotServingRegionException because RS has already closed the region before we restart master. So table will be disabling forever because the region will in RIT all along. In another case, when AssignmentManager#rebuildUserRegions(), it will put parent regions to AssignmentManager.regions, so we can't close these parent regions until it is purged by CatalogJanitor if we execute disabling the table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5571) Table will be disabling forever
[ https://issues.apache.org/jira/browse/HBASE-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228482#comment-13228482 ] stack commented on HBASE-5571: -- This patch has much merit. Thanks for digging in on it. I like how you made a method cancelClosingRegionIfDisabling to hold a bunch of code that was inside in a catch block. I see why you want to know if a region is splitting -- it makes it so we can remove the comments where we speculate a region is splitting -- but I don't think keeping a list of outstaning regions in HRS the right place for it ... in particular I don't think we should keep this state in the OnlineRegions Interface (splitting regions are not online). When we go to split a region, we put this fact up into zk. Why not check there rather than trying to keep around a collection of splitting regions? Table will be disabling forever --- Key: HBASE-5571 URL: https://issues.apache.org/jira/browse/HBASE-5571 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-5571.patch If we restart master when it is disabling one table, the table will be disabling forever. In current logic, Region CLOSE RPC will always returned NotServingRegionException because RS has already closed the region before we restart master. So table will be disabling forever because the region will in RIT all along. In another case, when AssignmentManager#rebuildUserRegions(), it will put parent regions to AssignmentManager.regions, so we can't close these parent regions until it is purged by CatalogJanitor if we execute disabling the table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228487#comment-13228487 ] stack commented on HBASE-5572: -- Above sounds good. Would suggest we retain the test that verifies that we expire znode if same host and port. That behavior can be useful in the single-master case. KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5572.v1.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5570) Compression tool section is referring to wrong link in HBase Book.
[ https://issues.apache.org/jira/browse/HBASE-5570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack reassigned HBASE-5570: Assignee: Doug Meil Ok if I assign this to you Mr. Doug? Compression tool section is referring to wrong link in HBase Book. -- Key: HBASE-5570 URL: https://issues.apache.org/jira/browse/HBASE-5570 Project: HBase Issue Type: Bug Components: documentation Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Reporter: Laxman Assignee: Doug Meil Priority: Trivial Labels: documentaion http://hbase.apache.org/book/ops_mgt.html#compression.tool Above section is refering to itself (recursive) in HBase book. This needs to be corrected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228488#comment-13228488 ] Lars Hofhansl commented on HBASE-5399: -- Only testMultiRowMutationMultiThreads is failing, which I added recently. I now think the test always had this problem. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5570) Compression tool section is referring to wrong link in HBase Book.
[ https://issues.apache.org/jira/browse/HBASE-5570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228491#comment-13228491 ] Doug Meil commented on HBASE-5570: -- that's fine Compression tool section is referring to wrong link in HBase Book. -- Key: HBASE-5570 URL: https://issues.apache.org/jira/browse/HBASE-5570 Project: HBase Issue Type: Bug Components: documentation Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Reporter: Laxman Assignee: Doug Meil Priority: Trivial Labels: documentaion http://hbase.apache.org/book/ops_mgt.html#compression.tool Above section is refering to itself (recursive) in HBase book. This needs to be corrected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads
Replace client ZooKeeper watchers by simple ZooKeeper reads --- Key: HBASE-5573 URL: https://issues.apache.org/jira/browse/HBASE-5573 Project: HBase Issue Type: Improvement Components: client, zookeeper Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Some code in the package needs to read data in ZK. This could be done by a simple read, but is actually implemented with a watcher. This holds ZK resources. Fixing this could also be an opportunity to remove the need for the client to provide the master address and port. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228494#comment-13228494 ] Lars Hofhansl commented on HBASE-5270: -- Stack, should I apply to 0.94? Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.2 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, HBASE-5270v11.patch, hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228496#comment-13228496 ] stack commented on HBASE-5270: -- @Lars Yeah. Let me try this trunk patch. Its good for master joining a cluster w/ concurrent crashing regionservers. Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.2 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, HBASE-5270v11.patch, hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5569) TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228498#comment-13228498 ] Lars Hofhansl commented on HBASE-5569: -- I'll run this in a loop on my work machine (8 core + hyperthreading), should increase the likelihood of this happening. Will then avoid the parallel flushing, and see of that fixes the problem. I think the test always had this problem. On the other I do think this indicates a problem with scanning. This is suspicious, and the code producing this was also added relatively recently: {quote} Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen {quote} TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally --- Key: HBASE-5569 URL: https://issues.apache.org/jira/browse/HBASE-5569 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Priority: Minor What I pieced together so far is that it is the *scanning* side that has problems sometimes. Every time I see a assertion failure in the log I see this before: {quote} 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0 {quote} The order of if the Put and Delete is sometimes reversed. The test threads should always see exactly one KV, if the before was the Put the thread see 0 KVs, if the before was the Delete the threads see 2 KVs. This debug message comes from StoreScanner to checkReseek. It seems we still some consistency issue with scanning sometimes :( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228499#comment-13228499 ] stack commented on HBASE-5270: -- The trunk patch went into 0.94. I committed it. So, just need patch for 0.92.2 now. Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.2 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, HBASE-5270v11.patch, hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads
[ https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228501#comment-13228501 ] stack commented on HBASE-5573: -- Hurray! Replace client ZooKeeper watchers by simple ZooKeeper reads --- Key: HBASE-5573 URL: https://issues.apache.org/jira/browse/HBASE-5573 Project: HBase Issue Type: Improvement Components: client, zookeeper Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Some code in the package needs to read data in ZK. This could be done by a simple read, but is actually implemented with a watcher. This holds ZK resources. Fixing this could also be an opportunity to remove the need for the client to provide the master address and port. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228526#comment-13228526 ] ramkrishna.s.vasudevan commented on HBASE-5270: --- So this one went in. Thanks Stack, Ted and mainly Chunhui for pursuing on this. The 0.90 patch that Chunhui gave in HBASE-5179 is running in our test clusters. Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.2 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, HBASE-5270v11.patch, hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5572: --- Attachment: 5572.v2.patch KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5572.v1.patch, 5572.v2.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5572: --- Status: Open (was: Patch Available) KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5572.v1.patch, 5572.v2.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5572: --- Status: Patch Available (was: Open) KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5572.v1.patch, 5572.v2.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5574) DEFAULT_MAX_FILE_SIZE defaults to a negative value
DEFAULT_MAX_FILE_SIZE defaults to a negative value -- Key: HBASE-5574 URL: https://issues.apache.org/jira/browse/HBASE-5574 Project: HBase Issue Type: Bug Affects Versions: 0.94.0 Reporter: Michael Drzal Assignee: Michael Drzal HBASE-4365 changed the value of DEFAULT_MAX_FILE_SIZE from 256MB to 10G. Here is the line of code: public static final long DEFAULT_MAX_FILE_SIZE = 10 * 1024 * 1024 * 1024; The problem is that java evaluates the constant as an int which wraps and gets assigned to a long. I verified this with a test. The quick fix is to change the end to 1024L; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5574) DEFAULT_MAX_FILE_SIZE defaults to a negative value
[ https://issues.apache.org/jira/browse/HBASE-5574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Drzal updated HBASE-5574: - Attachment: HBASE-5574.patch Changes constant evaluation to a long. DEFAULT_MAX_FILE_SIZE defaults to a negative value -- Key: HBASE-5574 URL: https://issues.apache.org/jira/browse/HBASE-5574 Project: HBase Issue Type: Bug Affects Versions: 0.94.0 Reporter: Michael Drzal Assignee: Michael Drzal Attachments: HBASE-5574.patch HBASE-4365 changed the value of DEFAULT_MAX_FILE_SIZE from 256MB to 10G. Here is the line of code: public static final long DEFAULT_MAX_FILE_SIZE = 10 * 1024 * 1024 * 1024; The problem is that java evaluates the constant as an int which wraps and gets assigned to a long. I verified this with a test. The quick fix is to change the end to 1024L; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5569) TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228546#comment-13228546 ] Lars Hofhansl commented on HBASE-5569: -- I cannot make this test fail locally it seems. Running in a loop for an hour now (test takes ~12s on my machine). TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally --- Key: HBASE-5569 URL: https://issues.apache.org/jira/browse/HBASE-5569 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Priority: Minor What I pieced together so far is that it is the *scanning* side that has problems sometimes. Every time I see a assertion failure in the log I see this before: {quote} 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0 {quote} The order of if the Put and Delete is sometimes reversed. The test threads should always see exactly one KV, if the before was the Put the thread see 0 KVs, if the before was the Delete the threads see 2 KVs. This debug message comes from StoreScanner to checkReseek. It seems we still some consistency issue with scanning sometimes :( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5574) DEFAULT_MAX_FILE_SIZE defaults to a negative value
[ https://issues.apache.org/jira/browse/HBASE-5574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228548#comment-13228548 ] stack commented on HBASE-5574: -- +1 Thanks Michael DEFAULT_MAX_FILE_SIZE defaults to a negative value -- Key: HBASE-5574 URL: https://issues.apache.org/jira/browse/HBASE-5574 Project: HBase Issue Type: Bug Affects Versions: 0.94.0 Reporter: Michael Drzal Assignee: Michael Drzal Attachments: HBASE-5574.patch HBASE-4365 changed the value of DEFAULT_MAX_FILE_SIZE from 256MB to 10G. Here is the line of code: public static final long DEFAULT_MAX_FILE_SIZE = 10 * 1024 * 1024 * 1024; The problem is that java evaluates the constant as an int which wraps and gets assigned to a long. I verified this with a test. The quick fix is to change the end to 1024L; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5569) TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228553#comment-13228553 ] Lars Hofhansl commented on HBASE-5569: -- Are my assumptions about scanning wrong here? The test works as follows: A bunch of thread alternate putting a column on RowA and deleting that column on RowB in a transaction (next time delete is on RowA and put on RowB). The they each scan starting with RowA and then expect to always see exactly one KV (either the column in RowA or the one in RowB). So this relies on a scan providing an atomic view over the two rows (which is think should work if both RowA and RowB are rolled forward with the same MVCC writepoint). TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally --- Key: HBASE-5569 URL: https://issues.apache.org/jira/browse/HBASE-5569 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Priority: Minor What I pieced together so far is that it is the *scanning* side that has problems sometimes. Every time I see a assertion failure in the log I see this before: {quote} 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0 {quote} The order of if the Put and Delete is sometimes reversed. The test threads should always see exactly one KV, if the before was the Put the thread see 0 KVs, if the before was the Delete the threads see 2 KVs. This debug message comes from StoreScanner to checkReseek. It seems we still some consistency issue with scanning sometimes :( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5569) TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228554#comment-13228554 ] Lars Hofhansl commented on HBASE-5569: -- Ok... Failed locally once now as well. TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally --- Key: HBASE-5569 URL: https://issues.apache.org/jira/browse/HBASE-5569 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Priority: Minor What I pieced together so far is that it is the *scanning* side that has problems sometimes. Every time I see a assertion failure in the log I see this before: {quote} 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0 {quote} The order of if the Put and Delete is sometimes reversed. The test threads should always see exactly one KV, if the before was the Put the thread see 0 KVs, if the before was the Delete the threads see 2 KVs. This debug message comes from StoreScanner to checkReseek. It seems we still some consistency issue with scanning sometimes :( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228558#comment-13228558 ] Hudson commented on HBASE-5179: --- Integrated in HBase-0.94 #30 (See [https://builds.apache.org/job/HBase-0.94/30/]) HBASE-5179 Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler (Revision 1300222) Result = SUCCESS stack : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/MasterServices.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/CreateTableHandler.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestCatalogJanitor.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.2 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 5179-90v16.patch, 5179-90v17.txt, 5179-90v18.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-92v17.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v17.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5542) Unify HRegion.mutateRowsWithLocks() and HRegion.processRow()
[ https://issues.apache.org/jira/browse/HBASE-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228563#comment-13228563 ] Phabricator commented on HBASE-5542: sc has commented on the revision HBASE-5542 [jira] Unify HRegion.mutateRowsWithLocks() and HRegion.processRow(). I think I should use a thread pool to at lease cache the threads for the time bound case. I will make a quick change and update this. REVISION DETAIL https://reviews.facebook.net/D2217 Unify HRegion.mutateRowsWithLocks() and HRegion.processRow() Key: HBASE-5542 URL: https://issues.apache.org/jira/browse/HBASE-5542 Project: HBase Issue Type: Improvement Reporter: Scott Chen Assignee: Scott Chen Fix For: 0.96.0 Attachments: HBASE-5542.D2217.1.patch, HBASE-5542.D2217.10.patch, HBASE-5542.D2217.2.patch, HBASE-5542.D2217.3.patch, HBASE-5542.D2217.4.patch, HBASE-5542.D2217.5.patch, HBASE-5542.D2217.6.patch, HBASE-5542.D2217.7.patch, HBASE-5542.D2217.8.patch, HBASE-5542.D2217.9.patch mutateRowsWithLocks() does atomic mutations on multiple rows. processRow() does atomic read-modify-writes on a single row. It will be useful to generalize both and have a processRowsWithLocks() that does atomic read-modify-writes on multiple rows. This also helps reduce some redundancy in the codes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5572: --- Status: Open (was: Patch Available) KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5572: --- Attachment: 5572.v2.patch KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5572: --- Fix Version/s: 0.96.0 Status: Patch Available (was: Open) KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228587#comment-13228587 ] nkeywal commented on HBASE-5399: @Stack; I think it fails 20% of the time. I run it alone, i.e. with -Dtest=TestAtomicOperation#testMultiRowMutationMultiThreads with nothing else running on the machine, and a mvn clean. No clue on when it started to happen. @Lars: I'm not sure I haven't seen failures on testRowMutationMultiThreads as well, I will launch a few tests to see if it happens. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5569) TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228592#comment-13228592 ] stack commented on HBASE-5569: -- hbase-5121 does mess w/ scanners... Seems like pretty issue though, what hbase-5121 is trying to solve. Pity its so hard verifying this started the failures else we could back it out for now. Should we back it out anyways and see if we get failures over the next few days? TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally --- Key: HBASE-5569 URL: https://issues.apache.org/jira/browse/HBASE-5569 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Priority: Minor What I pieced together so far is that it is the *scanning* side that has problems sometimes. Every time I see a assertion failure in the log I see this before: {quote} 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0 {quote} The order of if the Put and Delete is sometimes reversed. The test threads should always see exactly one KV, if the before was the Put the thread see 0 KVs, if the before was the Delete the threads see 2 KVs. This debug message comes from StoreScanner to checkReseek. It seems we still some consistency issue with scanning sometimes :( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228598#comment-13228598 ] stack commented on HBASE-5572: -- +1 on patch. Waiting on hadoopqa... KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-5574) DEFAULT_MAX_FILE_SIZE defaults to a negative value
[ https://issues.apache.org/jira/browse/HBASE-5574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-5574. -- Resolution: Fixed Fix Version/s: 0.94.0 Hadoop Flags: Reviewed Applied trunk and 0.94 branch. DEFAULT_MAX_FILE_SIZE defaults to a negative value -- Key: HBASE-5574 URL: https://issues.apache.org/jira/browse/HBASE-5574 Project: HBase Issue Type: Bug Affects Versions: 0.94.0 Reporter: Michael Drzal Assignee: Michael Drzal Fix For: 0.94.0 Attachments: HBASE-5574.patch HBASE-4365 changed the value of DEFAULT_MAX_FILE_SIZE from 256MB to 10G. Here is the line of code: public static final long DEFAULT_MAX_FILE_SIZE = 10 * 1024 * 1024 * 1024; The problem is that java evaluates the constant as an int which wraps and gets assigned to a long. I verified this with a test. The quick fix is to change the end to 1024L; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5206) Port HBASE-5155 to 0.92 and TRUNK
[ https://issues.apache.org/jira/browse/HBASE-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5206: -- Attachment: (was: 5206_trunk-v2.patch) Port HBASE-5155 to 0.92 and TRUNK - Key: HBASE-5206 URL: https://issues.apache.org/jira/browse/HBASE-5206 Project: HBase Issue Type: Bug Affects Versions: 0.92.2, 0.96.0 Reporter: Zhihong Yu Attachments: 5206_92_1.patch, 5206_92_latest_1.patch, 5206_trunk-v2.patch, 5206_trunk_1.patch, 5206_trunk_latest_1.patch This JIRA ports HBASE-5155 (ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted) to 0.92 and TRUNK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5206) Port HBASE-5155 to 0.92 and TRUNK
[ https://issues.apache.org/jira/browse/HBASE-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5206: -- Attachment: 5206_trunk-v2.patch Port HBASE-5155 to 0.92 and TRUNK - Key: HBASE-5206 URL: https://issues.apache.org/jira/browse/HBASE-5206 Project: HBase Issue Type: Bug Affects Versions: 0.92.2, 0.96.0 Reporter: Zhihong Yu Attachments: 5206_92_1.patch, 5206_92_latest_1.patch, 5206_trunk-v2.patch, 5206_trunk_1.patch, 5206_trunk_latest_1.patch This JIRA ports HBASE-5155 (ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted) to 0.92 and TRUNK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5206) Port HBASE-5155 to 0.92 and TRUNK
[ https://issues.apache.org/jira/browse/HBASE-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228634#comment-13228634 ] Hadoop QA commented on HBASE-5206: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518221/5206_trunk-v2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1174//console This message is automatically generated. Port HBASE-5155 to 0.92 and TRUNK - Key: HBASE-5206 URL: https://issues.apache.org/jira/browse/HBASE-5206 Project: HBase Issue Type: Bug Affects Versions: 0.92.2, 0.96.0 Reporter: Zhihong Yu Attachments: 5206_92_1.patch, 5206_92_latest_1.patch, 5206_trunk-v2.patch, 5206_trunk_1.patch, 5206_trunk_latest_1.patch This JIRA ports HBASE-5155 (ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted) to 0.92 and TRUNK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5206) Port HBASE-5155 to 0.92 and TRUNK
[ https://issues.apache.org/jira/browse/HBASE-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5206: -- Attachment: (was: 5206_trunk-v2.patch) Port HBASE-5155 to 0.92 and TRUNK - Key: HBASE-5206 URL: https://issues.apache.org/jira/browse/HBASE-5206 Project: HBase Issue Type: Bug Affects Versions: 0.92.2, 0.96.0 Reporter: Zhihong Yu Attachments: 5206_92_1.patch, 5206_92_latest_1.patch, 5206_trunk_1.patch, 5206_trunk_latest_1.patch This JIRA ports HBASE-5155 (ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted) to 0.92 and TRUNK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5206) Port HBASE-5155 to 0.92 and TRUNK
[ https://issues.apache.org/jira/browse/HBASE-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5206: -- Attachment: 5206_trunk-v3.patch Port HBASE-5155 to 0.92 and TRUNK - Key: HBASE-5206 URL: https://issues.apache.org/jira/browse/HBASE-5206 Project: HBase Issue Type: Bug Affects Versions: 0.92.2, 0.96.0 Reporter: Zhihong Yu Attachments: 5206_92_1.patch, 5206_92_latest_1.patch, 5206_trunk-v3.patch, 5206_trunk_1.patch, 5206_trunk_latest_1.patch This JIRA ports HBASE-5155 (ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted) to 0.92 and TRUNK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5206) Port HBASE-5155 to 0.92 and TRUNK
[ https://issues.apache.org/jira/browse/HBASE-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5206: -- Comment: was deleted (was: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518221/5206_trunk-v2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1174//console This message is automatically generated.) Port HBASE-5155 to 0.92 and TRUNK - Key: HBASE-5206 URL: https://issues.apache.org/jira/browse/HBASE-5206 Project: HBase Issue Type: Bug Affects Versions: 0.92.2, 0.96.0 Reporter: Zhihong Yu Attachments: 5206_92_1.patch, 5206_92_latest_1.patch, 5206_trunk-v3.patch, 5206_trunk_1.patch, 5206_trunk_latest_1.patch This JIRA ports HBASE-5155 (ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted) to 0.92 and TRUNK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5574) DEFAULT_MAX_FILE_SIZE defaults to a negative value
[ https://issues.apache.org/jira/browse/HBASE-5574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228642#comment-13228642 ] Hudson commented on HBASE-5574: --- Integrated in HBase-0.94 #31 (See [https://builds.apache.org/job/HBase-0.94/31/]) HBASE-5574 DEFAULT_MAX_FILE_SIZE defaults to a negative value (Revision 1300289) Result = SUCCESS stack : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java DEFAULT_MAX_FILE_SIZE defaults to a negative value -- Key: HBASE-5574 URL: https://issues.apache.org/jira/browse/HBASE-5574 Project: HBase Issue Type: Bug Affects Versions: 0.94.0 Reporter: Michael Drzal Assignee: Michael Drzal Fix For: 0.94.0 Attachments: HBASE-5574.patch HBASE-4365 changed the value of DEFAULT_MAX_FILE_SIZE from 256MB to 10G. Here is the line of code: public static final long DEFAULT_MAX_FILE_SIZE = 10 * 1024 * 1024 * 1024; The problem is that java evaluates the constant as an int which wraps and gets assigned to a long. I verified this with a test. The quick fix is to change the end to 1024L; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228662#comment-13228662 ] nkeywal commented on HBASE-5399: @Lars, Stack: After 50 tries, on trunk (fbd4bebd5cca129f49e91ec9936f604998a7025a) + 5572 I got it: testRowMutationMultiThreads(org.apache.hadoop.hbase.regionserver.TestAtomicOperation): expected:0 but was:5 at org.apache.hadoop.hbase.regionserver.TestAtomicOperation.testRowMutationMultiThreads(TestAtomicOperation.java:331) So the probability for testRowMutationMultiThreads is 10 times inferior than for testMultiRowMutationMultiThreads but it can occur as well.. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5569) TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228665#comment-13228665 ] Lars Hofhansl commented on HBASE-5569: -- I can try to back out HBASE-5121 and see if I can still get this fail. I do think my assumption about scanning were wrong, though. HBASE-5229 is still valid (in that it allows a bunch of operations across multiple rows either all fail or all succeed), just that there is currently no way to get a consistent scan over *multiple* rows when flushing is involved (which is OK, because the scanner contract never guaranteed that). If that is the case I should disable the test. TestAtomicOperation.testRowMutationMultiThreads basically does the same thing only within the same row, I have never seen that one fail. TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally --- Key: HBASE-5569 URL: https://issues.apache.org/jira/browse/HBASE-5569 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Priority: Minor What I pieced together so far is that it is the *scanning* side that has problems sometimes. Every time I see a assertion failure in the log I see this before: {quote} 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0 {quote} The order of if the Put and Delete is sometimes reversed. The test threads should always see exactly one KV, if the before was the Put the thread see 0 KVs, if the before was the Delete the threads see 2 KVs. This debug message comes from StoreScanner to checkReseek. It seems we still some consistency issue with scanning sometimes :( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5206) Port HBASE-5155 to 0.92 and TRUNK
[ https://issues.apache.org/jira/browse/HBASE-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228675#comment-13228675 ] Hadoop QA commented on HBASE-5206: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518229/5206_trunk-v3.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 161 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestImportTsv Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1175//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1175//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1175//console This message is automatically generated. Port HBASE-5155 to 0.92 and TRUNK - Key: HBASE-5206 URL: https://issues.apache.org/jira/browse/HBASE-5206 Project: HBase Issue Type: Bug Affects Versions: 0.92.2, 0.96.0 Reporter: Zhihong Yu Attachments: 5206_92_1.patch, 5206_92_latest_1.patch, 5206_trunk-v3.patch, 5206_trunk_1.patch, 5206_trunk_latest_1.patch This JIRA ports HBASE-5155 (ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted) to 0.92 and TRUNK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5542) Unify HRegion.mutateRowsWithLocks() and HRegion.processRow()
[ https://issues.apache.org/jira/browse/HBASE-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228676#comment-13228676 ] Hadoop QA commented on HBASE-5542: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518234/HBASE-5542.D2217.11.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1176//console This message is automatically generated. Unify HRegion.mutateRowsWithLocks() and HRegion.processRow() Key: HBASE-5542 URL: https://issues.apache.org/jira/browse/HBASE-5542 Project: HBase Issue Type: Improvement Reporter: Scott Chen Assignee: Scott Chen Fix For: 0.96.0 Attachments: HBASE-5542.D2217.1.patch, HBASE-5542.D2217.10.patch, HBASE-5542.D2217.11.patch, HBASE-5542.D2217.2.patch, HBASE-5542.D2217.3.patch, HBASE-5542.D2217.4.patch, HBASE-5542.D2217.5.patch, HBASE-5542.D2217.6.patch, HBASE-5542.D2217.7.patch, HBASE-5542.D2217.8.patch, HBASE-5542.D2217.9.patch mutateRowsWithLocks() does atomic mutations on multiple rows. processRow() does atomic read-modify-writes on a single row. It will be useful to generalize both and have a processRowsWithLocks() that does atomic read-modify-writes on multiple rows. This also helps reduce some redundancy in the codes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5572: --- Attachment: 5572.v2.patch KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5572: --- Status: Open (was: Patch Available) KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228689#comment-13228689 ] nkeywal commented on HBASE-5572: for an unknown reason the first two patches didn't make it to hadoop-qa. Rewriting once again. KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5542) Unify HRegion.mutateRowsWithLocks() and HRegion.processRow()
[ https://issues.apache.org/jira/browse/HBASE-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phabricator updated HBASE-5542: --- Attachment: HBASE-5542.D2217.12.patch sc updated the revision HBASE-5542 [jira] Unify HRegion.mutateRowsWithLocks() and HRegion.processRow(). Reviewers: tedyu, lhofhansl, JIRA Rebase against trunk REVISION DETAIL https://reviews.facebook.net/D2217 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/coprocessor/BaseRowProcessorEndpoint.java src/main/java/org/apache/hadoop/hbase/coprocessor/RowProcessorProtocol.java src/main/java/org/apache/hadoop/hbase/regionserver/BaseRowProcessor.java src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java src/main/java/org/apache/hadoop/hbase/regionserver/MultiRowMutationProcessor.java src/main/java/org/apache/hadoop/hbase/regionserver/RowProcessor.java src/main/java/org/apache/hadoop/hbase/coprocessor/RowProcessor.java src/test/java/org/apache/hadoop/hbase/coprocessor/TestProcessRowEndpoint.java src/test/java/org/apache/hadoop/hbase/coprocessor/TestRowProcessorEndpoint.java Unify HRegion.mutateRowsWithLocks() and HRegion.processRow() Key: HBASE-5542 URL: https://issues.apache.org/jira/browse/HBASE-5542 Project: HBase Issue Type: Improvement Reporter: Scott Chen Assignee: Scott Chen Fix For: 0.96.0 Attachments: HBASE-5542.D2217.1.patch, HBASE-5542.D2217.10.patch, HBASE-5542.D2217.11.patch, HBASE-5542.D2217.12.patch, HBASE-5542.D2217.2.patch, HBASE-5542.D2217.3.patch, HBASE-5542.D2217.4.patch, HBASE-5542.D2217.5.patch, HBASE-5542.D2217.6.patch, HBASE-5542.D2217.7.patch, HBASE-5542.D2217.8.patch, HBASE-5542.D2217.9.patch mutateRowsWithLocks() does atomic mutations on multiple rows. processRow() does atomic read-modify-writes on a single row. It will be useful to generalize both and have a processRowsWithLocks() that does atomic read-modify-writes on multiple rows. This also helps reduce some redundancy in the codes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5335) Dynamic Schema Configurations
[ https://issues.apache.org/jira/browse/HBASE-5335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228732#comment-13228732 ] Phabricator commented on HBASE-5335: nspiegelberg has commented on the revision [jira] [HBASE-5335] Dynamic Schema Config. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java:736 okay. maybe getValues since that's the variable name? src/main/java/org/apache/hadoop/hbase/HTableDescriptor.java:571 yeah, I think that HTD HCD needs a lot of unification work beyond this. src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:938 I'll write more comments, there's a nasty issue here: when you split, you take the 'conf' from the parent region and pass it into the daughter region's constructor. If you passed in the CompoundConfiguration, you would end up with using the HTD of the parent region and the new HTD of the daughter region. You really need to pass the base Configuration object used by HRegionServer to the daughter regions to avoid a tricky dedupe problem. REVISION DETAIL https://reviews.facebook.net/D2247 Dynamic Schema Configurations - Key: HBASE-5335 URL: https://issues.apache.org/jira/browse/HBASE-5335 Project: HBase Issue Type: New Feature Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Labels: configuration, schema Attachments: D2247.1.patch Currently, the ability for a core developer to add per-table per-CF configuration settings is very heavyweight. You need to add a reserved keyword all the way up the stack you have to support this variable long-term if you're going to expose it explicitly to the user. This has ended up with using Configuration.get() a lot because it is lightweight and you can tweak settings while you're trying to understand system behavior [since there are many config params that may never need to be tuned]. We need to add the ability to put read arbitrary KV settings in the HBase schema. Combined with online schema change, this will allow us to safely iterate on configuration settings. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5364) Fix source files missing licenses in 0.92 and trunk
[ https://issues.apache.org/jira/browse/HBASE-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-5364: -- Fix Version/s: 0.90.6 I marked this as fixed for 0.90.6 but I'm not changing the title since it's all over the CHANGES.txt files. Fix source files missing licenses in 0.92 and trunk --- Key: HBASE-5364 URL: https://issues.apache.org/jira/browse/HBASE-5364 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jonathan Hsieh Assignee: Elliott Clark Priority: Blocker Fix For: 0.90.6, 0.92.1, 0.94.0 Attachments: HBASE-5364-1.patch, hbase-5364-0.90.patch, hbase-5364-0.92.patch, hbase-5364-v2.patch running 'mvn rat:check' shows that a few files have snuck in that do not have proper apache licenses. Ideally we should fix these before we cut another release/release candidate. This is a blocker for 0.94, and probably should be for the other branches as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5575) Configure Arcanist lint engine for HBase
Configure Arcanist lint engine for HBase Key: HBASE-5575 URL: https://issues.apache.org/jira/browse/HBASE-5575 Project: HBase Issue Type: Improvement Reporter: Mikhail Bautin Assignee: Mikhail Bautin We need to enable Arcanist lint engine in HBase, so that a commit could be checked by running arc lint. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5576) Configure Arcanist lint engine for HBase
Configure Arcanist lint engine for HBase Key: HBASE-5576 URL: https://issues.apache.org/jira/browse/HBASE-5576 Project: HBase Issue Type: Improvement Reporter: Mikhail Bautin Assignee: Mikhail Bautin We need to be able to use arc lint to check a patch for code style errors before submission. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-5576) Configure Arcanist lint engine for HBase
[ https://issues.apache.org/jira/browse/HBASE-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Bautin resolved HBASE-5576. --- Resolution: Duplicate See HBASE-5575. For some reason new JIRAs are not immediately visible, so I ended up creating a duplicate. Configure Arcanist lint engine for HBase Key: HBASE-5576 URL: https://issues.apache.org/jira/browse/HBASE-5576 Project: HBase Issue Type: Improvement Reporter: Mikhail Bautin Assignee: Mikhail Bautin We need to be able to use arc lint to check a patch for code style errors before submission. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5542) Unify HRegion.mutateRowsWithLocks() and HRegion.processRow()
[ https://issues.apache.org/jira/browse/HBASE-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228743#comment-13228743 ] Phabricator commented on HBASE-5542: tedyu has commented on the revision HBASE-5542 [jira] Unify HRegion.mutateRowsWithLocks() and HRegion.processRow(). INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:4417 Exception includes IOE. Should we check against IOE so that we don't wrap again ? REVISION DETAIL https://reviews.facebook.net/D2217 Unify HRegion.mutateRowsWithLocks() and HRegion.processRow() Key: HBASE-5542 URL: https://issues.apache.org/jira/browse/HBASE-5542 Project: HBase Issue Type: Improvement Reporter: Scott Chen Assignee: Scott Chen Fix For: 0.96.0 Attachments: HBASE-5542.D2217.1.patch, HBASE-5542.D2217.10.patch, HBASE-5542.D2217.11.patch, HBASE-5542.D2217.12.patch, HBASE-5542.D2217.2.patch, HBASE-5542.D2217.3.patch, HBASE-5542.D2217.4.patch, HBASE-5542.D2217.5.patch, HBASE-5542.D2217.6.patch, HBASE-5542.D2217.7.patch, HBASE-5542.D2217.8.patch, HBASE-5542.D2217.9.patch mutateRowsWithLocks() does atomic mutations on multiple rows. processRow() does atomic read-modify-writes on a single row. It will be useful to generalize both and have a processRowsWithLocks() that does atomic read-modify-writes on multiple rows. This also helps reduce some redundancy in the codes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228753#comment-13228753 ] Lars Hofhansl commented on HBASE-5399: -- Arrgghhh... That's not good! Do you see above message in that case in the logs? Do you still have the log output? (Feel free to send a zip via email or attach here). Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5575) Configure Arcanist lint engine for HBase
[ https://issues.apache.org/jira/browse/HBASE-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228761#comment-13228761 ] Nicolas Spiegelberg commented on HBASE-5575: +1. This should make every patch writer's job easier and committers can run 'arc lint' before checkin instead of having to focus on style issues during peer review. Configure Arcanist lint engine for HBase Key: HBASE-5575 URL: https://issues.apache.org/jira/browse/HBASE-5575 Project: HBase Issue Type: Improvement Reporter: Mikhail Bautin Assignee: Mikhail Bautin We need to enable Arcanist lint engine in HBase, so that a commit could be checked by running arc lint. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5559) --presplit option creates a first split with rowkey-end=0
[ https://issues.apache.org/jira/browse/HBASE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujee Maniyam updated HBASE-5559: - Attachment: HBASE-5559-v2.patch revised patch --presplit option creates a first split with rowkey-end=0 - Key: HBASE-5559 URL: https://issues.apache.org/jira/browse/HBASE-5559 Project: HBase Issue Type: Bug Components: util Reporter: Sujee Maniyam Assignee: Sujee Maniyam Priority: Trivial Labels: benchmark Attachments: 5559_v1.patch, HBASE-5559-v2.patch HBASE-4440 adds a 'presplit' option to PerformanceEvaluation utility. when the splits are generated, the first split has row-end-key=0 (zero). Hence this split doesn't get any data. For example, if total keyspace is 100, and splits requested are 5, generated splits = [0, 20, 40, 60, 80] it should be = [20, 40, 60, 80, 100] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5559) --presplit option creates a first split with rowkey-end=0
[ https://issues.apache.org/jira/browse/HBASE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujee Maniyam updated HBASE-5559: - Status: Open (was: Patch Available) --presplit option creates a first split with rowkey-end=0 - Key: HBASE-5559 URL: https://issues.apache.org/jira/browse/HBASE-5559 Project: HBase Issue Type: Bug Components: util Reporter: Sujee Maniyam Assignee: Sujee Maniyam Priority: Trivial Labels: benchmark Attachments: 5559_v1.patch, HBASE-5559-v2.patch HBASE-4440 adds a 'presplit' option to PerformanceEvaluation utility. when the splits are generated, the first split has row-end-key=0 (zero). Hence this split doesn't get any data. For example, if total keyspace is 100, and splits requested are 5, generated splits = [0, 20, 40, 60, 80] it should be = [20, 40, 60, 80, 100] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5559) --presplit option creates a first split with rowkey-end=0
[ https://issues.apache.org/jira/browse/HBASE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujee Maniyam updated HBASE-5559: - Status: Patch Available (was: Open) HBASE-5559-v2.patch --presplit option creates a first split with rowkey-end=0 - Key: HBASE-5559 URL: https://issues.apache.org/jira/browse/HBASE-5559 Project: HBase Issue Type: Bug Components: util Reporter: Sujee Maniyam Assignee: Sujee Maniyam Priority: Trivial Labels: benchmark Attachments: 5559_v1.patch, HBASE-5559-v2.patch HBASE-4440 adds a 'presplit' option to PerformanceEvaluation utility. when the splits are generated, the first split has row-end-key=0 (zero). Hence this split doesn't get any data. For example, if total keyspace is 100, and splits requested are 5, generated splits = [0, 20, 40, 60, 80] it should be = [20, 40, 60, 80, 100] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228798#comment-13228798 ] Lars Hofhansl commented on HBASE-5399: -- Yeah. Let's move the discussion to HBASE-5569. Thanks for doing this Nicolas (pardon me if I misspelled your name). You don't have to, though. Your change here is almost certainly not causing this problem. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5569) TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5569: --- Attachment: TestAtomicOperation-output.trunk_120313.rar testRowMutationMultiThreads logs, on trunk as of today. It failed after 200 iterations. {noformat} Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 9.007 sec FAILURE! testRowMutationMultiThreads(org.apache.hadoop.hbase.regionserver.TestAtomicOperation) Time elapsed: 8.651 sec FAILURE! junit.framework.AssertionFailedError: expected:0 but was:8 at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.failNotEquals(Assert.java:287) at junit.framework.Assert.assertEquals(Assert.java:67) at junit.framework.Assert.assertEquals(Assert.java:199) at junit.framework.Assert.assertEquals(Assert.java:205) at org.apache.hadoop.hbase.regionserver.TestAtomicOperation.testRowMutationMultiThreads(TestAtomicOperation.java:331) {noformat} TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally --- Key: HBASE-5569 URL: https://issues.apache.org/jira/browse/HBASE-5569 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Priority: Minor Attachments: TestAtomicOperation-output.trunk_120313.rar What I pieced together so far is that it is the *scanning* side that has problems sometimes. Every time I see a assertion failure in the log I see this before: {quote} 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0 {quote} The order of if the Put and Delete is sometimes reversed. The test threads should always see exactly one KV, if the before was the Put the thread see 0 KVs, if the before was the Delete the threads see 2 KVs. This debug message comes from StoreScanner to checkReseek. It seems we still some consistency issue with scanning sometimes :( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228816#comment-13228816 ] nkeywal commented on HBASE-5399: It's ok, we're all in the same boat :-) I've got the test running on a 2 weeks old version of the trunk, I will have the result tomorrow. On Wed, Mar 14, 2012 at 12:12 AM, Lars Hofhansl (Commented) (JIRA) Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5559) --presplit option creates a first split with rowkey-end=0
[ https://issues.apache.org/jira/browse/HBASE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228818#comment-13228818 ] Hadoop QA commented on HBASE-5559: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518265/HBASE-5559-v2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 161 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1178//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1178//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1178//console This message is automatically generated. --presplit option creates a first split with rowkey-end=0 - Key: HBASE-5559 URL: https://issues.apache.org/jira/browse/HBASE-5559 Project: HBase Issue Type: Bug Components: util Reporter: Sujee Maniyam Assignee: Sujee Maniyam Priority: Trivial Labels: benchmark Attachments: 5559_v1.patch, HBASE-5559-v2.patch HBASE-4440 adds a 'presplit' option to PerformanceEvaluation utility. when the splits are generated, the first split has row-end-key=0 (zero). Hence this split doesn't get any data. For example, if total keyspace is 100, and splits requested are 5, generated splits = [0, 20, 40, 60, 80] it should be = [20, 40, 60, 80, 100] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4608) HLog Compression
[ https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-4608: - Attachment: 4608v23.txt Renamed method enableCompression in all places to be setCompressionContext Made all instances of compression contexts have same name rather than a new name every time used. Cleaned up unused 'compression' data member flag or moved them local from being data members when only used by a single method. Removed define of TRUE and repeat of ENABLE_WAL_COMPRESSION key from SequenceFileLogReader. No longer needed. Rather than have the sequencefile metadata code making sprinkled over the reader and writer, instead do all in writer and have reader use write methods. Added a global WAL type as metadata. Added a compression type to metadata. Renamed method WALCompressionEnabled as isWALCompressionEnabled. Added some small tests to TestLRUDictionary and a new TestCompressor that taught me how this stuff works. Added documentation to methods where I was surprised; e.g. addEntry will happily add new entry even though already has dictionary entry. Miscellaneous cleanup. I ran this compression on one of our production logs and it halved its size. See below. I then decompressed and then recompressed and I got the same size back. {code} -rwxrwxrwx 1 stack staff 28540761 Mar 13 16:47 sv4r25s8%3A60020.1331661889339.out.out.out -rwxrwxrwx 1 stack staff 64945799 Mar 13 16:45 sv4r25s8%3A60020.1331661889339.out.out -rwxrwxrwx 1 stack staff 28540761 Mar 13 16:44 sv4r25s8%3A60020.1331661889339.out -rw-r--r-- 1 stack staff 64928728 Mar 13 16:25 sv4r25s8%3A60020.1331661889339 {code} Will run more of our production logs through the compressor this evening to see if I can turn up bugs. HLog Compression Key: HBASE-4608 URL: https://issues.apache.org/jira/browse/HBASE-4608 Project: HBase Issue Type: New Feature Reporter: Li Pi Assignee: stack Fix For: 0.94.0 Attachments: 4608-v19.txt, 4608-v20.txt, 4608-v22.txt, 4608v1.txt, 4608v13.txt, 4608v13.txt, 4608v14.txt, 4608v15.txt, 4608v16.txt, 4608v17.txt, 4608v18.txt, 4608v23.txt, 4608v5.txt, 4608v6.txt, 4608v7.txt, 4608v8fixed.txt The current bottleneck to HBase write speed is replicating the WAL appends across different datanodes. We can speed up this process by compressing the HLog. Current plan involves using a dictionary to compress table name, region id, cf name, and possibly other bits of repeated data. Also, HLog format may be changed in other ways to produce a smaller HLog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5128) [uber hbck] Enable hbck to automatically repair table integrity problems as well as region consistency problems while online.
[ https://issues.apache.org/jira/browse/HBASE-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228837#comment-13228837 ] Jonathan Hsieh commented on HBASE-5128: --- @Lars I believe the port to 0.94.0 and 0.92.x are likely identical and nearly trivial and I was intending on doing it. The initial version was also for 0.90.x and a version for that will be ported as well since my crew will be supporting that version for a while. I may try to do a 0.90.x release at some point. [uber hbck] Enable hbck to automatically repair table integrity problems as well as region consistency problems while online. - Key: HBASE-5128 URL: https://issues.apache.org/jira/browse/HBASE-5128 Project: HBase Issue Type: New Feature Components: hbck Affects Versions: 0.90.5, 0.92.0, 0.94.0, 0.96.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Attachments: hbase-5128-trunk.patch The current (0.90.5, 0.92.0rc2) versions of hbck detects most of region consistency and table integrity invariant violations. However with '-fix' it can only automatically repair region consistency cases having to do with deployment problems. This updated version should be able to handle all cases (including a new orphan regiondir case). When complete will likely deprecate the OfflineMetaRepair tool and subsume several open META-hole related issue. Here's the approach (from the comment of at the top of the new version of the file). {code} /** * HBaseFsck (hbck) is a tool for checking and repairing region consistency and * table integrity. * * Region consistency checks verify that META, region deployment on * region servers and the state of data in HDFS (.regioninfo files) all are in * accordance. * * Table integrity checks verify that that all possible row keys can resolve to * exactly one region of a table. This means there are no individual degenerate * or backwards regions; no holes between regions; and that there no overlapping * regions. * * The general repair strategy works in these steps. * 1) Repair Table Integrity on HDFS. (merge or fabricate regions) * 2) Repair Region Consistency with META and assignments * * For table integrity repairs, the tables their region directories are scanned * for .regioninfo files. Each table's integrity is then verified. If there * are any orphan regions (regions with no .regioninfo files), or holes, new * regions are fabricated. Backwards regions are sidelined as well as empty * degenerate (endkey==startkey) regions. If there are any overlapping regions, * a new region is created and all data is merged into the new region. * * Table integrity repairs deal solely with HDFS and can be done offline -- the * hbase region servers or master do not need to be running. These phase can be * use to completely reconstruct the META table in an offline fashion. * * Region consistency requires three conditions -- 1) valid .regioninfo file * present in an hdfs region dir, 2) valid row with .regioninfo data in META, * and 3) a region is deployed only at the regionserver that is was assigned to. * * Region consistency requires hbck to contact the HBase master and region * servers, so the connect() must first be called successfully. Much of the * region consistency information is transient and less risky to repair. */ {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4608) HLog Compression
[ https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228842#comment-13228842 ] Zhihong Yu commented on HBASE-4608: --- In isWALCompressionEnabled(): {code} +if (txt == null || Integer.parseInt(txt.toString()) VERSION) return false; {code} What would happen when we have a newer version for WAL_VERSION_KEY ? Looks like the following check should suffice for isWALCompressionEnabled(): {code} +txt = metadata.get(WAL_COMPRESSION_TYPE_KEY); +return txt != null txt.equals(DICTIONARY_COMPRESSION_TYPE); {code} HLog Compression Key: HBASE-4608 URL: https://issues.apache.org/jira/browse/HBASE-4608 Project: HBase Issue Type: New Feature Reporter: Li Pi Assignee: stack Fix For: 0.94.0 Attachments: 4608-v19.txt, 4608-v20.txt, 4608-v22.txt, 4608v1.txt, 4608v13.txt, 4608v13.txt, 4608v14.txt, 4608v15.txt, 4608v16.txt, 4608v17.txt, 4608v18.txt, 4608v23.txt, 4608v5.txt, 4608v6.txt, 4608v7.txt, 4608v8fixed.txt The current bottleneck to HBase write speed is replicating the WAL appends across different datanodes. We can speed up this process by compressing the HLog. Current plan involves using a dictionary to compress table name, region id, cf name, and possibly other bits of repeated data. Also, HLog format may be changed in other ways to produce a smaller HLog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5563) HRegionInfo#compareTo add the comparison of regionId
[ https://issues.apache.org/jira/browse/HBASE-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228846#comment-13228846 ] Jonathan Hsieh commented on HBASE-5563: --- @chunhui Check out the review patch in HBASE-5128 (its big but in there) -- there is a fix for TestRegionObserverInterface in that patch which can probably go over here. We need to look into the other faliures as well (the one I fixed was a Medium test -- the others are likely Large tests that aren't run until small and medium pass) HRegionInfo#compareTo add the comparison of regionId Key: HBASE-5563 URL: https://issues.apache.org/jira/browse/HBASE-5563 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-5563.patch, HBASE-5563v2.patch, HBASE-5563v2.patch In the one region multi assigned case, we could find that two regions have the same table name, same startKey, same endKey, and different regionId, so these two regions are same in TreeMap but different in HashMap. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5128) [uber hbck] Enable hbck to automatically repair table integrity problems as well as region consistency problems while online.
[ https://issues.apache.org/jira/browse/HBASE-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228848#comment-13228848 ] Jonathan Hsieh commented on HBASE-5128: --- @Zhihong No problem -- I intend to address the reviews. Sorry about the test failures -- these are actually are related to HBASE-5563 -- I'll help chenhui there. I've been in 0.92 and 0.90 land and then away for a little bit and didn't realize that a failure in medium skips all the large tests. (I fixed the medium and expected it to pass but then the large tests ran and failed). [uber hbck] Enable hbck to automatically repair table integrity problems as well as region consistency problems while online. - Key: HBASE-5128 URL: https://issues.apache.org/jira/browse/HBASE-5128 Project: HBase Issue Type: New Feature Components: hbck Affects Versions: 0.90.5, 0.92.0, 0.94.0, 0.96.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Attachments: hbase-5128-trunk.patch The current (0.90.5, 0.92.0rc2) versions of hbck detects most of region consistency and table integrity invariant violations. However with '-fix' it can only automatically repair region consistency cases having to do with deployment problems. This updated version should be able to handle all cases (including a new orphan regiondir case). When complete will likely deprecate the OfflineMetaRepair tool and subsume several open META-hole related issue. Here's the approach (from the comment of at the top of the new version of the file). {code} /** * HBaseFsck (hbck) is a tool for checking and repairing region consistency and * table integrity. * * Region consistency checks verify that META, region deployment on * region servers and the state of data in HDFS (.regioninfo files) all are in * accordance. * * Table integrity checks verify that that all possible row keys can resolve to * exactly one region of a table. This means there are no individual degenerate * or backwards regions; no holes between regions; and that there no overlapping * regions. * * The general repair strategy works in these steps. * 1) Repair Table Integrity on HDFS. (merge or fabricate regions) * 2) Repair Region Consistency with META and assignments * * For table integrity repairs, the tables their region directories are scanned * for .regioninfo files. Each table's integrity is then verified. If there * are any orphan regions (regions with no .regioninfo files), or holes, new * regions are fabricated. Backwards regions are sidelined as well as empty * degenerate (endkey==startkey) regions. If there are any overlapping regions, * a new region is created and all data is merged into the new region. * * Table integrity repairs deal solely with HDFS and can be done offline -- the * hbase region servers or master do not need to be running. These phase can be * use to completely reconstruct the META table in an offline fashion. * * Region consistency requires three conditions -- 1) valid .regioninfo file * present in an hdfs region dir, 2) valid row with .regioninfo data in META, * and 3) a region is deployed only at the regionserver that is was assigned to. * * Region consistency requires hbck to contact the HBase master and region * servers, so the connect() must first be called successfully. Much of the * region consistency information is transient and less risky to repair. */ {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5577) improve 'patch submission' section in HBase book
improve 'patch submission' section in HBase book Key: HBASE-5577 URL: https://issues.apache.org/jira/browse/HBASE-5577 Project: HBase Issue Type: Improvement Components: documentation Reporter: Sujee Maniyam Assignee: Sujee Maniyam Improve patch section in the book http://hbase.apache.org/book/submitting.patches.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: HBASE-5270-92v11.patch patchv11 for 0.92 Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.2 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, HBASE-5270-92v11.patch, HBASE-5270v11.patch, hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4608) HLog Compression
[ https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228931#comment-13228931 ] Hadoop QA commented on HBASE-4608: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518270/4608v23.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 161 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestLRUDictionary Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1180//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1180//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1180//console This message is automatically generated. HLog Compression Key: HBASE-4608 URL: https://issues.apache.org/jira/browse/HBASE-4608 Project: HBase Issue Type: New Feature Reporter: Li Pi Assignee: stack Fix For: 0.94.0 Attachments: 4608-v19.txt, 4608-v20.txt, 4608-v22.txt, 4608v1.txt, 4608v13.txt, 4608v13.txt, 4608v14.txt, 4608v15.txt, 4608v16.txt, 4608v17.txt, 4608v18.txt, 4608v23.txt, 4608v5.txt, 4608v6.txt, 4608v7.txt, 4608v8fixed.txt The current bottleneck to HBase write speed is replicating the WAL appends across different datanodes. We can speed up this process by compressing the HLog. Current plan involves using a dictionary to compress table name, region id, cf name, and possibly other bits of repeated data. Also, HLog format may be changed in other ways to produce a smaller HLog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5542) Unify HRegion.mutateRowsWithLocks() and HRegion.processRow()
[ https://issues.apache.org/jira/browse/HBASE-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228933#comment-13228933 ] Phabricator commented on HBASE-5542: sc has commented on the revision HBASE-5542 [jira] Unify HRegion.mutateRowsWithLocks() and HRegion.processRow(). INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:4417 I just check the javadoc. It seems that it throws TimeoutException, ExecutionException and InterruptedException http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/FutureTask.html#get(long, java.util.concurrent.TimeUnit) But I can see your point. If the exception is warped too many times, it will be hard to debug. REVISION DETAIL https://reviews.facebook.net/D2217 Unify HRegion.mutateRowsWithLocks() and HRegion.processRow() Key: HBASE-5542 URL: https://issues.apache.org/jira/browse/HBASE-5542 Project: HBase Issue Type: Improvement Reporter: Scott Chen Assignee: Scott Chen Fix For: 0.96.0 Attachments: HBASE-5542.D2217.1.patch, HBASE-5542.D2217.10.patch, HBASE-5542.D2217.11.patch, HBASE-5542.D2217.12.patch, HBASE-5542.D2217.2.patch, HBASE-5542.D2217.3.patch, HBASE-5542.D2217.4.patch, HBASE-5542.D2217.5.patch, HBASE-5542.D2217.6.patch, HBASE-5542.D2217.7.patch, HBASE-5542.D2217.8.patch, HBASE-5542.D2217.9.patch mutateRowsWithLocks() does atomic mutations on multiple rows. processRow() does atomic read-modify-writes on a single row. It will be useful to generalize both and have a processRowsWithLocks() that does atomic read-modify-writes on multiple rows. This also helps reduce some redundancy in the codes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228936#comment-13228936 ] Laxman commented on HBASE-5564: --- Thanks Stack. Let me give a try. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5571) Table will be disabling forever
[ https://issues.apache.org/jira/browse/HBASE-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5571: Attachment: BASE-5571v2.patch Table will be disabling forever --- Key: HBASE-5571 URL: https://issues.apache.org/jira/browse/HBASE-5571 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: BASE-5571v2.patch, HBASE-5571.patch If we restart master when it is disabling one table, the table will be disabling forever. In current logic, Region CLOSE RPC will always returned NotServingRegionException because RS has already closed the region before we restart master. So table will be disabling forever because the region will in RIT all along. In another case, when AssignmentManager#rebuildUserRegions(), it will put parent regions to AssignmentManager.regions, so we can't close these parent regions until it is purged by CatalogJanitor if we execute disabling the table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4608) HLog Compression
[ https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-4608: - Attachment: 4608v24.txt Address Ted's comments. HLog Compression Key: HBASE-4608 URL: https://issues.apache.org/jira/browse/HBASE-4608 Project: HBase Issue Type: New Feature Reporter: Li Pi Assignee: stack Fix For: 0.94.0 Attachments: 4608-v19.txt, 4608-v20.txt, 4608-v22.txt, 4608v1.txt, 4608v13.txt, 4608v13.txt, 4608v14.txt, 4608v15.txt, 4608v16.txt, 4608v17.txt, 4608v18.txt, 4608v23.txt, 4608v24.txt, 4608v5.txt, 4608v6.txt, 4608v7.txt, 4608v8fixed.txt The current bottleneck to HBase write speed is replicating the WAL appends across different datanodes. We can speed up this process by compressing the HLog. Current plan involves using a dictionary to compress table name, region id, cf name, and possibly other bits of repeated data. Also, HLog format may be changed in other ways to produce a smaller HLog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5571) Table will be disabling forever
[ https://issues.apache.org/jira/browse/HBASE-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228942#comment-13228942 ] chunhui shen commented on HBASE-5571: - patch v2, check region is splitting through AssignmentManager#RegionState, Also I fix another a bug when region is disabling and split is rolled back, in current logic, if parent region is roll back and it will not be closed if the table is disabling. Table will be disabling forever --- Key: HBASE-5571 URL: https://issues.apache.org/jira/browse/HBASE-5571 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: BASE-5571v2.patch, HBASE-5571.patch If we restart master when it is disabling one table, the table will be disabling forever. In current logic, Region CLOSE RPC will always returned NotServingRegionException because RS has already closed the region before we restart master. So table will be disabling forever because the region will in RIT all along. In another case, when AssignmentManager#rebuildUserRegions(), it will put parent regions to AssignmentManager.regions, so we can't close these parent regions until it is purged by CatalogJanitor if we execute disabling the table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5542) Unify HRegion.mutateRowsWithLocks() and HRegion.processRow()
[ https://issues.apache.org/jira/browse/HBASE-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228943#comment-13228943 ] Phabricator commented on HBASE-5542: sc has commented on the revision HBASE-5542 [jira] Unify HRegion.mutateRowsWithLocks() and HRegion.processRow(). @Ted: I think there are some problem with file renaming with reviews.facebook.net. My git patch actually works fine. It applies to trunk. If this patch doesn't apply again. I will manually upload the patch to JIRA. REVISION DETAIL https://reviews.facebook.net/D2217 Unify HRegion.mutateRowsWithLocks() and HRegion.processRow() Key: HBASE-5542 URL: https://issues.apache.org/jira/browse/HBASE-5542 Project: HBase Issue Type: Improvement Reporter: Scott Chen Assignee: Scott Chen Fix For: 0.96.0 Attachments: HBASE-5542.D2217.1.patch, HBASE-5542.D2217.10.patch, HBASE-5542.D2217.11.patch, HBASE-5542.D2217.12.patch, HBASE-5542.D2217.13.patch, HBASE-5542.D2217.2.patch, HBASE-5542.D2217.3.patch, HBASE-5542.D2217.4.patch, HBASE-5542.D2217.5.patch, HBASE-5542.D2217.6.patch, HBASE-5542.D2217.7.patch, HBASE-5542.D2217.8.patch, HBASE-5542.D2217.9.patch mutateRowsWithLocks() does atomic mutations on multiple rows. processRow() does atomic read-modify-writes on a single row. It will be useful to generalize both and have a processRowsWithLocks() that does atomic read-modify-writes on multiple rows. This also helps reduce some redundancy in the codes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5542) Unify HRegion.mutateRowsWithLocks() and HRegion.processRow()
[ https://issues.apache.org/jira/browse/HBASE-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phabricator updated HBASE-5542: --- Attachment: HBASE-5542.D2217.13.patch sc updated the revision HBASE-5542 [jira] Unify HRegion.mutateRowsWithLocks() and HRegion.processRow(). Reviewers: tedyu, lhofhansl, JIRA Log the IOE for easier debugging. REVISION DETAIL https://reviews.facebook.net/D2217 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/coprocessor/BaseRowProcessorEndpoint.java src/main/java/org/apache/hadoop/hbase/coprocessor/RowProcessorProtocol.java src/main/java/org/apache/hadoop/hbase/regionserver/BaseRowProcessor.java src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java src/main/java/org/apache/hadoop/hbase/regionserver/MultiRowMutationProcessor.java src/main/java/org/apache/hadoop/hbase/regionserver/RowProcessor.java src/main/java/org/apache/hadoop/hbase/coprocessor/RowProcessor.java src/test/java/org/apache/hadoop/hbase/coprocessor/TestProcessRowEndpoint.java src/test/java/org/apache/hadoop/hbase/coprocessor/TestRowProcessorEndpoint.java Unify HRegion.mutateRowsWithLocks() and HRegion.processRow() Key: HBASE-5542 URL: https://issues.apache.org/jira/browse/HBASE-5542 Project: HBase Issue Type: Improvement Reporter: Scott Chen Assignee: Scott Chen Fix For: 0.96.0 Attachments: HBASE-5542.D2217.1.patch, HBASE-5542.D2217.10.patch, HBASE-5542.D2217.11.patch, HBASE-5542.D2217.12.patch, HBASE-5542.D2217.13.patch, HBASE-5542.D2217.2.patch, HBASE-5542.D2217.3.patch, HBASE-5542.D2217.4.patch, HBASE-5542.D2217.5.patch, HBASE-5542.D2217.6.patch, HBASE-5542.D2217.7.patch, HBASE-5542.D2217.8.patch, HBASE-5542.D2217.9.patch mutateRowsWithLocks() does atomic mutations on multiple rows. processRow() does atomic read-modify-writes on a single row. It will be useful to generalize both and have a processRowsWithLocks() that does atomic read-modify-writes on multiple rows. This also helps reduce some redundancy in the codes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4608) HLog Compression
[ https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228946#comment-13228946 ] stack commented on HBASE-4608: -- Here's my compressing, decompressing, compressing again, decompressing again, then recompressing a random log file from our front end: {code} -rw-r--r--1 stack staff 64928728 Mar 13 20:43 sv4r25s8%3A60020.1331661889339 -rwxrwxrwx1 stack staff 28540761 Mar 13 20:48 sv4r25s8%3A60020.1331661889339.compressed -rwxrwxrwx1 stack staff 28540761 Mar 13 20:58 sv4r25s8%3A60020.1331661889339.compressed.again -rwxrwxrwx1 stack staff 28540761 Mar 13 21:02 sv4r25s8%3A60020.1331661889339.compressed.again.again -rwxrwxrwx1 stack staff 64945799 Mar 13 20:57 sv4r25s8%3A60020.1331661889339.decompressed -rwxrwxrwx1 stack staff 64945799 Mar 13 21:02 sv4r25s8%3A60020.1331661889339.decompressed.again {code} Its 44% of original size. HLog Compression Key: HBASE-4608 URL: https://issues.apache.org/jira/browse/HBASE-4608 Project: HBase Issue Type: New Feature Reporter: Li Pi Assignee: stack Fix For: 0.94.0 Attachments: 4608-v19.txt, 4608-v20.txt, 4608-v22.txt, 4608v1.txt, 4608v13.txt, 4608v13.txt, 4608v14.txt, 4608v15.txt, 4608v16.txt, 4608v17.txt, 4608v18.txt, 4608v23.txt, 4608v24.txt, 4608v5.txt, 4608v6.txt, 4608v7.txt, 4608v8fixed.txt The current bottleneck to HBase write speed is replicating the WAL appends across different datanodes. We can speed up this process by compressing the HLog. Current plan involves using a dictionary to compress table name, region id, cf name, and possibly other bits of repeated data. Also, HLog format may be changed in other ways to produce a smaller HLog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4608) HLog Compression
[ https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228947#comment-13228947 ] stack commented on HBASE-4608: -- Here's my compressing, decompressing, compressing again, decompressing again, then recompressing a random log file from our front end: {code} -rw-r--r--1 stack staff 64928728 Mar 13 20:43 sv4r25s8%3A60020.1331661889339 -rwxrwxrwx1 stack staff 28540761 Mar 13 20:48 sv4r25s8%3A60020.1331661889339.compressed -rwxrwxrwx1 stack staff 28540761 Mar 13 20:58 sv4r25s8%3A60020.1331661889339.compressed.again -rwxrwxrwx1 stack staff 28540761 Mar 13 21:02 sv4r25s8%3A60020.1331661889339.compressed.again.again -rwxrwxrwx1 stack staff 64945799 Mar 13 20:57 sv4r25s8%3A60020.1331661889339.decompressed -rwxrwxrwx1 stack staff 64945799 Mar 13 21:02 sv4r25s8%3A60020.1331661889339.decompressed.again {code} Its 44% of original size. HLog Compression Key: HBASE-4608 URL: https://issues.apache.org/jira/browse/HBASE-4608 Project: HBase Issue Type: New Feature Reporter: Li Pi Assignee: stack Fix For: 0.94.0 Attachments: 4608-v19.txt, 4608-v20.txt, 4608-v22.txt, 4608v1.txt, 4608v13.txt, 4608v13.txt, 4608v14.txt, 4608v15.txt, 4608v16.txt, 4608v17.txt, 4608v18.txt, 4608v23.txt, 4608v24.txt, 4608v5.txt, 4608v6.txt, 4608v7.txt, 4608v8fixed.txt The current bottleneck to HBase write speed is replicating the WAL appends across different datanodes. We can speed up this process by compressing the HLog. Current plan involves using a dictionary to compress table name, region id, cf name, and possibly other bits of repeated data. Also, HLog format may be changed in other ways to produce a smaller HLog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5542) Unify HRegion.mutateRowsWithLocks() and HRegion.processRow()
[ https://issues.apache.org/jira/browse/HBASE-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phabricator updated HBASE-5542: --- Attachment: HBASE-5542.D2217.14.patch sc updated the revision HBASE-5542 [jira] Unify HRegion.mutateRowsWithLocks() and HRegion.processRow(). Reviewers: tedyu, lhofhansl, JIRA Fixed a typo. Sorry. REVISION DETAIL https://reviews.facebook.net/D2217 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/coprocessor/BaseRowProcessorEndpoint.java src/main/java/org/apache/hadoop/hbase/coprocessor/RowProcessorProtocol.java src/main/java/org/apache/hadoop/hbase/regionserver/BaseRowProcessor.java src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java src/main/java/org/apache/hadoop/hbase/regionserver/MultiRowMutationProcessor.java src/main/java/org/apache/hadoop/hbase/regionserver/RowProcessor.java src/main/java/org/apache/hadoop/hbase/coprocessor/RowProcessor.java src/test/java/org/apache/hadoop/hbase/coprocessor/TestProcessRowEndpoint.java src/test/java/org/apache/hadoop/hbase/coprocessor/TestRowProcessorEndpoint.java Unify HRegion.mutateRowsWithLocks() and HRegion.processRow() Key: HBASE-5542 URL: https://issues.apache.org/jira/browse/HBASE-5542 Project: HBase Issue Type: Improvement Reporter: Scott Chen Assignee: Scott Chen Fix For: 0.96.0 Attachments: HBASE-5542.D2217.1.patch, HBASE-5542.D2217.10.patch, HBASE-5542.D2217.11.patch, HBASE-5542.D2217.12.patch, HBASE-5542.D2217.13.patch, HBASE-5542.D2217.14.patch, HBASE-5542.D2217.2.patch, HBASE-5542.D2217.3.patch, HBASE-5542.D2217.4.patch, HBASE-5542.D2217.5.patch, HBASE-5542.D2217.6.patch, HBASE-5542.D2217.7.patch, HBASE-5542.D2217.8.patch, HBASE-5542.D2217.9.patch mutateRowsWithLocks() does atomic mutations on multiple rows. processRow() does atomic read-modify-writes on a single row. It will be useful to generalize both and have a processRowsWithLocks() that does atomic read-modify-writes on multiple rows. This also helps reduce some redundancy in the codes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira