[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258137#comment-13258137 ] nkeywal commented on HBASE-5844: Patch v1. Tested on a local cluster pseudo distributed, by stopping the server by kill -9. I will do some minor improvements on the java code and then test on a real cluster, but I'm interested by a feedback on the script. Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Attachments: 5844.v1.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5729) Jenkins build failing; failsafe NPE'ing
[ https://issues.apache.org/jira/browse/HBASE-5729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248156#comment-13248156 ] nkeywal commented on HBASE-5729: For future reference if it happens again: root exception was {noformat} Caused by: java.lang.NullPointerException at org.apache.maven.surefire.util.DefaultDirectoryScanner.processIncludesExcludes(DefaultDirectoryScanner.java:136) at org.apache.maven.surefire.util.DefaultDirectoryScanner.collectTests(DefaultDirectoryScanner.java:113) at org.apache.maven.surefire.util.DefaultDirectoryScanner.locateTestClasses(DefaultDirectoryScanner.java:62) at org.apache.maven.surefire.junitcore.JUnitCoreProvider.scanClassPath(JUnitCoreProvider.java:188) at org.apache.maven.surefire.junitcore.JUnitCoreProvider.getSuitesAsList(JUnitCoreProvider.java:145) at org.apache.maven.surefire.junitcore.JUnitCoreProvider.getSuites(JUnitCoreProvider.java:103) {noformat} Initial error was: {noformat} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-failsafe-plugin:2.12-TRUNK-HBASE-2:integration-test (integration-test) on project hbase: Execution integration-test of goal org.apache.maven.plugins:maven-failsafe-plugin:2.12-TRUNK-HBASE-2:integration-test failed: java.lang.reflect.InvocationTargetException; nested exception is java.lang.reflect.InvocationTargetException: null: NullPointerException - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-failsafe-plugin:2.12-TRUNK-HBASE-2:integration-test (integration-test) on project hbase: Execution integration-test of goal org.apache.maven.plugins:maven-failsafe-plugin:2.12-TRUNK-HBASE-2:integration-test failed: java.lang.reflect.InvocationTargetException; nested exception is java.lang.reflect.InvocationTargetException: null at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:225) {noformat} Jenkins build failing; failsafe NPE'ing --- Key: HBASE-5729 URL: https://issues.apache.org/jira/browse/HBASE-5729 Project: HBase Issue Type: Bug Reporter: stack Assignee: stack Priority: Blocker Builds up on jenkins have been failing over the last few days. Looking at it w/ nkeyway, its kinda odd. I ran exact command locally as did N and it works fine. I removed all of my repo and still works. N looked at surefire source. Its the includes that is coming back empty causing the NPE we see up on jenkins. Extra odd is that it does not seem like it a checkin of ours that brought this on. See here where its 'working' on 0.94 branch: https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.94/76/ Then a little later Ted triggers a build w/ no changes made: https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.94/77/console Its failing running the integration test phase. Let me mess around and try and get it going again. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5729) Jenkins build failing; failsafe NPE'ing
[ https://issues.apache.org/jira/browse/HBASE-5729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248243#comment-13248243 ] nkeywal commented on HBASE-5729: {noformat} mvn -e site install assembly:single -Dtest.exclude.pattern=whatever {noformat} works as well. It seems that surefire wants the variables to be defined in {noformat} excludes exlude${unittest.include}/exlude exclude**/*$*/exclude exclude${test.exclude.pattern}/exclude /excludes {noformat} So suppressing the line in the .pom sould be ok, even if it does not explain why it was working before. Jenkins build failing; failsafe NPE'ing --- Key: HBASE-5729 URL: https://issues.apache.org/jira/browse/HBASE-5729 Project: HBase Issue Type: Bug Reporter: stack Assignee: stack Priority: Blocker Builds up on jenkins have been failing over the last few days. Looking at it w/ nkeyway, its kinda odd. I ran exact command locally as did N and it works fine. I removed all of my repo and still works. N looked at surefire source. Its the includes that is coming back empty causing the NPE we see up on jenkins. Extra odd is that it does not seem like it a checkin of ours that brought this on. See here where its 'working' on 0.94 branch: https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.94/76/ Then a little later Ted triggers a build w/ no changes made: https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.94/77/console Its failing running the integration test phase. Let me mess around and try and get it going again. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5729) Jenkins build failing; failsafe NPE'ing
[ https://issues.apache.org/jira/browse/HBASE-5729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247414#comment-13247414 ] nkeywal commented on HBASE-5729: locally, I reproduce the issue with {noformat} mvn -e -X site install assembly:single {noformat} Jenkins build failing; failsafe NPE'ing --- Key: HBASE-5729 URL: https://issues.apache.org/jira/browse/HBASE-5729 Project: HBase Issue Type: Bug Reporter: stack Priority: Blocker Builds up on jenkins have been failing over the last few days. Looking at it w/ nkeyway, its kinda odd. I ran exact command locally as did N and it works fine. I removed all of my repo and still works. N looked at surefire source. Its the includes that is coming back empty causing the NPE we see up on jenkins. Extra odd is that it does not seem like it a checkin of ours that brought this on. See here where its 'working' on 0.94 branch: https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.94/76/ Then a little later Ted triggers a build w/ no changes made: https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.94/77/console Its failing running the integration test phase. Let me mess around and try and get it going again. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1320#comment-1320 ] nkeywal commented on HBASE-5682: .bq none of ZKUtil methods actually throw exceptions From what is see on 0.96 it should, as the return is not reached: the pattern is too call keeperException, and keeperException throws an exception. {noformat} public void keeperException(KeeperException ke) throws KeeperException { LOG.error(prefix(Received unexpected KeeperException, re-throwing exception), ke); throw ke; } {noformat} Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244484#comment-13244484 ] nkeywal commented on HBASE-5682: In 0.96 this should work, with the restriction that the logic is that you can get a non working connection, that will get fixed when you try to use it. It's a different mechanism than the one for HBaseAdmin, as HBaseAdmin first check the connection. Thz ZK mechanism is more efficient (you save a remote call to check that the connection is really working), but is more complex. However it seems it does not work at the end: bq. What I saw in 0.96 is that the client was blocked for a very long time (gave up after a few minutes), even though I had set all timeouts to low values. This is also deadly in an app server setting. Might be a simple fix there, didn't dig deeper. @lars What did you exactly do? I can do the fix it on 0.96. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all-v4.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5693) When creating a region, the master initializes it and creates a memstore within the master server
[ https://issues.apache.org/jira/browse/HBASE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244519#comment-13244519 ] nkeywal commented on HBASE-5693: I don't think so. I didn't see them locally. When creating a region, the master initializes it and creates a memstore within the master server - Key: HBASE-5693 URL: https://issues.apache.org/jira/browse/HBASE-5693 Project: HBase Issue Type: Improvement Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5593.v2.patch, 5693.v1.patch I didn't do a complete analysis, but the attached patch saves more than 0.25s for each region creation and locally all the unit tests work. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5693) When creating a region, the master initializes it and creates a memstore within the master server
[ https://issues.apache.org/jira/browse/HBASE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243751#comment-13243751 ] nkeywal commented on HBASE-5693: I didn't look very far in the code. CreateTableHandler is executed on the master. It does not need to initialize the memstore so on. The underlying method is called from the region server as well; and here the initialization code is called. May be there is some thing more complex I didn't see, but at least all the unit tests went well. On Sun, Apr 1, 2012 at 5:28 PM, Ted Yu (Commented) (JIRA) When creating a region, the master initializes it and creates a memstore within the master server - Key: HBASE-5693 URL: https://issues.apache.org/jira/browse/HBASE-5693 Project: HBase Issue Type: Improvement Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5693.v1.patch I didn't do a complete analysis, but the attached patch saves more than 0.25s for each region creation and locally all the unit tests work. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5693) When creating a region, the master initializes it and creates a memstore within the master server
[ https://issues.apache.org/jira/browse/HBASE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243775#comment-13243775 ] nkeywal commented on HBASE-5693: Ok, I will do that + a test on a real cluster. On Sun, Apr 1, 2012 at 6:12 PM, Ted Yu (Commented) (JIRA) When creating a region, the master initializes it and creates a memstore within the master server - Key: HBASE-5693 URL: https://issues.apache.org/jira/browse/HBASE-5693 Project: HBase Issue Type: Improvement Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5693.v1.patch I didn't do a complete analysis, but the attached patch saves more than 0.25s for each region creation and locally all the unit tests work. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads
[ https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242117#comment-13242117 ] nkeywal commented on HBASE-5573: It can be committed imho. Replace client ZooKeeper watchers by simple ZooKeeper reads --- Key: HBASE-5573 URL: https://issues.apache.org/jira/browse/HBASE-5573 Project: HBase Issue Type: Improvement Components: client, zookeeper Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5573.v1.patch, 5573.v2.patch, 5573.v4.patch, 5573.v6.patch, 5573.v7.patch, 5573.v8.patch Some code in the package needs to read data in ZK. This could be done by a simple read, but is actually implemented with a watcher. This holds ZK resources. Fixing this could also be an opportunity to remove the need for the client to provide the master address and port. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5666) RegionServer doesn't retry to check if base node is available
[ https://issues.apache.org/jira/browse/HBASE-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242796#comment-13242796 ] nkeywal commented on HBASE-5666: I confirm I didn't modify this part in trunk. But who knows. I will have a look at it this week end. RegionServer doesn't retry to check if base node is available - Key: HBASE-5666 URL: https://issues.apache.org/jira/browse/HBASE-5666 Project: HBase Issue Type: Bug Components: regionserver, zookeeper Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Attachments: hbase-1-regionserver.log, hbase-2-regionserver.log, hbase-3-regionserver.log, hbase-master.log, hbase-regionserver.log, hbase-zookeeper.log I've a script that starts hbase and a couple of region servers in distributed mode (hbase.cluster.distributed = true) {code} $HBASE_HOME/bin/start-hbase.sh $HBASE_HOME/bin/local-regionservers.sh start 1 2 3 {code} but the region servers are not able to start... It seems that during the RS start the the znode is still not available, and HRegionServer.initializeZooKeeper() check just once if the base not is available. {code} 2012-03-28 21:54:05,013 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Check the value configured in 'zookeeper.znode.parent'. There could be a mismatch with the one configured in the master. 2012-03-28 21:54:08,598 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server localhost,60202,133296824: Initialization of RS failed. Hence aborting RS. java.io.IOException: Received the shutdown message while waiting. at org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:626) at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:596) at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:558) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:672) at java.lang.Thread.run(Thread.java:662) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads
[ https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236499#comment-13236499 ] nkeywal commented on HBASE-5573: bq. System.exit(1); Actually is was already like that in hbasefsck, I replaced a tracker by a watcher that does not watch to read the data, that's all. bq. Why not add a create method to ZooKeeperWatcher that takes a name, conf, and Abortable? Or is that a ZKW Constructor altogether? Yes, the question is what to do when you're asked to abort. Here I reused the approach in hbasefsck, just exit. bq. N, can you explain more about what is going on here. How is it that we are not taking a Watcher when we are creating a ZKW? Because we don't call start? (If so, that'd be 'elegant' solution) A ZKW is a watcher. When you create a ZKW, you create a RecoverableZooKeeper with yourself as a parameter. Pseudo code is: {noformat} class RecoverableZooKeeper { ZooKeeper zk; RecoverableZooKeeper (Watcher w){ zk=new ZooKeeper(w) } } class ZooKeeperWatcher implements Watcher { RecoverableZooKeeper rz; ZooKeeperWatcher (){ rz = new RecoverableZooKeeper(this); } } {noformat} Using 'this' in a constructor is looking for problems but it works in this case (remember, that's the existing code, not mine :-) ). Basically all these classes are very strongly coupled. When I tried to partially decouple them it exploded in my hands because you anyway need a watcher to manage the session expiry stuff. I don't have a middle solution here: it's either a full rewriting with a lot of fun to keep the existing interfaces for backward compatibility or nothing. So in the final patch I've just done some cleanup (removed the last usage of getZooKeeperWatcher) and the usage of any watcher. So there's no proof in the code, just that actually all the functions we use on the client don't use a watcher. Anyway, they have a session in the ZK servers so they are expensive. But thanks to #5399 the session on ZK will be closed after 5 minutes. So if you have an architecture with clients coming up and down, you will be able to increase the number of clients. Three last comments: - one of the design issue is that there ate two API: you can use directly any of the ZKW, RZK, RK object or you can go through the static ZKUtils. May be the intermediate solutions lie around this area. - even if the existing design should not be shown to innocent scholars it's not that terrible, because it's small. I didn't really like my first patches because I was adding more classes and complexity without fixing the design. - On the long term, I think that it actually make sense to have a watcher in the client. It's not about the previous code: The previous code was not really using watchers. The previous code was setting watchers without using them. The new code (after #5399 and #5573) does not use or set watchers. But when you have a fat client architecture like we have, it makes sense to share some global state information, and it scales better when the info is pushed vs. pulled. Having said that, there are many questions left: possible issues in how expensive it is with ZooKeeper today, may be ZooKeeper is not really designed for this (it's not really a global coordination work, as the client would be readers only) and so on. FWIW, it seems that the current limit is around 10K sessions in ZK: {panel} Patrick Hunt / Nov 18, 2010; 8:57pm Re: number of clients/watchers fyi: I haven't heard of anyone running over 10k sessions. I've tried 20k before and had issues. [...] A session is represented by a ZooKeeper object. One session per object. So if you have 10 client hosts each creating it's own ZooKeeper instance you'll have 10 sessions. This is regardless of the number of znodes, watches, etc... Watches were designed to be lightweight and you can maintain a large number of them. (25million spread across 500 sessions in my example) {panel} There were also a discussion on ZK mailing list about lightweith sessions. http://markmail.org/message/cyow2xkneh2t3juc Replace client ZooKeeper watchers by simple ZooKeeper reads --- Key: HBASE-5573 URL: https://issues.apache.org/jira/browse/HBASE-5573 Project: HBase Issue Type: Improvement Components: client, zookeeper Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5573.v1.patch, 5573.v2.patch, 5573.v4.patch, 5573.v6.patch Some code in the package needs to read data in ZK. This could be done by a simple read, but is actually implemented with a watcher. This holds ZK resources. Fixing this could also be an opportunity to remove the need for the client to provide the master address and port. -- This message
[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads
[ https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237143#comment-13237143 ] nkeywal commented on HBASE-5573: bq. OK. Any recommendation you can make here having been down deep in this code? We should make everyone go via ZKUtils and via ZKAssign, etc., and clean up any other errant use of zkw directly? Would that be good to do? It would do no harm as it's not good to have two APIs. I could be a first step to change the internal design. I haven't checked the impact. bq. Do you want to be consistent? You call methods getZKW most times and then getZooKeeperWatcher in this test code (I prefer the latter). Ok, I will change all this to getZooKeeperWatcher. Replace client ZooKeeper watchers by simple ZooKeeper reads --- Key: HBASE-5573 URL: https://issues.apache.org/jira/browse/HBASE-5573 Project: HBase Issue Type: Improvement Components: client, zookeeper Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5573.v1.patch, 5573.v2.patch, 5573.v4.patch, 5573.v6.patch Some code in the package needs to read data in ZK. This could be done by a simple read, but is actually implemented with a watcher. This holds ZK resources. Fixing this could also be an opportunity to remove the need for the client to provide the master address and port. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4993) Performance regression in minicluster creation
[ https://issues.apache.org/jira/browse/HBASE-4993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237342#comment-13237342 ] nkeywal commented on HBASE-4993: Hi JD, You're right, there is a bug. The code does not do what the documentation says it does. It seems that changing the code to {code} (count minToStart || lastCountChange+interval now){code} would make the code works as documented. If you have 0 region servers that checked in and you are under the interval, you wait: (true or true) = true. If you have 0 region servers but you are above the interval, you wait: (true or false) = true. If you have 1 or more region servers that checked in and you are under the interval, you wait: (false or true) = true. Would that be ok for you, or do you want to change the documented behavior as well? Note that we could also change the default values for the min number of servers or the interval. On Fri, Mar 23, 2012 at 11:07 PM, Jean-Daniel Cryans (Commented) (JIRA) Performance regression in minicluster creation -- Key: HBASE-4993 URL: https://issues.apache.org/jira/browse/HBASE-4993 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Fix For: 0.94.0 Attachments: 4993.patch, 4993.v3.patch Side effect of 4610: the mini cluster needs 4,5 seconds to start -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4955) Use the official versions of surefire junit
[ https://issues.apache.org/jira/browse/HBASE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235605#comment-13235605 ] nkeywal commented on HBASE-4955: Monthly update: - The JUnit team merged my fix on the JUnit trunk, so the next JUnit version (should be 4.11) will be ok for us. I don't have the release date however. - The surefire regression (SUREFIRE-827) is not yet fixed, still planned for 2.13. No release date. Use the official versions of surefire junit - Key: HBASE-4955 URL: https://issues.apache.org/jira/browse/HBASE-4955 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor We currently use private versions for Surefire JUnit since HBASE-4763. This JIRA traks what we need to move to official versions. Surefire 2.11 is just out, but, after some tests, it does not contain all what we need. JUnit. Could be for JUnit 4.11. Issue to monitor: https://github.com/KentBeck/junit/issues/359: fixed in our version, no feedback for an integration on trunk Surefire: Could be for Surefire 2.12. Issues to monitor are: 329 (category support): fixed, we use the official implementation from the trunk 786 (@Category with forkMode=always): fixed, we use the official implementation from the trunk 791 (incorrect elapsed time on test failure): fixed, we use the official implementation from the trunk 793 (incorrect time in the XML report): Not fixed (reopen) on trunk, fixed on our version. 760 (does not take into account the test method): fixed in trunk, not fixed in our version 798 (print immediately the test class name): not fixed in trunk, not fixed in our version 799 (Allow test parallelization when forkMode=always): not fixed in trunk, not fixed in our version 800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, fixed on our version 800 793 are the more important to monitor, it's the only ones that are fixed in our version but not on trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads
[ https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234254#comment-13234254 ] nkeywal commented on HBASE-5573: For ZK and ZKW dependency, prefixing zk for the ZooKeeper api objects and hb for the HBase objects: a zk.ZooKeeper can be created with a zk.Watcher as a parameter for the constructor. a hb.ZooKeeperWatcher implements the zk.Watcher interface. So we have: hb.ZooKeeperWatcher implements zk.ZooKeeper. hb.ZooKeeperWatcher contains hb.RecoverableZooKeeper hb.RecoverableZooKeeper contains zk.ZooKeeper zk.ZooKeeper contains (hb.ZooKeeperWatcher implements zk.ZooKeeper) loop done. It was like this before my patch. After my patch there are two cases: 1) As above 2) for ZooKeeperHBaseNodes, the watcher is actually null, so it becomes: hb.ZooKeeperHBaseNodes contains hb.RecoverableZooKeeper hb.RecoverableZooKeeper contains zk.ZooKeeper zk.ZooKeeper contains null It allows to share the code, but it makes it more complex. bq. I see. ROZK has a RZK. Can RZK have a ZooKeeper only? Yes, but it would lead to some code duplication and we would lose the recoverable feature (or we would need to duplicate it as well). bq. ZooKeeperHBaseNodes = ReadOnlyZooKeeper It's not really a readonly zookeeper: you can write with it. bq. Looks good Ok, I am gonna finish it with this approach then. Replace client ZooKeeper watchers by simple ZooKeeper reads --- Key: HBASE-5573 URL: https://issues.apache.org/jira/browse/HBASE-5573 Project: HBase Issue Type: Improvement Components: client, zookeeper Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5573.v1.patch, 5573.v2.patch Some code in the package needs to read data in ZK. This could be done by a simple read, but is actually implemented with a watcher. This holds ZK resources. Fixing this could also be an opportunity to remove the need for the client to provide the master address and port. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads
[ https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234413#comment-13234413 ] nkeywal commented on HBASE-5573: bq; This seems way broke Nicolas: zk.ZooKeeper contains (hb.ZooKeeperWatcher implements zk.ZooKeeper) I wanted to say that in ZK API, you want give a Watcher as a parameter to the ZooKeeper object. In HBase, this watcher is the ZooKeeperWatcher. And this ZooKeeperWatcher contains the RecoverableZK that contains the ZooKeeper object, so we have a loop. bq. Is ZKW doing everything, not just Watching? If so, would fixing this help? Yes it does everything. With the split there is now a new object when we just want to read/write. For the name, let's go for NoWatchZK. I'm currently testing the patch. There is still an issue with stuff like ZKAssign.getData: it sets a watcher, but is it really needed? Replace client ZooKeeper watchers by simple ZooKeeper reads --- Key: HBASE-5573 URL: https://issues.apache.org/jira/browse/HBASE-5573 Project: HBase Issue Type: Improvement Components: client, zookeeper Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5573.v1.patch, 5573.v2.patch Some code in the package needs to read data in ZK. This could be done by a simple read, but is actually implemented with a watcher. This holds ZK resources. Fixing this could also be an opportunity to remove the need for the client to provide the master address and port. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads
[ https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234462#comment-13234462 ] nkeywal commented on HBASE-5573: Oops; there is another -big- issue: you need a watcher if you want to get the info on session expiry... So isolating it is not possible. That breaks the patch it seems. Replace client ZooKeeper watchers by simple ZooKeeper reads --- Key: HBASE-5573 URL: https://issues.apache.org/jira/browse/HBASE-5573 Project: HBase Issue Type: Improvement Components: client, zookeeper Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5573.v1.patch, 5573.v2.patch Some code in the package needs to read data in ZK. This could be done by a simple read, but is actually implemented with a watcher. This holds ZK resources. Fixing this could also be an opportunity to remove the need for the client to provide the master address and port. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5551) Some functions should not be used by customer code and must be deprecated in 0.94
[ https://issues.apache.org/jira/browse/HBASE-5551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233540#comment-13233540 ] nkeywal commented on HBASE-5551: @Jonathan The client code should not need to connect to the master directly, so all the functions should be available from HBaseAdmin or HConnection. Some functions should not be used by customer code and must be deprecated in 0.94 - Key: HBASE-5551 URL: https://issues.apache.org/jira/browse/HBASE-5551 Project: HBase Issue Type: Improvement Affects Versions: 0.92.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.94.0 Attachments: 5551.092.patch They are: HBaseAdmin#getMaster HConnection#getZooKeeperWatcher HConnection#getMaster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads
[ https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233721#comment-13233721 ] nkeywal commented on HBASE-5573: Patch to get a first feedback. Unfortunately, it's more a hack than anything else, because I'am trying to keep the existing code interface and not rewriting everything. Today HBase considers any ZK client as a client that will watch the values, and does not distinguish simple readers vs. watchers. To change this, I: - Split ZooKeeperWatcher in two classes, one ZooKeeperWatcher with the same responsibilities as today, and another, ZooKeeperHBaseNodes, that contains the hbase znode definition. ZooKeeperWatcher extends ZooKeeperHBaseNodes. - In ZKUtils, depending if a watch is involved or not, changed the expected type from ZooKeeperWatcher to ZooKeeperHBaseNodes. That's not a hack yet. The issues are: - The client is supposed to wait if the root location znode is not yet created in ZK. I don't think that the trunk implementation actually works. But it's done with a watcher. As we don't want a watcher, I changed it to a loop. - As HConnectionImplementation now uses a simple connection and not a Watcher, the deprecated interface (that returns a ZooKeeperWatcher) cannot reuse the internal connection to ZK, but must be duplicated. - In trunk, the current dependencies are: - RecovableZooKeeper depends(contains) on ZooKeeper - ZooKeeper depends(contains) on ZooKeeperWatcher - ZooKeeperWatcher depends(contains) RecovableZooKeeper - ZKUtils depends(uses) RecovableZooKeeper and ZooKeeperWatcher That makes it difficult to reuse any part of code without having a ZooKeeperWatcher. To be able to reuse it, what's happening when using a ZooKeeperHBaseNodes is that the underlying ZooKeeperWatcher is actually null. I still have to do a lot of renaming if we go for this approach. I had some failure that could be unrelated, but I haven't looked at them yet: org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster org.apache.hadoop.hbase.io.encoding.TestLoadAndSwitchEncodeOnDisk Replace client ZooKeeper watchers by simple ZooKeeper reads --- Key: HBASE-5573 URL: https://issues.apache.org/jira/browse/HBASE-5573 Project: HBase Issue Type: Improvement Components: client, zookeeper Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Some code in the package needs to read data in ZK. This could be done by a simple read, but is actually implemented with a watcher. This holds ZK resources. Fixing this could also be an opportunity to remove the need for the client to provide the master address and port. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5569) Do not collect deleted KVs when they are still in use by a scanner.
[ https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232513#comment-13232513 ] nkeywal commented on HBASE-5569: I stopped it after 2700 iterations (10 hours), no error = patch seems to be fix the issue... Do not collect deleted KVs when they are still in use by a scanner. --- Key: HBASE-5569 URL: https://issues.apache.org/jira/browse/HBASE-5569 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.0, 0.96.0 Attachments: 5569-v2.txt, 5569-v3.txt, 5569-v4.txt, 5569.txt, TestAtomicOperation-output.trunk_120313.rar I noticed this because TestAtomicOperation.testMultiRowMutationMultiThreads fails rarely. The solution is similar to HBASE-2856, where expired KVs are not collected when in use by a scanner. --- What I pieced together so far is that it is the *scanning* side that has problems sometimes. Every time I see a assertion failure in the log I see this before: {quote} 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0 {quote} The order of if the Put and Delete is sometimes reversed. The test threads should always see exactly one KV, if the before was the Put the thread see 0 KVs, if the before was the Delete the threads see 2 KVs. This debug message comes from StoreScanner to checkReseek. It seems we still some consistency issue with scanning sometimes :( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5569) Do not collect deleted KVs when they are still in use by a scanner.
[ https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232382#comment-13232382 ] nkeywal commented on HBASE-5569: I've got the testRowMutationMultiThreads running currently on the patch v3. No issue so far. I will make it run 5000 times, previously it always failed before 1000 iterations. Do not collect deleted KVs when they are still in use by a scanner. --- Key: HBASE-5569 URL: https://issues.apache.org/jira/browse/HBASE-5569 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.0, 0.96.0 Attachments: 5569-v2.txt, 5569-v3.txt, 5569.txt, TestAtomicOperation-output.trunk_120313.rar I noticed this because TestAtomicOperation.testMultiRowMutationMultiThreads fails rarely. The solution is similar to HBASE-2856, where expired KVs are not collected when in use by a scanner. --- What I pieced together so far is that it is the *scanning* side that has problems sometimes. Every time I see a assertion failure in the log I see this before: {quote} 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0 {quote} The order of if the Put and Delete is sometimes reversed. The test threads should always see exactly one KV, if the before was the Put the thread see 0 KVs, if the before was the Delete the threads see 2 KVs. This debug message comes from StoreScanner to checkReseek. It seems we still some consistency issue with scanning sometimes :( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5569) Do not collect deleted KVs when they are still in use by a scanner.
[ https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232390#comment-13232390 ] nkeywal commented on HBASE-5569: Right now it's still running well. I'm doing the test on a small server, with a 4 core Intel Xeon E3-1220. Do not collect deleted KVs when they are still in use by a scanner. --- Key: HBASE-5569 URL: https://issues.apache.org/jira/browse/HBASE-5569 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.0, 0.96.0 Attachments: 5569-v2.txt, 5569-v3.txt, 5569-v4.txt, 5569.txt, TestAtomicOperation-output.trunk_120313.rar I noticed this because TestAtomicOperation.testMultiRowMutationMultiThreads fails rarely. The solution is similar to HBASE-2856, where expired KVs are not collected when in use by a scanner. --- What I pieced together so far is that it is the *scanning* side that has problems sometimes. Every time I see a assertion failure in the log I see this before: {quote} 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0 {quote} The order of if the Put and Delete is sometimes reversed. The test threads should always see exactly one KV, if the before was the Put the thread see 0 KVs, if the before was the Delete the threads see 2 KVs. This debug message comes from StoreScanner to checkReseek. It seems we still some consistency issue with scanning sometimes :( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires
[ https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232005#comment-13232005 ] nkeywal commented on HBASE-5549: @lars: there is no theoretical dependency. In practice, I think that both impacted some same tests so it may require some minor changes when applying the patch. Master can fail if ZooKeeper session expires Key: HBASE-5549 URL: https://issues.apache.org/jira/browse/HBASE-5549 Project: HBase Issue Type: Bug Components: master, zookeeper Affects Versions: 0.96.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5549.v10.patch, 5549.v11.patch, 5549.v6.patch, 5549.v7.patch, 5549.v8.patch, 5549.v9.patch, nochange.patch There is a retry mechanism in RecoverableZooKeeper, but when the session expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism does not work in this case. This is why a sleep is needed in TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher to be recreated before using the connection. This can happen in real life, it can happen when: - master zookeeper starts - zookeeper connection is cut - master enters the retry loop - in the meantime the session expires - the network comes back, the session is recreated - the retries continues, but on the wrong object, hence fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires
[ https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231305#comment-13231305 ] nkeywal commented on HBASE-5549: Can be committed. Hopefully this is the end of the ZooKeeper expiry flakiness. Master can fail if ZooKeeper session expires Key: HBASE-5549 URL: https://issues.apache.org/jira/browse/HBASE-5549 Project: HBase Issue Type: Bug Components: master, zookeeper Affects Versions: 0.96.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5549.v10.patch, 5549.v6.patch, 5549.v7.patch, 5549.v8.patch, 5549.v9.patch, nochange.patch There is a retry mechanism in RecoverableZooKeeper, but when the session expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism does not work in this case. This is why a sleep is needed in TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher to be recreated before using the connection. This can happen in real life, it can happen when: - master zookeeper starts - zookeeper connection is cut - master enters the retry loop - in the meantime the session expires - the network comes back, the session is recreated - the retries continues, but on the wrong object, hence fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires
[ https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231441#comment-13231441 ] nkeywal commented on HBASE-5549: Yes, see HBASE-5572 for the reasons... On Fri, Mar 16, 2012 at 6:41 PM, Zhihong Yu (Commented) (JIRA) Master can fail if ZooKeeper session expires Key: HBASE-5549 URL: https://issues.apache.org/jira/browse/HBASE-5549 Project: HBase Issue Type: Bug Components: master, zookeeper Affects Versions: 0.96.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5549.v10.patch, 5549.v6.patch, 5549.v7.patch, 5549.v8.patch, 5549.v9.patch, nochange.patch There is a retry mechanism in RecoverableZooKeeper, but when the session expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism does not work in this case. This is why a sleep is needed in TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher to be recreated before using the connection. This can happen in real life, it can happen when: - master zookeeper starts - zookeeper connection is cut - master enters the retry loop - in the meantime the session expires - the network comes back, the session is recreated - the retries continues, but on the wrong object, hence fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires
[ https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231444#comment-13231444 ] nkeywal commented on HBASE-5549: Can't create a review, I got error 500 as well... Master can fail if ZooKeeper session expires Key: HBASE-5549 URL: https://issues.apache.org/jira/browse/HBASE-5549 Project: HBase Issue Type: Bug Components: master, zookeeper Affects Versions: 0.96.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5549.v10.patch, 5549.v6.patch, 5549.v7.patch, 5549.v8.patch, 5549.v9.patch, nochange.patch There is a retry mechanism in RecoverableZooKeeper, but when the session expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism does not work in this case. This is why a sleep is needed in TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher to be recreated before using the connection. This can happen in real life, it can happen when: - master zookeeper starts - zookeeper connection is cut - master enters the retry loop - in the meantime the session expires - the network comes back, the session is recreated - the retries continues, but on the wrong object, hence fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires
[ https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231453#comment-13231453 ] nkeywal commented on HBASE-5549: Now it works: https://reviews.apache.org/r/4391/ Master can fail if ZooKeeper session expires Key: HBASE-5549 URL: https://issues.apache.org/jira/browse/HBASE-5549 Project: HBase Issue Type: Bug Components: master, zookeeper Affects Versions: 0.96.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5549.v10.patch, 5549.v6.patch, 5549.v7.patch, 5549.v8.patch, 5549.v9.patch, nochange.patch There is a retry mechanism in RecoverableZooKeeper, but when the session expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism does not work in this case. This is why a sleep is needed in TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher to be recreated before using the connection. This can happen in real life, it can happen when: - master zookeeper starts - zookeeper connection is cut - master enters the retry loop - in the meantime the session expires - the network comes back, the session is recreated - the retries continues, but on the wrong object, hence fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires
[ https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231598#comment-13231598 ] nkeywal commented on HBASE-5549: v11 with the comments taken into account... Thank you for the review. Master can fail if ZooKeeper session expires Key: HBASE-5549 URL: https://issues.apache.org/jira/browse/HBASE-5549 Project: HBase Issue Type: Bug Components: master, zookeeper Affects Versions: 0.96.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5549.v10.patch, 5549.v11.patch, 5549.v6.patch, 5549.v7.patch, 5549.v8.patch, 5549.v9.patch, nochange.patch There is a retry mechanism in RecoverableZooKeeper, but when the session expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism does not work in this case. This is why a sleep is needed in TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher to be recreated before using the connection. This can happen in real life, it can happen when: - master zookeeper starts - zookeeper connection is cut - master enters the retry loop - in the meantime the session expires - the network comes back, the session is recreated - the retries continues, but on the wrong object, hence fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230035#comment-13230035 ] nkeywal commented on HBASE-5572: @stack: it seems the patch could be committed? KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230040#comment-13230040 ] nkeywal commented on HBASE-5399: I've actually been able to reproduce the issue on a previous trunk version, so it's not linked to my changes, and that's a great step forward in term of rationality. Sometimes when we trigger a zk session expiry the zk connection remains in the connected state. I don't have any explication for this, but at least it's not a new behavior. I will manage this case explicitly in the version I'm writing for #5549. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5569) TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229035#comment-13229035 ] nkeywal commented on HBASE-5569: There's no message on Storescanner.peek, nor error or warning. Here's the log when it fails: {noformat} 2012-03-14 03:14:02,146 DEBUG [Thread-51] regionserver.TestAtomicOperation$1(305): keyvalues=NONE Exception in thread Thread-51 junit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:48) at junit.framework.Assert.fail(Assert.java:56) at org.apache.hadoop.hbase.regionserver.TestAtomicOperation$1.run(TestAtomicOperation.java:307) 2012-03-14 03:14:02,228 DEBUG [Thread-92] regionserver.TestAtomicOperation$1(279): flushing {noformat} Reproduced on Feb' 24th trunk as well, after ~700 iterations, same logs. TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally --- Key: HBASE-5569 URL: https://issues.apache.org/jira/browse/HBASE-5569 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Priority: Minor Attachments: TestAtomicOperation-output.trunk_120313.rar What I pieced together so far is that it is the *scanning* side that has problems sometimes. Every time I see a assertion failure in the log I see this before: {quote} 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): Storescanner.peek() is changed where before = rowB/colfamily11:qual1/75366/Put/vlen=6,and after = rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0 {quote} The order of if the Put and Delete is sometimes reversed. The test threads should always see exactly one KV, if the before was the Put the thread see 0 KVs, if the before was the Delete the threads see 2 KVs. This debug message comes from StoreScanner to checkReseek. It seems we still some consistency issue with scanning sometimes :( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229476#comment-13229476 ] nkeywal commented on HBASE-5399: You're right. It's a regression. I will fix this with 5549. I'm having a hard time trying to understand what's going on exactly here, and the sleep coding pattern does not help to say the least. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228300#comment-13228300 ] nkeywal commented on HBASE-5399: @stack Yes, this test is flaky... I reproduce the error on the trunk as of March 10th as well. I've seen it failing previously, I think it's flaky for at the very least a month (and may be much more) git log: {noformat} commit 0f3e025a62f89763fffbf8298d565a6c4e5b7d06 Date: Sat Mar 10 02:27:05 2012 + {noformat} With the same stack as in trunk #2676: {noformat} Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.444 sec FAILURE! testMultiRowMutationMultiThreads(org.apache.hadoop.hbase.regionserver.TestAtomicOperation) Time elapsed: 7.083 sec FAILURE! junit.framework.AssertionFailedError: expected:0 but was:1 at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.failNotEquals(Assert.java:287) at junit.framework.Assert.assertEquals(Assert.java:67) at junit.framework.Assert.assertEquals(Assert.java:199) at junit.framework.Assert.assertEquals(Assert.java:205) at org.apache.hadoop.hbase.regionserver.TestAtomicOperation.testMultiRowMutationMultiThreads(TestAtomicOperation.java:416) {noformat} Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators:
[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228476#comment-13228476 ] nkeywal commented on HBASE-5572: Yes. I've done 3 modifications in the code, two like for like (hopefully!) and one with a different behavior. I: - removed the variable named cleanSetOfActiveMaster, replaced by return true or return false. - replaced the recursive call by a while(true) loop. - implicitly (it's hidden because there is no recursive call anymore) changed the function behavior: we now return the final result. For this reason the function behaves differently (we return true instead of false), but it's more on line with the method contract. This change breaks the testMasterZKSessionRecoveryFailure, because it does not fail anymore. TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure was testing explicitly the behavior with both SessionExpired AND master with same host port. I removed it, but I can move it to TestZooKeeper (to save a cluster start/stop) and reverse the assertion in the test (now it does not fail). KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5572.v1.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228587#comment-13228587 ] nkeywal commented on HBASE-5399: @Stack; I think it fails 20% of the time. I run it alone, i.e. with -Dtest=TestAtomicOperation#testMultiRowMutationMultiThreads with nothing else running on the machine, and a mvn clean. No clue on when it started to happen. @Lars: I'm not sure I haven't seen failures on testRowMutationMultiThreads as well, I will launch a few tests to see if it happens. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228662#comment-13228662 ] nkeywal commented on HBASE-5399: @Lars, Stack: After 50 tries, on trunk (fbd4bebd5cca129f49e91ec9936f604998a7025a) + 5572 I got it: testRowMutationMultiThreads(org.apache.hadoop.hbase.regionserver.TestAtomicOperation): expected:0 but was:5 at org.apache.hadoop.hbase.regionserver.TestAtomicOperation.testRowMutationMultiThreads(TestAtomicOperation.java:331) So the probability for testRowMutationMultiThreads is 10 times inferior than for testMultiRowMutationMultiThreads but it can occur as well.. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master
[ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228689#comment-13228689 ] nkeywal commented on HBASE-5572: for an unknown reason the first two patches didn't make it to hadoop-qa. Rewriting once again. KeeperException.SessionExpiredException management could be improved in Master -- Key: HBASE-5572 URL: https://issues.apache.org/jira/browse/HBASE-5572 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch Synthesis: 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this. 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call. 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above. Changes are: 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion 1) Removing TestMasterZKSessionRecovery Detailed justification: testMasterZKSessionRecoveryFailure says: {noformat} /** * Negative test of master recovery from zk session expiry. * * Starts with one master. Fakes the master zk session expired. * Ensures the master cannot recover the expired zk session since * the master zk node is still there. */ public void testMasterZKSessionRecoveryFailure() throws Exception { MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster(); HMaster m = cluster.getMaster(); m.abort(Test recovery from zk session expired, new KeeperException.SessionExpiredException()); assertTrue(m.isStopped()); } {noformat} This tests works, i.e. the assertion is always verified. But do we really want this behavior? When looking at the code, we see that this what's happening is strange: - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master. - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master) - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master). - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master. - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master. In other words, the comment says Ensures the master cannot recover the expired zk session since the master zk node is still there. but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return true, so we would not stop the master, so the test would fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228816#comment-13228816 ] nkeywal commented on HBASE-5399: It's ok, we're all in the same boat :-) I've got the test running on a 2 weeks old version of the trunk, I will have the result tomorrow. On Wed, Mar 14, 2012 at 12:12 AM, Lars Hofhansl (Commented) (JIRA) Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227455#comment-13227455 ] nkeywal commented on HBASE-5399: Ok, even an empty patch gets these errors from hadoop-qa, so the v42 can be committed imho. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch, nochange.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226903#comment-13226903 ] nkeywal commented on HBASE-5399: I don't know. The errors are the same, but I don't reproduce them locally. Let's retry, if we have them a third time I will look at them tomorrow or Monday. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226195#comment-13226195 ] nkeywal commented on HBASE-5399: They're all different from the ones I got locally. It could be pure test flakiness. Let's retry. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5551) Some functions should not be used by customer code and must be deprecated in 0.94
[ https://issues.apache.org/jira/browse/HBASE-5551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226204#comment-13226204 ] nkeywal commented on HBASE-5551: Patch contains only comments meta info modifications. Done on 0.94 branch. Can be committed. Some functions should not be used by customer code and must be deprecated in 0.94 - Key: HBASE-5551 URL: https://issues.apache.org/jira/browse/HBASE-5551 Project: HBase Issue Type: Improvement Affects Versions: 0.92.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.94.0 Attachments: 5551.092.patch They are: HBaseAdmin#getMaster HConnection#getZooKeeperWatcher HConnection#getMaster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5551) Some functions should not be used by customer code and must be deprecated in 0.94
[ https://issues.apache.org/jira/browse/HBASE-5551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226235#comment-13226235 ] nkeywal commented on HBASE-5551: We expect that all the functions needed are already in HTable or HBaseAdmin. We can detail this in the javadoc of course (but then first in 5399 to ensure that all the info is in the trunk for 0.96). Some functions should not be used by customer code and must be deprecated in 0.94 - Key: HBASE-5551 URL: https://issues.apache.org/jira/browse/HBASE-5551 Project: HBase Issue Type: Improvement Affects Versions: 0.92.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.94.0 Attachments: 5551.092.patch They are: HBaseAdmin#getMaster HConnection#getZooKeeperWatcher HConnection#getMaster -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223089#comment-13223089 ] nkeywal commented on HBASE-5399: TestRegionRebalancing: seems to be a flaky test. Will retry on Hadoop-QA, but I don't reproduce it here. TestRegionRebalancing: With the 7s sleep (i.e. same sleep as before), I don't reproduce it. I will try to understand why this sleep changes the result, but anyway it's not a regression. So this patch is a good candidate for a commit I think. Further enhancement (clusterId, ZK watcher replacement by simple calls) could be put in another JIRA. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223657#comment-13223657 ] nkeywal commented on HBASE-5399: @all; Ok, I will provide an updated patch with the comments taken into account. I also understood the issue behind TestZooKeeper. My fix currently breaks other unit tests, but if I manage to make it work I will include it. For 0.94 vs. 0.96 ok as well, however, it would makes sense to deprecate some methods in the 0.94, especially getMaster getZooKeeper in HConnection: people will get more time to react this way. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223664#comment-13223664 ] nkeywal commented on HBASE-5399: I can do a patch with only the deprecated stuff, or it can be be added in a raw commit, as you like. On Tue, Mar 6, 2012 at 10:03 PM, stack (Commented) (JIRA) Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399.v40.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13222436#comment-13222436 ] nkeywal commented on HBASE-5399: org.apache.hadoop.hbase.TestZooKeeper is surprising, because: - if we add a 7s sleep before in testMasterSessionExpired(), then it's much more difficult to reproduce. - in RecovableZooKeeper, there is no tests for SESSIONEXPIRED: if it happens there is no retry. So I will tend to think it's an existing issue, even if I need to understand how it's supposed to work when there is a session timeout. I tried to add it but it does not work. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220964#comment-13220964 ] nkeywal commented on HBASE-5399: @stack. A lot of variance, but not that much. So I know I broke something somewhere. I fixed a synchronization issue in v23 (plus the points mentioned in you review). Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220156#comment-13220156 ] nkeywal commented on HBASE-5399: The patch is still in progress. Submitted to the build to see how it goes in hadoop-qa. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220171#comment-13220171 ] nkeywal commented on HBASE-5399: @stack: done on https://reviews.apache.org/r/3967/ Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220317#comment-13220317 ] nkeywal commented on HBASE-5399: fyi, here are the results on 4 runs. HADOOP-QA 1 org.apache.hadoop.hbase.client.TestAdmin org.apache.hadoop.hbase.client.TestScannerTimeout org.apache.hadoop.hbase.replication.TestReplicationPeer org.apache.hadoop.hbase.util.TestMergeTool org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.master.TestSplitLogManager Local 1 org.apache.hadoop.hbase.regionserver.TestAtomicOperation org.apache.hadoop.hbase.regionserver.wal.TestHLog org.apache.hadoop.hbase.master.TestDistributedLogSplitting org.apache.hadoop.hbase.replication.TestReplication org.apache.hadoop.hbase.util.TestMergeTool org.apache.hadoop.hbase.client.TestAdmin org.apache.hadoop.hbase.client.TestScannerTimeout Local 2 org.apache.hadoop.hbase.regionserver.TestSplitLogWorker org.apache.hadoop.hbase.client.TestScannerTimeout org.apache.hadoop.hbase.util.TestMergeTool Local 3 org.apache.hadoop.hbase.coprocessor.TestClassLoading org.apache.hadoop.hbase.master.TestSplitLogManager org.apache.hadoop.hbase.util.TestMergeTool org.apache.hadoop.hbase.coprocessor.TestAggregateProtocol org.apache.hadoop.hbase.io.encoding.TestChangingEncoding org.apache.hadoop.hbase.TestZooKeeper org.apache.hadoop.hbase.client.TestScannerTimeout org.apache.hadoop.hbase.replication.TestReplication Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators:
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219681#comment-13219681 ] nkeywal commented on HBASE-5399: v16 (still in progress, some unit tests fail, indent comments to redo so on), after a discussion with Stack. HConnection is a connection to the cluster. However, the fact that the cluster is composed of master, zookeeper and region servers should be hidden from the client: some functions need a master some others not. This is not the client problem. Especially, these functions can move (from example, getting a table descriptor, today a master function, could become a region server function). So the getMaster getZookeeperWatcher, shared or not, should not be in HConnection interface. Client functions are today split in two classes HBaseAdmin HConnection (with stuff like processBatch, listTables, getHTableDescriptor). It could make sense to split HConnection further, but anyway we already have two classes using master, and the master connection should remain shared between these two classes. This should be handled by the HConnection as it is its core responsibility and it's as well much simpler technically. So we need to have package visible function to get them for HBaseAdmin. I prefer to have them in HConnectionImplementation only, even it it implies a cast in HBaseAdmin, as this makes HConnection clean from a client point of view. We stick to the keep alive mechanism with the Closeable, and hence a dynamic proxy for master and a subclass for ZooKeeperMaster. Note that if master zookeeper are implementation details, the same should apply to HRegionInterface (HConnection#getHRegionConnection). But there are dependencies from the master package so it can not be included in this jira. The keep alive mechanism could be applied to HRegionInterface as well. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13218257#comment-13218257 ] nkeywal commented on HBASE-5399: If we really want it, I found 3 options, and tried 2. 1) Adding 'close' to the HMasterInterface After looking at it, I don't think it's a good option: HMasterInterface is an interface shared between the client the server. So adding a close function to it would mean the server must implement it, while it's a client side function only. I believe that's the reason why there is already a function 'stopProxy' in the RPCENgine instead of a close function. 2) Adding the possibility to have a delayed close in RPCENgine Instead of doing it for HMasterInterface in Connection only, we could do it all proxies and code this in RPCENgine. There is already a reference counting in the hbase RPCENgine. So we could add here some code to allow a delayed close. I don't see why it would not be possible, all the code seems to be in the hbase package (and not hadoop). This would require smart convention to make it configurable on a per proxy basis, but it should work. 3) Add an class with a delegation So I've got this {noformat} public interface SharedMaster extends HMasterInterface, Closeable {} {noformat} With this in HConnection {noformat} public interface HConnection extends Abortable, Closeable { public SharedMaster getSharedMaster() } {noformat} Then the client writes {noformat} SharedMaster master = connection.getSharedMaster(); try { master.move(encodedRegionName, destServerName); } finally { master.close(); } {noformat} With a java proxy to manage the delegation for us: {noformat} private static class SharedMasterHandler implements InvocationHandler { private HConnectionImplementation connection; private HMasterInterface master; SharedMasterHandler(HConnectionImplementation connection, HMasterInterface master){ this.connection = connection; this.master = master; } @Override public Object invoke(Object proxy, Method method, Object[] args) throws Throwable { if (method.getName().equals(close)){ connection.releaseSharedMaster((HMasterInterface)master); return null; } else { return method.invoke(master, args); } } } {noformat} releaseSharedMaster is private in this solution. It was not really my first idea, but it's a reasonable way to get to the objective. The reflective delegation is not fast, obviously, but it doesn't matter here as there is much more expensive remote call just after... I'am currently testing it, it seems to work. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217177#comment-13217177 ] nkeywal commented on HBASE-5399: bq. Do we have to have the getSharedZookeeperWatcher and releaseSharedZookeeperWatcher and getSharedMaster, etc., in the HConnection API? Are these not implementation details? (Or would it be too hard to undo them – you'd have not way of counting zk and master connections?) Yes, we need the user to explicitly release the connection, as we can't hide that we managing the object life cycle. I would prefer to use close for this, but I didn't find an easy way to extend the master proxy to make it closeable. So I preferred to use release for both objects to make it consistent. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5075) regionserver crashed and failover
[ https://issues.apache.org/jira/browse/HBASE-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217183#comment-13217183 ] nkeywal commented on HBASE-5075: In the case that you want to manage (region server crash without any hardware issue, i.e. a pure application bug), a possible solution as well is loop in the launch script. This typically allows 20s failover (stop time + start time), and is even compatible with hotfailover. It's faster and less error prone than monitoring pid. But for HBase it would be a new start mode. It could make sense if we observe much more application bugs than hw issues (this solution is quite common with C/C++ stuff as it's easy to crash a process with these languages...) regionserver crashed and failover - Key: HBASE-5075 URL: https://issues.apache.org/jira/browse/HBASE-5075 Project: HBase Issue Type: Improvement Components: monitoring, regionserver, replication, zookeeper Affects Versions: 0.92.1 Reporter: zhiyuan.dai Fix For: 0.90.5 Attachments: Degion of Failure Detection.pdf, HBase-5075-shell.patch, HBase-5075-src.patch regionserver crashed,it is too long time to notify hmaster.when hmaster know regionserver's shutdown,it is long time to fetch the hlog's lease. hbase is a online db, availability is very important. i have a idea to improve availability, monitor node to check regionserver's pid.if this pid not exsits,i think the rs down,i will delete the znode,and force close the hlog file. so the period maybe 100ms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4955) Use the official versions of surefire junit
[ https://issues.apache.org/jira/browse/HBASE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217200#comment-13217200 ] nkeywal commented on HBASE-4955: Small update: - No news from JUnit team. Issues was raised in november and they classified it as a bug, but my one line fix is not yet integrated. The main committers don't have a lot of time to spend on the project... - Surefire 2.12 is available, but there is a regression (SUREFIRE-827) that they are currently fixing, so it seems better to wait for the 2.13. Use the official versions of surefire junit - Key: HBASE-4955 URL: https://issues.apache.org/jira/browse/HBASE-4955 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor We currently use private versions for Surefire JUnit since HBASE-4763. This JIRA traks what we need to move to official versions. Surefire 2.11 is just out, but, after some tests, it does not contain all what we need. JUnit. Could be for JUnit 4.11. Issue to monitor: https://github.com/KentBeck/junit/issues/359: fixed in our version, no feedback for an integration on trunk Surefire: Could be for Surefire 2.12. Issues to monitor are: 329 (category support): fixed, we use the official implementation from the trunk 786 (@Category with forkMode=always): fixed, we use the official implementation from the trunk 791 (incorrect elapsed time on test failure): fixed, we use the official implementation from the trunk 793 (incorrect time in the XML report): Not fixed (reopen) on trunk, fixed on our version. 760 (does not take into account the test method): fixed in trunk, not fixed in our version 798 (print immediately the test class name): not fixed in trunk, not fixed in our version 799 (Allow test parallelization when forkMode=always): not fixed in trunk, not fixed in our version 800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, fixed on our version 800 793 are the more important to monitor, it's the only ones that are fixed in our version but not on trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217305#comment-13217305 ] nkeywal commented on HBASE-5399: yes :-) The user would code something like: {noformat} HMasterInterface master = connection.getSharedMaster(); try { master.move(encodedRegionName, destServerName); } finally { master.close(); } {noformat} HMasterInterface is an interface, with a proxy sending the calls to the master server. I would need to add a close that would not be a remote call but would decrement a reference counter. I could add another proxy object, but it's not very clean (it should work however, may be it's an option). If I don't want to do that, I need to add the method in the object returned by getProxy. You think it makes sense? Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211422#comment-13211422 ] nkeywal commented on HBASE-5399: In v9 I added some comments and fixed some issues. bq. I'd say lets not check master is there till we need it. Seems like a PITA going ahead and checking master on construction. This changes the behavior but I think its one thats ok to change. Agreed, removed in v9. bq. You are doing your own Callables. You've seen the Callables that go on in HTable. Any reason you avoid them? I suppose this is different in that you want to let go of the shared master. Looks fine. I could also add the master management in the HTable callables. I would still need the one I wrote for HBaseAdmin, but it could be useful to be able to use master from the HTable callables? I still need to work on the unit tests, some large tests are failing today: {noformat} Failed tests: testBackgroundEvictionThread[1](org.apache.hadoop.hbase.io.hfile .TestLruBlockCache) testShutdownFixupWhenDaughterHasSplit(org.apache.hadoop.hbase.regionserver.Tes tSplitTransactionOnCluster): expected:1 but was:0 queueFailover(org.apache.hadoop.hbase.replication.TestReplication): Waited too much time for queueFailover replication testMergeTool(org.apache.hadoop.hbase.util.TestMergeTool): 'merging regions 0 and 1' failed with errCode -1 Tests in error: test3686a(org.apache.hadoop.hbase.client.TestScannerTimeout): 64142ms passed s ince the last invocation, timeout is currently set to 1 {noformat} For some of them it could be random and unrelated, but I am sure there are some real issues for of them. I am on vacations next week; so I will come back to it the week after... Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA.
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211435#comment-13211435 ] nkeywal commented on HBASE-5399: Moreover, v9 patch work on trunk as of today. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211703#comment-13211703 ] nkeywal commented on HBASE-5399: @Lars and all: done see https://reviews.apache.org/r/3967/ I sent it to the whole hbase group, I hope it's the right thing to do. I'm on vacation this week so I will see the comments after the 27th... Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210561#comment-13210561 ] nkeywal commented on HBASE-5399: It's not the last version (it needs more comments, unit tests and likely bug fixes), but there is already a lot. Master ZooKeeper connection are now created only when necessary, and are closed if not used for 5 minutes. I added the keep alive stuff. It's not a nice to have; without it the unit tests take twice more time. There is an issue with the masterCheck part, the previous behavior was strange. I need to review it in details. The patch is on monday trunk. I will make it compatible on current trunk this week-end. I will move isTableEnabled so on in an other patch, this one is already too big... Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209565#comment-13209565 ] nkeywal commented on HBASE-5399: After some thinking and different tries; I believe it makes sense to have a shared master connection and a shared zookeeper connection with a reference counter. The creation/release/close function would be in HConnection. We could add the refCounter in ZooKeeperWatcher, but it's more difficult to add in the proxy (I could do it however). Ideally, there would be a keep alive, i.e. we would not close the physical connection after when the counter becomes zero, but wait a minute or so. I'am just not sure on how to distinguish cleanly a simple release to a jvm shutdown. Patch is underway, I think the complete version for review will be ready tomorrow. However, I have not found a new place for the table descriptor part, today it's still in the connection. Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble
[ https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208051#comment-13208051 ] nkeywal commented on HBASE-5399: bq. Why are these deprecated: getHRegionConnection(HServerAddress regionServer) It's because HServerAddress is deprecated. I will check that all the redirection are there. bq. Do we get the clusterid on connection setup? Do we have to? Can we just get that when someone asks for it? Today yes, but we also manage the case when there is no id. bq. Should we be using straight ZooKeeper instead of ZooKeeperWatcher? We don't need watch facility? We can do that, it requires some work to do it well (ZooKeeperWatcher has spread too much, and has a lot of responsibilities (for example, it's the owner of the znode names, created from the config parameters)). It would be much cleaner, and a little bit faster. We would still pay for the tcp connection however. bq. isTableEnabled, etc., should they be deprecated, moved out of HConnnection I was thinking putting them in HBaseAdmin, does it makes sense? Cut the link between the client and the zookeeper ensemble -- Key: HBASE-5399 URL: https://issues.apache.org/jira/browse/HBASE-5399 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5399_inprogress.patch The link is often considered as an issue, for various reasons. One of them being that there is a limit on the number of connection that ZK can manage. Stack was suggesting as well to remove the link to master from HConnection. There are choices to be made considering the existing API (that we don't want to break). The first patches I will submit on hadoop-qa should not be committed: they are here to show the progress on the direction taken. ZooKeeper is used for: - public getter, to let the client do whatever he wants, and close ZooKeeper when closing the connection = we have to deprecate this but keep it. - read get master address to create a master = now done with a temporary zookeeper connection - read root location = now done with a temporary zookeeper connection, but questionable. Used in public function locateRegion. To be reworked. - read cluster id = now done once with a temporary zookeeper connection. - check if base done is available = now done once with a zookeeper connection given as a parameter - isTableDisabled/isTableAvailable = public functions, now done with a temporary zookeeper connection. - Called internally from HBaseAdmin and HTable - getCurrentNrHRS(): public function to get the number of region servers and create a pool of thread = now done with a temporary zookeeper connection - Master is used for: - getMaster public getter, as for ZooKeeper = we have to deprecate this but keep it. - isMasterRunning(): public function, used internally by HMerge HBaseAdmin - getHTableDescriptor*: public functions offering access to the master. = we could make them using a temporary master connection as well. Main points are: - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a strongly coupled architecture ;-). This can be changed, but requires a lot of modifications in these classes (likely adding a class in the middle of the hierarchy, something like that). Anyway, non connected client will always be really slower, because it's a tcp connection, and establishing a tcp connection is slow. - having a link between ZK and all the client seems to make sense for some Use Cases. However, it won't scale if a TCP connection is required for every client - if we move the table descriptor part away from the client, we need to find a new place for it. - we will have the same issue if HBaseAdmin (for both ZK Master), may be we can put a timeout on the connection. That would make the whole system less deterministic however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183655#comment-13183655 ] nkeywal commented on HBASE-5163: I've got a fix using a variation of #2. Tested 100 times without any failure. The advantage of #1 for me is that it eliminates a quite tricky behavior, but the fix would then be outside HBase... TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.) - Key: HBASE-5163 URL: https://issues.apache.org/jira/browse/HBASE-5163 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor The stack is typically: {noformat} error message=Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. type=java.io.IOExceptionjava.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) at org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) // ... {noformat} It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time. This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops: {noformat} for (int i = curDatanodesNum; i curDatanodesNum+numDataNodes; i++) { if (manageDfsDirs) { File dir1 = new File(data_dir, data+(2*i+1)); File dir2 = new File(data_dir, data+(2*i+2)); dir1.mkdirs(); dir2.mkdirs(); // [...] {noformat} This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior: {noformat} @Test public void testMiniDFSCluster_startDataNode() throws Exception { assertTrue( dfsCluster.getDataNodes().size() == 2 ); // Works, as we kill the last datanode, we can now start a datanode dfsCluster.stopDataNode(1); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); // Fails, as it's not the last datanode, the directory will conflict on // creation dfsCluster.stopDataNode(0); try { dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); fail(There should be an exception because the directory already exists); } catch (IOException e) { assertTrue( e.getMessage().contains(The directory is already locked.)); LOG.info(Expected (!) exception caught + e.getMessage()); } // Works, as we kill the last datanode, we can now restart 2 datanodes // This makes us back with 2 nodes dfsCluster.stopDataNode(0); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); } {noformat} And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do {noformat} DatanodeInfo[] pipeline = getPipeline(log); assertTrue(pipeline.length == fs.getDefaultReplication()); {noformat} and then kill the datanodes in the pipeline, we will have: - most of the time: pipeline = 1 2, so after killing 12 we can start a new datanode that will reuse the available 2's directory. - sometimes: pipeline = 1 3. In this case,when we try to launch the
[jira] [Commented] (HBASE-5064) utilize surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178161#comment-13178161 ] nkeywal commented on HBASE-5064: bq. Anyway we need to find out why TestMetaReaderEditor wasn't executed. Likely because it belongs to the too long list of flaky tests. A test should not fail randomly. Here is the list, from 2974 to 2598, with a timeouted TestMetaReaderEditor https://builds.apache.org/job/HBase-TRUNK/2597/testReport/org.apache.hadoop.hbase.catalog/ https://builds.apache.org/job/HBase-TRUNK/2588/testReport/org.apache.hadoop.hbase.catalog/ https://builds.apache.org/job/HBase-TRUNK/2585/testReport/org.apache.hadoop.hbase.catalog/ https://builds.apache.org/job/HBase-TRUNK/2584/testReport/org.apache.hadoop.hbase.catalog/ https://builds.apache.org/job/HBase-TRUNK/2582/testReport/org.apache.hadoop.hbase.catalog/ https://builds.apache.org/job/HBase-TRUNK/2577/testReport/org.apache.hadoop.hbase.catalog/ https://builds.apache.org/job/HBase-TRUNK/2574/testReport/org.apache.hadoop.hbase.catalog/ It happens on hadoop-qa as well: https://builds.apache.org/job/PreCommit-HBASE-Build/640/testReport/org.apache.hadoop.hbase.catalog/ https://builds.apache.org/job/PreCommit-HBASE-Build/638/testReport/org.apache.hadoop.hbase.catalog/ https://builds.apache.org/job/PreCommit-HBASE-Build/630/testReport/org.apache.hadoop.hbase.catalog/ https://builds.apache.org/job/PreCommit-HBASE-Build/622/testReport/org.apache.hadoop.hbase.catalog/ That's around 25% of the time for trunk. I can't get back before 2574. I will spend some time on it to see if there is anything simple to do(you never know). If I find something I will create a separate jira. utilize surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) utilize surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178183#comment-13178183 ] nkeywal commented on HBASE-5064: I found some issues in TestMetaReaderEditor, explaining at least why it timeouts instead of failing (I don't know what it fails sometimes, still looking). For 2588, when the process timeouts (i.e. forkedProcessTimeoutInSeconds is reached), there is no entry in the jenkins report nor in the console. The only way to find out is that the test is missing, and you can see that TestMetaReaderEditor is actually missing in the list... utilize surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) utilize surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178042#comment-13178042 ] nkeywal commented on HBASE-5064: The exact same test already timeouted in trunk in https://builds.apache.org/job/HBase-TRUNK/2588/testReport/ I don't think it's related to this patch. On Sat, Dec 31, 2011 at 4:28 PM, Zhihong Yu (Commented) (JIRA) utilize surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177637#comment-13177637 ] nkeywal commented on HBASE-5064: the v20 is ok for commit imho. There are two processes by default, and 4 on hadoop-qa. It possible to change the number of processes used by specifying -Dsurefire.secondPartThreadCount=WhatYouWant on mvn command line. Using -Dsurefire.secondPartThreadCount=1 means no parallelization. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4955) Use the official versions of surefire junit
[ https://issues.apache.org/jira/browse/HBASE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177644#comment-13177644 ] nkeywal commented on HBASE-4955: We're now using 2.12-TRUNK-HBASE-2. It's a private version, built on the 2.12 trunk (i.e.: it does not contain eveyrthing that will be in 2.12 final). Surefire: Could be for Surefire 2.12. Issues to monitor are: 329 (category support): fixed, we use the official implementation from the trunk 773 (forked processes not killed after timeout): not fixed in trunk, not fixed in our version 786 (@Category with forkMode=always): fixed, we use the official implementation from the trunk 791 (incorrect elapsed time on test failure): fixed, we use the official implementation from the trunk 793 (incorrect time in the XML report): Not fixed (reopen) in trunk, partial fixed in our version. 760 (does not take into account the test method): fixed, we use the official implementation from the trunk 798 (print immediately the test class name): not fixed in trunk, not fixed in our version 799 (Allow test parallelization when forkMode=always): fixed in trunk, fixed in our version with some minimal differences. 800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, fixed in our version 806 (Ignore selection criteria when -Dtest= is specified): not fixed in trunk, not fixed in our version 813 (Randomly wrong tests count and empty summary files): fixed in trunk, fixed in our version 800 793 are the more important to monitor, it's the only ones that are fixed in our version but not on trunk. Use the official versions of surefire junit - Key: HBASE-4955 URL: https://issues.apache.org/jira/browse/HBASE-4955 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor We currently use private versions for Surefire JUnit since HBASE-4763. This JIRA traks what we need to move to official versions. Surefire 2.11 is just out, but, after some tests, it does not contain all what we need. JUnit. Could be for JUnit 4.11. Issue to monitor: https://github.com/KentBeck/junit/issues/359: fixed in our version, no feedback for an integration on trunk Surefire: Could be for Surefire 2.12. Issues to monitor are: 329 (category support): fixed, we use the official implementation from the trunk 786 (@Category with forkMode=always): fixed, we use the official implementation from the trunk 791 (incorrect elapsed time on test failure): fixed, we use the official implementation from the trunk 793 (incorrect time in the XML report): Not fixed (reopen) on trunk, fixed on our version. 760 (does not take into account the test method): fixed in trunk, not fixed in our version 798 (print immediately the test class name): not fixed in trunk, not fixed in our version 799 (Allow test parallelization when forkMode=always): not fixed in trunk, not fixed in our version 800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, fixed on our version 800 793 are the more important to monitor, it's the only ones that are fixed in our version but not on trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177274#comment-13177274 ] nkeywal commented on HBASE-5064: @ted: The 3 last tests seems ok on hadoop-qa, even if I'am sure that we will discover some issues. A 4th test is in progress. I will do a 5th also. Anything meaningful in the logs for TestMiniClusterLoadParallel? Note that the v18 patch is with 4 tests in //, I believe we will need to commit it with 2 or 3 to be sure that a 1 year old laptop can run the tests smoothly. I use 4 here to maximize // and minimize test time. We can set a parameter to the hadoop-qa build to use 4 or more if we want to... use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177285#comment-13177285 ] nkeywal commented on HBASE-5064: You want 2 process in // only for hadoop-qa? It seems that the machine never runs two builds in //, so we should maximize the number of process. That makes the build last ~35 minutes instead of 2 hours. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177388#comment-13177388 ] nkeywal commented on HBASE-5064: From the last 5 execution, it seems we have NumberFormatException + some random failure that I would tend to attribute to the usual flakiness (I may be wrong :-)). The FileNotFound exception was not reproduced. @ted: from the time taken by TestAdmin (twice the time on hadoop-qa, and very close to the test timeout of 900 seconds), I think that the machine is heavily loaded, or lacks some memory and swaps. That makes me think that we should default the // degree to 2 (and keep 4 on hadoop-qa). use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176658#comment-13176658 ] nkeywal commented on HBASE-5064: once again a java.io.FileNotFoundException: /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/../logs/userlogs/job_20111228130003142_0001/attempt_20111228130003142_0001_m_01_2/log.index (No such file or directory) Will have to understand this. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176830#comment-13176830 ] nkeywal commented on HBASE-5064: Yep, may be the NumberFormatException was hiding the FileNotFoundException before (it's possible as well to have both error simultaneously). I haven't tried v16 on the whole suite test locally, but it seems to work on my subset. Let's see. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176860#comment-13176860 ] nkeywal commented on HBASE-5064: fail: java.io.FileNotFoundException: /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/238434d5-7c45-413f-ba99-1440a560e1f1/hadoop-log-dir/userlogs/job_20111228220208254_0001/attempt_20111228220208254_0001_m_02_0/log.index (No such file or directory) enough for today use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175718#comment-13175718 ] nkeywal commented on HBASE-5064: Below a review of the tests on trunk prebuild. We have often, nearly always, hanging tests in hadoop-qa, while it's occurs only sometimes on main build. It seems that it hangs more often when // is activated It seems that we have this kind of behavior when we cannot start a thread. I will look if we can enhance the reporting in surefire (i.e. it seems some errors are not reported today) #2572 Tests run: 789, Failures: 0, Errors: 0, Skipped: 9 #2570 Tests run: 789, Failures: 0, Errors: 0, Skipped: 9 #2569 Tests run: 789, Failures: 0, Errors: 0, Skipped: 9 #2568 Tests run: 789, Failures: 0, Errors: 0, Skipped: 9 #2566 Tests run: 789, Failures: 0, Errors: 0, Skipped: 9 #2565 (marked as success, but there is a hanging test) wal.TestLogRollAbort #596 5064 1 processesn min number of threads Tests run: 786, Failures: 3, Errors: 1, Skipped: 9 #595 5064.v7.patch - 3 processesn min number of threads Tests run: 785, Failures: 3, Errors: 1, Skipped: 9 hbck.TestOfflineMetaRebuildOverlap master.TestMasterRestartAfterDisablingTable replication.TestMasterReplication #594 5064.v7.patch - 3 processesn min number of threads Tests run: 786, Failures: 3, Errors: 1, Skipped: 9 util.TestMiniClusterLoadParallel coprocessor.TestMasterCoprocessorExceptionWithRemove #593 Tests run: 779, Failures: 4, Errors: 2, Skipped: 9 util.TestMiniClusterLoadSequential #592 - Data-block-encoding-2011-12-23.patch - fails, but with a clear message # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 32756 bytes for ChunkPool::allocate # An error report file with more information is saved as: # /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/hs_err_pid20773.log Tests run: 739, Failures: 3, Errors: 1, Skipped: 9 #591 - scannerMVCC1.txt Tests run: 780, Failures: 3, Errors: 1, Skipped: 9 master.TestDistributedLogSplitting #589 - 5064.v12.patch, 2 threads Tests run: 783, Failures: 3, Errors: 1, Skipped: 9 coprocessor.TestMasterCoprocessorExceptionWithRemove hbase.TestRegionRebalancing master.TestDistributedLogSplitting #588 - 5064 4 threads Tests run: 768, Failures: 3, Errors: 1, Skipped: 9 util.hbck.TestOfflineMetaRebuildBase catalog.TestMetaReaderEditor coprocessor.TestMasterCoprocessorExceptionWithRemove replication.regionserver.TestReplicationSink replication.TestMasterReplication #587 - 5064 (7 processes, standard config) replication.TestMasterReplication master.TestMasterFailover master.TestRollingRestart #586 - 5064 (6 processes, min number of threads) Tests run: 775, Failures: 3, Errors: 1, Skipped: 9 master.TestRestartCluster regionserver.wal.TestLogRollAbort client.TestMetaScanner TestGlobalMemStoreSize client.TestMultiParallel #585 - 5064 (3 processes, min number of threads) Tests run: 788, Failures: 4, Errors: 1, Skipped: 9 #583 - HBASE-4720.trunk.v2.patch Tests run: 788, Failures: 3, Errors: 1, Skipped: 9 replication.TestMasterReplication #582 - Delta-encoding.patch-2011-12-22_11_52_07.patch unable to create new native thread replication.TestReplication: Tests run: 2, Failures: 0, Errors: 1, Skipped: 0 use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5085) fix test-patch script from setting the ulimit
[ https://issues.apache.org/jira/browse/HBASE-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175374#comment-13175374 ] nkeywal commented on HBASE-5085: On prebuild #584, we have: open files (-n) 6 max user processes (-u) 10240 This should be ok, trying immediately. fix test-patch script from setting the ulimit - Key: HBASE-5085 URL: https://issues.apache.org/jira/browse/HBASE-5085 Project: HBase Issue Type: Bug Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Fix For: 0.94.0 Attachments: 5085-v2-experiment.txt, 5085-v2-experiment.txt, 5085-v3-experiment.txt, 5085-v3-experiment.txt, 5085-v4-experiment.txt, 5085-v5.txt, hbase-5085.patch test-patch.sh script sets the ulimit -n 1024 just after triggering the patch setting this overrides the underlying systems ulimit and hence failing the hbase tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175463#comment-13175463 ] nkeywal commented on HBASE-5064: #585 (3 processes, min number of threads) Total time: 40:46.855s Tests run: 788, Failures: 4, Errors: 1, Skipped: 9 Invalid result expected:134 but was:190 TestCoprocessorEndpoint.testAggregation NumberFormatException TestTableMapReduce.testMultiRegionTable TestHFileOutputFormat.testMRIncrementalLoad TestHFileOutputFormat.testMRIncrementalLoadWithSplit TestHFileOutputFormat.testExcludeMinorCompaction Hung None Parallelization went well, nearly 100% linearility, as we're 3 times faster with 3 processes. TestCoprocessorEndpoint.testAggregation failed as well in prebuild #584, so it could be unrelated to the //. Let's try with 6 processes. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175486#comment-13175486 ] nkeywal commented on HBASE-5064: #586 (6 processes, min number of threads) Total time: 22:54.145s Tests run: 775, Failures: 3, Errors: 1, Skipped: 9 NumberFormatException TestTableMapReduce.testMultiRegionTable TestHFileOutputFormat.testMRIncrementalLoad TestHFileOutputFormat.testMRIncrementalLoadWithSplit TestHFileOutputFormat.testExcludeMinorCompaction Hung master.TestRestartCluster regionserver.wal.TestLogRollAbort client.TestMetaScanner TestGlobalMemStoreSize client.TestMultiParallel Still nearly 100% linearility (5 times faster with 6 processes) A lot of tests hung. TestCoprocessorEndpoint.testAggregation didn't fail this time. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175496#comment-13175496 ] nkeywal commented on HBASE-5064: #587 (7 processes, standard config) Total time: 31:58.502s Tests run: 782, Failures: 3, Errors: 1, Skipped: 9 NumberFormatException TestTableMapReduce.testMultiRegionTable TestHFileOutputFormat.testMRIncrementalLoad TestHFileOutputFormat.testMRIncrementalLoadWithSplit TestHFileOutputFormat.testExcludeMinorCompaction Hung replication.TestMasterReplication master.TestMasterFailover master.TestRollingRestart As usual nearly 100% linearility 3 tests hung. TestCoprocessorEndpoint.testAggregation didn't fail this time. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175517#comment-13175517 ] nkeywal commented on HBASE-5064: #588 Total time: 39:50.795s Tests run: 768, Failures: 3, Errors: 1, Skipped: 9 Hanging test: Running org.apache.hadoop.hbase.util.hbck.TestOfflineMetaRebuildBase Hanging test: Running org.apache.hadoop.hbase.catalog.TestMetaReaderEditor Hanging test: Running org.apache.hadoop.hbase.coprocessor.TestMasterCoprocessorExceptionWithRemove Hanging test: Running org.apache.hadoop.hbase.replication.regionserver.TestReplicationSink Hanging test: Running org.apache.hadoop.hbase.replication.TestMasterReplication Gonna try with only 2 threads. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173984#comment-13173984 ] nkeywal commented on HBASE-5064: #554 (No //, min number of threads) Total time: 1:55:37.351s Tests run: 781, Failures: 3, Errors: 2, Skipped: 9 Too many open files TestInstantSchemaChange.testInstantSchemaJanitor NumberFormatException TestTableMapReduce.testMultiRegionTable TestHFileOutputFormat.testMRIncrementalLoad TestHFileOutputFormat.testMRIncrementalLoadWithSplit TestHFileOutputFormat.testExcludeMinorCompaction Hung TestReplication use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174089#comment-13174089 ] nkeywal commented on HBASE-5064: #565 (No //, min number of threads) Total time: 1:59:24.242s Tests run: 788, Failures: 3, Errors: 3, Skipped: 9 Too many open files TestInstantSchemaChange.testInstantSchemaJanitor NumberFormatException TestTableMapReduce.testMultiRegionTable TestHFileOutputFormat.testMRIncrementalLoad TestHFileOutputFormat.testMRIncrementalLoadWithSplit TestHFileOutputFormat.testExcludeMinorCompaction Hung None it seems, and at it least not TestReplication The directory is already locked. TestLogRolling.testLogRollOnDatanodeDeath use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174207#comment-13174207 ] nkeywal commented on HBASE-5064: #566 (No //, min number of threads) Total time: 1:56:55.198s Tests run: 789, Failures: 3, Errors: 2, Skipped: 9 Too many open files TestInstantSchemaChange.testInstantSchemaJanitor NumberFormatException TestTableMapReduce.testMultiRegionTable TestHFileOutputFormat.testMRIncrementalLoad TestHFileOutputFormat.testMRIncrementalLoadWithSplit TestHFileOutputFormat.testExcludeMinorCompaction Hung None use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174277#comment-13174277 ] nkeywal commented on HBASE-5064: No, we need the hadoopqa setting increased before activating //. Here, I am launching multiple tests to identify most of the random issues we can have, this will help to distinguish what is random vs. what is due to //. However, I can split this patch in smaller ones: 1) one to activate the lastest version of surefire 2) one for the minimal thread settings 3) one for the // itself. From all the tests done so far, 1 2 seem ok today. On Wed, Dec 21, 2011 at 6:49 PM, Zhihong Yu (Commented) (JIRA) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v8.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172086#comment-13172086 ] nkeywal commented on HBASE-5064: Tests run: 780, Failures: 4, Errors: 2, Skipped: 9 Total time: 1:05:24.474s To be compared with prebuild #535: Tests run: 785, Failures: 0, Errors: 3, Skipped: 9 Total time: 1:49:04.397s Still faster (~45minutes) Still some tests missing in the final report: (795 vs. 797). Don't know why yet. TestInstantSchemaChange.testInstantSchemaJanitor: Too many open files TestFromClientSide.testPoolBehavior: Fixed in v4. TestTableMapReduce.testMultiRegionTable: NumberFormatException TestHFileOutputFormat.testMRIncrementalLoad: NumberFormatException TestHFileOutputFormat.testMRIncrementalLoadWithSplit: NumberFormatException TestHFileOutputFormat.testExcludeMinorCompaction: NumberFormatException NumberFormatException: seems to be unrelated, I've got it on other patches as well. @stack: it's a parallelisation per process, so it should not be necessary: there is no added load on maven itself. However, the results of the first run seem to say that the hadoop-qa machine was overloaded. But it seems that the other runs went well. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172259#comment-13172259 ] nkeywal commented on HBASE-5064: Error while running command to get file permissions : java.io.IOException: Cannot run program /bin/ls: java.lang.OutOfMemoryError: unable to create new native thread = undersized hadoop-qa = reducing the number of threads used didn't help much = that two failures for undersizing out of 5 tries. Let's do another one with only 1 thread to see if everything goes right when there is no // use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5051) HBaseTestingUtility#getHBaseAdmin() creates a new HBaseAdmin instance at each call
[ https://issues.apache.org/jira/browse/HBASE-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172263#comment-13172263 ] nkeywal commented on HBASE-5051: TestMasterReplication hang; but it hang on trunk as well (see: https://builds.apache.org/job/HBase-TRUNK/2558/console) TestReplication: usually flaky. I think the patch is ok. HBaseTestingUtility#getHBaseAdmin() creates a new HBaseAdmin instance at each call -- Key: HBASE-5051 URL: https://issues.apache.org/jira/browse/HBASE-5051 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5051.patch, 5051.v2.patch, 5051.v2.patch, 5051.v2.patch, 5051.v2.patch As it's a new instance, it should be closed. As the function name seems to imply that it's an instance managed by HBaseTestingUtility, most of the users don't close it = leak -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172375#comment-13172375 ] nkeywal commented on HBASE-5064: #535 (with previous version of surefire) Total time: 1:49:04.397s Tests run: 785, Failures: 0, Errors: 3, Skipped: 9 #537 (2 threads) Total time: 1:13:18.269s Tests run: 771, Failures: 3, Errors: 1, Skipped: 9 #539 (2 threads, thread min configuration) Total time: 1:05:24.474s Tests run: 780, Failures: 4, Errors: 2, Skipped: 9 #546 (1 threads, thread min configuration) Tests run: 771, Failures: 3, Errors: 2, Skipped: 9 Total time: 2:07:01.697s Seems that the missing tests are more or less random as well. Let's do another try with 3 threads. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172383#comment-13172383 ] nkeywal commented on HBASE-5064: @stack: yes,it should be that there are two many threads - jvm limitation that we should manage with -Xmx or -Xss = but then we should reproduce it on our machine as well (asd far as I know no one does reproduce it locally), and we already set it to -Xmx1900m - a limitation on the total number of thread on the machine = but it's quite strange - the ulimit per process, but then running the tests on multiple processes should not change the probability to have an issue The patch v2+ changed HBase configuration, On TestAdmin, we need 40 less threads, so the benefit is not high. And I saw no real impact. There are some stuff to look at, as sysctl kernel.threads-max, i will check if it is printed in the default script. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172393#comment-13172393 ] nkeywal commented on HBASE-5064: I've seen something. If it's not the problem, it is a problem. On hadoop QA, we have max user processes (-u) 2048 Try this on you computer: ulimit -u 256 Then run a test such as TestAdmin You will have a nice java.lang.OutOfMemoryError: unable to create new native thread This because this limit controls not only the number of processes but as well the number of native threads. We should have this set to a much bigger value. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172420#comment-13172420 ] nkeywal commented on HBASE-5064: There are two issues: 1) open files (-n) 1024 This limit is per process This is the cause for too many open file This occurs when a test, for a reason or another, goes beyond this limit (some of the tests are very close) 2) max user processes (-u) 2048 This limit is per user/machine, not per process! This is the cause for unable to create new native thread This occurs when the machine runs multiple tests in parallel, and when all tests happen to use a lot of threads. Both must be changed. As we should limit the number of file descriptor used, it makes sense to set open files to something reasonable (like 2048 or 4096), but the number of max user processes could be set to unlimited without any issue, as when it fails it's because there are other processes using threads as well, and we don't control them. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172435#comment-13172435 ] nkeywal commented on HBASE-5064: For #547 NumberFormatException: mapreduce.TestHFileOutputFormat.testExcludeMinorCompaction mapreduce.TestHFileOutputFormat.testMRIncrementalLoadWithSplit mapreduce.TestHFileOutputFormat.testMRIncrementalLoad mapred.TestTableMapReduce.testMultiRegionTable Too many open files TestInstantSchemaChange.testInstantSchemaJanitor ?? TestSplitLogManager.testMultipleResubmits Total time: 41:21.481s Tests run: 733, Failures: 4, Errors: 2, Skipped: 6 Many tests hung or are not in the report, so the final result is not meaningfull. But it will make more sense to try when the ulimit settings are updated. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172438#comment-13172438 ] nkeywal commented on HBASE-5064: @stack; the synthesis is at 19/Dec/11 17:26, and yes, we really need these values to be changed :-) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171811#comment-13171811 ] nkeywal commented on HBASE-5064: java.lang.OutOfMemoryError: unable to create new native thread: hadoop-qa can't execute 3 tests in // Let's try with 2. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171815#comment-13171815 ] nkeywal commented on HBASE-5064: hum, it was already set to 2. Let's retry, but hadoop-qa is not really oversized... use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171847#comment-13171847 ] nkeywal commented on HBASE-5064: This time is worked: I have the errors above on another patch as well. Logs says: Total time: 1:13:18.269s To be compared with prebuild #535: Total time: 1:49:04.397s =50 minutes faster. However, we have some tests without results: Tests run: 771, Failures: 3, Errors: 1, Skipped: 9 To be compared with prebuild #535: Tests run: 785, Failures: 0, Errors: 3, Skipped: 9 use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5051) HBaseTestingUtility#getHBaseAdmin() creates a new HBaseAdmin instance at each call
[ https://issues.apache.org/jira/browse/HBASE-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171522#comment-13171522 ] nkeywal commented on HBASE-5051: mapred.TestTableMapReduce.testMultiRegionTable: NumberFormatException: For input string: 18446743988037478956 mapreduce.TestHFileOutputFormat.testMRIncrementalLoad: NumberFormatException: For input string: 18446743988037169444 mapreduce.TestHFileOutputFormat.testMRIncrementalLoadWithSplit NumberFormatException: For input string: 18446743988037182259 mapreduce.TestHFileOutputFormat.testExcludeMinorCompaction: NumberFormatException: For input string: 18446743988037186552 TestReplication.queueFailover: ?? TestInstantSchemaChange.testInstantSchemaJanitor: Too many open files TestMasterReplication: hung TestMasterReplication works locally. The patch should not have impacted it. It hungs on trunk as well (see: https://builds.apache.org/job/HBase-TRUNK/2553/testReport/org.apache.hadoop.hbase.replication/ : there is no entry for TestMasterReplication) TestReplication works locally. This test is usually flaky. mapred.TestTableMapReduce worked in the previous prebuild. mapreduce.TestHFileOutputFormat worked in the previous prebuild. Let's retry. HBaseTestingUtility#getHBaseAdmin() creates a new HBaseAdmin instance at each call -- Key: HBASE-5051 URL: https://issues.apache.org/jira/browse/HBASE-5051 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5051.patch, 5051.v2.patch, 5051.v2.patch, 5051.v2.patch As it's a new instance, it should be closed. As the function name seems to imply that it's an instance managed by HBaseTestingUtility, most of the users don't close it = leak -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5053) HCM Tests leaks connections
[ https://issues.apache.org/jira/browse/HBASE-5053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171011#comment-13171011 ] nkeywal commented on HBASE-5053: org.apache.hadoop.hbase.client.TestInstantSchemaChange = Too many open files HCM Tests leaks connections --- Key: HBASE-5053 URL: https://issues.apache.org/jira/browse/HBASE-5053 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5053.patch There are simple leaks and one more complex. The complex one comes from the fact fact HConnectionManager.HConnectionImplementation keeps a *reference* to the configuration used for the creation. So if this configuration is updated later, the HConnectionKey created initially will differ from the current one. As a consequence, the close() will not find the connection anymore in the list, and the connection won't be deleted. I added a warning when a close does not find the connection in the list; but I wonder if we should not copy the HConnectionKey instead of keeping a reference. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira