[jira] [Created] (HBASE-24612) Consider allowing a separate EventLoopGroup for accepting new connections.
Mark Robert Miller created HBASE-24612: -- Summary: Consider allowing a separate EventLoopGroup for accepting new connections. Key: HBASE-24612 URL: https://issues.apache.org/jira/browse/HBASE-24612 Project: HBase Issue Type: Improvement Reporter: Mark Robert Miller Netty applications often set a separate thread pool for accepting connections rather than sharing a single pool for accepting new connections and the work those connections do. It would be interesting to allow configuring a separation of pools to allow users to experiment with a pool dedicated to accepting new connections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24447) Contribute a Test class that shows some examples for using the Async Client API
Mark Robert Miller created HBASE-24447: -- Summary: Contribute a Test class that shows some examples for using the Async Client API Key: HBASE-24447 URL: https://issues.apache.org/jira/browse/HBASE-24447 Project: HBase Issue Type: Test Components: test Reporter: Mark Robert Miller Kind of along the lines of [https://github.com/apache/hbase/blob/master/hbase-examples/src/main/java/org/apache/hadoop/hbase/client/example/AsyncClientExample.java] but initially in the form of a test to make verification and env easy. This is basically some examples of how you can use the CompletableFuture API with the Async Client - it can be a little painful to do from scratch for a noobie given the expressivness and size of the CompletableFuture API, but is much easier with some more example code to build on or tinker with. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.
[ https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113749#comment-17113749 ] Mark Robert Miller commented on HBASE-24155: It took me a bit longer, but I ended up tracking this down a bit further. Raising the socket cache size and expiration for hdfs had helped a fair amount, but there still 50% the number of sockets getting made, a lot of it I tracked to *ReplicationSourceWALReader* and it's reset to look for additional data to read. > When running the tests, a tremendous number of connections are put into > TIME_WAIT. > -- > > Key: HBASE-24155 > URL: https://issues.apache.org/jira/browse/HBASE-24155 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Major > > When you run the test suite and monitor the number of connections in > TIME_WAIT, it appears that a very large number of connections do not end up > with a proper connection close lifecycle or perhaps proper reuse. > Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, > running the tests faster or with more tests in parallel increases the > TIME_WAIT connection buildup. Some tests spin up a very, very large number of > connections and if the wrong ones run at the same time, this can also greatly > increase the number of connections put into TIME_WAIT. This can have a > dramatic affect on performance (as it can take longer to create a new > connection) or flat out fail or timeout. > In my experience, a much, much smaller number of connections in a test suite > would end up in TIME_WAIT when connection handling is all correct. > Notes to come in comments below. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23806) Provide a much faster and efficient alternate option to maven and surefire for running tests.
[ https://issues.apache.org/jira/browse/HBASE-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109348#comment-17109348 ] Mark Robert Miller commented on HBASE-23806: I’m done with my project. I was originally going to try and kind of document and share my trail, but it did not really pan out. My intention was that once the test suite was seen at another level, it might just be too tempting to take some benefit from that example of what can be done. Really though, the main benefit was for me, for general knowledge and so that I could understand how to do a couple things in the code with confidence. > Provide a much faster and efficient alternate option to maven and surefire > for running tests. > - > > Key: HBASE-23806 > URL: https://issues.apache.org/jira/browse/HBASE-23806 > Project: HBase > Issue Type: Wish >Reporter: Mark Robert Miller >Priority: Minor > > With HBASE-23795, the hope is to drive tests with maven and surefire much > closer to their potential. > That will still leave a lot of room for improvement. > For those that have some nice hardware and a need for speed, we can blow > right past maven+surefire. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23806) Provide a much faster and efficient alternate option to maven and surefire for running tests.
[ https://issues.apache.org/jira/browse/HBASE-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller resolved HBASE-23806. Resolution: Won't Fix > Provide a much faster and efficient alternate option to maven and surefire > for running tests. > - > > Key: HBASE-23806 > URL: https://issues.apache.org/jira/browse/HBASE-23806 > Project: HBase > Issue Type: Wish >Reporter: Mark Robert Miller >Priority: Minor > > With HBASE-23795, the hope is to drive tests with maven and surefire much > closer to their potential. > That will still leave a lot of room for improvement. > For those that have some nice hardware and a need for speed, we can blow > right past maven+surefire. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23787) TestSyncTimeRangeTracker fails quite easily and allocates a very expensive array.
[ https://issues.apache.org/jira/browse/HBASE-23787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller resolved HBASE-23787. Resolution: Not A Problem I think the expensive array may have already been dealt with elsewhere. > TestSyncTimeRangeTracker fails quite easily and allocates a very expensive > array. > - > > Key: HBASE-23787 > URL: https://issues.apache.org/jira/browse/HBASE-23787 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Major > > I see this test fail a lot in my environments. It also uses such a large > array that it seems particularly memory wasteful and difficult to get good > contention in the test as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23849) Harden small and medium tests for lots of parallel runs with re-used jvms.
[ https://issues.apache.org/jira/browse/HBASE-23849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller resolved HBASE-23849. Resolution: Won't Fix Small and medium tests don't actually take too long to get going, mostly just have to deal with some statics issues. I had these working well on master at one point, but have only been looking at branch2 for a while, so not looking go back to that. > Harden small and medium tests for lots of parallel runs with re-used jvms. > -- > > Key: HBASE-23849 > URL: https://issues.apache.org/jira/browse/HBASE-23849 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.
[ https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106302#comment-17106302 ] Mark Robert Miller edited comment on HBASE-24155 at 5/13/20, 1:27 PM: -- Man it took me a long time to finally see a lot of what was going on here. Mostly just seems to be hdfs short circuit read socket pooling management that you can just call me not a fan of and with defaults that you can call me an anti fan of. Couple that with some hbase fail and fast retry stuff, especially in like snapshooting or splitting stuff, and well, the number of potential sockets (many without a proper tcp lifecycle) are just one part of the resulting fun. And the number of datanode transfer threads (also will use sockets) that can be spun up in these cases is clearly beyond what makes sense to me. was (Author: markrmiller): Man it took me a long time to finally see a lot of what was going on here. Mostly just seems to be hdfs short circuit read socket polling management that you can just call me not a fan of and with defaults that you can call me an anti fan of. Couple that with some hbase fail and fast retry stuff, especially in like snapshooting or splitting stuff, and well, the number of potential sockets (many without a proper tcp lifecycle) are just one part of the resulting fun. > When running the tests, a tremendous number of connections are put into > TIME_WAIT. > -- > > Key: HBASE-24155 > URL: https://issues.apache.org/jira/browse/HBASE-24155 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Major > > When you run the test suite and monitor the number of connections in > TIME_WAIT, it appears that a very large number of connections do not end up > with a proper connection close lifecycle or perhaps proper reuse. > Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, > running the tests faster or with more tests in parallel increases the > TIME_WAIT connection buildup. Some tests spin up a very, very large number of > connections and if the wrong ones run at the same time, this can also greatly > increase the number of connections put into TIME_WAIT. This can have a > dramatic affect on performance (as it can take longer to create a new > connection) or flat out fail or timeout. > In my experience, a much, much smaller number of connections in a test suite > would end up in TIME_WAIT when connection handling is all correct. > Notes to come in comments below. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.
[ https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller resolved HBASE-24155. Resolution: Information Provided Man it took me a long time to finally see a lot of what was going on here. Mostly just seems to be hdfs short circuit read socket polling management that you can just call me not a fan of and with defaults that you can call me an anti fan of. Couple that with some hbase fail and fast retry stuff, especially in like snapshooting or splitting stuff, and well, the number of potential sockets (many without a proper tcp lifecycle) are just one part of the resulting fun. > When running the tests, a tremendous number of connections are put into > TIME_WAIT. > -- > > Key: HBASE-24155 > URL: https://issues.apache.org/jira/browse/HBASE-24155 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Major > > When you run the test suite and monitor the number of connections in > TIME_WAIT, it appears that a very large number of connections do not end up > with a proper connection close lifecycle or perhaps proper reuse. > Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, > running the tests faster or with more tests in parallel increases the > TIME_WAIT connection buildup. Some tests spin up a very, very large number of > connections and if the wrong ones run at the same time, this can also greatly > increase the number of connections put into TIME_WAIT. This can have a > dramatic affect on performance (as it can take longer to create a new > connection) or flat out fail or timeout. > In my experience, a much, much smaller number of connections in a test suite > would end up in TIME_WAIT when connection handling is all correct. > Notes to come in comments below. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23830) TestReplicationEndpoint appears to fail a lot in my attempts for a clean test run locally.
[ https://issues.apache.org/jira/browse/HBASE-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller resolved HBASE-23830. Resolution: Not A Problem Don't see this so often anymore. > TestReplicationEndpoint appears to fail a lot in my attempts for a clean test > run locally. > -- > > Key: HBASE-23830 > URL: https://issues.apache.org/jira/browse/HBASE-23830 > Project: HBase > Issue Type: Test >Affects Versions: master >Reporter: Mark Robert Miller >Priority: Major > Attachments: test_fails.tar.xz > > > This test is failing for me like 30-40% of the time. Fail seems to usually be > as below. I've tried increasing the wait timeout but that does not seem to > help at all. > {code} > [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: > 105.145 s <<< FAILURE! - in > org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] Tests run: > 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.145 s <<< FAILURE! - > in org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] > org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication > Time elapsed: 38.725 s <<< FAILURE!java.lang.AssertionError: Waiting timed > out after [30,000] msec Failed to replicate all edits, expected = 2500 > replicated = 2476 at org.junit.Assert.fail(Assert.java:89) at > org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203) at > org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137) at > org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication(TestReplicationEndpoint.java:235){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23918) Track sensitive resources to ensure they are closed and assist devs in finding leaks.
[ https://issues.apache.org/jira/browse/HBASE-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller resolved HBASE-23918. Resolution: Information Provided So I use a tool like this when I'm tacking things to close - I just cut and paste it in when I'm hunting for closeable objects that are either not closed or missed due to concurrency or a bug or whatever. It's more controversial to use asserts like this in tests permanently, so I'll just leave this with the seed idea for some kind of auto shutdown/close enforcement options above the current hmaster/regionserver thread checker. > Track sensitive resources to ensure they are closed and assist devs in > finding leaks. > - > > Key: HBASE-23918 > URL: https://issues.apache.org/jira/browse/HBASE-23918 > Project: HBase > Issue Type: Improvement >Reporter: Mark Robert Miller >Priority: Major > > Closing some objects is quite critical. Issues with leaks can be quite > slippery and nasty and growy. Maintaining close integrity is an embarrassing > sport for humans. > In the past, those 3 thoughts led me to start tracking objects in tests to > alert of leaks. Even with an alert though, the job of tracking down all of > the leaks just based on what leaked was beyond my skill. If it's beyond even > one devs skill that is committing, that tends to end up trouble. So I added > the stack trace for the origin of the object. Things can still get a bit > tricky to track down in some cases, but now I had the start of a real > solution to all of the whack-a-mole games I spent too much time playing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-23831) TestChoreService is very sensitive to resources.
[ https://issues.apache.org/jira/browse/HBASE-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104878#comment-17104878 ] Mark Robert Miller edited comment on HBASE-23831 at 5/11/20, 8:20 PM: -- When I try and run this test in a VM on my iMac, it always fails in a couple ways. However, I don't see others with this issue and it doesn't seem to happen on my primary box and addressing is not easy to do cleanly, you just have to keep adding more fudge to allow slower envs with fewer cores to handle it. Doesn't really show up so easily in my other faster envs. was (Author: markrmiller): When I try and run this test in a VM on my iMac, it also fails in a couple ways. However, I don't see others with this issue and it doesn't seem to happen on my primary box and addressing is not easy to do cleanly, you just have to keep adding more fudge to allow slower envs with fewer cores to handle it. Doesn't really show up so easily in my other faster envs. > TestChoreService is very sensitive to resources. > > > Key: HBASE-23831 > URL: https://issues.apache.org/jira/browse/HBASE-23831 > Project: HBase > Issue Type: Test >Affects Versions: master >Reporter: Mark Robert Miller >Priority: Major > > More details following. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23831) TestChoreService is very sensitive to resources.
[ https://issues.apache.org/jira/browse/HBASE-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller resolved HBASE-23831. Resolution: Not A Problem When I try and run this test in a VM on my iMac, it also fails in a couple ways. However, I don't see others with this issue and it doesn't seem to happen on my primary box and addressing is not easy to do cleanly, you just have to keep adding more fudge to allow slower envs with fewer cores to handle it. Doesn't really show up so easily in my other faster envs. > TestChoreService is very sensitive to resources. > > > Key: HBASE-23831 > URL: https://issues.apache.org/jira/browse/HBASE-23831 > Project: HBase > Issue Type: Test >Affects Versions: master >Reporter: Mark Robert Miller >Priority: Major > > More details following. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23796) Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as well.
[ https://issues.apache.org/jira/browse/HBASE-23796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller resolved HBASE-23796. Resolution: Won't Fix This was helpful to me in getting more successful test runs outside of docker, but in recent days, branch-2 can actually pass for me a decent percentage of runs without this change or Docker, so I'm not sure how valuable any of this remaining is. Hadoop auto looks up the hostname regardless of this setting a lot and that appears very slow in my env when happening a lot concurrently and so trying to get out of docker to perform like in docker in my case doesn't seem easily attainable. In docker it is. > Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as > well. > --- > > Key: HBASE-23796 > URL: https://issues.apache.org/jira/browse/HBASE-23796 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Priority: Minor > > This is perhaps controversial, but there are a variety of problems with > counting on dns hostname resolution, especially for locahost. > > # It can often be slow, slow under concurrency, or slow under specific > conditions. > # It can often not work at all - when on a VPN, with weird DNS hijacking > hi-jinks, when you have a real hostname for you machines, a custom /etc/hosts > file, OS's run their own local/funny DNS server services. > # This makes coming to HBase for new devs a hit or miss experience and if > you miss, dealing with an diagnosing the issues is a large endeavor and not > straight forward or transparent. > # 99% of the difference doesn't matter in most cases - except that > 127.0.0.1 works and is fast pretty much universally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.
[ https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller resolved HBASE-23795. Resolution: Information Provided So I've gone through this pretty thoroughly, though the tests failed on me a lot more than they are currently at the time. If you use a statics checker and clear large statics and you reinit the right test and runtime statics, and shutdown additional threads and resources that can remain outstanding, these tests can run in the same JVM about as well as Lucene and Solr tests do, taking advantage of class caching and hotspot, etc. I've seen all the tests run on my 16-core machine in about 30-40 minutes with a new JVM for every test and parallelism lifted to max load, so perhaps 20 minutes would be in sight with reuse. > Enable all tests to be run in parallel on reused JVMs. > -- > > Key: HBASE-23795 > URL: https://issues.apache.org/jira/browse/HBASE-23795 > Project: HBase > Issue Type: Wish >Reporter: Mark Robert Miller >Priority: Major > > I'd like to be able to run HBase tests in under 30-40 minutes on good > parallel hardware. > It will require some small changes / fixes for that wish to come true. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-24332) TestJMXListener.setupBeforeClass can fail due to not getting a random port.
[ https://issues.apache.org/jira/browse/HBASE-24332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller resolved HBASE-24332. Resolution: Duplicate > TestJMXListener.setupBeforeClass can fail due to not getting a random port. > --- > > Key: HBASE-24332 > URL: https://issues.apache.org/jira/browse/HBASE-24332 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Minor > > [ERROR] Errors: > [ERROR] TestJMXListener.setupBeforeClass:61 » IO Shutting down -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24346) TestWALProcedureStoreOnHDFS#testWalAbortOnLowReplicationWithQueuedWriters fails too easily.
[ https://issues.apache.org/jira/browse/HBASE-24346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104724#comment-17104724 ] Mark Robert Miller commented on HBASE-24346: I added another second of sleep and the test at least fails much less often for me. > TestWALProcedureStoreOnHDFS#testWalAbortOnLowReplicationWithQueuedWriters > fails too easily. > --- > > Key: HBASE-24346 > URL: https://issues.apache.org/jira/browse/HBASE-24346 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23882) Scale *MiniCluster config for the environment it runs in.
[ https://issues.apache.org/jira/browse/HBASE-23882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller resolved HBASE-23882. Resolution: Duplicate > Scale *MiniCluster config for the environment it runs in. > - > > Key: HBASE-23882 > URL: https://issues.apache.org/jira/browse/HBASE-23882 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24346) TestWALProcedureStoreOnHDFS#testWalAbortOnLowReplicationWithQueuedWriters fails too easily.
Mark Robert Miller created HBASE-24346: -- Summary: TestWALProcedureStoreOnHDFS#testWalAbortOnLowReplicationWithQueuedWriters fails too easily. Key: HBASE-24346 URL: https://issues.apache.org/jira/browse/HBASE-24346 Project: HBase Issue Type: Test Components: test Reporter: Mark Robert Miller -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24327) TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer can fail with retries exhausted on an admin call.
[ https://issues.apache.org/jira/browse/HBASE-24327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102559#comment-17102559 ] Mark Robert Miller commented on HBASE-24327: Yeah, fire away. > TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer can fail > with retries exhausted on an admin call. > > > Key: HBASE-24327 > URL: https://issues.apache.org/jira/browse/HBASE-24327 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24342) [Flakey Tests] Disable TestClusterPortAssignment.testClusterPortAssignment as it can't pass 100% of the time
[ https://issues.apache.org/jira/browse/HBASE-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102021#comment-17102021 ] Mark Robert Miller commented on HBASE-24342: Nice! Also in my list. > [Flakey Tests] Disable TestClusterPortAssignment.testClusterPortAssignment as > it can't pass 100% of the time > > > Key: HBASE-24342 > URL: https://issues.apache.org/jira/browse/HBASE-24342 > Project: HBase > Issue Type: Bug > Components: flakies, test >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0-alpha-1, 2.3.0 > > > This is a BindException special. We get randomFreePort and then put up the > procesess. > {code} > 2020-05-07 00:30:15,844 INFO [Time-limited test] http.HttpServer(1080): > HttpServer.start() threw a non Bind IOException > java.net.BindException: Port in use: 0.0.0.0:59568 > at > org.apache.hadoop.hbase.http.HttpServer.openListeners(HttpServer.java:1146) > at org.apache.hadoop.hbase.http.HttpServer.start(HttpServer.java:1077) > at org.apache.hadoop.hbase.http.InfoServer.start(InfoServer.java:148) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.putUpWebUI(HRegionServer.java:2133) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.(HRegionServer.java:670) > at org.apache.hadoop.hbase.master.HMaster.(HMaster.java:511) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:132) > at > org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:239) > at > org.apache.hadoop.hbase.LocalHBaseCluster.(LocalHBaseCluster.java:181) > at > org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:245) > at > org.apache.hadoop.hbase.MiniHBaseCluster.(MiniHBaseCluster.java:115) > at > org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:1178) > at > org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1142) > at > org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1106) > at > org.apache.hadoop.hbase.TestClusterPortAssignment.testClusterPortAssignment(TestClusterPortAssignment.java:57) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) > at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) > at > org.apache.hadoop.hbase.SystemExitRule$1.evaluate(SystemExitRule.java:38) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at
[jira] [Commented] (HBASE-24331) [Flakey Test] TestJMXListener rmi port clash
[ https://issues.apache.org/jira/browse/HBASE-24331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100263#comment-17100263 ] Mark Robert Miller commented on HBASE-24331: Ha, I was dealing with this today as well. HBASE-24332 I went about it slightly different, I'll put up the PR, though it's not cleaned up and checkstyled yet. > [Flakey Test] TestJMXListener rmi port clash > > > Key: HBASE-24331 > URL: https://issues.apache.org/jira/browse/HBASE-24331 > Project: HBase > Issue Type: Sub-task > Components: flakies, test >Reporter: Michael Stack >Priority: Major > > The TestJMXListener can fail because the random port it wants to put the jmx > listener on is occupied when it goes to run. Handle this case in test startup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24332) TestJMXListener.setupBeforeClass can fail due to not getting a random port.
[ https://issues.apache.org/jira/browse/HBASE-24332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100188#comment-17100188 ] Mark Robert Miller commented on HBASE-24332: I think this happens because resuseAddress is not set on the socket used to look for a free port. We can clean up and consolidate some of this port allocation test code. > TestJMXListener.setupBeforeClass can fail due to not getting a random port. > --- > > Key: HBASE-24332 > URL: https://issues.apache.org/jira/browse/HBASE-24332 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Minor > > [ERROR] Errors: > [ERROR] TestJMXListener.setupBeforeClass:61 » IO Shutting down -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24332) TestJMXListener.setupBeforeClass can fail due to not getting a random port.
Mark Robert Miller created HBASE-24332: -- Summary: TestJMXListener.setupBeforeClass can fail due to not getting a random port. Key: HBASE-24332 URL: https://issues.apache.org/jira/browse/HBASE-24332 Project: HBase Issue Type: Test Components: test Reporter: Mark Robert Miller [ERROR] Errors: [ERROR] TestJMXListener.setupBeforeClass:61 » IO Shutting down -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24327) TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer can fail with retries exhausted on an admin call.
[ https://issues.apache.org/jira/browse/HBASE-24327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099572#comment-17099572 ] Mark Robert Miller commented on HBASE-24327: The fail for this is: [ERROR] Errors: [ERROR] TestMasterShutdown.testMasterShutdownBeforeStartingAnyRegionServer:166 ? RetriesExhausted > TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer can fail > with retries exhausted on an admin call. > > > Key: HBASE-24327 > URL: https://issues.apache.org/jira/browse/HBASE-24327 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24327) TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer can fail with retries exhausted on an admin call.
Mark Robert Miller created HBASE-24327: -- Summary: TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer can fail with retries exhausted on an admin call. Key: HBASE-24327 URL: https://issues.apache.org/jira/browse/HBASE-24327 Project: HBase Issue Type: Test Components: test Reporter: Mark Robert Miller -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24325) TestJMXConnectorServer can fail to start the minicluster due to it's port already having been chosen by another test.
Mark Robert Miller created HBASE-24325: -- Summary: TestJMXConnectorServer can fail to start the minicluster due to it's port already having been chosen by another test. Key: HBASE-24325 URL: https://issues.apache.org/jira/browse/HBASE-24325 Project: HBase Issue Type: Test Components: test Reporter: Mark Robert Miller -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24185) Junit tests do not behave well with System.exit or Runtime.halt or JVM exits in general.
Mark Robert Miller created HBASE-24185: -- Summary: Junit tests do not behave well with System.exit or Runtime.halt or JVM exits in general. Key: HBASE-24185 URL: https://issues.apache.org/jira/browse/HBASE-24185 Project: HBase Issue Type: Test Components: test Reporter: Mark Robert Miller This ends up exiting the JVM and confusing / erroring out the test runner that manages that JVM as well as cutting off test output files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.
[ https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082353#comment-17082353 ] Mark Robert Miller commented on HBASE-24155: Still doing a little digging before I dump more info. Basically, the more JVM's I run in parallel to make the tests faster, the more I hit this certain fail in a large variety of tests where the test times out. Looking at resource usage, the only thing that seems to approach or exceed limits is the number of connections that end up in TIME_WAIT. It feels like some number of tests is creating a huge number of connections. If I ignore enough of the tests that end up hanging, I can run the remaining 95% of the tests in as many JVMs as I have RAM for. I'm narrowing down which tests are creating the most connections so that I can inspect them a little closer. > When running the tests, a tremendous number of connections are put into > TIME_WAIT. > -- > > Key: HBASE-24155 > URL: https://issues.apache.org/jira/browse/HBASE-24155 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Major > > When you run the test suite and monitor the number of connections in > TIME_WAIT, it appears that a very large number of connections do not end up > with a proper connection close lifecycle or perhaps proper reuse. > Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, > running the tests faster or with more tests in parallel increases the > TIME_WAIT connection buildup. Some tests spin up a very, very large number of > connections and if the wrong ones run at the same time, this can also greatly > increase the number of connections put into TIME_WAIT. This can have a > dramatic affect on performance (as it can take longer to create a new > connection) or flat out fail or timeout. > In my experience, a much, much smaller number of connections in a test suite > would end up in TIME_WAIT when connection handling is all correct. > Notes to come in comments below. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23806) Provide a much faster and efficient alternate option to maven and surefire for running tests.
[ https://issues.apache.org/jira/browse/HBASE-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081129#comment-17081129 ] Mark Robert Miller commented on HBASE-23806: I look at this as not likely something that will be contributed back, but that will demonstrate the upper bounds of how fast tests can run. Once you are running efficiently in parallel on a lot of cores, it takes a little extra to keep all those cores busy vs having a long tail of less and less cores being utilized. This can have a very large difference, and while you can't always get there with standard test systems, it's good to know how far off you are as well. The order you start the parallel tests in and how you distribute them across jvms on good hardware can easily half the test time or more. > Provide a much faster and efficient alternate option to maven and surefire > for running tests. > - > > Key: HBASE-23806 > URL: https://issues.apache.org/jira/browse/HBASE-23806 > Project: HBase > Issue Type: Wish >Reporter: Mark Robert Miller >Priority: Minor > > With HBASE-23795, the hope is to drive tests with maven and surefire much > closer to their potential. > That will still leave a lot of room for improvement. > For those that have some nice hardware and a need for speed, we can blow > right past maven+surefire. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.
[ https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller updated HBASE-24155: --- Description: When you run the test suite and monitor the number of connections in TIME_WAIT, it appears that a very large number of connections do not end up with a proper connection close lifecycle or perhaps proper reuse. Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, running the tests faster or with more tests in parallel increases the TIME_WAIT connection buildup. Some tests spin up a very, very large number of connections and if the wrong ones run at the same time, this can also greatly increase the number of connections put into TIME_WAIT. This can have a dramatic affect on performance (as it can take longer to create a new connection) or flat out fail or timeout. In my experience, a much, much smaller number of connections in a test suite would end up in TIME_WAIT when connection handling is all correct. Notes to come in comments below. was: When you run the test suite and monitor the number of connections in TIME_WAIT, it appears that a very large number of connections do not end up with a proper connection close lifecycle. Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, running the tests faster or with more tests in parallel increases the TIME_WAIT connection buildup. Some tests spin up a very, very large number of connections and if the wrong ones run at the same time, this can also greatly increase the number of connections put into TIME_WAIT. This can have a dramatic affect on performance (as it can take longer to create a new connection) or flat out fail or timeout. In my experience, a much, much smaller number of connections in a test suite would end up in TIME_WAIT when connection handling is all correct. Notes to come in comments below. > When running the tests, a tremendous number of connections are put into > TIME_WAIT. > -- > > Key: HBASE-24155 > URL: https://issues.apache.org/jira/browse/HBASE-24155 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Major > > When you run the test suite and monitor the number of connections in > TIME_WAIT, it appears that a very large number of connections do not end up > with a proper connection close lifecycle or perhaps proper reuse. > Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, > running the tests faster or with more tests in parallel increases the > TIME_WAIT connection buildup. Some tests spin up a very, very large number of > connections and if the wrong ones run at the same time, this can also greatly > increase the number of connections put into TIME_WAIT. This can have a > dramatic affect on performance (as it can take longer to create a new > connection) or flat out fail or timeout. > In my experience, a much, much smaller number of connections in a test suite > would end up in TIME_WAIT when connection handling is all correct. > Notes to come in comments below. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.
[ https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller updated HBASE-24155: --- Description: When you run the test suite and monitor the number of connections in TIME_WAIT, it appears that a very large number of connections do not end up with a proper connection close lifecycle. Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, running the tests faster or with more tests in parallel increases the TIME_WAIT connection buildup. Some tests spin up a very, very large number of connections and if the wrong ones run at the same time, this can also greatly increase the number of connections put into TIME_WAIT. This can have a dramatic affect on performance (as it can take longer to create a new connection) or flat out fail or timeout. In my experience, a much, much smaller number of connections in a test suite would end up in TIME_WAIT when connection handling is all correct. Notes to come in comments below. was: When you run the test suite and monitor the number of connections in TIME_WAIT, it appears that a very large number of connections do not end up with a proper connection close lifecycle. Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, running the tests faster or with more tests in parallel increases the TIME_WAIT connection buildup. Some tests spin up a very, very large number of connections and if the wrong ones run at the same time, this can also greatly increase the number of connections put into TIME_WAIT. This can have a dramatic affect on performance (as it can take longer to create a new connection) or flat out fail or timeout. Ideally, a small proportion of connections in a test suite would end up in TIME_WAIT in comparison to the number created. Notes to come in comments below. > When running the tests, a tremendous number of connections are put into > TIME_WAIT. > -- > > Key: HBASE-24155 > URL: https://issues.apache.org/jira/browse/HBASE-24155 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Major > > When you run the test suite and monitor the number of connections in > TIME_WAIT, it appears that a very large number of connections do not end up > with a proper connection close lifecycle. > Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, > running the tests faster or with more tests in parallel increases the > TIME_WAIT connection buildup. Some tests spin up a very, very large number of > connections and if the wrong ones run at the same time, this can also greatly > increase the number of connections put into TIME_WAIT. This can have a > dramatic affect on performance (as it can take longer to create a new > connection) or flat out fail or timeout. > In my experience, a much, much smaller number of connections in a test suite > would end up in TIME_WAIT when connection handling is all correct. > Notes to come in comments below. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.
[ https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079417#comment-17079417 ] Mark Robert Miller commented on HBASE-24155: I've got some notes, comments, questions, but one thought is that, given the number of connections, I'm wondering if this could be such a high problem if there was a lot of connection reuse going on. Some tests appear to make *way* more connections that I'd expect with reuse/pooling. > When running the tests, a tremendous number of connections are put into > TIME_WAIT. > -- > > Key: HBASE-24155 > URL: https://issues.apache.org/jira/browse/HBASE-24155 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Major > > When you run the test suite and monitor the number of connections in > TIME_WAIT, it appears that a very large number of connections do not end up > with a proper connection close lifecycle. > Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, > running the tests faster or with more tests in parallel increases the > TIME_WAIT connection buildup. Some tests spin up a very, very large number of > connections and if the wrong ones run at the same time, this can also greatly > increase the number of connections put into TIME_WAIT. This can have a > dramatic affect on performance (as it can take longer to create a new > connection) or flat out fail or timeout. > Ideally, a small proportion of connections in a test suite would end up in > TIME_WAIT in comparison to the number created. > Notes to come in comments below. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.
[ https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079417#comment-17079417 ] Mark Robert Miller edited comment on HBASE-24155 at 4/9/20, 2:34 PM: - I've got some notes, comments, questions, but one thought is that, given the number of connections, I'm wondering if this could be such a high problem if there was a lot of connection reuse going on. Some tests appear to make *way* more connections than I'd expect with reuse/pooling. was (Author: markrmiller): I've got some notes, comments, questions, but one thought is that, given the number of connections, I'm wondering if this could be such a high problem if there was a lot of connection reuse going on. Some tests appear to make *way* more connections that I'd expect with reuse/pooling. > When running the tests, a tremendous number of connections are put into > TIME_WAIT. > -- > > Key: HBASE-24155 > URL: https://issues.apache.org/jira/browse/HBASE-24155 > Project: HBase > Issue Type: Test > Components: test >Reporter: Mark Robert Miller >Priority: Major > > When you run the test suite and monitor the number of connections in > TIME_WAIT, it appears that a very large number of connections do not end up > with a proper connection close lifecycle. > Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, > running the tests faster or with more tests in parallel increases the > TIME_WAIT connection buildup. Some tests spin up a very, very large number of > connections and if the wrong ones run at the same time, this can also greatly > increase the number of connections put into TIME_WAIT. This can have a > dramatic affect on performance (as it can take longer to create a new > connection) or flat out fail or timeout. > Ideally, a small proportion of connections in a test suite would end up in > TIME_WAIT in comparison to the number created. > Notes to come in comments below. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.
Mark Robert Miller created HBASE-24155: -- Summary: When running the tests, a tremendous number of connections are put into TIME_WAIT. Key: HBASE-24155 URL: https://issues.apache.org/jira/browse/HBASE-24155 Project: HBase Issue Type: Test Components: test Reporter: Mark Robert Miller When you run the test suite and monitor the number of connections in TIME_WAIT, it appears that a very large number of connections do not end up with a proper connection close lifecycle. Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, running the tests faster or with more tests in parallel increases the TIME_WAIT connection buildup. Some tests spin up a very, very large number of connections and if the wrong ones run at the same time, this can also greatly increase the number of connections put into TIME_WAIT. This can have a dramatic affect on performance (as it can take longer to create a new connection) or flat out fail or timeout. Ideally, a small proportion of connections in a test suite would end up in TIME_WAIT in comparison to the number created. Notes to come in comments below. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24143) [JDK11] Switch default garbage collector from CMS
[ https://issues.apache.org/jira/browse/HBASE-24143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078642#comment-17078642 ] Mark Robert Miller commented on HBASE-24143: Couple notes off the top of my head: * I believe the Lucene project has found that CMS is the best collector performance wise for their test suite (vs G1 and the other classic collectors). I don't think CMS is actually removed until maybe Java 14? Lucene also found a variety of bugs and stability issues in earlier versions of G1 as compared to CMS, probably that was mostly worked out by Java 11 though. * I'm a big fan of pinning these things explicitly so that developer test runs can be a consistent experience across a wider range of envs and devs. The more that is picked up based on java version or env or hardware, the harder it is to deliver a solid developer experience. As long as you provide a way for a dev to override these things, they can still personalize, but checkout and run test experience is more consistent and known. > [JDK11] Switch default garbage collector from CMS > - > > Key: HBASE-24143 > URL: https://issues.apache.org/jira/browse/HBASE-24143 > Project: HBase > Issue Type: Sub-task > Components: scripts >Affects Versions: 3.0.0, 2.3.0 >Reporter: Nick Dimiduk >Priority: Major > > When running HBase tools on the cli, one of the warnings generated is > {noformat} > OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in > version 9.0 and will likely be removed in a future release. > {noformat} > Java9+ use G1GC as the default collector. Maybe we simply omit GC > configurations and use the default settings? Or someone has some suggested > settings we can ship out of the box? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23113) IPC Netty Optimization
[ https://issues.apache.org/jira/browse/HBASE-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076796#comment-17076796 ] Mark Robert Miller commented on HBASE-23113: One short term option to consider is to only enable it for tests as well. Tests put a very high number of connections into TIME_WAIT - I see a larger issue with it the faster I run tests. I think making that less common is not likely super short term, but more SO_REUSEPORT usage can alleviate it a bit. (lots of sources on this stuff, but here is one [http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html]) > IPC Netty Optimization > -- > > Key: HBASE-23113 > URL: https://issues.apache.org/jira/browse/HBASE-23113 > Project: HBase > Issue Type: Improvement > Components: IPC/RPC >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Minor > Attachments: Decoder.jpeg > > > Netty options in IPC Server/Client optimization: > 1.SO_BACKLOG setting:Two queues are maintained in the Linux system kernel: > syns queue and accept queue. The first is a semi-join queue that saves the > connections to the synrecv state after receiving the client syn. The default > netty is 128,io.netty.util.NetUtil#SOMAXCONN , and then read > /proc/sys/net/core /somaxconn to continue to determine, and then there are > some system level coverage logic.In some scenarios, if the client is far > redundant to the server and the connection is established, it may not be > enough. This value should not be too large, otherwise it will not prevent > SYN-Flood attacks. The current value has been changed to 1024. After setting, > the value set by yourself is equivalent to setting the upper limit because of > the setting of the system and the size of the system. If some settings of the > Linux system operation and maintenance are wrong, it can be avoided at the > code level.At present, our Linux level is usually set to 128, and the final > calculation will be set to 128. > 2.WRITE_BUFFER_WATER_MARK setting:After WRITEBUFFERWATERMARK sets the maximum > and minimum Buffer that can be temporarily stored on a connection, isWritable > returns unwritable if the amount of data waiting to be sent for the > connection is greater than the set value. In this way, the client can no > longer send, preventing this amount of continuous backlog, and eventually the > client may hang. If this happens, it is usually caused by slow processing on > the server side. This value can effectively protect the client. At this point > the data was not sent. > 3.SO_REUSEADDR - Port multiplexing (allowing multiple sockets to listen on > the same IP+ port): For time-wait links, it ensures that the server restarts > successfully. In the case where some servers start up very quickly, it can > prevent startup failure. > Netty decoder in IPC Server optimization: > Netty provides a convenient decoding tool class ByteToMessageDecoder, as > shown in the top half of the figure, this class has accumulate bulk unpacking > capability, can read bytes from the socket as much as possible, and then > synchronously call the decode method to decode the business object. And > compose a List. Finally, the traversal traverses the List and submits it to > ChannelPipeline for processing. Here we made a small change, as shown in the > bottom half of the figure, the content to be submitted is changed from a > single command to the entire List, which reduces the number of pipeline > executions and improves throughput. This mode has no advantage in > low-concurrency scenarios, and has a significant performance boost in boost > throughput in high-concurrency scenarios. > !Decoder.jpeg! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-23113) IPC Netty Optimization
[ https://issues.apache.org/jira/browse/HBASE-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076674#comment-17076674 ] Mark Robert Miller edited comment on HBASE-23113 at 4/6/20, 9:13 PM: - I've put up a new pull request and spent a good amount of time looking for any negative results to the tests with these changes - so far I don't see any. * WRITE_BUFFER_WATER_MARK This is a good Netty feature that I think certainly makes sense to expose. It could be disabled by default, but personally I'd see setting it by default as a nice improvement as well. * SO_REUSEADDR This is a useful feature for good restart behavior as mentioned above, often very helpful in tests as much or more than production as they may restart things very quickly and the speed/behavior of that can be hardware/OS dependent. * SO_BACKLOG I don't like increasing queue sizes as a default stance, but this one is known to improve things when connection management life cycle is often off and/or connection reuse is not great, or lot's of retries, etc. It's also nice to pin it in HBase config vs relying on changing/varying external defaults. Given what I've seen of HBase, I think raising the default here is a good idea, but again, could default to previous behavior and just allow configuration as well. was (Author: markrmiller): I've put up a new pull request and spent a good amount of time looking for any negative results to the tests with these changes - so far I don't see any. * WRITE_BUFFER_WATER_MARK This is a good Netty feature that I think certainly makes sense to expose. It could be disabled by default, but personally I'd see setting it by default as a nice improvement as well. * SO_REUSEADDR This is a useful feature for good restart behavior as mentioned above, often very helpful in tests as much or more than production as they may restart things very quickly and the speed/behavior of that can be hardware/OS dependent. * SO_BACKLOG I don't like increasing queue sizes as a default stance, but this one is known to improve things when connection management cycle is often off and/or connection reuse is not great, or lot's of retries, etc. It's also nice to pin it in HBase config vs relying on changing/varying external defaults. Given what I've seen of HBase, I think raising the default here is a good idea, but again, could default to previous behavior and just allow configuration as well. > IPC Netty Optimization > -- > > Key: HBASE-23113 > URL: https://issues.apache.org/jira/browse/HBASE-23113 > Project: HBase > Issue Type: Improvement > Components: IPC/RPC >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Minor > Attachments: Decoder.jpeg > > > Netty options in IPC Server/Client optimization: > 1.SO_BACKLOG setting:Two queues are maintained in the Linux system kernel: > syns queue and accept queue. The first is a semi-join queue that saves the > connections to the synrecv state after receiving the client syn. The default > netty is 128,io.netty.util.NetUtil#SOMAXCONN , and then read > /proc/sys/net/core /somaxconn to continue to determine, and then there are > some system level coverage logic.In some scenarios, if the client is far > redundant to the server and the connection is established, it may not be > enough. This value should not be too large, otherwise it will not prevent > SYN-Flood attacks. The current value has been changed to 1024. After setting, > the value set by yourself is equivalent to setting the upper limit because of > the setting of the system and the size of the system. If some settings of the > Linux system operation and maintenance are wrong, it can be avoided at the > code level.At present, our Linux level is usually set to 128, and the final > calculation will be set to 128. > 2.WRITE_BUFFER_WATER_MARK setting:After WRITEBUFFERWATERMARK sets the maximum > and minimum Buffer that can be temporarily stored on a connection, isWritable > returns unwritable if the amount of data waiting to be sent for the > connection is greater than the set value. In this way, the client can no > longer send, preventing this amount of continuous backlog, and eventually the > client may hang. If this happens, it is usually caused by slow processing on > the server side. This value can effectively protect the client. At this point > the data was not sent. > 3.SO_REUSEADDR - Port multiplexing (allowing multiple sockets to listen on > the same IP+ port): For time-wait links, it ensures that the server restarts > successfully. In the case where some servers start up very quickly, it can > prevent startup failure. > Netty decoder in IPC Server optimization: > Netty provides a convenient decoding tool class
[jira] [Commented] (HBASE-23113) IPC Netty Optimization
[ https://issues.apache.org/jira/browse/HBASE-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076674#comment-17076674 ] Mark Robert Miller commented on HBASE-23113: I've put up a new pull request and spent a good amount of time looking for any negative results to the tests with these changes - so far I don't see any. * WRITE_BUFFER_WATER_MARK This is a good Netty feature that I think certainly makes sense to expose. It could be disabled by default, but personally I'd see setting it by default as a nice improvement as well. * SO_REUSEADDR This is a useful feature for good restart behavior as mentioned above, often very helpful in tests as much or more than production as they may restart things very quickly and the speed/behavior of that can be hardware/OS dependent. * SO_BACKLOG I don't like increasing queue sizes as a default stance, but this one is known to improve things when connection management cycle is often off and/or connection reuse is not great, or lot's of retries, etc. It's also nice to pin it in HBase config vs relying on changing/varying external defaults. Given what I've seen of HBase, I think raising the default here is a good idea, but again, could default to previous behavior and just allow configuration as well. > IPC Netty Optimization > -- > > Key: HBASE-23113 > URL: https://issues.apache.org/jira/browse/HBASE-23113 > Project: HBase > Issue Type: Improvement > Components: IPC/RPC >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Minor > Attachments: Decoder.jpeg > > > Netty options in IPC Server/Client optimization: > 1.SO_BACKLOG setting:Two queues are maintained in the Linux system kernel: > syns queue and accept queue. The first is a semi-join queue that saves the > connections to the synrecv state after receiving the client syn. The default > netty is 128,io.netty.util.NetUtil#SOMAXCONN , and then read > /proc/sys/net/core /somaxconn to continue to determine, and then there are > some system level coverage logic.In some scenarios, if the client is far > redundant to the server and the connection is established, it may not be > enough. This value should not be too large, otherwise it will not prevent > SYN-Flood attacks. The current value has been changed to 1024. After setting, > the value set by yourself is equivalent to setting the upper limit because of > the setting of the system and the size of the system. If some settings of the > Linux system operation and maintenance are wrong, it can be avoided at the > code level.At present, our Linux level is usually set to 128, and the final > calculation will be set to 128. > 2.WRITE_BUFFER_WATER_MARK setting:After WRITEBUFFERWATERMARK sets the maximum > and minimum Buffer that can be temporarily stored on a connection, isWritable > returns unwritable if the amount of data waiting to be sent for the > connection is greater than the set value. In this way, the client can no > longer send, preventing this amount of continuous backlog, and eventually the > client may hang. If this happens, it is usually caused by slow processing on > the server side. This value can effectively protect the client. At this point > the data was not sent. > 3.SO_REUSEADDR - Port multiplexing (allowing multiple sockets to listen on > the same IP+ port): For time-wait links, it ensures that the server restarts > successfully. In the case where some servers start up very quickly, it can > prevent startup failure. > Netty decoder in IPC Server optimization: > Netty provides a convenient decoding tool class ByteToMessageDecoder, as > shown in the top half of the figure, this class has accumulate bulk unpacking > capability, can read bytes from the socket as much as possible, and then > synchronously call the decode method to decode the business object. And > compose a List. Finally, the traversal traverses the List and submits it to > ChannelPipeline for processing. Here we made a small change, as shown in the > bottom half of the figure, the content to be submitted is changed from a > single command to the entire List, which reduces the number of pipeline > executions and improves throughput. This mode has no advantage in > low-concurrency scenarios, and has a significant performance boost in boost > throughput in high-concurrency scenarios. > !Decoder.jpeg! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23113) IPC Netty Optimization
[ https://issues.apache.org/jira/browse/HBASE-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074823#comment-17074823 ] Mark Robert Miller commented on HBASE-23113: I've got an updated PR for this I'll submit shortly. {quote}Netty decoder in IPC Server optimization {quote} I think this should probably be spun out into another issue. I took a stab at it a couple months ago, but I ran into some RAM usage changes I believe and may need some additional periodic flush or something. I'll go back and look at that again shortly for another JIRA. > IPC Netty Optimization > -- > > Key: HBASE-23113 > URL: https://issues.apache.org/jira/browse/HBASE-23113 > Project: HBase > Issue Type: Improvement >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Minor > Attachments: Decoder.jpeg > > > Netty options in IPC Server/Client optimization: > 1.SO_BACKLOG setting:Two queues are maintained in the Linux system kernel: > syns queue and accept queue. The first is a semi-join queue that saves the > connections to the synrecv state after receiving the client syn. The default > netty is 128,io.netty.util.NetUtil#SOMAXCONN , and then read > /proc/sys/net/core /somaxconn to continue to determine, and then there are > some system level coverage logic.In some scenarios, if the client is far > redundant to the server and the connection is established, it may not be > enough. This value should not be too large, otherwise it will not prevent > SYN-Flood attacks. The current value has been changed to 1024. After setting, > the value set by yourself is equivalent to setting the upper limit because of > the setting of the system and the size of the system. If some settings of the > Linux system operation and maintenance are wrong, it can be avoided at the > code level.At present, our Linux level is usually set to 128, and the final > calculation will be set to 128. > 2.WRITE_BUFFER_WATER_MARK setting:After WRITEBUFFERWATERMARK sets the maximum > and minimum Buffer that can be temporarily stored on a connection, isWritable > returns unwritable if the amount of data waiting to be sent for the > connection is greater than the set value. In this way, the client can no > longer send, preventing this amount of continuous backlog, and eventually the > client may hang. If this happens, it is usually caused by slow processing on > the server side. This value can effectively protect the client. At this point > the data was not sent. > 3.SO_REUSEADDR - Port multiplexing (allowing multiple sockets to listen on > the same IP+ port): For time-wait links, it ensures that the server restarts > successfully. In the case where some servers start up very quickly, it can > prevent startup failure. > Netty decoder in IPC Server optimization: > Netty provides a convenient decoding tool class ByteToMessageDecoder, as > shown in the top half of the figure, this class has accumulate bulk unpacking > capability, can read bytes from the socket as much as possible, and then > synchronously call the decode method to decode the business object. And > compose a List. Finally, the traversal traverses the List and submits it to > ChannelPipeline for processing. Here we made a small change, as shown in the > bottom half of the figure, the content to be submitted is changed from a > single command to the entire List, which reduces the number of pipeline > executions and improves throughput. This mode has no advantage in > low-concurrency scenarios, and has a significant performance boost in boost > throughput in high-concurrency scenarios. > !Decoder.jpeg! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count
[ https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056296#comment-17056296 ] Mark Robert Miller commented on HBASE-23779: Most tests don’t really need even more than 1GB of memory at most - I think that is the larger issue. The ‘limits’ creep that occurs in the simpler search of more stable tests - it often ends up just being a cover for ugly stuff. > Up the default fork count to make builds complete faster; make count relative > to CPU count > -- > > Key: HBASE-23779 > URL: https://issues.apache.org/jira/browse/HBASE-23779 > Project: HBase > Issue Type: Bug > Components: test >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > Attachments: addendum2.patch, test_yetus_934.0.patch > > > Tests take a long time. Our fork count running all tests are conservative -- > 1 (small) for first part and 5 for second part (medium and large). Rather > than hardcoding we should set the fork count to be relative to machine size. > Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box. > Looking up at jenkins, it seems like the boxes are 24 cores... at least going > by my random survey. The load reported on a few seems low though this not > representative (looking at machine/uptime). > More parallelism willl probably mean more test failure. Let me take a look > see. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23952) Address thread safety issue with Map used in BufferCallBeforeInitHandler.
Mark Robert Miller created HBASE-23952: -- Summary: Address thread safety issue with Map used in BufferCallBeforeInitHandler. Key: HBASE-23952 URL: https://issues.apache.org/jira/browse/HBASE-23952 Project: HBase Issue Type: Bug Affects Versions: master Reporter: Mark Robert Miller id2Call is a HashMap and has a call back method that accesses it that can be run via an executor as well as another method accessing it that can be run from a different thread. id2Call should likely be a ConcurrentHashMap to be shared like this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-23951) Avoid high speed recursion trap in AsyncRequestFutureImpl.
[ https://issues.apache.org/jira/browse/HBASE-23951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller updated HBASE-23951: --- Description: While working on branch-2, I ran into an issue where a retryable error kept occurring and code in AsyncRequestFutureImpl would reduce the backoff wait to 0 and extremely rapidly eat up a of thread stack space with recursive retry calls. This little patch stops the backoff wait kill after 3 retries. Chosen kind of arbitrarily, perhaps 5 is the right number, but I find large retry counts tend to hide things and that has made me default to fairly conservative in all my arbitrary number picking. (was: While working on branch-2, I ran into an issue where a retryable error kept occurring and code in ) > Avoid high speed recursion trap in AsyncRequestFutureImpl. > -- > > Key: HBASE-23951 > URL: https://issues.apache.org/jira/browse/HBASE-23951 > Project: HBase > Issue Type: Improvement >Affects Versions: 2.3.0 >Reporter: Mark Robert Miller >Priority: Minor > > While working on branch-2, I ran into an issue where a retryable error kept > occurring and code in AsyncRequestFutureImpl would reduce the backoff wait to > 0 and extremely rapidly eat up a of thread stack space with recursive retry > calls. This little patch stops the backoff wait kill after 3 retries. Chosen > kind of arbitrarily, perhaps 5 is the right number, but I find large retry > counts tend to hide things and that has made me default to fairly > conservative in all my arbitrary number picking. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-23951) Avoid high speed recursion trap in AsyncRequestFutureImpl.
[ https://issues.apache.org/jira/browse/HBASE-23951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller updated HBASE-23951: --- Description: While working on branch-2, I ran into an issue where a retryable error kept occurring and code in > Avoid high speed recursion trap in AsyncRequestFutureImpl. > -- > > Key: HBASE-23951 > URL: https://issues.apache.org/jira/browse/HBASE-23951 > Project: HBase > Issue Type: Improvement >Affects Versions: 2.3.0 >Reporter: Mark Robert Miller >Priority: Minor > > While working on branch-2, I ran into an issue where a retryable error kept > occurring and code in -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23951) Avoid high speed recursion trap in AsyncRequestFutureImpl.
Mark Robert Miller created HBASE-23951: -- Summary: Avoid high speed recursion trap in AsyncRequestFutureImpl. Key: HBASE-23951 URL: https://issues.apache.org/jira/browse/HBASE-23951 Project: HBase Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Mark Robert Miller -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.
[ https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048742#comment-17048742 ] Mark Robert Miller commented on HBASE-23795: I think the fail safe finalizer asking everyone to be polite citizens and close connections may be both failing in it's politeness and making a lot of the test situation even worse at best. You have all these extra costs and bad affects on GC with a finalizer, and the amount of connections these tests roll through and do not close is very high. I think it's quite easily overwhelmed and I think it contributes various bad things that probably varies by the garbage collector in place. Timely, proper connection closes and verification of reuse is a very fast gc agnostic recipe. Those good citizens could use an assist, so I filed HBASE-23918 Track sensitive resources to ensure they are closed and assist devs in finding leaks. > Enable all tests to be run in parallel on reused JVMs. > -- > > Key: HBASE-23795 > URL: https://issues.apache.org/jira/browse/HBASE-23795 > Project: HBase > Issue Type: Wish >Reporter: Mark Robert Miller >Priority: Major > > I'd like to be able to run HBase tests in under 30-40 minutes on good > parallel hardware. > It will require some small changes / fixes for that wish to come true. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23918) Track sensitive resources to ensure they are closed and assist devs in finding leaks.
[ https://issues.apache.org/jira/browse/HBASE-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048672#comment-17048672 ] Mark Robert Miller commented on HBASE-23918: I chose to do this via the Java assert keyword. asserts should not be turned on in production systems. Tests would fail if run without asserts turned on. At the end of the constructor for an Object I needed to track: {code:java} assert ObjectReleaseTracker.track(this); {code} At the end of the close or shutdown method: {code:java} assert ObjectReleaseTracker.release(this); {code} At the end of each test: {code:java} /** * @return null if ok else error message */ String ObjectReleaseTracker.checkEmpty() {code} If checkEmpty finds objects that have not been released, it lists them, shows the stacktrace for the code origin, and then you can fail the test. > Track sensitive resources to ensure they are closed and assist devs in > finding leaks. > - > > Key: HBASE-23918 > URL: https://issues.apache.org/jira/browse/HBASE-23918 > Project: HBase > Issue Type: Improvement >Reporter: Mark Robert Miller >Priority: Major > > Closing some objects is quite critical. Issues with leaks can be quite > slippery and nasty and growy. Maintaining close integrity is an embarrassing > sport for humans. > In the past, those 3 thoughts led me to start tracking objects in tests to > alert of leaks. Even with an alert though, the job of tracking down all of > the leaks just based on what leaked was beyond my skill. If it's beyond even > one devs skill that is committing, that tends to end up trouble. So I added > the stack trace for the origin of the object. Things can still get a bit > tricky to track down in some cases, but now I had the start of a real > solution to all of the whack-a-mole games I spent too much time playing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23918) Track sensitive resources to ensure they are closed and assist devs in finding leaks.
Mark Robert Miller created HBASE-23918: -- Summary: Track sensitive resources to ensure they are closed and assist devs in finding leaks. Key: HBASE-23918 URL: https://issues.apache.org/jira/browse/HBASE-23918 Project: HBase Issue Type: Improvement Reporter: Mark Robert Miller Closing some objects is quite critical. Issues with leaks can be quite slippery and nasty and growy. Maintaining close integrity is an embarrassing sport for humans. In the past, those 3 thoughts led me to start tracking objects in tests to alert of leaks. Even with an alert though, the job of tracking down all of the leaks just based on what leaked was beyond my skill. If it's beyond even one devs skill that is committing, that tends to end up trouble. So I added the stack trace for the origin of the object. Things can still get a bit tricky to track down in some cases, but now I had the start of a real solution to all of the whack-a-mole games I spent too much time playing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23796) Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as well.
[ https://issues.apache.org/jira/browse/HBASE-23796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048662#comment-17048662 ] Mark Robert Miller commented on HBASE-23796: I still think this is a good, developer friendly change, but lowering on my priority list for a bit. This was helping some major problems I was hitting in the tests, but likely because I had not changed everything to use 127.0.0.1 yet and so it was giving me more ports to work with, which staved off the large connection leak from eating all the ports, which allowed tests to run fast and well longer. > Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as > well. > --- > > Key: HBASE-23796 > URL: https://issues.apache.org/jira/browse/HBASE-23796 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Priority: Minor > > This is perhaps controversial, but there are a variety of problems with > counting on dns hostname resolution, especially for locahost. > > # It can often be slow, slow under concurrency, or slow under specific > conditions. > # It can often not work at all - when on a VPN, with weird DNS hijacking > hi-jinks, when you have a real hostname for you machines, a custom /etc/hosts > file, OS's run their own local/funny DNS server services. > # This makes coming to HBase for new devs a hit or miss experience and if > you miss, dealing with an diagnosing the issues is a large endeavor and not > straight forward or transparent. > # 99% of the difference doesn't matter in most cases - except that > 127.0.0.1 works and is fast pretty much universally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-23796) Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as well.
[ https://issues.apache.org/jira/browse/HBASE-23796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller updated HBASE-23796: --- Priority: Minor (was: Major) > Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as > well. > --- > > Key: HBASE-23796 > URL: https://issues.apache.org/jira/browse/HBASE-23796 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Priority: Minor > > This is perhaps controversial, but there are a variety of problems with > counting on dns hostname resolution, especially for locahost. > > # It can often be slow, slow under concurrency, or slow under specific > conditions. > # It can often not work at all - when on a VPN, with weird DNS hijacking > hi-jinks, when you have a real hostname for you machines, a custom /etc/hosts > file, OS's run their own local/funny DNS server services. > # This makes coming to HBase for new devs a hit or miss experience and if > you miss, dealing with an diagnosing the issues is a large endeavor and not > straight forward or transparent. > # 99% of the difference doesn't matter in most cases - except that > 127.0.0.1 works and is fast pretty much universally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.
[ https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048462#comment-17048462 ] Mark Robert Miller commented on HBASE-23795: So, after getting to know the tests more individually, they did not match up with experience of trying to run the test suite. So I jumped ahead a bit. I reduced resources and improvement. But the test suite still did not make sense. I started shutting down resources and resolving deadlock or deadlock like situations on shutdown. I removed System.exit type calls that not appropriate for Junit tests and kill JVM’s on us. Again, I saw improvements but the test runs still didn’t make sense. So I started hacking and providing higher limits to find out why nothing made sense. In the end, lots of connections are fired up, few are closed, unless you run the tests fairly slowly, you will bed DOS attacked by them. The worst of the leaks appears to be within the rpc client. These tests are much faster, the rest of the flakies become conquerable, running tests with more parallel JVMs and even in the same JVM is now not that difficult, though some of the code does make it a little exercise. Good news all around. > Enable all tests to be run in parallel on reused JVMs. > -- > > Key: HBASE-23795 > URL: https://issues.apache.org/jira/browse/HBASE-23795 > Project: HBase > Issue Type: Wish >Reporter: Mark Robert Miller >Priority: Major > > I'd like to be able to run HBase tests in under 30-40 minutes on good > parallel hardware. > It will require some small changes / fixes for that wish to come true. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23899) [Flakey Test] Stabilizations and Debug
[ https://issues.apache.org/jira/browse/HBASE-23899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046501#comment-17046501 ] Mark Robert Miller commented on HBASE-23899: bq. Fix some issues particular where we ran into mismatched filesystem complaint. Great! I had been running into this sometimes and did not know what to make of it. Had to work around it. bq. Removal of unnecessary deletes +1, unless it releases resources, just eats time and adds things that can go wrong. bq. manifests as test failing in startup saying master didn't launch Cool, been biting me too. This and the first issue will improve my local test situation a lot without hack work arounds. bq. TestReplicationStatus I looked at this for a while the other day as well. I added code to try and wait for the right state to show up. I found any and many possible thread safety issues that could be related and fixed them. And still this one fail would happen. It did seem to make it much harder for it to happen on my 16 core box, I thought it was fixed. But I could still easily reproduce it on my 4 and 8 core boxes. > [Flakey Test] Stabilizations and Debug > -- > > Key: HBASE-23899 > URL: https://issues.apache.org/jira/browse/HBASE-23899 > Project: HBase > Issue Type: Bug > Components: flakies >Reporter: Michael Stack >Priority: Major > > Bunch of test stabilization and extra debug. This set of changes make it so > local 'test' runs pass about 20% of the time (where before they didn't) when > run on a linux vm and on a mac. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23904) Procedure updating meta and Master shutdown are incompatible: CODE-BUG
[ https://issues.apache.org/jira/browse/HBASE-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046471#comment-17046471 ] Mark Robert Miller commented on HBASE-23904: Oh cool, I've been seeing some stuff like this in tests and didn't know what was expected. bq. The rejected exception is probably because the pool has been shutdown Yeah, pool is TERMINATED, so it was shutdown and that process was fully completed: Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 5 bq. So I do not think we should let the procedures to finish when master is going to quit... I agree that this is not a great default behavior if the procedures are likely going to take a while - unless you have a clear easy option of shutdown ASAP or shutdown after you have finished any outstanding work you might be chewing on that is going to take more than a fairly short time. As a user, if you were going to prevent the procedures from running and do a speedy shutdown, I wouldn't want to see anything about CODE-BUG or even an exception - just a message about how the master is shutting down and procedures have been aborted or skipped. > Procedure updating meta and Master shutdown are incompatible: CODE-BUG > -- > > Key: HBASE-23904 > URL: https://issues.apache.org/jira/browse/HBASE-23904 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: Michael Stack >Priority: Major > > Chasing flakies, studying TestMasterAbortWhileMergingTable, I noticed a > failure because > {code:java} > 2020-02-27 00:57:51,702 ERROR [PEWorker-6] > procedure2.ProcedureExecutor(1688): CODE-BUG: Uncaught runtime exception: > pid=14, state=RUNNABLE:MERGE_TABLE_REGIONS_UPDATE_META, locked=true; > MergeTableRegionsProcedure table=test, > regions=[48c9be922fa4356bfc7fc61b5b0785f3, ef196d5377c5c1d143e9a2a2ea056a9c], > force=false > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.FutureTask@28b956c7 rejected from > java.util.concurrent.ThreadPoolExecutor@639f20e5[Terminated, pool size = 0, > active threads = 0, queued tasks = 0, completed tasks = 5] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134) > at > org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:974) > at > org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:953) > at > org.apache.hadoop.hbase.MetaTableAccessor.multiMutate(MetaTableAccessor.java:1771) > at > org.apache.hadoop.hbase.MetaTableAccessor.mergeRegions(MetaTableAccessor.java:1637) > at > org.apache.hadoop.hbase.master.assignment.RegionStateStore.mergeRegions(RegionStateStore.java:268) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsMerged(AssignmentManager.java:1854) > at > org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.updateMetaForMergedRegions(MergeTableRegionsProcedure.java:687) > at > org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:229) > at > org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:77) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1669) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1416) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:79) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1986) > {code} > A few seconds above, as part of the test, we'd stopped Master > {code:java} > 2020-02-27 00:57:51,620 INFO [Time-limited test] > regionserver.HRegionServer(2212): * STOPPING region server > 'rn-hbased-lapp01.rno.exampl.com,36587,1582765058324' * > 2020-02-27 00:57:51,620 INFO [Time-limited test] > regionserver.HRegionServer(2226): STOPPED: Stopping master 0 {code} > The rejected execution damages the merge procedure. It shows as an unhandled > CODE-BUG. > Why we let a runtime exception out when trying to update meta
[jira] [Commented] (HBASE-23882) Scale *MiniCluster config for the environment it runs in.
[ https://issues.apache.org/jira/browse/HBASE-23882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042245#comment-17042245 ] Mark Robert Miller commented on HBASE-23882: Here is some initial experimentation with bringing mini cluster settings down to scale. To start, I've been shooting for pretty minimal. I've pulled this out of an experimental branch and so some work may still be in order to find any tests these settings are too low for. I will continue to update this as I fine tune. I also have some other changes I'd like to dig out around thread pool sizing if I can find them again. I'll likely update again early next week. > Scale *MiniCluster config for the environment it runs in. > - > > Key: HBASE-23882 > URL: https://issues.apache.org/jira/browse/HBASE-23882 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23849) Harden small and medium tests for lots of parallel runs with re-used jvms.
[ https://issues.apache.org/jira/browse/HBASE-23849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042204#comment-17042204 ] Mark Robert Miller commented on HBASE-23849: HBASE-23882 is will be useful here as well. > Harden small and medium tests for lots of parallel runs with re-used jvms. > -- > > Key: HBASE-23849 > URL: https://issues.apache.org/jira/browse/HBASE-23849 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23882) Scale *MiniCluster config for the environment it runs in.
Mark Robert Miller created HBASE-23882: -- Summary: Scale *MiniCluster config for the environment it runs in. Key: HBASE-23882 URL: https://issues.apache.org/jira/browse/HBASE-23882 Project: HBase Issue Type: Test Reporter: Mark Robert Miller -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.
[ https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040572#comment-17040572 ] Mark Robert Miller commented on HBASE-23795: Now that I have spent enough time to understand what is possible here, to kind of zoom out, here is what would need to happen (this can take some time, but is easily done in parts): * Tune HBase and hdfs settings to make sense for the small world we are creating in tests. If you wash a test out with threads, it's really not very realistic at all. This means pinning settings for the JVM so that the right number of threads is created, settings for HBase and hdfs that make sense - handler counts and pool sizes netty 'stuff' - and scaling down thousands of threads to 200-300 or something. Not many of these threads are running at the same time - the rest is just flooding and eating resources. * Make things close and shutdown and enforce this. Lots and lots of various leaks currently. Once mostly removed, you can start reusing JVM's, but I think there are other benefits to actually closing and stopping your resources explicitly. Not closing or shutting down some things explicit can have lingering OS ramifications even when creating new JVM's. * Have tests clean up their expensive or config statics and reset sys props, with enforcement. Needed for JVM reuse. * Clean up any heavy resource usage you can't currently control or doesn't seem to make a lot of sense (I think an ipc thread pool is set to core and max size 200?) That's kind of the core of it, there is a lot of related useful stuff to do I think. A lot of these tests are reasonably fast now. They could be even faster with a little work, but even without that, they are not that slow. Loading up a JVM, loading up 10-20k classes, warming up JIT, blah blah, that is super costly. Just getting to rerunning tests in the same JVM will be super helpful. There is a lot that can be done after that as well, but such a large win - good enough goal for now. Most of these tests don't even need that much RAM. They are just not running with sensible resources. > Enable all tests to be run in parallel on reused JVMs. > -- > > Key: HBASE-23795 > URL: https://issues.apache.org/jira/browse/HBASE-23795 > Project: HBase > Issue Type: Wish >Reporter: Mark Robert Miller >Priority: Major > > I'd like to be able to run HBase tests in under 30-40 minutes on good > parallel hardware. > It will require some small changes / fixes for that wish to come true. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count
[ https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039560#comment-17039560 ] Mark Robert Miller edited comment on HBASE-23779 at 2/18/20 11:45 PM: -- In the meantime, some some ideas on JVM args that be able to help the situation a little. -XX:-UseContainerSupport - when running in docker, query docker for hardware info, not the host. I have not played with it, but makes sense to me. -XX:ActiveProcessorCount - fake the active processer count to 1 - when running multiple jvms, you don't want each one to think it should grab 32 gc threads on your 32 hyperthreaded core machine. I'd much prefer to run 32 JVMs and have 32 gc threads. -XX:+AlwaysPreTouch -XX:+UseTransparentHugePages - given so much JVM startup and shutdowns, some things might help more when there is more JVM reuse - so much time and cost is in creating large heap jvms and spinning up lots of unused resources and threads that it's hard to win much with good tuning - but huge pages, always a nice win. -XX:+UseParallelGC -XX:TieredStopAtLevel=1 - even when G1 is available, this is gold standard for test speed. Maybe with like -XX:ParallelGCThreads=1 if not using ActiveProcessorCout. Probably not a big advantage when each med/large JVM is spinning up thousands of useless threads that eat ram, heap and os resources, but something, and probably pretty nice for small test single jvm runs now. It's also generally best to pin Xms and Xmx vs eat all the resizing cost. I suggest the opposite above given the RAM reqs - to keep the JVMs that don't need it from sucking up so much RAM unnecessarily - but with good reuse, you want to pin them. Most of the tests don't need these huge heaps and limits - I think that having them for the bad apples and outliers just allows the 95% of tests to easily be wasteful and misbehave. * Note: UseContainerSupport is enabled by default - the above disables - so guess that should already be in affect. was (Author: markrmiller): In the meantime, some some ideas on JVM args that be able to help the situation a little. -XX:-UseContainerSupport - when running in docker, query docker for hardware info, not the host. I have not played with it, but makes sense to me. -XX:ActiveProcessorCount - fake the active processer count to 1 - when running multiple jvms, you don't want each one to think it should grab 32 gc threads on your 32 hyperthreaded core machine. I'd much prefer to run 32 JVMs and have 32 gc threads. -XX:+AlwaysPreTouch -XX:+UseTransparentHugePages - given so much JVM startup and shutdowns, some things might help more when there is more JVM reuse - so much time and cost is in creating large heap jvms and spinning up lots of unused resources and threads that it's hard to win much with good tuning - but huge pages, always a nice win. -XX:+UseParallelGC -XX:TieredStopAtLevel=1 - even when G1 is available, this is gold standard for test speed. Maybe with like -XX:ParallelGCThreads=1 if not using ActiveProcessorCout. Probably not a big advantage when each med/large JVM is spinning up thousands of useless threads that eat ram, heap and os resources, but something, and probably pretty nice for small test single jvm runs now. It's also generally best to pin Xms and Xmx vs eat all the resizing cost. I suggest the opposite above given the RAM reqs - to keep the JVMs that don't need it from sucking up so much RAM unnecessarily - but with good reuse, you want to pin them. Most of the tests don't need these huge heaps and limits - I think that having them for the bad apples and outliers just allows the 95% of tests to easily be wasteful and misbehave. > Up the default fork count to make builds complete faster; make count relative > to CPU count > -- > > Key: HBASE-23779 > URL: https://issues.apache.org/jira/browse/HBASE-23779 > Project: HBase > Issue Type: Bug > Components: test >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > Attachments: addendum2.patch, test_yetus_934.0.patch > > > Tests take a long time. Our fork count running all tests are conservative -- > 1 (small) for first part and 5 for second part (medium and large). Rather > than hardcoding we should set the fork count to be relative to machine size. > Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box. > Looking up at jenkins, it seems like the boxes are 24 cores... at least going > by my random survey. The load reported on a few seems low though this not > representative (looking at machine/uptime). > More parallelism willl probably mean more test failure. Let me take a
[jira] [Comment Edited] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count
[ https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039560#comment-17039560 ] Mark Robert Miller edited comment on HBASE-23779 at 2/18/20 11:40 PM: -- In the meantime, some some ideas on JVM args that be able to help the situation a little. -XX:-UseContainerSupport - when running in docker, query docker for hardware info, not the host. I have not played with it, but makes sense to me. -XX:ActiveProcessorCount - fake the active processer count to 1 - when running multiple jvms, you don't want each one to think it should grab 32 gc threads on your 32 hyperthreaded core machine. I'd much prefer to run 32 JVMs and have 32 gc threads. -XX:+AlwaysPreTouch -XX:+UseTransparentHugePages - given so much JVM startup and shutdowns, some things might help more when there is more JVM reuse - so much time and cost is in creating large heap jvms and spinning up lots of unused resources and threads that it's hard to win much with good tuning - but huge pages, always a nice win. -XX:+UseParallelGC -XX:TieredStopAtLevel=1 - even when G1 is available, this is gold standard for test speed. Maybe with like -XX:ParallelGCThreads=1 if not using ActiveProcessorCout. Probably not a big advantage when each med/large JVM is spinning up thousands of useless threads that eat ram, heap and os resources, but something, and probably pretty nice for small test single jvm runs now. It's also generally best to pin Xms and Xmx vs eat all the resizing cost. I suggest the opposite above given the RAM reqs - to keep the JVMs that don't need it from sucking up so much RAM unnecessarily - but with good reuse, you want to pin them. Most of the tests don't need these huge heaps and limits - I think that having them for the bad apples and outliers just allows the 95% of tests to easily be wasteful and misbehave. was (Author: markrmiller): In the meantime, some some ideas on JVM args that be able to help the situation a little. -XX:-UseContainerSupport - when running in docker, query docker for hardware info, not the host. I have no played with it, but makes sense to me. -XX:ActiveProcessorCount - fake the active process count to 1 - when running multiple jvms, you don't want each one to think it should grab 32 gc threads on your 32 hyperthreaded core machine. I'd much prefer to run 32 JVMs and have 32 gc threads. -XX:+AlwaysPreTouch -XX:+UseTransparentHugePages - given so much JVM startup and shutdowns, some things might help more when there is more JVM reuse - so much time and cost is in creating large heap jvms and spinning up lots of unused resources and threads that it's hard to win much with good tuning - but huge pages, always a nice win. -XX:+UseParallelGC -XX:TieredStopAtLevel=1 - even when G1 is available, this is gold standard for test speed. Maybe with like -XX:ParallelGCThreads=1 if no using ActiveProcessorCout. Probably not a big advantage when each med/large JVM is spinning up thousands of useless threads that eat ram, heap and os resources, but something, and probably pretty nice for small test single jvm runs now. It's also generally best to pin Xms and Xmx vs eat all the resizing cost. I suggest the opposite above given the RAM reqs - to keep the JVMs that don't need it from sucking up so much RAM unnecessarily - but with good reuse, you want to pin them. Most of the tests don't need these huge heaps and limits - I think that having them for the bad apples and outliers just allows the 95% of tests to easily be wasteful and misbehave. > Up the default fork count to make builds complete faster; make count relative > to CPU count > -- > > Key: HBASE-23779 > URL: https://issues.apache.org/jira/browse/HBASE-23779 > Project: HBase > Issue Type: Bug > Components: test >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > Attachments: addendum2.patch, test_yetus_934.0.patch > > > Tests take a long time. Our fork count running all tests are conservative -- > 1 (small) for first part and 5 for second part (medium and large). Rather > than hardcoding we should set the fork count to be relative to machine size. > Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box. > Looking up at jenkins, it seems like the boxes are 24 cores... at least going > by my random survey. The load reported on a few seems low though this not > representative (looking at machine/uptime). > More parallelism willl probably mean more test failure. Let me take a look > see. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count
[ https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039560#comment-17039560 ] Mark Robert Miller commented on HBASE-23779: In the meantime, some some ideas on JVM args that be able to help the situation a little. -XX:-UseContainerSupport - when running in docker, query docker for hardware info, not the host. I have no played with it, but makes sense to me. -XX:ActiveProcessorCount - fake the active process count to 1 - when running multiple jvms, you don't want each one to think it should grab 32 gc threads on your 32 hyperthreaded core machine. I'd much prefer to run 32 JVMs and have 32 gc threads. -XX:+AlwaysPreTouch -XX:+UseTransparentHugePages - given so much JVM startup and shutdowns, some things might help more when there is more JVM reuse - so much time and cost is in creating large heap jvms and spinning up lots of unused resources and threads that it's hard to win much with good tuning - but huge pages, always a nice win. -XX:+UseParallelGC -XX:TieredStopAtLevel=1 - even when G1 is available, this is gold standard for test speed. Maybe with like -XX:ParallelGCThreads=1 if no using ActiveProcessorCout. Probably not a big advantage when each med/large JVM is spinning up thousands of useless threads that eat ram, heap and os resources, but something, and probably pretty nice for small test single jvm runs now. It's also generally best to pin Xms and Xmx vs eat all the resizing cost. I suggest the opposite above given the RAM reqs - to keep the JVMs that don't need it from sucking up so much RAM unnecessarily - but with good reuse, you want to pin them. Most of the tests don't need these huge heaps and limits - I think that having them for the bad apples and outliers just allows the 95% of tests to easily be wasteful and misbehave. > Up the default fork count to make builds complete faster; make count relative > to CPU count > -- > > Key: HBASE-23779 > URL: https://issues.apache.org/jira/browse/HBASE-23779 > Project: HBase > Issue Type: Bug > Components: test >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > Attachments: addendum2.patch, test_yetus_934.0.patch > > > Tests take a long time. Our fork count running all tests are conservative -- > 1 (small) for first part and 5 for second part (medium and large). Rather > than hardcoding we should set the fork count to be relative to machine size. > Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box. > Looking up at jenkins, it seems like the boxes are 24 cores... at least going > by my random survey. The load reported on a few seems low though this not > representative (looking at machine/uptime). > More parallelism willl probably mean more test failure. Let me take a look > see. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.
[ https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039393#comment-17039393 ] Mark Robert Miller commented on HBASE-23795: I've started making some progress here. With some tuning, these large tests are not so large after all. I have to do some work to make sure they clean up after themselves so that JVMs dont get dirtier and dirtier over time, but all of these tests can run in parallel and much faster than they are now. > Enable all tests to be run in parallel on reused JVMs. > -- > > Key: HBASE-23795 > URL: https://issues.apache.org/jira/browse/HBASE-23795 > Project: HBase > Issue Type: Wish >Reporter: Mark Robert Miller >Priority: Major > > I'd like to be able to run HBase tests in under 30-40 minutes on good > parallel hardware. > It will require some small changes / fixes for that wish to come true. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count
[ https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038062#comment-17038062 ] Mark Robert Miller commented on HBASE-23779: I see what's going on here. A lot ;) To some degree Maven is not helping - the equivalent approximation to Gradles awesome parallel build performance can be a fair bit more expensive at the least. That's just half the equation though. Its really largely small-med or potentially small-med tests masquerading as large or super large tests and/or waiting for a non CI, less intense option. Expanding limits and trying to baby and isolate the tests has gotten HBase to like a billion tests, which I am both impressed and frustrated with. So. Many. Test classes. Just tossing that many no op tests at an executor is going to take some time - toss a new 2800Mb JVM in between for most of them and it will take a little more :) We could address it all within a few months time (the basics anyway, really so much that can be done and improved on the basics). I'll convince someone to champion a reverse course and shrink resources and expose the tests to a bit of hell on purpose for profit and pleasure. There is a lot of hidden flakiness, which is a valid strategy in these cases - if you can hide it well enough, at least it's a mostly rational signal. But with so many tests and hours running times, any real stability will be forever elusive, a mirage, or hanging on a dime. You also just pay for it in so many ways, even if it does end up with some success. We can expose these tests to sunlight and it will force us to shape them right up. > Up the default fork count to make builds complete faster; make count relative > to CPU count > -- > > Key: HBASE-23779 > URL: https://issues.apache.org/jira/browse/HBASE-23779 > Project: HBase > Issue Type: Bug > Components: test >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > Attachments: addendum2.patch, test_yetus_934.0.patch > > > Tests take a long time. Our fork count running all tests are conservative -- > 1 (small) for first part and 5 for second part (medium and large). Rather > than hardcoding we should set the fork count to be relative to machine size. > Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box. > Looking up at jenkins, it seems like the boxes are 24 cores... at least going > by my random survey. The load reported on a few seems low though this not > representative (looking at machine/uptime). > More parallelism willl probably mean more test failure. Let me take a look > see. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23806) Provide a much faster and efficient alternate option to maven and surefire for running tests.
[ https://issues.apache.org/jira/browse/HBASE-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037448#comment-17037448 ] Mark Robert Miller commented on HBASE-23806: While things have been getting better over the years, build and test running tools in general have not gone very far down the road of good efficiency. In my experience, Gradle easily leads the pack, Maven is trying to match with similiar features, but it's a turbo charger on an Eclipse vs a solid Audi at best. But Gradle is mostly living off what it's done for build, which is great, but no real focus on tests. We can do better for special occasions. > Provide a much faster and efficient alternate option to maven and surefire > for running tests. > - > > Key: HBASE-23806 > URL: https://issues.apache.org/jira/browse/HBASE-23806 > Project: HBase > Issue Type: Wish >Reporter: Mark Robert Miller >Priority: Minor > > With HBASE-23795, the hope is to drive tests with maven and surefire much > closer to their potential. > That will still leave a lot of room for improvement. > For those that have some nice hardware and a need for speed, we can blow > right past maven+surefire. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.
[ https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037443#comment-17037443 ] Mark Robert Miller commented on HBASE-23795: I'll start with small and medium tests in HBASE-23849 Harden small and medium tests for lots of parallel runs with re-used jvms. Good way to get some solid, easy progress before the mountain that is large tests. In the mean time, I've been working in parallel on things related to HBASE-23806 Provide a much faster and efficient alternate option to maven and surefire for running tests. Not likely to be shared any time soon, but providing it's own benefit to this related issue. > Enable all tests to be run in parallel on reused JVMs. > -- > > Key: HBASE-23795 > URL: https://issues.apache.org/jira/browse/HBASE-23795 > Project: HBase > Issue Type: Wish >Reporter: Mark Robert Miller >Priority: Major > > I'd like to be able to run HBase tests in under 30-40 minutes on good > parallel hardware. > It will require some small changes / fixes for that wish to come true. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23849) Harden small and medium tests for lots of parallel runs with re-used jvms.
Mark Robert Miller created HBASE-23849: -- Summary: Harden small and medium tests for lots of parallel runs with re-used jvms. Key: HBASE-23849 URL: https://issues.apache.org/jira/browse/HBASE-23849 Project: HBase Issue Type: Test Reporter: Mark Robert Miller -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23835) TestFromClientSide3 and subclasses often fail on testScanAfterDeletingSpecifiedRowV2.
[ https://issues.apache.org/jira/browse/HBASE-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036897#comment-17036897 ] Mark Robert Miller commented on HBASE-23835: I've started to dig into this test - I've played around with speeding it up and hardening it - that led to a couple other little things in other code - I'll put up a pr for this test soon and spin off a JIRA issue or two. > TestFromClientSide3 and subclasses often fail on > testScanAfterDeletingSpecifiedRowV2. > - > > Key: HBASE-23835 > URL: https://issues.apache.org/jira/browse/HBASE-23835 > Project: HBase > Issue Type: Test >Affects Versions: master >Reporter: Mark Robert Miller >Priority: Major > > This test method fails a fair amount on me with something like: > TestFromClientSide3WoUnsafe>TestFromClientSide3.testScanAfterDeletingSpecifiedRowV2:236 > expected:<3> but was:<2> > I had a hunch that it might be due to interference from other test methods > running first so I tried changing the table name for just this method to be > unique - still fails. > However, when I just run testScanAfterDeletingSpecifiedRowV2 on it's own > without the methods, it does not seem to fail so far. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23829) Get `-PrunSmallTests` passing on JDK11
[ https://issues.apache.org/jira/browse/HBASE-23829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036804#comment-17036804 ] Mark Robert Miller commented on HBASE-23829: Here is some work I have done towards this against small/med tests (so far have not needed/jumped to hadoop 3x): https://github.com/markrmiller/hbase/tree/jdk11 > Get `-PrunSmallTests` passing on JDK11 > -- > > Key: HBASE-23829 > URL: https://issues.apache.org/jira/browse/HBASE-23829 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: Nick Dimiduk >Priority: Major > > Start with the small tests, shaking out issues identified by the harness. So > far it seems like {{-Dhadoop.profile=3.0}} and > {{-Dhadoop-three.version=3.3.0-SNAPSHOT}} maybe be required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23839) TestEntityLocks often fails in lower resource envs in testEntityLockTimeout
Mark Robert Miller created HBASE-23839: -- Summary: TestEntityLocks often fails in lower resource envs in testEntityLockTimeout Key: HBASE-23839 URL: https://issues.apache.org/jira/browse/HBASE-23839 Project: HBase Issue Type: Test Reporter: Mark Robert Miller The test waits for something to happen and if the computer is a little slow, it will fail on line 178. Doing a check against 3 instead 2x seems to help a lot to start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-23835) TestFromClientSide3 and subclasses often fail on testScanAfterDeletingSpecifiedRowV2.
[ https://issues.apache.org/jira/browse/HBASE-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller updated HBASE-23835: --- Description: This test method fails a fair amount on me with something like: TestFromClientSide3WoUnsafe>TestFromClientSide3.testScanAfterDeletingSpecifiedRowV2:236 expected:<3> but was:<2> I had a hunch that it might be due to interference from other test methods running first so I tried changing the table name for just this method to be unique - still fails. However, when I just run testScanAfterDeletingSpecifiedRowV2 on it's own without the methods, it does not seem to fail so far. was: This test method fails a fair amount on me with something like: TestFromClientSide3WoUnsafe>TestFromClientSide3.testScanAfterDeletingSpecifiedRowV2:236 expected:<3> but was:<2> I had a hunch that it might be due to interference from other test methods running first so I tried changing the table name for just this method to be unique - still fails. However, when I just run > TestFromClientSide3 and subclasses often fail on > testScanAfterDeletingSpecifiedRowV2. > - > > Key: HBASE-23835 > URL: https://issues.apache.org/jira/browse/HBASE-23835 > Project: HBase > Issue Type: Test >Affects Versions: master >Reporter: Mark Robert Miller >Priority: Major > > This test method fails a fair amount on me with something like: > TestFromClientSide3WoUnsafe>TestFromClientSide3.testScanAfterDeletingSpecifiedRowV2:236 > expected:<3> but was:<2> > I had a hunch that it might be due to interference from other test methods > running first so I tried changing the table name for just this method to be > unique - still fails. > However, when I just run testScanAfterDeletingSpecifiedRowV2 on it's own > without the methods, it does not seem to fail so far. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-23835) TestFromClientSide3 and subclasses often fail on testScanAfterDeletingSpecifiedRowV2.
[ https://issues.apache.org/jira/browse/HBASE-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller updated HBASE-23835: --- Description: This test method fails a fair amount on me with something like: TestFromClientSide3WoUnsafe>TestFromClientSide3.testScanAfterDeletingSpecifiedRowV2:236 expected:<3> but was:<2> I had a hunch that it might be due to interference from other test methods running first so I tried changing the table name for just this method to be unique - still fails. However, when I just run was: This test method fails a fair amount on me with something like: > TestFromClientSide3 and subclasses often fail on > testScanAfterDeletingSpecifiedRowV2. > - > > Key: HBASE-23835 > URL: https://issues.apache.org/jira/browse/HBASE-23835 > Project: HBase > Issue Type: Test >Affects Versions: master >Reporter: Mark Robert Miller >Priority: Major > > This test method fails a fair amount on me with something like: > TestFromClientSide3WoUnsafe>TestFromClientSide3.testScanAfterDeletingSpecifiedRowV2:236 > expected:<3> but was:<2> > I had a hunch that it might be due to interference from other test methods > running first so I tried changing the table name for just this method to be > unique - still fails. > However, when I just run -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23835) TestFromClientSide3 and subclasses often fail on testScanAfterDeletingSpecifiedRowV2.
Mark Robert Miller created HBASE-23835: -- Summary: TestFromClientSide3 and subclasses often fail on testScanAfterDeletingSpecifiedRowV2. Key: HBASE-23835 URL: https://issues.apache.org/jira/browse/HBASE-23835 Project: HBase Issue Type: Test Affects Versions: master Reporter: Mark Robert Miller This test method fails a fair amount on me with something like: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23830) TestReplicationEndpoint appears to fail a lot in my attempts for a clean test run locally.
[ https://issues.apache.org/jira/browse/HBASE-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035886#comment-17035886 ] Mark Robert Miller commented on HBASE-23830: Thanks [~stack]. I see similiar fails in these other tests: [ERROR] TestReplicationEndpointWithMultipleAsyncWAL>TestReplicationEndpoint.testInterClusterReplication:231 Waiting timed out after [30,000] msec Failed to replicate all edits, expected = 2500 replicated = 2491 [ERROR] TestReplicationEndpointWithMultipleWAL>TestReplicationEndpoint.testInterClusterReplication:231 Waiting timed out after [30,000] msec Failed to replicate all edits, expected = 2500 replicated = 2440 Will attach some logs for those if this is likely the same issue. > TestReplicationEndpoint appears to fail a lot in my attempts for a clean test > run locally. > -- > > Key: HBASE-23830 > URL: https://issues.apache.org/jira/browse/HBASE-23830 > Project: HBase > Issue Type: Test >Affects Versions: master >Reporter: Mark Robert Miller >Priority: Major > Attachments: test_fails.tar.xz > > > This test is failing for me like 30-40% of the time. Fail seems to usually be > as below. I've tried increasing the wait timeout but that does not seem to > help at all. > {code} > [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: > 105.145 s <<< FAILURE! - in > org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] Tests run: > 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.145 s <<< FAILURE! - > in org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] > org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication > Time elapsed: 38.725 s <<< FAILURE!java.lang.AssertionError: Waiting timed > out after [30,000] msec Failed to replicate all edits, expected = 2500 > replicated = 2476 at org.junit.Assert.fail(Assert.java:89) at > org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203) at > org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137) at > org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication(TestReplicationEndpoint.java:235){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23831) TestChoreService is very sensitive to resources.
Mark Robert Miller created HBASE-23831: -- Summary: TestChoreService is very sensitive to resources. Key: HBASE-23831 URL: https://issues.apache.org/jira/browse/HBASE-23831 Project: HBase Issue Type: Test Affects Versions: master Reporter: Mark Robert Miller More details following. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23830) TestReplicationEndpoint appears to fail a lot in my attempts for a clean test run locally.
[ https://issues.apache.org/jira/browse/HBASE-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034883#comment-17034883 ] Mark Robert Miller commented on HBASE-23830: I've attached the 13 fail logs that I got out of the last 30 runs on master. > TestReplicationEndpoint appears to fail a lot in my attempts for a clean test > run locally. > -- > > Key: HBASE-23830 > URL: https://issues.apache.org/jira/browse/HBASE-23830 > Project: HBase > Issue Type: Test >Affects Versions: master >Reporter: Mark Robert Miller >Priority: Major > Attachments: test_fails.tar.xz > > > This test is failing for me like 30-40% of the time. Fail seems to usually be > as below. I've tried increasing the wait timeout but that does not seem to > help at all. > {code} > [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: > 105.145 s <<< FAILURE! - in > org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] Tests run: > 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.145 s <<< FAILURE! - > in org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] > org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication > Time elapsed: 38.725 s <<< FAILURE!java.lang.AssertionError: Waiting timed > out after [30,000] msec Failed to replicate all edits, expected = 2500 > replicated = 2476 at org.junit.Assert.fail(Assert.java:89) at > org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203) at > org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137) at > org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication(TestReplicationEndpoint.java:235){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-23830) TestReplicationEndpoint appears to fail a lot in my attempts for a clean test run locally.
[ https://issues.apache.org/jira/browse/HBASE-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller updated HBASE-23830: --- Attachment: test_fails.tar.xz > TestReplicationEndpoint appears to fail a lot in my attempts for a clean test > run locally. > -- > > Key: HBASE-23830 > URL: https://issues.apache.org/jira/browse/HBASE-23830 > Project: HBase > Issue Type: Test >Affects Versions: master >Reporter: Mark Robert Miller >Priority: Major > Attachments: test_fails.tar.xz > > > This test is failing for me like 30-40% of the time. Fail seems to usually be > as below. I've tried increasing the wait timeout but that does not seem to > help at all. > {code} > [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: > 105.145 s <<< FAILURE! - in > org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] Tests run: > 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.145 s <<< FAILURE! - > in org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] > org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication > Time elapsed: 38.725 s <<< FAILURE!java.lang.AssertionError: Waiting timed > out after [30,000] msec Failed to replicate all edits, expected = 2500 > replicated = 2476 at org.junit.Assert.fail(Assert.java:89) at > org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203) at > org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137) at > org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication(TestReplicationEndpoint.java:235){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23830) TestReplicationEndpoint appears to fail a lot in my attempts for a clean test run locally.
Mark Robert Miller created HBASE-23830: -- Summary: TestReplicationEndpoint appears to fail a lot in my attempts for a clean test run locally. Key: HBASE-23830 URL: https://issues.apache.org/jira/browse/HBASE-23830 Project: HBase Issue Type: Test Affects Versions: master Reporter: Mark Robert Miller This test is failing for me like 30-40% of the time. Fail seems to usually be as below. I've tried increasing the wait timeout but that does not seem to help at all. {code} [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.145 s <<< FAILURE! - in org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.145 s <<< FAILURE! - in org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication Time elapsed: 38.725 s <<< FAILURE!java.lang.AssertionError: Waiting timed out after [30,000] msec Failed to replicate all edits, expected = 2500 replicated = 2476 at org.junit.Assert.fail(Assert.java:89) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137) at org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication(TestReplicationEndpoint.java:235){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count
[ https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034097#comment-17034097 ] Mark Robert Miller commented on HBASE-23779: FYI re: the -T arg I'm finding it's pretty sensitive to the Maven version - the 3.5.* I seem to get from the yetus Dockerfile in dev-support is crashing all the time, 3.6.1 has been behaving as I've been used to (my main desktop had 3.6.1 to start) > Up the default fork count to make builds complete faster; make count relative > to CPU count > -- > > Key: HBASE-23779 > URL: https://issues.apache.org/jira/browse/HBASE-23779 > Project: HBase > Issue Type: Bug > Components: test >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > Attachments: addendum2.patch, test_yetus_934.0.patch > > > Tests take a long time. Our fork count running all tests are conservative -- > 1 (small) for first part and 5 for second part (medium and large). Rather > than hardcoding we should set the fork count to be relative to machine size. > Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box. > Looking up at jenkins, it seems like the boxes are 24 cores... at least going > by my random survey. The load reported on a few seems low though this not > representative (looking at machine/uptime). > More parallelism willl probably mean more test failure. Let me take a look > see. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count
[ https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033283#comment-17033283 ] Mark Robert Miller commented on HBASE-23779: Couple notes I've noticed: # -T doesn't not seem to work so well for downloading deps at the least. If I don't do a non -T to download and make sure I have all deps at some point first, I see crashes. # In my experience, cannot create native thread due to OOM tends to be a RAM issue more often than open file limit issue. You can often help that by not accepting default huge stack sizes per thread - you don't often need nearly as much as some of the high defaults for that these days - try a megabyte. # More threads and tests at the same time uses more RAM of course, another way to help is to peg an Xms of like 256 or something, rather than just setting Xmx - encourage the tests that don't need so much RAM to not claim it to begin with. > Up the default fork count to make builds complete faster; make count relative > to CPU count > -- > > Key: HBASE-23779 > URL: https://issues.apache.org/jira/browse/HBASE-23779 > Project: HBase > Issue Type: Bug > Components: test >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > Attachments: addendum2.patch, test_yetus_934.0.patch > > > Tests take a long time. Our fork count running all tests are conservative -- > 1 (small) for first part and 5 for second part (medium and large). Rather > than hardcoding we should set the fork count to be relative to machine size. > Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box. > Looking up at jenkins, it seems like the boxes are 24 cores... at least going > by my random survey. The load reported on a few seems low though this not > representative (looking at machine/uptime). > More parallelism willl probably mean more test failure. Let me take a look > see. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.
[ https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032867#comment-17032867 ] Mark Robert Miller commented on HBASE-23795: Step one is to get to know the tests very well. This is normally a tall order for a mature distributed systems application and the scale of the HBase tests is beyond what I have run into as normal (>1 hour for all tests in a dev run). Because of this, I filed a few issues above and then kind of put on blinders for a bit. Unfortunetly, it doesn't help matters that the tests hate my current dns and vpn and osx environments out of the box. Anyway, getting to know the cast of characters. I'm almost done and as soon as I am, I will wrap up my PR's for at least two of those issues. 127.0.0.1 is likely a bit longer of a journey to fully complete. I'll have many more issues to file for step 2. > Enable all tests to be run in parallel on reused JVMs. > -- > > Key: HBASE-23795 > URL: https://issues.apache.org/jira/browse/HBASE-23795 > Project: HBase > Issue Type: Wish >Reporter: Mark Robert Miller >Priority: Major > > I'd like to be able to run HBase tests in under 30-40 minutes on good > parallel hardware. > It will require some small changes / fixes for that wish to come true. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-23806) Provide a much faster and efficient alternate option to maven and surefire for running tests.
[ https://issues.apache.org/jira/browse/HBASE-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller updated HBASE-23806: --- Description: With HBASE-23795, the hope is to drive tests with maven and surefire much closer to their potential. That will still leave a lot of room for improvement. For those that have some nice hardware and a need for speed, we can blow right past maven+surefire. > Provide a much faster and efficient alternate option to maven and surefire > for running tests. > - > > Key: HBASE-23806 > URL: https://issues.apache.org/jira/browse/HBASE-23806 > Project: HBase > Issue Type: Wish >Reporter: Mark Robert Miller >Priority: Minor > > With HBASE-23795, the hope is to drive tests with maven and surefire much > closer to their potential. > That will still leave a lot of room for improvement. > For those that have some nice hardware and a need for speed, we can blow > right past maven+surefire. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23806) Provide a much faster and efficient alternate option to maven and surefire for running tests.
Mark Robert Miller created HBASE-23806: -- Summary: Provide a much faster and efficient alternate option to maven and surefire for running tests. Key: HBASE-23806 URL: https://issues.apache.org/jira/browse/HBASE-23806 Project: HBase Issue Type: Wish Reporter: Mark Robert Miller -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count
[ https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030336#comment-17030336 ] Mark Robert Miller commented on HBASE-23779: bq. More parallelism willl probably mean more test failure. Let me take a look see. I'm going to convince you this is a good thing! But maybe not on main branches until it's a bit smooth. > Up the default fork count to make builds complete faster; make count relative > to CPU count > -- > > Key: HBASE-23779 > URL: https://issues.apache.org/jira/browse/HBASE-23779 > Project: HBase > Issue Type: Bug > Components: test >Reporter: Michael Stack >Priority: Major > > Tests take a long time. Our fork count running all tests are conservative -- > 1 (small) for first part and 5 for second part (medium and large). Rather > than hardcoding we should set the fork count to be relative to machine size. > Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box. > Looking up at jenkins, it seems like the boxes are 24 cores... at least going > by my random survey. The load reported on a few seems low though this not > representative (looking at machine/uptime). > More parallelism willl probably mean more test failure. Let me take a look > see. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23783) Address tests writing and reading SSL/Security files in a common location.
[ https://issues.apache.org/jira/browse/HBASE-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030335#comment-17030335 ] Mark Robert Miller commented on HBASE-23783: Thank you Mr Stack! I'll figure out all these checks some day - the whitespace slipped by me. With this committed, it will be easier for me to track down if there are any glaring remaining issues around HBASE-23779. > Address tests writing and reading SSL/Security files in a common location. > -- > > Key: HBASE-23783 > URL: https://issues.apache.org/jira/browse/HBASE-23783 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Assignee: Mark Robert Miller >Priority: Minor > Fix For: 3.0.0, 2.3.0 > > > This is causing me issues with parallel test runs because multiple tests can > write and read the same files in the test-classes directory. Some tests write > files in test-classes instead of their test data directory so that they can > put the files on the classpath. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23794) Consider setting -XX:MaxDirectMemorySize in the root Maven pom.xml file.
[ https://issues.apache.org/jira/browse/HBASE-23794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030018#comment-17030018 ] Mark Robert Miller commented on HBASE-23794: I'm still working out what a good suggestion for a value might be. Very few of the tests need even more than 1g of the 2g of heap given, so I'm looking into some numbers between the two things. Largely, it is just nice to be explicit so that all devs and CI envs get the same value. Older hotspot might default to lower explicit values depending on arch/client/server, more recent hotspot defaults to Xmx, hotspot can change again, other JVM's could do whatever. So a lot of the improvement I imagine here is just consistency of the build and knowing the value has been set high enough for the tests. I've run into fails due to this while playing around with giving tests less resources - so I'd like to set it high enough to avoid any fails, but also remove this confusion around messing with Xmx and running into off heap allocation failures and that type of thing. > Consider setting -XX:MaxDirectMemorySize in the root Maven pom.xml file. > > > Key: HBASE-23794 > URL: https://issues.apache.org/jira/browse/HBASE-23794 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Priority: Minor > > -XX:MaxDirectMemorySize is an artificial governor on how much off heap memory > can be allocated. > It would be nice to specify explicitly because: > # The default can vary by platform / jvm impl - some devs may see random > fails > # It's just a limiter, it won't pre allocate or anything > # A test env should normally ensure a healthy limit as would be done in > production -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23796) Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as well.
Mark Robert Miller created HBASE-23796: -- Summary: Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as well. Key: HBASE-23796 URL: https://issues.apache.org/jira/browse/HBASE-23796 Project: HBase Issue Type: Test Reporter: Mark Robert Miller This is perhaps controversial, but there are a variety of problems with counting on dns hostname resolution, especially for locahost. # It can often be slow, slow under concurrency, or slow under specific conditions. # It can often not work at all - when on a VPN, with weird DNS hijacking hi-jinks, when you have a real hostname for you machines, a custom /etc/hosts file, OS's run their own local/funny DNS server services. # This makes coming to HBase for new devs a hit or miss experience and if you miss, dealing with an diagnosing the issues is a large endeavor and not straight forward or transparent. # 99% of the difference doesn't matter in most cases - except that 127.0.0.1 works and is fast pretty much universally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.
Mark Robert Miller created HBASE-23795: -- Summary: Enable all tests to be run in parallel on reused JVMs. Key: HBASE-23795 URL: https://issues.apache.org/jira/browse/HBASE-23795 Project: HBase Issue Type: Wish Reporter: Mark Robert Miller I'd like to be able to run HBase tests in under 30-40 minutes on good parallel hardware. It will require some small changes / fixes for that wish to come true. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23794) Consider setting -XX:MaxDirectMemorySize in the root Maven pom.xml file.
Mark Robert Miller created HBASE-23794: -- Summary: Consider setting -XX:MaxDirectMemorySize in the root Maven pom.xml file. Key: HBASE-23794 URL: https://issues.apache.org/jira/browse/HBASE-23794 Project: HBase Issue Type: Test Reporter: Mark Robert Miller -XX:MaxDirectMemorySize is an artificial governor on how much off heap memory can be allocated. It would be nice to specify explicitly because: # The default can vary by platform / jvm impl - some devs may see random fails # It's just a limiter, it won't pre allocate or anything # A test env should normally ensure a healthy limit as would be done in production -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23787) TestSyncTimeRangeTracker fails quite easily and allocates a very expensive array.
Mark Robert Miller created HBASE-23787: -- Summary: TestSyncTimeRangeTracker fails quite easily and allocates a very expensive array. Key: HBASE-23787 URL: https://issues.apache.org/jira/browse/HBASE-23787 Project: HBase Issue Type: Test Components: test Reporter: Mark Robert Miller I see this test fail a lot in my environments. It also uses such a large array that it seems particularly memory wasteful and difficult to get good contention in the test as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-23783) Address tests writing and reading SSL/Security files in a common location.
[ https://issues.apache.org/jira/browse/HBASE-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029230#comment-17029230 ] Mark Robert Miller edited comment on HBASE-23783 at 2/3/20 8:04 PM: I would also like to add ${surefire.tempDir} To safely run surefire from multiple maven instances, you have to be able to specify a unique tmp directory. Otherwise, removal of the directory on JVM exit can interfere with tmp file creation. was (Author: markrmiller): I also like to add ${surefire.tempDir} To safely run surefire from multiple maven instances, you have to be able to specify a unique tmp directory. > Address tests writing and reading SSL/Security files in a common location. > -- > > Key: HBASE-23783 > URL: https://issues.apache.org/jira/browse/HBASE-23783 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Priority: Minor > > This is causing me issues with parallel test runs because multiple tests can > write and read the same files in the test-classes directory. Some tests write > files in test-classes instead of their test data directory so that they can > put the files on the classpath. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23783) Address tests writing and reading SSL/Security files in a common location.
[ https://issues.apache.org/jira/browse/HBASE-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029230#comment-17029230 ] Mark Robert Miller commented on HBASE-23783: I also like to add ${surefire.tempDir} To safely run surefire from multiple maven instances, you have to be able to specify a unique tmp directory. > Address tests writing and reading SSL/Security files in a common location. > -- > > Key: HBASE-23783 > URL: https://issues.apache.org/jira/browse/HBASE-23783 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Priority: Minor > > This is causing me issues with parallel test runs because multiple tests can > write and read the same files in the test-classes directory. Some tests write > files in test-classes instead of their test data directory so that they can > put the files on the classpath. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-23783) Address tests writing and reading SSL/Security files in a common location.
[ https://issues.apache.org/jira/browse/HBASE-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Robert Miller updated HBASE-23783: --- Description: This is causing me issues with parallel test runs because multiple tests can write and read the same files in the test-classes directory. Some tests write files in test-classes instead of their test data directory so that they can put the files on the classpath. (was: This is causing me issues with parallel test runs.) > Address tests writing and reading SSL/Security files in a common location. > -- > > Key: HBASE-23783 > URL: https://issues.apache.org/jira/browse/HBASE-23783 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Priority: Minor > > This is causing me issues with parallel test runs because multiple tests can > write and read the same files in the test-classes directory. Some tests write > files in test-classes instead of their test data directory so that they can > put the files on the classpath. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23783) Address tests writing and reading SSL/Security files in a common location.
[ https://issues.apache.org/jira/browse/HBASE-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028563#comment-17028563 ] Mark Robert Miller commented on HBASE-23783: I recently switched from eclipse to intellij - I have a little extra code formatting to clean up. This seems to be working better for me. I was creating a new unique sub directory on the classpath for these files, but it was simpler in the end just to use unique file names and keep these files in the root test-classes directory in the cases they were already located there. So far this has worked out well, still doing some testing. > Address tests writing and reading SSL/Security files in a common location. > -- > > Key: HBASE-23783 > URL: https://issues.apache.org/jira/browse/HBASE-23783 > Project: HBase > Issue Type: Test >Reporter: Mark Robert Miller >Priority: Minor > > This is causing me issues with parallel test runs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23783) Address tests writing and reading SSL/Security files in a common location.
Mark Robert Miller created HBASE-23783: -- Summary: Address tests writing and reading SSL/Security files in a common location. Key: HBASE-23783 URL: https://issues.apache.org/jira/browse/HBASE-23783 Project: HBase Issue Type: Test Reporter: Mark Robert Miller This is causing me issues with parallel test runs. -- This message was sent by Atlassian Jira (v8.3.4#803005)