[jira] [Created] (HBASE-24612) Consider allowing a separate EventLoopGroup for accepting new connections.

2020-06-22 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-24612:
--

 Summary: Consider allowing a separate EventLoopGroup for accepting 
new connections.
 Key: HBASE-24612
 URL: https://issues.apache.org/jira/browse/HBASE-24612
 Project: HBase
  Issue Type: Improvement
Reporter: Mark Robert Miller


Netty applications often set a separate thread pool for accepting connections 
rather than sharing a single pool for accepting new connections and the work 
those connections do.

It would be interesting to allow configuring a separation of pools to allow 
users to experiment with a pool dedicated to accepting new connections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24447) Contribute a Test class that shows some examples for using the Async Client API

2020-05-27 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-24447:
--

 Summary: Contribute a Test class that shows some examples for 
using the Async Client API
 Key: HBASE-24447
 URL: https://issues.apache.org/jira/browse/HBASE-24447
 Project: HBase
  Issue Type: Test
  Components: test
Reporter: Mark Robert Miller


Kind of along the lines of 
[https://github.com/apache/hbase/blob/master/hbase-examples/src/main/java/org/apache/hadoop/hbase/client/example/AsyncClientExample.java]

but initially in the form of a test to make verification and env easy.

This is basically some examples of how you can use the CompletableFuture API 
with the Async Client - it can be a little painful to do from scratch for a 
noobie given the expressivness and size of the CompletableFuture API, but is 
much easier with some more example code to build on or tinker with.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.

2020-05-21 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113749#comment-17113749
 ] 

Mark Robert Miller commented on HBASE-24155:


It took me a bit longer, but I ended up tracking this down a bit further. 
Raising the socket cache size and expiration for hdfs had helped a fair amount, 
but there still 50% the number of sockets getting made, a lot of it I tracked 
to *ReplicationSourceWALReader*  and it's reset to look for additional data to 
read.

> When running the tests, a tremendous number of connections are put into 
> TIME_WAIT.
> --
>
> Key: HBASE-24155
> URL: https://issues.apache.org/jira/browse/HBASE-24155
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Major
>
> When you run the test suite and monitor the number of connections in 
> TIME_WAIT, it appears that a very large number of connections do not end up 
> with a proper connection close lifecycle or perhaps proper reuse.
> Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
> running the tests faster or with more tests in parallel increases the 
> TIME_WAIT connection buildup. Some tests spin up a very, very large number of 
> connections and if the wrong ones run at the same time, this can also greatly 
> increase the number of connections put into TIME_WAIT. This can have a 
> dramatic affect on performance (as it can take longer to create a new 
> connection) or flat out fail or timeout.
> In my experience, a much, much smaller number of connections in a test suite 
> would end up in TIME_WAIT when connection handling is all correct.
> Notes to come in comments below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23806) Provide a much faster and efficient alternate option to maven and surefire for running tests.

2020-05-17 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109348#comment-17109348
 ] 

Mark Robert Miller commented on HBASE-23806:


I’m done with my project. I was originally going to try and kind of document 
and share my trail, but it did not really pan out. My intention was that once 
the test suite was seen at another level, it might just be too tempting to take 
some benefit from that example of what can be done. Really though, the main 
benefit was for me, for general knowledge and so that I could understand how to 
do a couple things in the code with confidence. 

> Provide a much faster and efficient alternate option to maven and surefire 
> for running tests.
> -
>
> Key: HBASE-23806
> URL: https://issues.apache.org/jira/browse/HBASE-23806
> Project: HBase
>  Issue Type: Wish
>Reporter: Mark Robert Miller
>Priority: Minor
>
> With HBASE-23795, the hope is to drive tests with maven and surefire much 
> closer to their potential.
> That will still leave a lot of room for improvement.
> For those that have some nice hardware and a need for speed, we can blow 
> right past maven+surefire.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-23806) Provide a much faster and efficient alternate option to maven and surefire for running tests.

2020-05-17 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller resolved HBASE-23806.

Resolution: Won't Fix

> Provide a much faster and efficient alternate option to maven and surefire 
> for running tests.
> -
>
> Key: HBASE-23806
> URL: https://issues.apache.org/jira/browse/HBASE-23806
> Project: HBase
>  Issue Type: Wish
>Reporter: Mark Robert Miller
>Priority: Minor
>
> With HBASE-23795, the hope is to drive tests with maven and surefire much 
> closer to their potential.
> That will still leave a lot of room for improvement.
> For those that have some nice hardware and a need for speed, we can blow 
> right past maven+surefire.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-23787) TestSyncTimeRangeTracker fails quite easily and allocates a very expensive array.

2020-05-14 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller resolved HBASE-23787.

Resolution: Not A Problem

I think the expensive array may have already been dealt with elsewhere.

> TestSyncTimeRangeTracker fails quite easily and allocates a very expensive 
> array.
> -
>
> Key: HBASE-23787
> URL: https://issues.apache.org/jira/browse/HBASE-23787
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Major
>
> I see this test fail a lot in my environments. It also uses such a large 
> array that it seems particularly memory wasteful and difficult to get good 
> contention in the test as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-23849) Harden small and medium tests for lots of parallel runs with re-used jvms.

2020-05-14 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller resolved HBASE-23849.

Resolution: Won't Fix

Small and medium tests don't actually take too long to get going, mostly just 
have to deal with some statics issues. I had these working well on master at 
one point, but have only been looking at branch2 for a while, so not looking go 
back to that.

> Harden small and medium tests for lots of parallel runs with re-used jvms.
> --
>
> Key: HBASE-23849
> URL: https://issues.apache.org/jira/browse/HBASE-23849
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.

2020-05-13 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106302#comment-17106302
 ] 

Mark Robert Miller edited comment on HBASE-24155 at 5/13/20, 1:27 PM:
--

Man it took me a long time to finally see a lot of what was going on here.

Mostly just seems to be hdfs short circuit read socket pooling management that 
you can just call me not a fan of and with defaults that you can call me an 
anti fan of. Couple that with some hbase fail and fast retry stuff, especially 
in like snapshooting or splitting stuff, and well, the number of potential 
sockets (many without a proper tcp lifecycle) are just one part of the 
resulting fun. And the number of datanode transfer threads (also will use 
sockets) that can be spun up in these cases is clearly beyond what makes sense 
to me.


was (Author: markrmiller):
Man it took me a long time to finally see a lot of what was going on here.

Mostly just seems to be hdfs short circuit read socket polling management that 
you can just call me not a fan of and with defaults that you can call me an 
anti fan of. Couple that with some hbase fail and fast retry stuff, especially 
in like snapshooting or splitting stuff, and well, the number of potential 
sockets (many without a proper tcp lifecycle) are just one part of the 
resulting fun.

> When running the tests, a tremendous number of connections are put into 
> TIME_WAIT.
> --
>
> Key: HBASE-24155
> URL: https://issues.apache.org/jira/browse/HBASE-24155
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Major
>
> When you run the test suite and monitor the number of connections in 
> TIME_WAIT, it appears that a very large number of connections do not end up 
> with a proper connection close lifecycle or perhaps proper reuse.
> Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
> running the tests faster or with more tests in parallel increases the 
> TIME_WAIT connection buildup. Some tests spin up a very, very large number of 
> connections and if the wrong ones run at the same time, this can also greatly 
> increase the number of connections put into TIME_WAIT. This can have a 
> dramatic affect on performance (as it can take longer to create a new 
> connection) or flat out fail or timeout.
> In my experience, a much, much smaller number of connections in a test suite 
> would end up in TIME_WAIT when connection handling is all correct.
> Notes to come in comments below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.

2020-05-13 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller resolved HBASE-24155.

Resolution: Information Provided

Man it took me a long time to finally see a lot of what was going on here.

Mostly just seems to be hdfs short circuit read socket polling management that 
you can just call me not a fan of and with defaults that you can call me an 
anti fan of. Couple that with some hbase fail and fast retry stuff, especially 
in like snapshooting or splitting stuff, and well, the number of potential 
sockets (many without a proper tcp lifecycle) are just one part of the 
resulting fun.

> When running the tests, a tremendous number of connections are put into 
> TIME_WAIT.
> --
>
> Key: HBASE-24155
> URL: https://issues.apache.org/jira/browse/HBASE-24155
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Major
>
> When you run the test suite and monitor the number of connections in 
> TIME_WAIT, it appears that a very large number of connections do not end up 
> with a proper connection close lifecycle or perhaps proper reuse.
> Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
> running the tests faster or with more tests in parallel increases the 
> TIME_WAIT connection buildup. Some tests spin up a very, very large number of 
> connections and if the wrong ones run at the same time, this can also greatly 
> increase the number of connections put into TIME_WAIT. This can have a 
> dramatic affect on performance (as it can take longer to create a new 
> connection) or flat out fail or timeout.
> In my experience, a much, much smaller number of connections in a test suite 
> would end up in TIME_WAIT when connection handling is all correct.
> Notes to come in comments below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-23830) TestReplicationEndpoint appears to fail a lot in my attempts for a clean test run locally.

2020-05-11 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller resolved HBASE-23830.

Resolution: Not A Problem

Don't see this so often anymore.

> TestReplicationEndpoint appears to fail a lot in my attempts for a clean test 
> run locally.
> --
>
> Key: HBASE-23830
> URL: https://issues.apache.org/jira/browse/HBASE-23830
> Project: HBase
>  Issue Type: Test
>Affects Versions: master
>Reporter: Mark Robert Miller
>Priority: Major
> Attachments: test_fails.tar.xz
>
>
> This test is failing for me like 30-40% of the time. Fail seems to usually be 
> as below. I've tried increasing the wait timeout but that does not seem to 
> help at all.
> {code}
> [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 105.145 s <<< FAILURE! - in 
> org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] Tests run: 
> 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.145 s <<< FAILURE! - 
> in org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] 
> org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication
>   Time elapsed: 38.725 s  <<< FAILURE!java.lang.AssertionError: Waiting timed 
> out after [30,000] msec Failed to replicate all edits, expected = 2500 
> replicated = 2476 at org.junit.Assert.fail(Assert.java:89) at 
> org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203) at 
> org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137) at 
> org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication(TestReplicationEndpoint.java:235){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-23918) Track sensitive resources to ensure they are closed and assist devs in finding leaks.

2020-05-11 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller resolved HBASE-23918.

Resolution: Information Provided

So I use a tool like this when I'm tacking things to close - I just cut and 
paste it in when I'm hunting for closeable objects that are either not closed 
or missed due to concurrency or a bug or whatever.

It's more controversial to use asserts like this in tests permanently, so I'll 
just leave this with the seed idea for some kind of auto shutdown/close 
enforcement options above the current hmaster/regionserver thread checker.

> Track sensitive resources to ensure they are closed and assist devs in 
> finding leaks.
> -
>
> Key: HBASE-23918
> URL: https://issues.apache.org/jira/browse/HBASE-23918
> Project: HBase
>  Issue Type: Improvement
>Reporter: Mark Robert Miller
>Priority: Major
>
> Closing some objects is quite critical. Issues with leaks can be quite 
> slippery and nasty and growy. Maintaining close integrity is an embarrassing 
> sport for humans.
> In the past, those 3 thoughts led me to start tracking objects in tests to 
> alert of leaks. Even with an alert though, the job of tracking down all of 
> the leaks just based on what leaked was beyond my skill. If it's beyond even 
> one devs skill that is committing, that tends to end up trouble. So I added 
> the stack trace for the origin of the object. Things can still get a bit 
> tricky to track down in some cases, but now I had the start of a real 
> solution to all of the whack-a-mole games I spent too much time playing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23831) TestChoreService is very sensitive to resources.

2020-05-11 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104878#comment-17104878
 ] 

Mark Robert Miller edited comment on HBASE-23831 at 5/11/20, 8:20 PM:
--

When I try and run this test in a VM on my iMac, it always fails in a couple 
ways. However, I don't see others with this issue and it doesn't seem to happen 
on my primary box and addressing is not easy to do cleanly, you just have to 
keep adding more fudge to allow slower envs with fewer cores to handle it. 
Doesn't really show up so easily in my other faster envs.


was (Author: markrmiller):
When I try and run this test in a VM on my iMac, it also fails in a couple 
ways. However, I don't see others with this issue and it doesn't seem to happen 
on my primary box and addressing is not easy to do cleanly, you just have to 
keep adding more fudge to allow slower envs with fewer cores to handle it. 
Doesn't really show up so easily in my other faster envs.

> TestChoreService is very sensitive to resources.
> 
>
> Key: HBASE-23831
> URL: https://issues.apache.org/jira/browse/HBASE-23831
> Project: HBase
>  Issue Type: Test
>Affects Versions: master
>Reporter: Mark Robert Miller
>Priority: Major
>
> More details following.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-23831) TestChoreService is very sensitive to resources.

2020-05-11 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller resolved HBASE-23831.

Resolution: Not A Problem

When I try and run this test in a VM on my iMac, it also fails in a couple 
ways. However, I don't see others with this issue and it doesn't seem to happen 
on my primary box and addressing is not easy to do cleanly, you just have to 
keep adding more fudge to allow slower envs with fewer cores to handle it. 
Doesn't really show up so easily in my other faster envs.

> TestChoreService is very sensitive to resources.
> 
>
> Key: HBASE-23831
> URL: https://issues.apache.org/jira/browse/HBASE-23831
> Project: HBase
>  Issue Type: Test
>Affects Versions: master
>Reporter: Mark Robert Miller
>Priority: Major
>
> More details following.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-23796) Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as well.

2020-05-11 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller resolved HBASE-23796.

Resolution: Won't Fix

This was helpful to me in getting more successful test runs outside of docker, 
but in recent days, branch-2 can actually pass for me a decent percentage of 
runs without this change or Docker, so I'm not sure how valuable any of this 
remaining is. Hadoop auto looks up the hostname regardless of this setting a 
lot and that appears very slow in my env when happening a lot concurrently and 
so trying to get out of docker to perform like in docker in my case doesn't 
seem easily attainable. In docker it is.

> Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as 
> well.
> ---
>
> Key: HBASE-23796
> URL: https://issues.apache.org/jira/browse/HBASE-23796
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Priority: Minor
>
> This is perhaps controversial, but there are a variety of problems with 
> counting on dns hostname resolution, especially for locahost.
>  
>  # It can often be slow, slow under concurrency, or slow under specific 
> conditions.
>  # It can often not work at all - when on a VPN, with weird DNS hijacking 
> hi-jinks, when you have a real hostname for you machines, a custom /etc/hosts 
> file, OS's run their own local/funny DNS server services.
>  # This makes coming to HBase for new devs a hit or miss experience and if 
> you miss, dealing with an diagnosing the issues is a large endeavor and not 
> straight forward or transparent.
>  #  99% of the difference doesn't matter in most cases - except that 
> 127.0.0.1 works and is fast pretty much universally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.

2020-05-11 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller resolved HBASE-23795.

Resolution: Information Provided

So I've gone through this pretty thoroughly, though the tests failed on me a 
lot more than they are currently at the time.

If you use a statics checker and clear large statics and you reinit the right 
test and runtime statics, and shutdown additional threads and resources that 
can remain outstanding, these tests  can run in the same JVM about as well as 
Lucene and Solr tests do, taking advantage of class caching and hotspot, etc.

I've seen all the tests run on my 16-core machine in about 30-40 minutes with a 
new JVM for every test and parallelism lifted to max load, so perhaps 20 
minutes would be in sight with reuse.

> Enable all tests to be run in parallel on reused JVMs.
> --
>
> Key: HBASE-23795
> URL: https://issues.apache.org/jira/browse/HBASE-23795
> Project: HBase
>  Issue Type: Wish
>Reporter: Mark Robert Miller
>Priority: Major
>
> I'd like to be able to run HBase tests in under 30-40 minutes on good 
> parallel hardware.
> It will require some small changes / fixes for that wish to come true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-24332) TestJMXListener.setupBeforeClass can fail due to not getting a random port.

2020-05-11 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller resolved HBASE-24332.

Resolution: Duplicate

> TestJMXListener.setupBeforeClass can fail due to not getting a random port.
> ---
>
> Key: HBASE-24332
> URL: https://issues.apache.org/jira/browse/HBASE-24332
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Minor
>
> [ERROR] Errors: 
> [ERROR] TestJMXListener.setupBeforeClass:61 » IO Shutting down



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24346) TestWALProcedureStoreOnHDFS#testWalAbortOnLowReplicationWithQueuedWriters fails too easily.

2020-05-11 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104724#comment-17104724
 ] 

Mark Robert Miller commented on HBASE-24346:


I added another second of sleep and the test at least fails much less often for 
me.

> TestWALProcedureStoreOnHDFS#testWalAbortOnLowReplicationWithQueuedWriters 
> fails too easily.
> ---
>
> Key: HBASE-24346
> URL: https://issues.apache.org/jira/browse/HBASE-24346
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-23882) Scale *MiniCluster config for the environment it runs in.

2020-05-10 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller resolved HBASE-23882.

Resolution: Duplicate

> Scale *MiniCluster config for the environment it runs in.
> -
>
> Key: HBASE-23882
> URL: https://issues.apache.org/jira/browse/HBASE-23882
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24346) TestWALProcedureStoreOnHDFS#testWalAbortOnLowReplicationWithQueuedWriters fails too easily.

2020-05-10 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-24346:
--

 Summary: 
TestWALProcedureStoreOnHDFS#testWalAbortOnLowReplicationWithQueuedWriters fails 
too easily.
 Key: HBASE-24346
 URL: https://issues.apache.org/jira/browse/HBASE-24346
 Project: HBase
  Issue Type: Test
  Components: test
Reporter: Mark Robert Miller






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24327) TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer can fail with retries exhausted on an admin call.

2020-05-08 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102559#comment-17102559
 ] 

Mark Robert Miller commented on HBASE-24327:


Yeah, fire away.

> TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer can fail 
> with retries exhausted on an admin call.
> 
>
> Key: HBASE-24327
> URL: https://issues.apache.org/jira/browse/HBASE-24327
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24342) [Flakey Tests] Disable TestClusterPortAssignment.testClusterPortAssignment as it can't pass 100% of the time

2020-05-07 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102021#comment-17102021
 ] 

Mark Robert Miller commented on HBASE-24342:


Nice! Also in my list.

> [Flakey Tests] Disable TestClusterPortAssignment.testClusterPortAssignment as 
> it can't pass 100% of the time
> 
>
> Key: HBASE-24342
> URL: https://issues.apache.org/jira/browse/HBASE-24342
> Project: HBase
>  Issue Type: Bug
>  Components: flakies, test
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
>
> This is a BindException special. We get randomFreePort and then put up the 
> procesess.
> {code}
> 2020-05-07 00:30:15,844 INFO  [Time-limited test] http.HttpServer(1080): 
> HttpServer.start() threw a non Bind IOException
> java.net.BindException: Port in use: 0.0.0.0:59568
>   at 
> org.apache.hadoop.hbase.http.HttpServer.openListeners(HttpServer.java:1146)
>   at org.apache.hadoop.hbase.http.HttpServer.start(HttpServer.java:1077)
>   at org.apache.hadoop.hbase.http.InfoServer.start(InfoServer.java:148)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.putUpWebUI(HRegionServer.java:2133)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.(HRegionServer.java:670)
>   at org.apache.hadoop.hbase.master.HMaster.(HMaster.java:511)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:132)
>   at 
> org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:239)
>   at 
> org.apache.hadoop.hbase.LocalHBaseCluster.(LocalHBaseCluster.java:181)
>   at 
> org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:245)
>   at 
> org.apache.hadoop.hbase.MiniHBaseCluster.(MiniHBaseCluster.java:115)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:1178)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1142)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1106)
>   at 
> org.apache.hadoop.hbase.TestClusterPortAssignment.testClusterPortAssignment(TestClusterPortAssignment.java:57)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>   at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
>   at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
>   at 
> org.apache.hadoop.hbase.SystemExitRule$1.evaluate(SystemExitRule.java:38)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.BindException: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at 

[jira] [Commented] (HBASE-24331) [Flakey Test] TestJMXListener rmi port clash

2020-05-05 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100263#comment-17100263
 ] 

Mark Robert Miller commented on HBASE-24331:


Ha, I was dealing with this today as well. HBASE-24332

I went about it slightly different, I'll put up the PR, though it's not cleaned 
up and checkstyled yet.

> [Flakey Test] TestJMXListener rmi port clash
> 
>
> Key: HBASE-24331
> URL: https://issues.apache.org/jira/browse/HBASE-24331
> Project: HBase
>  Issue Type: Sub-task
>  Components: flakies, test
>Reporter: Michael Stack
>Priority: Major
>
> The TestJMXListener can fail because the random port it wants to put the jmx 
> listener on is occupied when it goes to run. Handle this case in test startup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24332) TestJMXListener.setupBeforeClass can fail due to not getting a random port.

2020-05-05 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100188#comment-17100188
 ] 

Mark Robert Miller commented on HBASE-24332:


I think this happens because resuseAddress is not set on the socket used to 
look for a free port. We can clean up and consolidate some of this port 
allocation test code.

> TestJMXListener.setupBeforeClass can fail due to not getting a random port.
> ---
>
> Key: HBASE-24332
> URL: https://issues.apache.org/jira/browse/HBASE-24332
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Minor
>
> [ERROR] Errors: 
> [ERROR] TestJMXListener.setupBeforeClass:61 » IO Shutting down



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24332) TestJMXListener.setupBeforeClass can fail due to not getting a random port.

2020-05-05 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-24332:
--

 Summary: TestJMXListener.setupBeforeClass can fail due to not 
getting a random port.
 Key: HBASE-24332
 URL: https://issues.apache.org/jira/browse/HBASE-24332
 Project: HBase
  Issue Type: Test
  Components: test
Reporter: Mark Robert Miller


[ERROR] Errors: 
[ERROR] TestJMXListener.setupBeforeClass:61 » IO Shutting down



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24327) TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer can fail with retries exhausted on an admin call.

2020-05-05 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099572#comment-17099572
 ] 

Mark Robert Miller commented on HBASE-24327:


The fail for this is:

 

[ERROR] Errors: 
[ERROR] TestMasterShutdown.testMasterShutdownBeforeStartingAnyRegionServer:166 
? RetriesExhausted

> TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer can fail 
> with retries exhausted on an admin call.
> 
>
> Key: HBASE-24327
> URL: https://issues.apache.org/jira/browse/HBASE-24327
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24327) TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer can fail with retries exhausted on an admin call.

2020-05-05 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-24327:
--

 Summary: 
TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer can fail 
with retries exhausted on an admin call.
 Key: HBASE-24327
 URL: https://issues.apache.org/jira/browse/HBASE-24327
 Project: HBase
  Issue Type: Test
  Components: test
Reporter: Mark Robert Miller






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24325) TestJMXConnectorServer can fail to start the minicluster due to it's port already having been chosen by another test.

2020-05-04 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-24325:
--

 Summary: TestJMXConnectorServer can fail to start the minicluster 
due to it's port already having been chosen by another test.
 Key: HBASE-24325
 URL: https://issues.apache.org/jira/browse/HBASE-24325
 Project: HBase
  Issue Type: Test
  Components: test
Reporter: Mark Robert Miller






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24185) Junit tests do not behave well with System.exit or Runtime.halt or JVM exits in general.

2020-04-14 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-24185:
--

 Summary: Junit tests do not behave well with System.exit or 
Runtime.halt or JVM exits in general.
 Key: HBASE-24185
 URL: https://issues.apache.org/jira/browse/HBASE-24185
 Project: HBase
  Issue Type: Test
  Components: test
Reporter: Mark Robert Miller


This ends up exiting the JVM and confusing / erroring out the test runner that 
manages that JVM as well as cutting off test output files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.

2020-04-13 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082353#comment-17082353
 ] 

Mark Robert Miller commented on HBASE-24155:


Still doing a little digging before I dump more info.

Basically, the more JVM's I run in parallel to make the tests faster, the more 
I hit this certain fail in a large variety of tests where the test times out.

Looking at resource usage, the only thing that seems to approach or exceed 
limits is the number of connections that end up in TIME_WAIT. It feels like 
some number of tests is creating a huge number of connections. If I ignore 
enough of the tests that end up hanging, I can run the remaining 95% of the 
tests in as many JVMs as I have RAM for. I'm narrowing down which tests are 
creating the most connections so that I can inspect them a little closer.

> When running the tests, a tremendous number of connections are put into 
> TIME_WAIT.
> --
>
> Key: HBASE-24155
> URL: https://issues.apache.org/jira/browse/HBASE-24155
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Major
>
> When you run the test suite and monitor the number of connections in 
> TIME_WAIT, it appears that a very large number of connections do not end up 
> with a proper connection close lifecycle or perhaps proper reuse.
> Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
> running the tests faster or with more tests in parallel increases the 
> TIME_WAIT connection buildup. Some tests spin up a very, very large number of 
> connections and if the wrong ones run at the same time, this can also greatly 
> increase the number of connections put into TIME_WAIT. This can have a 
> dramatic affect on performance (as it can take longer to create a new 
> connection) or flat out fail or timeout.
> In my experience, a much, much smaller number of connections in a test suite 
> would end up in TIME_WAIT when connection handling is all correct.
> Notes to come in comments below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23806) Provide a much faster and efficient alternate option to maven and surefire for running tests.

2020-04-10 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081129#comment-17081129
 ] 

Mark Robert Miller commented on HBASE-23806:


I look at this as not likely something that will be contributed back, but that 
will demonstrate the upper bounds of how fast tests can run.

Once you are running efficiently in parallel on a lot of cores, it takes a 
little extra to keep all those cores busy vs having a long tail of less and 
less cores being utilized. This can have a very large difference, and while you 
can't always get there with standard test systems, it's good to know how far 
off you are as well.

 

The order you start the parallel tests in and how you distribute them across 
jvms on good hardware can easily half the test time or more.

> Provide a much faster and efficient alternate option to maven and surefire 
> for running tests.
> -
>
> Key: HBASE-23806
> URL: https://issues.apache.org/jira/browse/HBASE-23806
> Project: HBase
>  Issue Type: Wish
>Reporter: Mark Robert Miller
>Priority: Minor
>
> With HBASE-23795, the hope is to drive tests with maven and surefire much 
> closer to their potential.
> That will still leave a lot of room for improvement.
> For those that have some nice hardware and a need for speed, we can blow 
> right past maven+surefire.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.

2020-04-09 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller updated HBASE-24155:
---
Description: 
When you run the test suite and monitor the number of connections in TIME_WAIT, 
it appears that a very large number of connections do not end up with a proper 
connection close lifecycle or perhaps proper reuse.

Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
running the tests faster or with more tests in parallel increases the TIME_WAIT 
connection buildup. Some tests spin up a very, very large number of connections 
and if the wrong ones run at the same time, this can also greatly increase the 
number of connections put into TIME_WAIT. This can have a dramatic affect on 
performance (as it can take longer to create a new connection) or flat out fail 
or timeout.

In my experience, a much, much smaller number of connections in a test suite 
would end up in TIME_WAIT when connection handling is all correct.

Notes to come in comments below.

  was:
When you run the test suite and monitor the number of connections in TIME_WAIT, 
it appears that a very large number of connections do not end up with a proper 
connection close lifecycle.

Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
running the tests faster or with more tests in parallel increases the TIME_WAIT 
connection buildup. Some tests spin up a very, very large number of connections 
and if the wrong ones run at the same time, this can also greatly increase the 
number of connections put into TIME_WAIT. This can have a dramatic affect on 
performance (as it can take longer to create a new connection) or flat out fail 
or timeout.

In my experience, a much, much smaller number of connections in a test suite 
would end up in TIME_WAIT when connection handling is all correct.

Notes to come in comments below.


> When running the tests, a tremendous number of connections are put into 
> TIME_WAIT.
> --
>
> Key: HBASE-24155
> URL: https://issues.apache.org/jira/browse/HBASE-24155
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Major
>
> When you run the test suite and monitor the number of connections in 
> TIME_WAIT, it appears that a very large number of connections do not end up 
> with a proper connection close lifecycle or perhaps proper reuse.
> Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
> running the tests faster or with more tests in parallel increases the 
> TIME_WAIT connection buildup. Some tests spin up a very, very large number of 
> connections and if the wrong ones run at the same time, this can also greatly 
> increase the number of connections put into TIME_WAIT. This can have a 
> dramatic affect on performance (as it can take longer to create a new 
> connection) or flat out fail or timeout.
> In my experience, a much, much smaller number of connections in a test suite 
> would end up in TIME_WAIT when connection handling is all correct.
> Notes to come in comments below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.

2020-04-09 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller updated HBASE-24155:
---
Description: 
When you run the test suite and monitor the number of connections in TIME_WAIT, 
it appears that a very large number of connections do not end up with a proper 
connection close lifecycle.

Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
running the tests faster or with more tests in parallel increases the TIME_WAIT 
connection buildup. Some tests spin up a very, very large number of connections 
and if the wrong ones run at the same time, this can also greatly increase the 
number of connections put into TIME_WAIT. This can have a dramatic affect on 
performance (as it can take longer to create a new connection) or flat out fail 
or timeout.

In my experience, a much, much smaller number of connections in a test suite 
would end up in TIME_WAIT when connection handling is all correct.

Notes to come in comments below.

  was:
When you run the test suite and monitor the number of connections in TIME_WAIT, 
it appears that a very large number of connections do not end up with a proper 
connection close lifecycle.

Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
running the tests faster or with more tests in parallel increases the TIME_WAIT 
connection buildup. Some tests spin up a very, very large number of connections 
and if the wrong ones run at the same time, this can also greatly increase the 
number of connections put into TIME_WAIT. This can have a dramatic affect on 
performance (as it can take longer to create a new connection) or flat out fail 
or timeout.

Ideally, a small proportion of connections in a test suite would end up in 
TIME_WAIT in comparison to the number created.

Notes to come in comments below.


> When running the tests, a tremendous number of connections are put into 
> TIME_WAIT.
> --
>
> Key: HBASE-24155
> URL: https://issues.apache.org/jira/browse/HBASE-24155
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Major
>
> When you run the test suite and monitor the number of connections in 
> TIME_WAIT, it appears that a very large number of connections do not end up 
> with a proper connection close lifecycle.
> Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
> running the tests faster or with more tests in parallel increases the 
> TIME_WAIT connection buildup. Some tests spin up a very, very large number of 
> connections and if the wrong ones run at the same time, this can also greatly 
> increase the number of connections put into TIME_WAIT. This can have a 
> dramatic affect on performance (as it can take longer to create a new 
> connection) or flat out fail or timeout.
> In my experience, a much, much smaller number of connections in a test suite 
> would end up in TIME_WAIT when connection handling is all correct.
> Notes to come in comments below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.

2020-04-09 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079417#comment-17079417
 ] 

Mark Robert Miller commented on HBASE-24155:


I've got some notes, comments, questions, but one thought is that, given the 
number of connections, I'm wondering if this could be such a high problem if 
there was a lot of connection reuse going on. Some tests appear to make *way* 
more connections that I'd expect with reuse/pooling.

> When running the tests, a tremendous number of connections are put into 
> TIME_WAIT.
> --
>
> Key: HBASE-24155
> URL: https://issues.apache.org/jira/browse/HBASE-24155
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Major
>
> When you run the test suite and monitor the number of connections in 
> TIME_WAIT, it appears that a very large number of connections do not end up 
> with a proper connection close lifecycle.
> Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
> running the tests faster or with more tests in parallel increases the 
> TIME_WAIT connection buildup. Some tests spin up a very, very large number of 
> connections and if the wrong ones run at the same time, this can also greatly 
> increase the number of connections put into TIME_WAIT. This can have a 
> dramatic affect on performance (as it can take longer to create a new 
> connection) or flat out fail or timeout.
> Ideally, a small proportion of connections in a test suite would end up in 
> TIME_WAIT in comparison to the number created.
> Notes to come in comments below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.

2020-04-09 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079417#comment-17079417
 ] 

Mark Robert Miller edited comment on HBASE-24155 at 4/9/20, 2:34 PM:
-

I've got some notes, comments, questions, but one thought is that, given the 
number of connections, I'm wondering if this could be such a high problem if 
there was a lot of connection reuse going on. Some tests appear to make *way* 
more connections than I'd expect with reuse/pooling.


was (Author: markrmiller):
I've got some notes, comments, questions, but one thought is that, given the 
number of connections, I'm wondering if this could be such a high problem if 
there was a lot of connection reuse going on. Some tests appear to make *way* 
more connections that I'd expect with reuse/pooling.

> When running the tests, a tremendous number of connections are put into 
> TIME_WAIT.
> --
>
> Key: HBASE-24155
> URL: https://issues.apache.org/jira/browse/HBASE-24155
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Mark Robert Miller
>Priority: Major
>
> When you run the test suite and monitor the number of connections in 
> TIME_WAIT, it appears that a very large number of connections do not end up 
> with a proper connection close lifecycle.
> Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
> running the tests faster or with more tests in parallel increases the 
> TIME_WAIT connection buildup. Some tests spin up a very, very large number of 
> connections and if the wrong ones run at the same time, this can also greatly 
> increase the number of connections put into TIME_WAIT. This can have a 
> dramatic affect on performance (as it can take longer to create a new 
> connection) or flat out fail or timeout.
> Ideally, a small proportion of connections in a test suite would end up in 
> TIME_WAIT in comparison to the number created.
> Notes to come in comments below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24155) When running the tests, a tremendous number of connections are put into TIME_WAIT.

2020-04-09 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-24155:
--

 Summary: When running the tests, a tremendous number of 
connections are put into TIME_WAIT.
 Key: HBASE-24155
 URL: https://issues.apache.org/jira/browse/HBASE-24155
 Project: HBase
  Issue Type: Test
  Components: test
Reporter: Mark Robert Miller


When you run the test suite and monitor the number of connections in TIME_WAIT, 
it appears that a very large number of connections do not end up with a proper 
connection close lifecycle.

Given connections can stay in TIME_WAIT from 1-4 minutes depending on OS/Env, 
running the tests faster or with more tests in parallel increases the TIME_WAIT 
connection buildup. Some tests spin up a very, very large number of connections 
and if the wrong ones run at the same time, this can also greatly increase the 
number of connections put into TIME_WAIT. This can have a dramatic affect on 
performance (as it can take longer to create a new connection) or flat out fail 
or timeout.

Ideally, a small proportion of connections in a test suite would end up in 
TIME_WAIT in comparison to the number created.

Notes to come in comments below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24143) [JDK11] Switch default garbage collector from CMS

2020-04-08 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078642#comment-17078642
 ] 

Mark Robert Miller commented on HBASE-24143:


Couple notes off the top of my head:
 * I believe the Lucene project has found that CMS is the best collector 
performance wise for their test suite (vs G1 and the other classic collectors). 
I don't think CMS is actually removed until maybe Java 14? Lucene also found a 
variety of bugs and stability issues in earlier versions of G1 as compared to 
CMS, probably that was mostly worked out by Java 11 though.
 * I'm a big fan of pinning these things explicitly so that developer test runs 
can be a consistent experience across a wider range of envs and devs. The more 
that is picked up based on java version or env or hardware, the harder it is to 
deliver a solid developer experience. As long as you provide a way for a dev to 
override these things, they can still personalize, but checkout and run test 
experience is more consistent and known.

> [JDK11] Switch default garbage collector from CMS
> -
>
> Key: HBASE-24143
> URL: https://issues.apache.org/jira/browse/HBASE-24143
> Project: HBase
>  Issue Type: Sub-task
>  Components: scripts
>Affects Versions: 3.0.0, 2.3.0
>Reporter: Nick Dimiduk
>Priority: Major
>
> When running HBase tools on the cli, one of the warnings generated is
> {noformat}
> OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
> version 9.0 and will likely be removed in a future release.
> {noformat}
> Java9+ use G1GC as the default collector. Maybe we simply omit GC 
> configurations and use the default settings? Or someone has some suggested 
> settings we can ship out of the box?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23113) IPC Netty Optimization

2020-04-06 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076796#comment-17076796
 ] 

Mark Robert Miller commented on HBASE-23113:


One short term option to consider is to only enable it for tests as well.

Tests put a very high number of connections into TIME_WAIT - I see a larger 
issue with it the faster I run tests. I think making that less common is not 
likely super short term, but more SO_REUSEPORT usage can alleviate it a bit. 
(lots of sources on this stuff, but here is one 
[http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html])

> IPC Netty Optimization
> --
>
> Key: HBASE-23113
> URL: https://issues.apache.org/jira/browse/HBASE-23113
> Project: HBase
>  Issue Type: Improvement
>  Components: IPC/RPC
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Minor
> Attachments: Decoder.jpeg
>
>
> Netty options in IPC Server/Client optimization:
> 1.SO_BACKLOG setting:Two queues are maintained in the Linux system kernel: 
> syns queue and accept queue. The first is a semi-join queue that saves the 
> connections to the synrecv state after receiving the client syn. The default 
> netty is 128,io.netty.util.NetUtil#SOMAXCONN , and then read 
> /proc/sys/net/core /somaxconn to continue to determine, and then there are 
> some system level coverage logic.In some scenarios, if the client is far 
> redundant to the server and the connection is established, it may not be 
> enough. This value should not be too large, otherwise it will not prevent 
> SYN-Flood attacks. The current value has been changed to 1024. After setting, 
> the value set by yourself is equivalent to setting the upper limit because of 
> the setting of the system and the size of the system. If some settings of the 
> Linux system operation and maintenance are wrong, it can be avoided at the 
> code level.At present, our Linux level is usually set to 128, and the final 
> calculation will be set to 128.
> 2.WRITE_BUFFER_WATER_MARK setting:After WRITEBUFFERWATERMARK sets the maximum 
> and minimum Buffer that can be temporarily stored on a connection, isWritable 
> returns unwritable if the amount of data waiting to be sent for the 
> connection is greater than the set value. In this way, the client can no 
> longer send, preventing this amount of continuous backlog, and eventually the 
> client may hang. If this happens, it is usually caused by slow processing on 
> the server side. This value can effectively protect the client. At this point 
> the data was not sent.
> 3.SO_REUSEADDR - Port multiplexing (allowing multiple sockets to listen on 
> the same IP+ port): For time-wait links, it ensures that the server restarts 
> successfully. In the case where some servers start up very quickly, it can 
> prevent startup failure.
> Netty decoder in IPC Server optimization:
> Netty provides a convenient decoding tool class ByteToMessageDecoder, as 
> shown in the top half of the figure, this class has accumulate bulk unpacking 
> capability, can read bytes from the socket as much as possible, and then 
> synchronously call the decode method to decode the business object. And 
> compose a List. Finally, the traversal traverses the List and submits it to 
> ChannelPipeline for processing. Here we made a small change, as shown in the 
> bottom half of the figure, the content to be submitted is changed from a 
> single command to the entire List, which reduces the number of pipeline 
> executions and improves throughput. This mode has no advantage in 
> low-concurrency scenarios, and has a significant performance boost in boost 
> throughput in high-concurrency scenarios.
>  !Decoder.jpeg! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23113) IPC Netty Optimization

2020-04-06 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076674#comment-17076674
 ] 

Mark Robert Miller edited comment on HBASE-23113 at 4/6/20, 9:13 PM:
-

I've put up a new pull request and spent a good amount of time looking for any 
negative results to the tests with these changes - so far I don't see any.

 

* WRITE_BUFFER_WATER_MARK 

This is a good Netty feature that I think certainly makes sense to expose. It 
could be disabled by default, but personally I'd see setting it by default as a 
nice improvement as well.

 

* SO_REUSEADDR 

This is a useful feature for good restart behavior as mentioned above, often 
very helpful in tests as much or more than production as they may restart 
things very quickly and the speed/behavior of that can be hardware/OS dependent.

 

* SO_BACKLOG

I don't like increasing queue sizes as a default stance, but this one is known 
to improve things when connection management life cycle is often off and/or 
connection reuse is not great, or lot's of retries, etc. It's also nice to pin 
it in HBase config vs relying on changing/varying external defaults. Given what 
I've seen of HBase, I think raising the default here is a good idea, but again, 
could default to previous behavior and just allow configuration as well.

 


was (Author: markrmiller):
I've put up a new pull request and spent a good amount of time looking for any 
negative results to the tests with these changes - so far I don't see any.

 

* WRITE_BUFFER_WATER_MARK 

This is a good Netty feature that I think certainly makes sense to expose. It 
could be disabled by default, but personally I'd see setting it by default as a 
nice improvement as well.

 

* SO_REUSEADDR 

This is a useful feature for good restart behavior as mentioned above, often 
very helpful in tests as much or more than production as they may restart 
things very quickly and the speed/behavior of that can be hardware/OS dependent.

 

* SO_BACKLOG

I don't like increasing queue sizes as a default stance, but this one is known 
to improve things when connection management cycle is often off and/or 
connection reuse is not great, or lot's of retries, etc. It's also nice to pin 
it in HBase config vs relying on changing/varying external defaults. Given what 
I've seen of HBase, I think raising the default here is a good idea, but again, 
could default to previous behavior and just allow configuration as well.

 

> IPC Netty Optimization
> --
>
> Key: HBASE-23113
> URL: https://issues.apache.org/jira/browse/HBASE-23113
> Project: HBase
>  Issue Type: Improvement
>  Components: IPC/RPC
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Minor
> Attachments: Decoder.jpeg
>
>
> Netty options in IPC Server/Client optimization:
> 1.SO_BACKLOG setting:Two queues are maintained in the Linux system kernel: 
> syns queue and accept queue. The first is a semi-join queue that saves the 
> connections to the synrecv state after receiving the client syn. The default 
> netty is 128,io.netty.util.NetUtil#SOMAXCONN , and then read 
> /proc/sys/net/core /somaxconn to continue to determine, and then there are 
> some system level coverage logic.In some scenarios, if the client is far 
> redundant to the server and the connection is established, it may not be 
> enough. This value should not be too large, otherwise it will not prevent 
> SYN-Flood attacks. The current value has been changed to 1024. After setting, 
> the value set by yourself is equivalent to setting the upper limit because of 
> the setting of the system and the size of the system. If some settings of the 
> Linux system operation and maintenance are wrong, it can be avoided at the 
> code level.At present, our Linux level is usually set to 128, and the final 
> calculation will be set to 128.
> 2.WRITE_BUFFER_WATER_MARK setting:After WRITEBUFFERWATERMARK sets the maximum 
> and minimum Buffer that can be temporarily stored on a connection, isWritable 
> returns unwritable if the amount of data waiting to be sent for the 
> connection is greater than the set value. In this way, the client can no 
> longer send, preventing this amount of continuous backlog, and eventually the 
> client may hang. If this happens, it is usually caused by slow processing on 
> the server side. This value can effectively protect the client. At this point 
> the data was not sent.
> 3.SO_REUSEADDR - Port multiplexing (allowing multiple sockets to listen on 
> the same IP+ port): For time-wait links, it ensures that the server restarts 
> successfully. In the case where some servers start up very quickly, it can 
> prevent startup failure.
> Netty decoder in IPC Server optimization:
> Netty provides a convenient decoding tool class 

[jira] [Commented] (HBASE-23113) IPC Netty Optimization

2020-04-06 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076674#comment-17076674
 ] 

Mark Robert Miller commented on HBASE-23113:


I've put up a new pull request and spent a good amount of time looking for any 
negative results to the tests with these changes - so far I don't see any.

 

* WRITE_BUFFER_WATER_MARK 

This is a good Netty feature that I think certainly makes sense to expose. It 
could be disabled by default, but personally I'd see setting it by default as a 
nice improvement as well.

 

* SO_REUSEADDR 

This is a useful feature for good restart behavior as mentioned above, often 
very helpful in tests as much or more than production as they may restart 
things very quickly and the speed/behavior of that can be hardware/OS dependent.

 

* SO_BACKLOG

I don't like increasing queue sizes as a default stance, but this one is known 
to improve things when connection management cycle is often off and/or 
connection reuse is not great, or lot's of retries, etc. It's also nice to pin 
it in HBase config vs relying on changing/varying external defaults. Given what 
I've seen of HBase, I think raising the default here is a good idea, but again, 
could default to previous behavior and just allow configuration as well.

 

> IPC Netty Optimization
> --
>
> Key: HBASE-23113
> URL: https://issues.apache.org/jira/browse/HBASE-23113
> Project: HBase
>  Issue Type: Improvement
>  Components: IPC/RPC
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Minor
> Attachments: Decoder.jpeg
>
>
> Netty options in IPC Server/Client optimization:
> 1.SO_BACKLOG setting:Two queues are maintained in the Linux system kernel: 
> syns queue and accept queue. The first is a semi-join queue that saves the 
> connections to the synrecv state after receiving the client syn. The default 
> netty is 128,io.netty.util.NetUtil#SOMAXCONN , and then read 
> /proc/sys/net/core /somaxconn to continue to determine, and then there are 
> some system level coverage logic.In some scenarios, if the client is far 
> redundant to the server and the connection is established, it may not be 
> enough. This value should not be too large, otherwise it will not prevent 
> SYN-Flood attacks. The current value has been changed to 1024. After setting, 
> the value set by yourself is equivalent to setting the upper limit because of 
> the setting of the system and the size of the system. If some settings of the 
> Linux system operation and maintenance are wrong, it can be avoided at the 
> code level.At present, our Linux level is usually set to 128, and the final 
> calculation will be set to 128.
> 2.WRITE_BUFFER_WATER_MARK setting:After WRITEBUFFERWATERMARK sets the maximum 
> and minimum Buffer that can be temporarily stored on a connection, isWritable 
> returns unwritable if the amount of data waiting to be sent for the 
> connection is greater than the set value. In this way, the client can no 
> longer send, preventing this amount of continuous backlog, and eventually the 
> client may hang. If this happens, it is usually caused by slow processing on 
> the server side. This value can effectively protect the client. At this point 
> the data was not sent.
> 3.SO_REUSEADDR - Port multiplexing (allowing multiple sockets to listen on 
> the same IP+ port): For time-wait links, it ensures that the server restarts 
> successfully. In the case where some servers start up very quickly, it can 
> prevent startup failure.
> Netty decoder in IPC Server optimization:
> Netty provides a convenient decoding tool class ByteToMessageDecoder, as 
> shown in the top half of the figure, this class has accumulate bulk unpacking 
> capability, can read bytes from the socket as much as possible, and then 
> synchronously call the decode method to decode the business object. And 
> compose a List. Finally, the traversal traverses the List and submits it to 
> ChannelPipeline for processing. Here we made a small change, as shown in the 
> bottom half of the figure, the content to be submitted is changed from a 
> single command to the entire List, which reduces the number of pipeline 
> executions and improves throughput. This mode has no advantage in 
> low-concurrency scenarios, and has a significant performance boost in boost 
> throughput in high-concurrency scenarios.
>  !Decoder.jpeg! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23113) IPC Netty Optimization

2020-04-03 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074823#comment-17074823
 ] 

Mark Robert Miller commented on HBASE-23113:


I've got an updated PR for this I'll submit shortly.

 
{quote}Netty decoder in IPC Server optimization
{quote}
I think this should probably be spun out into another issue. I took a stab at 
it a couple months ago, but I ran into some RAM usage changes I believe and may 
need some additional periodic flush or something. I'll go back and look at that 
again shortly for another JIRA.

> IPC Netty Optimization
> --
>
> Key: HBASE-23113
> URL: https://issues.apache.org/jira/browse/HBASE-23113
> Project: HBase
>  Issue Type: Improvement
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Minor
> Attachments: Decoder.jpeg
>
>
> Netty options in IPC Server/Client optimization:
> 1.SO_BACKLOG setting:Two queues are maintained in the Linux system kernel: 
> syns queue and accept queue. The first is a semi-join queue that saves the 
> connections to the synrecv state after receiving the client syn. The default 
> netty is 128,io.netty.util.NetUtil#SOMAXCONN , and then read 
> /proc/sys/net/core /somaxconn to continue to determine, and then there are 
> some system level coverage logic.In some scenarios, if the client is far 
> redundant to the server and the connection is established, it may not be 
> enough. This value should not be too large, otherwise it will not prevent 
> SYN-Flood attacks. The current value has been changed to 1024. After setting, 
> the value set by yourself is equivalent to setting the upper limit because of 
> the setting of the system and the size of the system. If some settings of the 
> Linux system operation and maintenance are wrong, it can be avoided at the 
> code level.At present, our Linux level is usually set to 128, and the final 
> calculation will be set to 128.
> 2.WRITE_BUFFER_WATER_MARK setting:After WRITEBUFFERWATERMARK sets the maximum 
> and minimum Buffer that can be temporarily stored on a connection, isWritable 
> returns unwritable if the amount of data waiting to be sent for the 
> connection is greater than the set value. In this way, the client can no 
> longer send, preventing this amount of continuous backlog, and eventually the 
> client may hang. If this happens, it is usually caused by slow processing on 
> the server side. This value can effectively protect the client. At this point 
> the data was not sent.
> 3.SO_REUSEADDR - Port multiplexing (allowing multiple sockets to listen on 
> the same IP+ port): For time-wait links, it ensures that the server restarts 
> successfully. In the case where some servers start up very quickly, it can 
> prevent startup failure.
> Netty decoder in IPC Server optimization:
> Netty provides a convenient decoding tool class ByteToMessageDecoder, as 
> shown in the top half of the figure, this class has accumulate bulk unpacking 
> capability, can read bytes from the socket as much as possible, and then 
> synchronously call the decode method to decode the business object. And 
> compose a List. Finally, the traversal traverses the List and submits it to 
> ChannelPipeline for processing. Here we made a small change, as shown in the 
> bottom half of the figure, the content to be submitted is changed from a 
> single command to the entire List, which reduces the number of pipeline 
> executions and improves throughput. This mode has no advantage in 
> low-concurrency scenarios, and has a significant performance boost in boost 
> throughput in high-concurrency scenarios.
>  !Decoder.jpeg! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count

2020-03-10 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056296#comment-17056296
 ] 

Mark Robert Miller commented on HBASE-23779:


Most tests don’t really need even more than 1GB of memory at most - I think 
that is the larger issue. The ‘limits’ creep that occurs in the simpler search 
of more stable tests - it often ends up just being a cover for ugly stuff.

> Up the default fork count to make builds complete faster; make count relative 
> to CPU count
> --
>
> Key: HBASE-23779
> URL: https://issues.apache.org/jira/browse/HBASE-23779
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
> Attachments: addendum2.patch, test_yetus_934.0.patch
>
>
> Tests take a long time. Our fork count running all tests are conservative -- 
> 1 (small) for first part and 5 for second part (medium and large). Rather 
> than hardcoding we should set the fork count to be relative to machine size. 
> Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box.
> Looking up at jenkins, it seems like the boxes are 24 cores... at least going 
> by my random survey. The load reported on a few seems low though this not 
> representative (looking at machine/uptime).
> More parallelism willl probably mean more test failure. Let me take a look 
> see.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23952) Address thread safety issue with Map used in BufferCallBeforeInitHandler.

2020-03-09 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23952:
--

 Summary: Address thread safety issue with Map used in 
BufferCallBeforeInitHandler.
 Key: HBASE-23952
 URL: https://issues.apache.org/jira/browse/HBASE-23952
 Project: HBase
  Issue Type: Bug
Affects Versions: master
Reporter: Mark Robert Miller


id2Call is a HashMap and has a call back method that accesses it that can be 
run via an executor as well as another method accessing it that can be run from 
a different thread.

id2Call should likely be a ConcurrentHashMap to be shared like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23951) Avoid high speed recursion trap in AsyncRequestFutureImpl.

2020-03-09 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller updated HBASE-23951:
---
Description: While working on branch-2, I ran into an issue where a 
retryable error kept occurring and code in AsyncRequestFutureImpl would reduce 
the backoff wait to 0 and extremely rapidly eat up a of thread stack space with 
recursive retry calls. This little patch stops the backoff wait kill after 3 
retries. Chosen kind of arbitrarily, perhaps 5 is the right number, but I find 
large retry counts tend to hide things and that has made me default to fairly 
conservative in all my arbitrary number picking.  (was: While working on 
branch-2, I ran into an issue where a retryable error kept occurring and code 
in )

> Avoid high speed recursion trap in AsyncRequestFutureImpl.
> --
>
> Key: HBASE-23951
> URL: https://issues.apache.org/jira/browse/HBASE-23951
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.3.0
>Reporter: Mark Robert Miller
>Priority: Minor
>
> While working on branch-2, I ran into an issue where a retryable error kept 
> occurring and code in AsyncRequestFutureImpl would reduce the backoff wait to 
> 0 and extremely rapidly eat up a of thread stack space with recursive retry 
> calls. This little patch stops the backoff wait kill after 3 retries. Chosen 
> kind of arbitrarily, perhaps 5 is the right number, but I find large retry 
> counts tend to hide things and that has made me default to fairly 
> conservative in all my arbitrary number picking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23951) Avoid high speed recursion trap in AsyncRequestFutureImpl.

2020-03-09 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller updated HBASE-23951:
---
Description: While working on branch-2, I ran into an issue where a 
retryable error kept occurring and code in 

> Avoid high speed recursion trap in AsyncRequestFutureImpl.
> --
>
> Key: HBASE-23951
> URL: https://issues.apache.org/jira/browse/HBASE-23951
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.3.0
>Reporter: Mark Robert Miller
>Priority: Minor
>
> While working on branch-2, I ran into an issue where a retryable error kept 
> occurring and code in 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23951) Avoid high speed recursion trap in AsyncRequestFutureImpl.

2020-03-09 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23951:
--

 Summary: Avoid high speed recursion trap in AsyncRequestFutureImpl.
 Key: HBASE-23951
 URL: https://issues.apache.org/jira/browse/HBASE-23951
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.3.0
Reporter: Mark Robert Miller






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.

2020-03-01 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048742#comment-17048742
 ] 

Mark Robert Miller commented on HBASE-23795:


I think the fail safe finalizer asking everyone to be polite citizens and close 
connections may be both failing in it's politeness and making a lot of the test 
situation even worse at best. You have all these extra costs and bad affects on 
GC with a finalizer, and the amount of connections these tests roll through and 
do not close is very high. I think it's quite easily overwhelmed and I think it 
contributes various  bad things that probably varies by the garbage collector 
in place.

Timely, proper connection closes and verification of reuse is a very fast gc 
agnostic recipe.  Those good citizens could use an assist, so I filed 
HBASE-23918 Track sensitive resources to ensure they are closed and assist devs 
in finding leaks.

> Enable all tests to be run in parallel on reused JVMs.
> --
>
> Key: HBASE-23795
> URL: https://issues.apache.org/jira/browse/HBASE-23795
> Project: HBase
>  Issue Type: Wish
>Reporter: Mark Robert Miller
>Priority: Major
>
> I'd like to be able to run HBase tests in under 30-40 minutes on good 
> parallel hardware.
> It will require some small changes / fixes for that wish to come true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23918) Track sensitive resources to ensure they are closed and assist devs in finding leaks.

2020-03-01 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048672#comment-17048672
 ] 

Mark Robert Miller commented on HBASE-23918:


I chose to do this via the Java assert keyword.

asserts should not be turned on in production systems.

Tests would fail if run without asserts turned on.

At the end of the constructor for an Object I needed to track:
{code:java}
assert ObjectReleaseTracker.track(this);
{code}
At the end of the close or shutdown method:
{code:java}
assert ObjectReleaseTracker.release(this);
{code}
At the end of each test:
{code:java}
  /**
   * @return null if ok else error message
   */
 String ObjectReleaseTracker.checkEmpty()
{code}
If checkEmpty finds objects that have not been released, it lists them, shows 
the stacktrace for the code origin, and then you can fail the test.

> Track sensitive resources to ensure they are closed and assist devs in 
> finding leaks.
> -
>
> Key: HBASE-23918
> URL: https://issues.apache.org/jira/browse/HBASE-23918
> Project: HBase
>  Issue Type: Improvement
>Reporter: Mark Robert Miller
>Priority: Major
>
> Closing some objects is quite critical. Issues with leaks can be quite 
> slippery and nasty and growy. Maintaining close integrity is an embarrassing 
> sport for humans.
> In the past, those 3 thoughts led me to start tracking objects in tests to 
> alert of leaks. Even with an alert though, the job of tracking down all of 
> the leaks just based on what leaked was beyond my skill. If it's beyond even 
> one devs skill that is committing, that tends to end up trouble. So I added 
> the stack trace for the origin of the object. Things can still get a bit 
> tricky to track down in some cases, but now I had the start of a real 
> solution to all of the whack-a-mole games I spent too much time playing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23918) Track sensitive resources to ensure they are closed and assist devs in finding leaks.

2020-03-01 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23918:
--

 Summary: Track sensitive resources to ensure they are closed and 
assist devs in finding leaks.
 Key: HBASE-23918
 URL: https://issues.apache.org/jira/browse/HBASE-23918
 Project: HBase
  Issue Type: Improvement
Reporter: Mark Robert Miller


Closing some objects is quite critical. Issues with leaks can be quite slippery 
and nasty and growy. Maintaining close integrity is an embarrassing sport for 
humans.

In the past, those 3 thoughts led me to start tracking objects in tests to 
alert of leaks. Even with an alert though, the job of tracking down all of the 
leaks just based on what leaked was beyond my skill. If it's beyond even one 
devs skill that is committing, that tends to end up trouble. So I added the 
stack trace for the origin of the object. Things can still get a bit tricky to 
track down in some cases, but now I had the start of a real solution to all of 
the whack-a-mole games I spent too much time playing.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23796) Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as well.

2020-03-01 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048662#comment-17048662
 ] 

Mark Robert Miller commented on HBASE-23796:


I still think this is a good, developer friendly change, but lowering on my 
priority list for a bit.

This was helping some major problems I was hitting in the tests, but likely 
because I had not changed everything to use 127.0.0.1 yet and so it was giving 
me more ports to work with, which staved off the large connection leak from 
eating all the ports, which allowed tests to run fast and well longer.

> Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as 
> well.
> ---
>
> Key: HBASE-23796
> URL: https://issues.apache.org/jira/browse/HBASE-23796
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Priority: Minor
>
> This is perhaps controversial, but there are a variety of problems with 
> counting on dns hostname resolution, especially for locahost.
>  
>  # It can often be slow, slow under concurrency, or slow under specific 
> conditions.
>  # It can often not work at all - when on a VPN, with weird DNS hijacking 
> hi-jinks, when you have a real hostname for you machines, a custom /etc/hosts 
> file, OS's run their own local/funny DNS server services.
>  # This makes coming to HBase for new devs a hit or miss experience and if 
> you miss, dealing with an diagnosing the issues is a large endeavor and not 
> straight forward or transparent.
>  #  99% of the difference doesn't matter in most cases - except that 
> 127.0.0.1 works and is fast pretty much universally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23796) Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as well.

2020-03-01 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller updated HBASE-23796:
---
Priority: Minor  (was: Major)

> Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as 
> well.
> ---
>
> Key: HBASE-23796
> URL: https://issues.apache.org/jira/browse/HBASE-23796
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Priority: Minor
>
> This is perhaps controversial, but there are a variety of problems with 
> counting on dns hostname resolution, especially for locahost.
>  
>  # It can often be slow, slow under concurrency, or slow under specific 
> conditions.
>  # It can often not work at all - when on a VPN, with weird DNS hijacking 
> hi-jinks, when you have a real hostname for you machines, a custom /etc/hosts 
> file, OS's run their own local/funny DNS server services.
>  # This makes coming to HBase for new devs a hit or miss experience and if 
> you miss, dealing with an diagnosing the issues is a large endeavor and not 
> straight forward or transparent.
>  #  99% of the difference doesn't matter in most cases - except that 
> 127.0.0.1 works and is fast pretty much universally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.

2020-02-29 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048462#comment-17048462
 ] 

Mark Robert Miller commented on HBASE-23795:


So, after getting to know the tests more individually, they did not match up 
with experience of trying to run the test suite. So I jumped ahead a bit.

I reduced resources and improvement. But the test suite still did not make 
sense.

I started shutting down resources and resolving deadlock or deadlock like 
situations on shutdown.

I removed System.exit type calls that not appropriate for Junit tests and kill 
JVM’s on us.

Again, I saw improvements but the test runs still didn’t make sense.

So I started hacking and providing higher limits to find out why nothing made 
sense.

In the end, lots of connections are fired up, few are closed, unless you run 
the tests fairly slowly, you will bed DOS attacked by them.

The worst of the leaks appears to be within the rpc client.

These tests are much faster, the rest of the flakies become conquerable, 
running tests with more parallel JVMs and even in the same JVM is now not that 
difficult, though some of the code does make it a little exercise. Good news 
all around.


> Enable all tests to be run in parallel on reused JVMs.
> --
>
> Key: HBASE-23795
> URL: https://issues.apache.org/jira/browse/HBASE-23795
> Project: HBase
>  Issue Type: Wish
>Reporter: Mark Robert Miller
>Priority: Major
>
> I'd like to be able to run HBase tests in under 30-40 minutes on good 
> parallel hardware.
> It will require some small changes / fixes for that wish to come true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23899) [Flakey Test] Stabilizations and Debug

2020-02-27 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046501#comment-17046501
 ] 

Mark Robert Miller commented on HBASE-23899:


bq. Fix some issues particular where we ran into mismatched filesystem 
complaint.

Great! I had been running into this sometimes and did not know what to make of 
it. Had to work around it.

bq. Removal of unnecessary deletes

+1, unless it releases resources, just eats time and adds things that can go 
wrong.

bq. manifests as test failing in startup saying master didn't launch

Cool, been biting me too. This and the first issue will improve my local test 
situation a lot without hack work arounds.

bq. TestReplicationStatus

I looked at this for a while the other day as well. I added code to try and 
wait for the right state to show up. I found any and many possible thread 
safety issues that could be related and fixed them. And still this one fail 
would happen. It did seem to make it much harder for it to happen on my 16 core 
box, I thought it was fixed. But I could still easily reproduce it on my  4 and 
8 core boxes.




> [Flakey Test] Stabilizations and Debug
> --
>
> Key: HBASE-23899
> URL: https://issues.apache.org/jira/browse/HBASE-23899
> Project: HBase
>  Issue Type: Bug
>  Components: flakies
>Reporter: Michael Stack
>Priority: Major
>
> Bunch of test stabilization and extra debug. This set of changes make it so 
> local 'test' runs pass about 20% of the time (where before they didn't) when 
> run on a linux vm and on a mac.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23904) Procedure updating meta and Master shutdown are incompatible: CODE-BUG

2020-02-27 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046471#comment-17046471
 ] 

Mark Robert Miller commented on HBASE-23904:


Oh cool, I've been seeing some stuff like this in tests and didn't know what 
was expected.

bq. The rejected exception is probably because the pool has been shutdown

Yeah, pool is TERMINATED, so it was shutdown and that process was fully 
completed: Terminated, pool size = 0, active threads = 0, queued tasks = 0, 
completed tasks = 5

bq. So I do not think we should let the procedures to finish when master is 
going to quit...

I agree that this is not a great default behavior if the procedures are likely 
going to take a while - unless you have a clear easy option of shutdown ASAP or 
shutdown after you have finished any outstanding work you might be chewing on 
that is going to take more than a fairly short time. As a user, if you were 
going to prevent the procedures from running and do a speedy shutdown, I 
wouldn't want to see anything about CODE-BUG or even an exception - just a 
message about how the master is shutting down and procedures have been aborted 
or skipped.


> Procedure updating meta and Master shutdown are incompatible: CODE-BUG
> --
>
> Key: HBASE-23904
> URL: https://issues.apache.org/jira/browse/HBASE-23904
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Reporter: Michael Stack
>Priority: Major
>
> Chasing flakies, studying TestMasterAbortWhileMergingTable, I noticed a 
> failure because
> {code:java}
> 2020-02-27 00:57:51,702 ERROR [PEWorker-6] 
> procedure2.ProcedureExecutor(1688): CODE-BUG: Uncaught runtime exception: 
> pid=14, state=RUNNABLE:MERGE_TABLE_REGIONS_UPDATE_META, locked=true; 
> MergeTableRegionsProcedure table=test, 
> regions=[48c9be922fa4356bfc7fc61b5b0785f3, ef196d5377c5c1d143e9a2a2ea056a9c], 
> force=false
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.FutureTask@28b956c7 rejected from 
> java.util.concurrent.ThreadPoolExecutor@639f20e5[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 5]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
> at 
> org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:974)
> at 
> org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:953)
> at 
> org.apache.hadoop.hbase.MetaTableAccessor.multiMutate(MetaTableAccessor.java:1771)
> at 
> org.apache.hadoop.hbase.MetaTableAccessor.mergeRegions(MetaTableAccessor.java:1637)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.mergeRegions(RegionStateStore.java:268)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsMerged(AssignmentManager.java:1854)
> at 
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.updateMetaForMergedRegions(MergeTableRegionsProcedure.java:687)
> at 
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:229)
> at 
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:77)
> at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194)
> at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1669)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1416)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:79)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1986)
>  {code}
> A few seconds above, as part of the test, we'd stopped Master
> {code:java}
> 2020-02-27 00:57:51,620 INFO  [Time-limited test] 
> regionserver.HRegionServer(2212): * STOPPING region server 
> 'rn-hbased-lapp01.rno.exampl.com,36587,1582765058324' *
> 2020-02-27 00:57:51,620 INFO  [Time-limited test] 
> regionserver.HRegionServer(2226): STOPPED: Stopping master 0 {code}
> The rejected execution damages the merge procedure. It shows as an unhandled 
> CODE-BUG.
> Why we let a runtime exception out when trying to update meta 

[jira] [Commented] (HBASE-23882) Scale *MiniCluster config for the environment it runs in.

2020-02-21 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042245#comment-17042245
 ] 

Mark Robert Miller commented on HBASE-23882:


Here is some initial experimentation with bringing mini cluster settings down 
to scale.

To start, I've been shooting for pretty minimal.

I've pulled this out of an experimental branch and so some work may still be in 
order to find any tests these settings are too low for.

I will continue to update this as I fine tune. I also have some other changes 
I'd like to dig out around thread pool sizing if I can find them again. I'll 
likely update again early next week.

> Scale *MiniCluster config for the environment it runs in.
> -
>
> Key: HBASE-23882
> URL: https://issues.apache.org/jira/browse/HBASE-23882
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23849) Harden small and medium tests for lots of parallel runs with re-used jvms.

2020-02-21 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042204#comment-17042204
 ] 

Mark Robert Miller commented on HBASE-23849:


HBASE-23882 is will be useful here as well.

> Harden small and medium tests for lots of parallel runs with re-used jvms.
> --
>
> Key: HBASE-23849
> URL: https://issues.apache.org/jira/browse/HBASE-23849
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23882) Scale *MiniCluster config for the environment it runs in.

2020-02-21 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23882:
--

 Summary: Scale *MiniCluster config for the environment it runs in.
 Key: HBASE-23882
 URL: https://issues.apache.org/jira/browse/HBASE-23882
 Project: HBase
  Issue Type: Test
Reporter: Mark Robert Miller






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.

2020-02-19 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040572#comment-17040572
 ] 

Mark Robert Miller commented on HBASE-23795:


Now that I have spent enough time to understand what is possible here, to kind 
of zoom out, here is what would need to happen (this can take some time, but is 
easily done in parts):
 * Tune HBase and hdfs settings to make sense for the small world we are 
creating in tests. If you wash a test out with threads, it's really not very 
realistic at all. This means pinning settings for the JVM so that the right 
number of threads is created, settings for HBase and hdfs that make sense - 
handler counts and pool sizes netty 'stuff' - and scaling down thousands of 
threads to 200-300 or something. Not many of these threads are running at the 
same time - the rest is just flooding and eating resources.
 * Make things close and shutdown and enforce this. Lots and lots of various 
leaks currently. Once mostly removed, you can start reusing JVM's, but I think 
there are other benefits to actually closing and stopping your resources 
explicitly. Not closing or shutting down some things explicit can have 
lingering OS ramifications even when creating new JVM's.
 * Have tests clean up their expensive or config statics and reset sys props, 
with enforcement. Needed for JVM reuse.
 * Clean up any heavy resource usage you can't currently control or doesn't 
seem to make a lot of sense (I think an ipc thread pool is set to core and max 
size 200?) 

That's kind of the core of it, there is a lot of related useful stuff to do I 
think.

A lot of these tests are reasonably fast now. They could be even faster with a 
little work, but even without that, they are not that slow. Loading up a JVM, 
loading up 10-20k classes, warming up JIT, blah blah, that is super costly. 
Just getting to rerunning tests in the same JVM will be super helpful. There is 
a lot that can be done after that as well, but such a large win - good enough 
goal for now. Most of these tests don't even need that much RAM. They are just 
not running with sensible resources.

> Enable all tests to be run in parallel on reused JVMs.
> --
>
> Key: HBASE-23795
> URL: https://issues.apache.org/jira/browse/HBASE-23795
> Project: HBase
>  Issue Type: Wish
>Reporter: Mark Robert Miller
>Priority: Major
>
> I'd like to be able to run HBase tests in under 30-40 minutes on good 
> parallel hardware.
> It will require some small changes / fixes for that wish to come true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count

2020-02-18 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039560#comment-17039560
 ] 

Mark Robert Miller edited comment on HBASE-23779 at 2/18/20 11:45 PM:
--

In the meantime, some some ideas on JVM args that be able to help the situation 
a little.

-XX:-UseContainerSupport - when running in docker, query docker for hardware 
info, not the host. I have not played with it, but makes sense to me.

-XX:ActiveProcessorCount - fake the active processer count to 1 - when running 
multiple jvms, you don't want each one to think it should grab 32 gc threads on 
your 32 hyperthreaded core machine. I'd much prefer to run 32 JVMs and have 32 
gc threads.

-XX:+AlwaysPreTouch -XX:+UseTransparentHugePages - given so much JVM startup 
and shutdowns, some things might help more when there is more JVM reuse - so 
much time and cost is in creating large heap jvms and spinning up lots of 
unused resources and threads that it's hard to win much with good tuning - but 
huge pages, always a nice win.

-XX:+UseParallelGC -XX:TieredStopAtLevel=1 - even when G1 is available, this is 
gold standard for test speed. Maybe with like -XX:ParallelGCThreads=1 if not 
using ActiveProcessorCout. 

Probably not a big advantage when each med/large JVM is spinning up thousands 
of useless threads that eat ram, heap and os resources, but something, and 
probably pretty nice for small test single jvm runs now.

It's also generally best to pin Xms and Xmx vs eat all the resizing cost. I 
suggest the opposite above given the RAM reqs - to keep the JVMs that don't 
need it from sucking up so much RAM unnecessarily - but with good reuse, you 
want to pin them.

Most of the tests don't need these huge heaps and limits - I think that having 
them for the bad apples and outliers just allows the 95% of tests to easily be 
wasteful and misbehave.

 

 * Note: UseContainerSupport is enabled by default - the above disables - so 
guess that should already be in affect.

 


was (Author: markrmiller):
In the meantime, some some ideas on JVM args that be able to help the situation 
a little.

-XX:-UseContainerSupport - when running in docker, query docker for hardware 
info, not the host. I have not played with it, but makes sense to me.

-XX:ActiveProcessorCount - fake the active processer count to 1 - when running 
multiple jvms, you don't want each one to think it should grab 32 gc threads on 
your 32 hyperthreaded core machine. I'd much prefer to run 32 JVMs and have 32 
gc threads.

-XX:+AlwaysPreTouch -XX:+UseTransparentHugePages - given so much JVM startup 
and shutdowns, some things might help more when there is more JVM reuse - so 
much time and cost is in creating large heap jvms and spinning up lots of 
unused resources and threads that it's hard to win much with good tuning - but 
huge pages, always a nice win.

-XX:+UseParallelGC -XX:TieredStopAtLevel=1 - even when G1 is available, this is 
gold standard for test speed. Maybe with like -XX:ParallelGCThreads=1 if not 
using ActiveProcessorCout. 

Probably not a big advantage when each med/large JVM is spinning up thousands 
of useless threads that eat ram, heap and os resources, but something, and 
probably pretty nice for small test single jvm runs now.

It's also generally best to pin Xms and Xmx vs eat all the resizing cost. I 
suggest the opposite above given the RAM reqs - to keep the JVMs that don't 
need it from sucking up so much RAM unnecessarily - but with good reuse, you 
want to pin them.

Most of the tests don't need these huge heaps and limits - I think that having 
them for the bad apples and outliers just allows the 95% of tests to easily be 
wasteful and misbehave.

 

 

 

> Up the default fork count to make builds complete faster; make count relative 
> to CPU count
> --
>
> Key: HBASE-23779
> URL: https://issues.apache.org/jira/browse/HBASE-23779
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
> Attachments: addendum2.patch, test_yetus_934.0.patch
>
>
> Tests take a long time. Our fork count running all tests are conservative -- 
> 1 (small) for first part and 5 for second part (medium and large). Rather 
> than hardcoding we should set the fork count to be relative to machine size. 
> Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box.
> Looking up at jenkins, it seems like the boxes are 24 cores... at least going 
> by my random survey. The load reported on a few seems low though this not 
> representative (looking at machine/uptime).
> More parallelism willl probably mean more test failure. Let me take a 

[jira] [Comment Edited] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count

2020-02-18 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039560#comment-17039560
 ] 

Mark Robert Miller edited comment on HBASE-23779 at 2/18/20 11:40 PM:
--

In the meantime, some some ideas on JVM args that be able to help the situation 
a little.

-XX:-UseContainerSupport - when running in docker, query docker for hardware 
info, not the host. I have not played with it, but makes sense to me.

-XX:ActiveProcessorCount - fake the active processer count to 1 - when running 
multiple jvms, you don't want each one to think it should grab 32 gc threads on 
your 32 hyperthreaded core machine. I'd much prefer to run 32 JVMs and have 32 
gc threads.

-XX:+AlwaysPreTouch -XX:+UseTransparentHugePages - given so much JVM startup 
and shutdowns, some things might help more when there is more JVM reuse - so 
much time and cost is in creating large heap jvms and spinning up lots of 
unused resources and threads that it's hard to win much with good tuning - but 
huge pages, always a nice win.

-XX:+UseParallelGC -XX:TieredStopAtLevel=1 - even when G1 is available, this is 
gold standard for test speed. Maybe with like -XX:ParallelGCThreads=1 if not 
using ActiveProcessorCout. 

Probably not a big advantage when each med/large JVM is spinning up thousands 
of useless threads that eat ram, heap and os resources, but something, and 
probably pretty nice for small test single jvm runs now.

It's also generally best to pin Xms and Xmx vs eat all the resizing cost. I 
suggest the opposite above given the RAM reqs - to keep the JVMs that don't 
need it from sucking up so much RAM unnecessarily - but with good reuse, you 
want to pin them.

Most of the tests don't need these huge heaps and limits - I think that having 
them for the bad apples and outliers just allows the 95% of tests to easily be 
wasteful and misbehave.

 

 

 


was (Author: markrmiller):
In the meantime, some some ideas on JVM args that be able to help the situation 
a little.

-XX:-UseContainerSupport - when running in docker, query docker for hardware 
info, not the host. I have no played with it, but makes sense to me.

-XX:ActiveProcessorCount - fake the active process count to 1 - when running 
multiple jvms, you don't want each one to think it should grab 32 gc threads on 
your 32 hyperthreaded core machine. I'd much prefer to run 32 JVMs and have 32 
gc threads.

-XX:+AlwaysPreTouch -XX:+UseTransparentHugePages - given so much JVM startup 
and shutdowns, some things might help more when there is more JVM reuse - so 
much time and cost is in creating large heap jvms and spinning up lots of 
unused resources and threads that it's hard to win much with good tuning - but 
huge pages, always a nice win.

-XX:+UseParallelGC -XX:TieredStopAtLevel=1 - even when G1 is available, this is 
gold standard for test speed. Maybe with like -XX:ParallelGCThreads=1 if no 
using ActiveProcessorCout. 

Probably not a big advantage when each med/large JVM is spinning up thousands 
of useless threads that eat ram, heap and os resources, but something, and 
probably pretty nice for small test single jvm runs now.

It's also generally best to pin Xms and Xmx vs eat all the resizing cost. I 
suggest the opposite above given the RAM reqs - to keep the JVMs that don't 
need it from sucking up so much RAM unnecessarily - but with good reuse, you 
want to pin them.

Most of the tests don't need these huge heaps and limits - I think that having 
them for the bad apples and outliers just allows the 95% of tests to easily be 
wasteful and misbehave.

 

 

 

> Up the default fork count to make builds complete faster; make count relative 
> to CPU count
> --
>
> Key: HBASE-23779
> URL: https://issues.apache.org/jira/browse/HBASE-23779
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
> Attachments: addendum2.patch, test_yetus_934.0.patch
>
>
> Tests take a long time. Our fork count running all tests are conservative -- 
> 1 (small) for first part and 5 for second part (medium and large). Rather 
> than hardcoding we should set the fork count to be relative to machine size. 
> Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box.
> Looking up at jenkins, it seems like the boxes are 24 cores... at least going 
> by my random survey. The load reported on a few seems low though this not 
> representative (looking at machine/uptime).
> More parallelism willl probably mean more test failure. Let me take a look 
> see.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count

2020-02-18 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039560#comment-17039560
 ] 

Mark Robert Miller commented on HBASE-23779:


In the meantime, some some ideas on JVM args that be able to help the situation 
a little.

-XX:-UseContainerSupport - when running in docker, query docker for hardware 
info, not the host. I have no played with it, but makes sense to me.

-XX:ActiveProcessorCount - fake the active process count to 1 - when running 
multiple jvms, you don't want each one to think it should grab 32 gc threads on 
your 32 hyperthreaded core machine. I'd much prefer to run 32 JVMs and have 32 
gc threads.

-XX:+AlwaysPreTouch -XX:+UseTransparentHugePages - given so much JVM startup 
and shutdowns, some things might help more when there is more JVM reuse - so 
much time and cost is in creating large heap jvms and spinning up lots of 
unused resources and threads that it's hard to win much with good tuning - but 
huge pages, always a nice win.

-XX:+UseParallelGC -XX:TieredStopAtLevel=1 - even when G1 is available, this is 
gold standard for test speed. Maybe with like -XX:ParallelGCThreads=1 if no 
using ActiveProcessorCout. 

Probably not a big advantage when each med/large JVM is spinning up thousands 
of useless threads that eat ram, heap and os resources, but something, and 
probably pretty nice for small test single jvm runs now.

It's also generally best to pin Xms and Xmx vs eat all the resizing cost. I 
suggest the opposite above given the RAM reqs - to keep the JVMs that don't 
need it from sucking up so much RAM unnecessarily - but with good reuse, you 
want to pin them.

Most of the tests don't need these huge heaps and limits - I think that having 
them for the bad apples and outliers just allows the 95% of tests to easily be 
wasteful and misbehave.

 

 

 

> Up the default fork count to make builds complete faster; make count relative 
> to CPU count
> --
>
> Key: HBASE-23779
> URL: https://issues.apache.org/jira/browse/HBASE-23779
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
> Attachments: addendum2.patch, test_yetus_934.0.patch
>
>
> Tests take a long time. Our fork count running all tests are conservative -- 
> 1 (small) for first part and 5 for second part (medium and large). Rather 
> than hardcoding we should set the fork count to be relative to machine size. 
> Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box.
> Looking up at jenkins, it seems like the boxes are 24 cores... at least going 
> by my random survey. The load reported on a few seems low though this not 
> representative (looking at machine/uptime).
> More parallelism willl probably mean more test failure. Let me take a look 
> see.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.

2020-02-18 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039393#comment-17039393
 ] 

Mark Robert Miller commented on HBASE-23795:


I've started making some progress here.

With some tuning, these large tests are not so large after all.

I have to do some work to make sure they clean up after themselves so that JVMs 
dont get dirtier and dirtier over time, but all of these tests can run in 
parallel and much faster than they are now.

> Enable all tests to be run in parallel on reused JVMs.
> --
>
> Key: HBASE-23795
> URL: https://issues.apache.org/jira/browse/HBASE-23795
> Project: HBase
>  Issue Type: Wish
>Reporter: Mark Robert Miller
>Priority: Major
>
> I'd like to be able to run HBase tests in under 30-40 minutes on good 
> parallel hardware.
> It will require some small changes / fixes for that wish to come true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count

2020-02-16 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038062#comment-17038062
 ] 

Mark Robert Miller commented on HBASE-23779:


I see what's going on here. A lot ;)

To some degree Maven is not helping - the equivalent approximation to Gradles 
awesome parallel build performance can be a fair bit more expensive at the 
least. That's just half the equation though. Its really largely small-med or 
potentially small-med tests masquerading as large or super large tests and/or 
waiting for a non CI, less intense option. Expanding limits and trying to baby 
and isolate the tests has gotten HBase to like a billion tests, which I am both 
impressed and frustrated with. So. Many. Test classes. Just tossing that many 
no op tests at an executor is going to take some time - toss a new 2800Mb JVM 
in between for most of them and it will take a little more :)

We could address it all within a few months time (the basics anyway, really so 
much that can be done and improved on the basics). I'll convince someone to 
champion a reverse course and shrink resources and expose the tests to a bit of 
hell on purpose for profit and pleasure. There is a lot of hidden flakiness, 
which is a valid strategy in these cases - if you can hide it well enough, at 
least it's a mostly rational signal. But with so many tests and hours running 
times, any real stability will be forever elusive, a mirage, or hanging on a 
dime. You also just pay for it in so many ways, even if it does end up with 
some success.

We can expose these tests to sunlight and it will force us to shape them right 
up.

 

> Up the default fork count to make builds complete faster; make count relative 
> to CPU count
> --
>
> Key: HBASE-23779
> URL: https://issues.apache.org/jira/browse/HBASE-23779
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
> Attachments: addendum2.patch, test_yetus_934.0.patch
>
>
> Tests take a long time. Our fork count running all tests are conservative -- 
> 1 (small) for first part and 5 for second part (medium and large). Rather 
> than hardcoding we should set the fork count to be relative to machine size. 
> Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box.
> Looking up at jenkins, it seems like the boxes are 24 cores... at least going 
> by my random survey. The load reported on a few seems low though this not 
> representative (looking at machine/uptime).
> More parallelism willl probably mean more test failure. Let me take a look 
> see.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23806) Provide a much faster and efficient alternate option to maven and surefire for running tests.

2020-02-14 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037448#comment-17037448
 ] 

Mark Robert Miller commented on HBASE-23806:


While things have been getting better over the years, build and test running 
tools in general have not gone very far down the road of good efficiency.

In my experience, Gradle easily leads the pack, Maven is trying to match with 
similiar features, but  it's a turbo charger on an Eclipse vs a solid Audi at 
best. But Gradle is mostly living off what it's done for build, which is great, 
but no real focus on tests.

We can do better for special occasions.

> Provide a much faster and efficient alternate option to maven and surefire 
> for running tests.
> -
>
> Key: HBASE-23806
> URL: https://issues.apache.org/jira/browse/HBASE-23806
> Project: HBase
>  Issue Type: Wish
>Reporter: Mark Robert Miller
>Priority: Minor
>
> With HBASE-23795, the hope is to drive tests with maven and surefire much 
> closer to their potential.
> That will still leave a lot of room for improvement.
> For those that have some nice hardware and a need for speed, we can blow 
> right past maven+surefire.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.

2020-02-14 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037443#comment-17037443
 ] 

Mark Robert Miller commented on HBASE-23795:


I'll start with small and medium tests in HBASE-23849 Harden small and medium 
tests for lots of parallel runs with re-used jvms.

Good way to get some solid, easy progress before the mountain that is large 
tests.

In the mean time, I've been working in parallel on things related to 
HBASE-23806 Provide a much faster and efficient alternate option to maven and 
surefire for running tests. Not likely to be shared any time soon, but 
providing it's own benefit to this related issue.

> Enable all tests to be run in parallel on reused JVMs.
> --
>
> Key: HBASE-23795
> URL: https://issues.apache.org/jira/browse/HBASE-23795
> Project: HBase
>  Issue Type: Wish
>Reporter: Mark Robert Miller
>Priority: Major
>
> I'd like to be able to run HBase tests in under 30-40 minutes on good 
> parallel hardware.
> It will require some small changes / fixes for that wish to come true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23849) Harden small and medium tests for lots of parallel runs with re-used jvms.

2020-02-14 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23849:
--

 Summary: Harden small and medium tests for lots of parallel runs 
with re-used jvms.
 Key: HBASE-23849
 URL: https://issues.apache.org/jira/browse/HBASE-23849
 Project: HBase
  Issue Type: Test
Reporter: Mark Robert Miller






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23835) TestFromClientSide3 and subclasses often fail on testScanAfterDeletingSpecifiedRowV2.

2020-02-14 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036897#comment-17036897
 ] 

Mark Robert Miller commented on HBASE-23835:


I've started to dig into this test - I've played around with speeding it up and 
hardening it - that led to a couple other little things in other code - I'll 
put up a pr for this test soon and spin off a JIRA issue or two.

> TestFromClientSide3 and subclasses often fail on 
> testScanAfterDeletingSpecifiedRowV2.
> -
>
> Key: HBASE-23835
> URL: https://issues.apache.org/jira/browse/HBASE-23835
> Project: HBase
>  Issue Type: Test
>Affects Versions: master
>Reporter: Mark Robert Miller
>Priority: Major
>
> This test method fails a fair amount on me with something like:
> TestFromClientSide3WoUnsafe>TestFromClientSide3.testScanAfterDeletingSpecifiedRowV2:236
>  expected:<3> but was:<2>
> I had a hunch that it might be due to interference from other test methods 
> running first so I tried changing the table name for just this method to be 
> unique - still fails.
> However, when I just run testScanAfterDeletingSpecifiedRowV2 on it's own 
> without the methods, it does not seem to fail so far.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23829) Get `-PrunSmallTests` passing on JDK11

2020-02-14 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036804#comment-17036804
 ] 

Mark Robert Miller commented on HBASE-23829:


Here is some work I have done towards this against small/med tests (so far have 
not needed/jumped to hadoop 3x): https://github.com/markrmiller/hbase/tree/jdk11

> Get `-PrunSmallTests` passing on JDK11
> --
>
> Key: HBASE-23829
> URL: https://issues.apache.org/jira/browse/HBASE-23829
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: Nick Dimiduk
>Priority: Major
>
> Start with the small tests, shaking out issues identified by the harness. So 
> far it seems like {{-Dhadoop.profile=3.0}} and 
> {{-Dhadoop-three.version=3.3.0-SNAPSHOT}} maybe be required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23839) TestEntityLocks often fails in lower resource envs in testEntityLockTimeout

2020-02-13 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23839:
--

 Summary: TestEntityLocks often fails in lower resource envs in 
testEntityLockTimeout
 Key: HBASE-23839
 URL: https://issues.apache.org/jira/browse/HBASE-23839
 Project: HBase
  Issue Type: Test
Reporter: Mark Robert Miller


The test waits for something to happen and if the computer is a little slow, it 
will fail on line 178.

Doing a check against 3 instead 2x seems to help a lot to start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23835) TestFromClientSide3 and subclasses often fail on testScanAfterDeletingSpecifiedRowV2.

2020-02-13 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller updated HBASE-23835:
---
Description: 
This test method fails a fair amount on me with something like:

TestFromClientSide3WoUnsafe>TestFromClientSide3.testScanAfterDeletingSpecifiedRowV2:236
 expected:<3> but was:<2>

I had a hunch that it might be due to interference from other test methods 
running first so I tried changing the table name for just this method to be 
unique - still fails.

However, when I just run testScanAfterDeletingSpecifiedRowV2 on it's own 
without the methods, it does not seem to fail so far.




  was:
This test method fails a fair amount on me with something like:

TestFromClientSide3WoUnsafe>TestFromClientSide3.testScanAfterDeletingSpecifiedRowV2:236
 expected:<3> but was:<2>

I had a hunch that it might be due to interference from other test methods 
running first so I tried changing the table name for just this method to be 
unique - still fails.

However, when I just run 





> TestFromClientSide3 and subclasses often fail on 
> testScanAfterDeletingSpecifiedRowV2.
> -
>
> Key: HBASE-23835
> URL: https://issues.apache.org/jira/browse/HBASE-23835
> Project: HBase
>  Issue Type: Test
>Affects Versions: master
>Reporter: Mark Robert Miller
>Priority: Major
>
> This test method fails a fair amount on me with something like:
> TestFromClientSide3WoUnsafe>TestFromClientSide3.testScanAfterDeletingSpecifiedRowV2:236
>  expected:<3> but was:<2>
> I had a hunch that it might be due to interference from other test methods 
> running first so I tried changing the table name for just this method to be 
> unique - still fails.
> However, when I just run testScanAfterDeletingSpecifiedRowV2 on it's own 
> without the methods, it does not seem to fail so far.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23835) TestFromClientSide3 and subclasses often fail on testScanAfterDeletingSpecifiedRowV2.

2020-02-13 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller updated HBASE-23835:
---
Description: 
This test method fails a fair amount on me with something like:

TestFromClientSide3WoUnsafe>TestFromClientSide3.testScanAfterDeletingSpecifiedRowV2:236
 expected:<3> but was:<2>

I had a hunch that it might be due to interference from other test methods 
running first so I tried changing the table name for just this method to be 
unique - still fails.

However, when I just run 




  was:
This test method fails a fair amount on me with something like:





> TestFromClientSide3 and subclasses often fail on 
> testScanAfterDeletingSpecifiedRowV2.
> -
>
> Key: HBASE-23835
> URL: https://issues.apache.org/jira/browse/HBASE-23835
> Project: HBase
>  Issue Type: Test
>Affects Versions: master
>Reporter: Mark Robert Miller
>Priority: Major
>
> This test method fails a fair amount on me with something like:
> TestFromClientSide3WoUnsafe>TestFromClientSide3.testScanAfterDeletingSpecifiedRowV2:236
>  expected:<3> but was:<2>
> I had a hunch that it might be due to interference from other test methods 
> running first so I tried changing the table name for just this method to be 
> unique - still fails.
> However, when I just run 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23835) TestFromClientSide3 and subclasses often fail on testScanAfterDeletingSpecifiedRowV2.

2020-02-13 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23835:
--

 Summary: TestFromClientSide3 and subclasses often fail on 
testScanAfterDeletingSpecifiedRowV2.
 Key: HBASE-23835
 URL: https://issues.apache.org/jira/browse/HBASE-23835
 Project: HBase
  Issue Type: Test
Affects Versions: master
Reporter: Mark Robert Miller


This test method fails a fair amount on me with something like:






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23830) TestReplicationEndpoint appears to fail a lot in my attempts for a clean test run locally.

2020-02-12 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035886#comment-17035886
 ] 

Mark Robert Miller commented on HBASE-23830:


Thanks [~stack].

I see similiar fails in these other tests:

[ERROR]   
TestReplicationEndpointWithMultipleAsyncWAL>TestReplicationEndpoint.testInterClusterReplication:231
 Waiting timed out after [30,000] msec Failed to replicate all edits, expected 
= 2500 replicated = 2491
[ERROR]   
TestReplicationEndpointWithMultipleWAL>TestReplicationEndpoint.testInterClusterReplication:231
 Waiting timed out after [30,000] msec Failed to replicate all edits, expected 
= 2500 replicated = 2440

Will attach some logs for those if this is likely the same issue.

> TestReplicationEndpoint appears to fail a lot in my attempts for a clean test 
> run locally.
> --
>
> Key: HBASE-23830
> URL: https://issues.apache.org/jira/browse/HBASE-23830
> Project: HBase
>  Issue Type: Test
>Affects Versions: master
>Reporter: Mark Robert Miller
>Priority: Major
> Attachments: test_fails.tar.xz
>
>
> This test is failing for me like 30-40% of the time. Fail seems to usually be 
> as below. I've tried increasing the wait timeout but that does not seem to 
> help at all.
> {code}
> [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 105.145 s <<< FAILURE! - in 
> org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] Tests run: 
> 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.145 s <<< FAILURE! - 
> in org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] 
> org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication
>   Time elapsed: 38.725 s  <<< FAILURE!java.lang.AssertionError: Waiting timed 
> out after [30,000] msec Failed to replicate all edits, expected = 2500 
> replicated = 2476 at org.junit.Assert.fail(Assert.java:89) at 
> org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203) at 
> org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137) at 
> org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication(TestReplicationEndpoint.java:235){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23831) TestChoreService is very sensitive to resources.

2020-02-11 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23831:
--

 Summary: TestChoreService is very sensitive to resources.
 Key: HBASE-23831
 URL: https://issues.apache.org/jira/browse/HBASE-23831
 Project: HBase
  Issue Type: Test
Affects Versions: master
Reporter: Mark Robert Miller


More details following.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23830) TestReplicationEndpoint appears to fail a lot in my attempts for a clean test run locally.

2020-02-11 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034883#comment-17034883
 ] 

Mark Robert Miller commented on HBASE-23830:


I've attached the 13 fail logs that I got out of the last 30 runs on master.

> TestReplicationEndpoint appears to fail a lot in my attempts for a clean test 
> run locally.
> --
>
> Key: HBASE-23830
> URL: https://issues.apache.org/jira/browse/HBASE-23830
> Project: HBase
>  Issue Type: Test
>Affects Versions: master
>Reporter: Mark Robert Miller
>Priority: Major
> Attachments: test_fails.tar.xz
>
>
> This test is failing for me like 30-40% of the time. Fail seems to usually be 
> as below. I've tried increasing the wait timeout but that does not seem to 
> help at all.
> {code}
> [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 105.145 s <<< FAILURE! - in 
> org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] Tests run: 
> 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.145 s <<< FAILURE! - 
> in org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] 
> org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication
>   Time elapsed: 38.725 s  <<< FAILURE!java.lang.AssertionError: Waiting timed 
> out after [30,000] msec Failed to replicate all edits, expected = 2500 
> replicated = 2476 at org.junit.Assert.fail(Assert.java:89) at 
> org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203) at 
> org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137) at 
> org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication(TestReplicationEndpoint.java:235){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23830) TestReplicationEndpoint appears to fail a lot in my attempts for a clean test run locally.

2020-02-11 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller updated HBASE-23830:
---
Attachment: test_fails.tar.xz

> TestReplicationEndpoint appears to fail a lot in my attempts for a clean test 
> run locally.
> --
>
> Key: HBASE-23830
> URL: https://issues.apache.org/jira/browse/HBASE-23830
> Project: HBase
>  Issue Type: Test
>Affects Versions: master
>Reporter: Mark Robert Miller
>Priority: Major
> Attachments: test_fails.tar.xz
>
>
> This test is failing for me like 30-40% of the time. Fail seems to usually be 
> as below. I've tried increasing the wait timeout but that does not seem to 
> help at all.
> {code}
> [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 105.145 s <<< FAILURE! - in 
> org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] Tests run: 
> 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.145 s <<< FAILURE! - 
> in org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] 
> org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication
>   Time elapsed: 38.725 s  <<< FAILURE!java.lang.AssertionError: Waiting timed 
> out after [30,000] msec Failed to replicate all edits, expected = 2500 
> replicated = 2476 at org.junit.Assert.fail(Assert.java:89) at 
> org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203) at 
> org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137) at 
> org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication(TestReplicationEndpoint.java:235){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23830) TestReplicationEndpoint appears to fail a lot in my attempts for a clean test run locally.

2020-02-11 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23830:
--

 Summary: TestReplicationEndpoint appears to fail a lot in my 
attempts for a clean test run locally.
 Key: HBASE-23830
 URL: https://issues.apache.org/jira/browse/HBASE-23830
 Project: HBase
  Issue Type: Test
Affects Versions: master
Reporter: Mark Robert Miller


This test is failing for me like 30-40% of the time. Fail seems to usually be 
as below. I've tried increasing the wait timeout but that does not seem to help 
at all.

{code}
[ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.145 
s <<< FAILURE! - in 
org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] Tests run: 
7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.145 s <<< FAILURE! - 
in org.apache.hadoop.hbase.replication.TestReplicationEndpoint[ERROR] 
org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication
  Time elapsed: 38.725 s  <<< FAILURE!java.lang.AssertionError: Waiting timed 
out after [30,000] msec Failed to replicate all edits, expected = 2500 
replicated = 2476 at org.junit.Assert.fail(Assert.java:89) at 
org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:203) at 
org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:137) at 
org.apache.hadoop.hbase.replication.TestReplicationEndpoint.testInterClusterReplication(TestReplicationEndpoint.java:235){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count

2020-02-10 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034097#comment-17034097
 ] 

Mark Robert Miller commented on HBASE-23779:


FYI re: the -T arg

I'm finding it's pretty sensitive to the Maven version - the 3.5.* I seem to 
get from the yetus Dockerfile in dev-support is crashing all the time, 3.6.1 
has been behaving as I've been used to (my main desktop had 3.6.1 to start)

> Up the default fork count to make builds complete faster; make count relative 
> to CPU count
> --
>
> Key: HBASE-23779
> URL: https://issues.apache.org/jira/browse/HBASE-23779
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
> Attachments: addendum2.patch, test_yetus_934.0.patch
>
>
> Tests take a long time. Our fork count running all tests are conservative -- 
> 1 (small) for first part and 5 for second part (medium and large). Rather 
> than hardcoding we should set the fork count to be relative to machine size. 
> Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box.
> Looking up at jenkins, it seems like the boxes are 24 cores... at least going 
> by my random survey. The load reported on a few seems low though this not 
> representative (looking at machine/uptime).
> More parallelism willl probably mean more test failure. Let me take a look 
> see.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count

2020-02-09 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033283#comment-17033283
 ] 

Mark Robert Miller commented on HBASE-23779:


Couple notes I've noticed:
 # -T doesn't not seem to work so well for downloading deps at the least. If I 
don't do a non -T to download and make sure I have all deps at some point 
first, I see crashes.
 # In my experience, cannot create native thread due to OOM tends to be a RAM 
issue more often than open file limit issue. You can often help that by not 
accepting default huge stack sizes per thread - you don't often need nearly as 
much as some of the high defaults for that these days - try a megabyte.
 # More threads and tests at the same time uses more RAM of course, another way 
to help is to peg an Xms of like 256 or something, rather than just setting Xmx 
 - encourage the tests that don't need so much RAM to not claim it to begin 
with.

> Up the default fork count to make builds complete faster; make count relative 
> to CPU count
> --
>
> Key: HBASE-23779
> URL: https://issues.apache.org/jira/browse/HBASE-23779
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
> Attachments: addendum2.patch, test_yetus_934.0.patch
>
>
> Tests take a long time. Our fork count running all tests are conservative -- 
> 1 (small) for first part and 5 for second part (medium and large). Rather 
> than hardcoding we should set the fork count to be relative to machine size. 
> Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box.
> Looking up at jenkins, it seems like the boxes are 24 cores... at least going 
> by my random survey. The load reported on a few seems low though this not 
> representative (looking at machine/uptime).
> More parallelism willl probably mean more test failure. Let me take a look 
> see.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.

2020-02-08 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032867#comment-17032867
 ] 

Mark Robert Miller commented on HBASE-23795:


Step one is to get to know the tests very well. This is normally a tall order 
for a mature distributed systems application and the scale of the HBase tests 
is beyond what I have run into as normal (>1 hour for all tests in a dev run). 
Because of this, I filed a few issues above and then kind of put on blinders 
for a bit. Unfortunetly, it doesn't help matters that the tests hate my current 
dns and vpn and osx environments out of the box. Anyway, getting to know the 
cast of characters. I'm almost done and as soon as I am, I will wrap up my PR's 
for at least two of those issues. 127.0.0.1 is likely a bit longer of a journey 
to fully complete.

I'll have many more issues to file for step 2.

> Enable all tests to be run in parallel on reused JVMs.
> --
>
> Key: HBASE-23795
> URL: https://issues.apache.org/jira/browse/HBASE-23795
> Project: HBase
>  Issue Type: Wish
>Reporter: Mark Robert Miller
>Priority: Major
>
> I'd like to be able to run HBase tests in under 30-40 minutes on good 
> parallel hardware.
> It will require some small changes / fixes for that wish to come true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23806) Provide a much faster and efficient alternate option to maven and surefire for running tests.

2020-02-06 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller updated HBASE-23806:
---
Description: 
With HBASE-23795, the hope is to drive tests with maven and surefire much 
closer to their potential.

That will still leave a lot of room for improvement.

For those that have some nice hardware and a need for speed, we can blow right 
past maven+surefire.

> Provide a much faster and efficient alternate option to maven and surefire 
> for running tests.
> -
>
> Key: HBASE-23806
> URL: https://issues.apache.org/jira/browse/HBASE-23806
> Project: HBase
>  Issue Type: Wish
>Reporter: Mark Robert Miller
>Priority: Minor
>
> With HBASE-23795, the hope is to drive tests with maven and surefire much 
> closer to their potential.
> That will still leave a lot of room for improvement.
> For those that have some nice hardware and a need for speed, we can blow 
> right past maven+surefire.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23806) Provide a much faster and efficient alternate option to maven and surefire for running tests.

2020-02-06 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23806:
--

 Summary: Provide a much faster and efficient alternate option to 
maven and surefire for running tests.
 Key: HBASE-23806
 URL: https://issues.apache.org/jira/browse/HBASE-23806
 Project: HBase
  Issue Type: Wish
Reporter: Mark Robert Miller






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23779) Up the default fork count to make builds complete faster; make count relative to CPU count

2020-02-04 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030336#comment-17030336
 ] 

Mark Robert Miller commented on HBASE-23779:


bq. More parallelism willl probably mean more test failure. Let me take a look 
see.

I'm going to convince you this is a good thing! But maybe not on main branches 
until it's a bit smooth.

> Up the default fork count to make builds complete faster; make count relative 
> to CPU count
> --
>
> Key: HBASE-23779
> URL: https://issues.apache.org/jira/browse/HBASE-23779
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: Michael Stack
>Priority: Major
>
> Tests take a long time. Our fork count running all tests are conservative -- 
> 1 (small) for first part and 5 for second part (medium and large). Rather 
> than hardcoding we should set the fork count to be relative to machine size. 
> Suggestion here is 0.75C where C is CPU count. This ups the CPU use on my box.
> Looking up at jenkins, it seems like the boxes are 24 cores... at least going 
> by my random survey. The load reported on a few seems low though this not 
> representative (looking at machine/uptime).
> More parallelism willl probably mean more test failure. Let me take a look 
> see.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23783) Address tests writing and reading SSL/Security files in a common location.

2020-02-04 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030335#comment-17030335
 ] 

Mark Robert Miller commented on HBASE-23783:


Thank you Mr Stack!

 I'll figure out all these checks some day - the whitespace slipped by me.

With this committed, it will be easier for me to track down if there are any 
glaring remaining issues around  HBASE-23779.

> Address tests writing and reading SSL/Security files in a common location.
> --
>
> Key: HBASE-23783
> URL: https://issues.apache.org/jira/browse/HBASE-23783
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Assignee: Mark Robert Miller
>Priority: Minor
> Fix For: 3.0.0, 2.3.0
>
>
> This is causing me issues with parallel test runs because multiple tests can 
> write and read the same files in the test-classes directory. Some tests write 
> files in test-classes instead of their test data directory so that they can 
> put the files on the classpath.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23794) Consider setting -XX:MaxDirectMemorySize in the root Maven pom.xml file.

2020-02-04 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030018#comment-17030018
 ] 

Mark Robert Miller commented on HBASE-23794:


I'm still working out what a good suggestion for a value might be. Very few of 
the tests need even more than 1g of the 2g of heap given, so I'm looking into 
some numbers between the two things.

Largely, it is just nice to be explicit so that all devs and CI envs get the 
same value. Older hotspot might default to lower explicit values depending on 
arch/client/server, more recent hotspot defaults to Xmx, hotspot can change 
again, other JVM's could do whatever. So a lot of the improvement I imagine 
here is just consistency of the build and knowing the value has been set high 
enough for the tests.

I've run into fails due to this while playing around with giving tests less 
resources - so I'd like to set it high enough to avoid any fails, but also 
remove this confusion around messing with Xmx and running into off heap 
allocation failures and that type of thing.

> Consider setting -XX:MaxDirectMemorySize in the root Maven pom.xml file.
> 
>
> Key: HBASE-23794
> URL: https://issues.apache.org/jira/browse/HBASE-23794
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Priority: Minor
>
> -XX:MaxDirectMemorySize is an artificial governor on how much off heap memory 
> can be allocated.
> It would be nice to specify explicitly because:
>  # The default can vary by platform / jvm impl - some devs may see random 
> fails
>  # It's just a limiter, it won't pre allocate or anything
>  # A test env should normally ensure a healthy limit as would be done in 
> production



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23796) Consider using 127.0.0.1 instead of localhost and binding to 127.0.0.1 as well.

2020-02-04 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23796:
--

 Summary: Consider using 127.0.0.1 instead of localhost and binding 
to 127.0.0.1 as well.
 Key: HBASE-23796
 URL: https://issues.apache.org/jira/browse/HBASE-23796
 Project: HBase
  Issue Type: Test
Reporter: Mark Robert Miller


This is perhaps controversial, but there are a variety of problems with 
counting on dns hostname resolution, especially for locahost.

 
 # It can often be slow, slow under concurrency, or slow under specific 
conditions.
 # It can often not work at all - when on a VPN, with weird DNS hijacking 
hi-jinks, when you have a real hostname for you machines, a custom /etc/hosts 
file, OS's run their own local/funny DNS server services.
 # This makes coming to HBase for new devs a hit or miss experience and if you 
miss, dealing with an diagnosing the issues is a large endeavor and not 
straight forward or transparent.
 #  99% of the difference doesn't matter in most cases - except that 127.0.0.1 
works and is fast pretty much universally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23795) Enable all tests to be run in parallel on reused JVMs.

2020-02-04 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23795:
--

 Summary: Enable all tests to be run in parallel on reused JVMs.
 Key: HBASE-23795
 URL: https://issues.apache.org/jira/browse/HBASE-23795
 Project: HBase
  Issue Type: Wish
Reporter: Mark Robert Miller


I'd like to be able to run HBase tests in under 30-40 minutes on good parallel 
hardware.

It will require some small changes / fixes for that wish to come true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23794) Consider setting -XX:MaxDirectMemorySize in the root Maven pom.xml file.

2020-02-04 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23794:
--

 Summary: Consider setting -XX:MaxDirectMemorySize in the root 
Maven pom.xml file.
 Key: HBASE-23794
 URL: https://issues.apache.org/jira/browse/HBASE-23794
 Project: HBase
  Issue Type: Test
Reporter: Mark Robert Miller


-XX:MaxDirectMemorySize is an artificial governor on how much off heap memory 
can be allocated.

It would be nice to specify explicitly because:
 # The default can vary by platform / jvm impl - some devs may see random fails
 # It's just a limiter, it won't pre allocate or anything
 # A test env should normally ensure a healthy limit as would be done in 
production



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23787) TestSyncTimeRangeTracker fails quite easily and allocates a very expensive array.

2020-02-03 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23787:
--

 Summary: TestSyncTimeRangeTracker fails quite easily and allocates 
a very expensive array.
 Key: HBASE-23787
 URL: https://issues.apache.org/jira/browse/HBASE-23787
 Project: HBase
  Issue Type: Test
  Components: test
Reporter: Mark Robert Miller


I see this test fail a lot in my environments. It also uses such a large array 
that it seems particularly memory wasteful and difficult to get good contention 
in the test as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23783) Address tests writing and reading SSL/Security files in a common location.

2020-02-03 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029230#comment-17029230
 ] 

Mark Robert Miller edited comment on HBASE-23783 at 2/3/20 8:04 PM:


I would also like to add   ${surefire.tempDir}

To safely run surefire from multiple maven instances, you have to be able to 
specify a unique tmp directory. Otherwise, removal of the directory on JVM exit 
can interfere with tmp file creation.


was (Author: markrmiller):
I also like to add   ${surefire.tempDir}

To safely run surefire from multiple maven instances, you have to be able to 
specify a unique tmp directory. 

> Address tests writing and reading SSL/Security files in a common location.
> --
>
> Key: HBASE-23783
> URL: https://issues.apache.org/jira/browse/HBASE-23783
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Priority: Minor
>
> This is causing me issues with parallel test runs because multiple tests can 
> write and read the same files in the test-classes directory. Some tests write 
> files in test-classes instead of their test data directory so that they can 
> put the files on the classpath.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23783) Address tests writing and reading SSL/Security files in a common location.

2020-02-03 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029230#comment-17029230
 ] 

Mark Robert Miller commented on HBASE-23783:


I also like to add   ${surefire.tempDir}

To safely run surefire from multiple maven instances, you have to be able to 
specify a unique tmp directory. 

> Address tests writing and reading SSL/Security files in a common location.
> --
>
> Key: HBASE-23783
> URL: https://issues.apache.org/jira/browse/HBASE-23783
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Priority: Minor
>
> This is causing me issues with parallel test runs because multiple tests can 
> write and read the same files in the test-classes directory. Some tests write 
> files in test-classes instead of their test data directory so that they can 
> put the files on the classpath.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23783) Address tests writing and reading SSL/Security files in a common location.

2020-02-02 Thread Mark Robert Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller updated HBASE-23783:
---
Description: This is causing me issues with parallel test runs because 
multiple tests can write and read the same files in the test-classes directory. 
Some tests write files in test-classes instead of their test data directory so 
that they can put the files on the classpath.  (was: This is causing me issues 
with parallel test runs.)

> Address tests writing and reading SSL/Security files in a common location.
> --
>
> Key: HBASE-23783
> URL: https://issues.apache.org/jira/browse/HBASE-23783
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Priority: Minor
>
> This is causing me issues with parallel test runs because multiple tests can 
> write and read the same files in the test-classes directory. Some tests write 
> files in test-classes instead of their test data directory so that they can 
> put the files on the classpath.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23783) Address tests writing and reading SSL/Security files in a common location.

2020-02-02 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028563#comment-17028563
 ] 

Mark Robert Miller commented on HBASE-23783:


I recently switched from eclipse to intellij - I have a little extra code 
formatting to clean up.

This seems to be working better for me. I was creating a new unique sub 
directory on the classpath for these files, but it was simpler in the end just 
to use unique file names and keep these files in the root test-classes 
directory in the cases they were already located there.

So far this has worked out well, still doing some testing.

> Address tests writing and reading SSL/Security files in a common location.
> --
>
> Key: HBASE-23783
> URL: https://issues.apache.org/jira/browse/HBASE-23783
> Project: HBase
>  Issue Type: Test
>Reporter: Mark Robert Miller
>Priority: Minor
>
> This is causing me issues with parallel test runs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23783) Address tests writing and reading SSL/Security files in a common location.

2020-02-02 Thread Mark Robert Miller (Jira)
Mark Robert Miller created HBASE-23783:
--

 Summary: Address tests writing and reading SSL/Security files in a 
common location.
 Key: HBASE-23783
 URL: https://issues.apache.org/jira/browse/HBASE-23783
 Project: HBase
  Issue Type: Test
Reporter: Mark Robert Miller


This is causing me issues with parallel test runs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)