On Thu, 14 Sep 2023 05:44:51 GMT, David Holmes <dhol...@openjdk.org> wrote:

> > and consume the usual amount of memory.
> 
> And how much is that? And at what concurrency level will we not be able to 
> run these tests in parallel without potentially impacting the way they run 
> i.e. running out of memory sooner than expected?

They run at the standard heap sizes for the tests, driven by `MaxRAMPercentage` 
setup by build system. On my 18-core test servers, most of them run with ~700 
MB RSS, sometimes peaking at ~1.1G. AFAICS, this is a common RSS for VM/GC 
tests. These tests eat Java heap / class memory and exit as soon as they catch 
OOME or load all the classes. The extended parallelism might delay that a bit, 
but I don't see this manifesting in practice. 

> I'm concerned that these set of PRs to remove exclusive testing are going to 
> cause a headache for those of us who have to monitor and triage CI testing. 
> If I see one of these tests fail after this change goes in, there is nothing 
> to give me any hint as to what has changed - no git log for the test file 
> will show me something was modified!

True. That's one of the reasons to avoid external test configs, whether it is 
`TEST.properties` near the tests, or the settings in global suite `TEST.ROOT`.

There are two bonus points from maintenance perspective:

 1. (technical) Note that the current `exclusiveDirs` limit the _in-group_ 
parallelism. This means that there is a random chance something else is running 
concurrently with these tests, if that test is outside of the this test group. 
So it is not like we are deciding if these tests should run in complete 
resource isolation from everything else or not -- they already are not 
isolated. Which means, if tests experience resource starvation, it would 
manifest pretty randomly, depending on what had been running in parallel. 
Unblocking the _in-group_ parallelism allows us to make these conditions 
manifesting more reliably. Which, I argue, benefits tests maintainability: if 
test can fail due to resource starvation, they would do so more often than once 
in a blue moon. We verify that is unlikely to happen by stress-testing multiple 
iterations of these tests.
 
 2. (organizational) Due to these parallelism blockages, `tier4` is remarkably 
slow. It is >10x slower than `tier3`, for example, and it gets worse as more 
untapped parallelism there is on the machine. Which is why I see both ad-hoc 
developer and vendor testing pipelines do not run `tier4` as frequently as they 
run `tier{1,2,3}`. Making `tier4` more parallel, and thus faster to run, means 
more frequent testing, more attention to test failures there, and thus _less_ 
individual headache for a handful of people who actually run and have to 
analyze `tier4` failures.
 
As the stability check for all of this, I have been running `tier4` with _all_ 
`TEST.properties` removed, and it did not fail once (yet?), while running >5x 
times faster. Making `tier4` run under 1 hour on heavily parallel machine would 
be my dream goal. With these PRs, we are slowly chipping away towards that. The 
intent was to get parallelism limitations lifted gradually, test group by test 
group, so we can backtrack if problems arise. If you want to go slower, we can.

I hope that gives a broader context for this.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/15689#issuecomment-1718955534

Reply via email to