On Thu, 14 Sep 2023 08:00:56 GMT, Aleksey Shipilev <[email protected]> wrote:
>>> and consume the usual amount of memory.
>>
>> And how much is that? And at what concurrency level will we not be able to
>> run these tests in parallel without potentially impacting the way they run
>> i.e. running out of memory sooner than expected?
>>
>> I'm concerned that these set of PRs to remove exclusive testing are going to
>> cause a headache for those of us who have to monitor and triage CI testing.
>> If I see one of these tests fail after this change goes in, there is nothing
>> to give me any hint as to what has changed - no git log for the test file
>> will show me something was modified!
>
>> > and consume the usual amount of memory.
>>
>> And how much is that? And at what concurrency level will we not be able to
>> run these tests in parallel without potentially impacting the way they run
>> i.e. running out of memory sooner than expected?
>
> They run at the standard heap sizes for the tests, driven by
> `MaxRAMPercentage` setup by build system. On my 18-core test servers, most of
> them run with ~700 MB RSS, sometimes peaking at ~1.1G. AFAICS, this is a
> common RSS for VM/GC tests. These tests eat Java heap / class memory and exit
> as soon as they catch OOME or load all the classes. The extended parallelism
> might delay that a bit, but I don't see this manifesting in practice.
>
>> I'm concerned that these set of PRs to remove exclusive testing are going to
>> cause a headache for those of us who have to monitor and triage CI testing.
>> If I see one of these tests fail after this change goes in, there is nothing
>> to give me any hint as to what has changed - no git log for the test file
>> will show me something was modified!
>
> True. That's one of the reasons to avoid external test configs, whether it is
> `TEST.properties` near the tests, or the settings in global suite `TEST.ROOT`.
>
> There are two bonus points from maintenance perspective:
>
> 1. (technical) Note that the current `exclusiveDirs` limit the _in-group_
> parallelism. This means that there is a random chance something else is
> running concurrently with these tests, if that test is outside of the this
> test group. So it is not like we are deciding if these tests should run in
> complete resource isolation from everything else or not -- they already are
> not isolated. Which means, if tests experience resource starvation, it would
> manifest pretty randomly, depending on what had been running in parallel.
> Unblocking the _in-group_ parallelism allows us to make these conditions
> manifesting more reliably. Which, I argue, benefits tests maintainability: if
> test can fail due to resource starvation, they would do so more often than
> once in a blue moon. We verify that is unlikely to happen by stress-testing
> multiple iterations of these tests.
>
> 2. (organizational) Due to these parallelism blockages, `tier4` is
> remarkably slow. It is >10x slower than `tier3`, for example, and it gets
> worse as more untapped parallelism there is on the machine. Which is why I
> see both ad-hoc developer and vendor testing pipelines do not run `tier4` as
> frequently as they run `tier{1,2,3}`. Making `tier4` more parallel, and thus
> faster to run, me...
@shipilev thanks for the broader context, but what platforms and
configurations are you actually testing on?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/15689#issuecomment-1720374685