Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast Linux machines with Total safepoint time 0 ms

David Holmes Thu, 01 Aug 2019 00:57:36 -0700

On 1/08/2019 5:13 pm, Baesken, Matthias wrote:

Hi David + JC , thanks for the reviews .


David - I added the  suggested print-outputs  , and  also  the parameter to 
executeThreadDumps .


Okay - thanks for that.

David
-----

Best regards, Matthias

-----Original Message-----
From: David Holmes <[email protected]>
Sent: Mittwoch, 31. Juli 2019 23:57
To: Baesken, Matthias <[email protected]>; Jean Christophe
Beyler <[email protected]>
Cc: [email protected]; serviceability-dev <serviceability-
[email protected]>
Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast
Linux machines with Total safepoint time 0 ms

On 1/08/2019 12:01 am, Baesken, Matthias wrote:


Hi upload works again, now with webrev :

http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.2/


Could you please add, for diagnostic purposes:

System.out.println("Total safepoint time (ms): " + value);

after:

   60         long value = executeThreadDumps();

and

   68         long value2 = executeThreadDumps();

that way if the test fails we can check logs to see what kind of
safepoint times have been observed previously. No need to see an updated
webrev just for that.

I have one further suggestion, take it or leave it, that
executeThreadDumps() takes a parameter to specify the initial value, so
we'd have:

   60         long value = executeThreadDumps(0);

and

   68         long value2 = executeThreadDumps(value);

This might help detect getTotalSafepointTime() going backwards slightly
better than current code.

Thanks,
David

Best regards, Matthias

-----Original Message-----
From: Baesken, Matthias
Sent: Mittwoch, 31. Juli 2019 14:05
To: 'David Holmes' <[email protected]>; Jean Christophe Beyler
<[email protected]>
Cc: [email protected]; serviceability-dev <serviceability-
[email protected]>
Subject: RE: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on

fast

Linux machines with Total safepoint time 0 ms

Hello, here is a version following  the latest proposal of JC .

Unfortunately  attached as patch,  sorry for that - the uploads / pushes
currently  do not work from here .

Best regards, Matthias

-----Original Message-----
From: David Holmes <[email protected]>
Sent: Mittwoch, 31. Juli 2019 05:04
To: Jean Christophe Beyler <[email protected]>
Cc: Baesken, Matthias <[email protected]>; hotspot-
[email protected]; serviceability-dev <serviceability-
[email protected]>
Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on

fast

Linux machines with Total safepoint time 0 ms

On 31/07/2019 9:08 am, Jean Christophe Beyler wrote:

FWIW, I would have done something like what David was suggesting,

just

slightly tweaked:

public static long executeThreadDumps() {
    long value;
    long initial_value = mbean.getTotalSafepointTime();
    do {
        Thread.getAllStackTraces();
        value = mbean.getTotalSafepointTime();
    } while (value == initial_value);
    return value;
}

This ensures that the value is a new value as opposed to the current
value and if something goes wrong, as David said, it will timeout; which
is ok.


Works for me.

But I come back to not really understanding why we are doing this at
this point of relaxing (just get a new value of safepoint time).
Because, if we accept timeouts now as a failure here, then really the
whole test becomes:

executeThreadDumps();
executeThreadDumps();

Since the first call will return when value > 0 and the second call will
return when value2 > value (I still wonder why we want to ensure it
works twice...).


The test is trying to sanity check that we are actually recording the
time used by safepoints. So first check is that we can get a non-zero
value; second check is we get a greater non-zero value. It's just a
sanity test to try and catch if something gets unexpectedly broken in
the time tracking code.

So both failures and even testing for it is kind of redundant, once you
have a do/while until a change?


Yes - the problem with the tests that try to check internal VM behaviour
is that we have no specified way to do something, in this case execute
safepoints, that relates to internal VM behaviour, so we have to do
something we know will currently work even if not specified to do so -
e.g. dumping all thread stacks uses a global safepoint. The second
problem is that the timer granularity is so coarse that we then have to
guess how many times we need to do that something before seeing a
change. To make the test robust we can keep doing stuff until we see a
change and so the only way that will fail is if the overall timeout of
the test kicks in. Or we can try and second guess how long it should
take by introducing our own internal timeout - either directly or by
limiting the number of loops in this case. That has its own problems and
in general we have tried to reduce internal test timeouts (by removing
them) and let overall timeouts take charge.

No ideal solution. And this has already consumed way too much of
everyone's time.

Cheers,
David

Thanks,
Jc


On Tue, Jul 30, 2019 at 2:35 PM David Holmes

<[email protected]

<mailto:[email protected]>> wrote:

      On 30/07/2019 10:39 pm, Baesken, Matthias wrote:
       > Hi David,   "put that whole code (the while loop) in a helper
      method."   was JC's idea,  and I like the idea .

      Regardless I think the way you are using NUM_THREAD_DUMPS is

really

      confusing. As an all-caps static you'd expect it to be a constant.

      Thanks,
      David

       > Let's see what others think .
       >
       >>
       >> Overall tests like this are not very useful, yet very fragile.
       >>
       >
       > I am also  fine with putting the test on the exclude list.
       >
       > Best regards, Matthias
       >
       >
       >> -----Original Message-----
       >> From: David Holmes <[email protected]
      <mailto:[email protected]>>
       >> Sent: Dienstag, 30. Juli 2019 14:12
       >> To: Baesken, Matthias <[email protected]
      <mailto:[email protected]>>; Jean Christophe
       >> Beyler <[email protected] <mailto:[email protected]>>
       >> Cc: [email protected]
      <mailto:[email protected]>; serviceability-dev
      <serviceability-
       >> [email protected] <mailto:[email protected]>>
       >> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java
      fails on fast
       >> Linux machines with Total safepoint time 0 ms
       >>
       >> Hi Matthias,
       >>
       >> On 30/07/2019 9:25 pm, Baesken, Matthias wrote:
       >>> Hello  JC / David,   here is a second webrev  :
       >>>
       >>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.1/
       >>>
       >>> It moves   the  thread dump execution into a  method
       >>> executeThreadDumps(long)     , and also adds  while loops
      (but with a
       >>> limitation  for the number of thread dumps, really don’t
       >>> want to cause timeouts etc.).    I removed a check for
       >>> MAX_VALUE_FOR_PASS   because we cannot go over

Long.MAX_VALUE .

       >>
       >> I don't think executeThreadDumps is worth factoring out like out.
       >>
       >> The handling of NUM_THREAD_DUMPS is a bit confusing. I'd

rather

it

       >> remains a constant 100, and then you set a simple loop iteration
      count
       >> limit. Further with the proposed code when you get here:
       >>
       >>    85         NUM_THREAD_DUMPS = NUM_THREAD_DUMPS * 2;
       >>
       >> you don't even know what value you may be starting with.
       >>
       >> But I was thinking of simply:
       >>
       >> long value = 0;
       >> do {
       >>       Thread.getAllStackTraces();
       >>       value = mbean.getTotalSafepointTime();
       >> } while (value == 0);
       >>
       >> We'd only hit a timeout if something is completely broken -
      which is fine.
       >>
       >> Overall tests like this are not very useful, yet very fragile.
       >>
       >> Thanks,
       >> David
       >>
       >>> Hope you like this version  better.
       >>>
       >>> Best regards, Matthias
       >>>
       >>> *From:*Jean Christophe Beyler <[email protected]
      <mailto:[email protected]>>
       >>> *Sent:* Dienstag, 30. Juli 2019 05:39
       >>> *To:* David Holmes <[email protected]
      <mailto:[email protected]>>
       >>> *Cc:* Baesken, Matthias <[email protected]
      <mailto:[email protected]>>;
       >>> [email protected]
      <mailto:[email protected]>; serviceability-dev
       >>> <[email protected]
      <mailto:[email protected]>>
       >>> *Subject:* Re: RFR: [XS] 8228658: test
      GetTotalSafepointTime.java fails
       >>> on fast Linux machines with Total safepoint time 0 ms
       >>>
       >>> Hi Matthias,
       >>>
       >>> I wonder if you should not do what David is suggesting and then
      put that
       >>> whole code (the while loop) in a helper method. Below you

have a

       >>> calculation again using value2 (which I wonder what the added
      value of
       >>> it is though) but anyway, that value2 could also be 0 at some
      point, no?
       >>>
       >>> So would it not be best to just refactor the getAllStackTraces

and

       >>> calculate safepoint time in a helper method for both value /

value2

       >>> variables?
       >>>
       >>> Thanks,
       >>>
       >>> Jc
       >>>
       >>> On Mon, Jul 29, 2019 at 7:50 PM David Holmes
      <[email protected] <mailto:[email protected]>
       >>> <mailto:[email protected]
      <mailto:[email protected]>>> wrote:
       >>>
       >>>      Hi Matthias,
       >>>
       >>>      On 29/07/2019 8:20 pm, Baesken, Matthias wrote:
       >>>       > Hello , please review this small test fix .
       >>>       >
       >>>       > The test
       >>>
       >>

test/jdk/sun/management/HotspotRuntimeMBean/GetTotalSafepointTime.

       >> java
       >>>      fails sometimes on fast Linux machines with this error
      message :
       >>>       >
       >>>       > java.lang.RuntimeException: Total safepoint time
      illegal value: 0
       >>>      ms (MIN = 1; MAX = 9223372036854775807)
       >>>       >
       >>>       > looks like the total safepoint time is too low
      currently on these
       >>>      machines, it is < 1 ms.
       >>>       >
       >>>       > There might be several ways to handle this :
       >>>       >
       >>>       >    *   Change the test  in a way that it might generate
      nigher
       >>>      safepoint times
       >>>       >    *   Allow  safepoint time  == 0 ms
       >>>       >    *   Offer an additional interface that gives
      safepoint times
       >>>      with finer granularity ( currently the HS has safepoint
      time values
       >>>      in ns , see  jdk/src/hotspot/share/runtime/safepoint.cpp
       >>>        SafepointTracing::end
       >>>       >
       >>>       > But it is converted on ms in this code
       >>>       >
       >>>       > 114jlong RuntimeService::safepoint_time_ms() {
       >>>       > 115  return UsePerfData ?
       >>>       > 116
       >>>
      Management::ticks_to_ms(_safepoint_time_ticks->get_value()) : -

1;

       >>>       > 117}
       >>>       >
       >>>       > 064jlong Management::ticks_to_ms(jlong ticks) {
       >>>       > 2065  assert(os::elapsed_frequency() > 0, "Must be
      non-zero");
       >>>       > 2066  return (jlong)(((double)ticks /
       >>>      (double)os::elapsed_frequency())
       >>>       > 2067                 * (double)1000.0);
       >>>       > 2068}
       >>>       >
       >>>       >
       >>>       >
       >>>       > Currently I go for  the first attempt (and try to generate
       >>>      higher safepoint times in my patch) .
       >>>
       >>>      Yes that's probably best. Coarse-grained timing on very
      fast machines
       >>>      was bound to eventually lead to problems.
       >>>
       >>>      But perhaps a more future-proof approach is to just add a
      do-while loop
       >>>      around the stack dumps and only exit when we have a non-

zero

       >> safepoint
       >>>      time?
       >>>
       >>>      Thanks,
       >>>      David
       >>>      -----
       >>>
       >>>       > Bug/webrev :
       >>>       >
       >>>       > https://bugs.openjdk.java.net/browse/JDK-8228658
       >>>       >
       >>>       >

http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.0/

       >>>       >
       >>>       >
       >>>       > Thanks, Matthias
       >>>       >
       >>>
       >>>
       >>> --
       >>>
       >>> Thanks,
       >>>
       >>> Jc
       >>>



--

Thanks,
Jc

Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast Linux machines with Total safepoint time 0 ms

Reply via email to