RE: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast Linux machines with Total safepoint time 0 ms

Baesken, Matthias Wed, 31 Jul 2019 07:03:37 -0700

Hi upload works again, now with webrev :

http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.2/


Best regards, Matthias


> -----Original Message-----
> From: Baesken, Matthias
> Sent: Mittwoch, 31. Juli 2019 14:05
> To: 'David Holmes' <[email protected]>; Jean Christophe Beyler
> <[email protected]>
> Cc: [email protected]; serviceability-dev <serviceability-
> [email protected]>
> Subject: RE: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast
> Linux machines with Total safepoint time 0 ms
> 
> Hello, here is a version following  the latest proposal of JC .
> 
> Unfortunately  attached as patch,  sorry for that - the uploads / pushes
> currently  do not work from here .
> 
> Best regards, Matthias
> 
> 
> > -----Original Message-----
> > From: David Holmes <[email protected]>
> > Sent: Mittwoch, 31. Juli 2019 05:04
> > To: Jean Christophe Beyler <[email protected]>
> > Cc: Baesken, Matthias <[email protected]>; hotspot-
> > [email protected]; serviceability-dev <serviceability-
> > [email protected]>
> > Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on
> fast
> > Linux machines with Total safepoint time 0 ms
> >
> > On 31/07/2019 9:08 am, Jean Christophe Beyler wrote:
> > > FWIW, I would have done something like what David was suggesting, just
> > > slightly tweaked:
> > >
> > > public static long executeThreadDumps() {
> > >   long value;
> > >   long initial_value = mbean.getTotalSafepointTime();
> > >   do {
> > >       Thread.getAllStackTraces();
> > >       value = mbean.getTotalSafepointTime();
> > >   } while (value == initial_value);
> > >   return value;
> > > }
> > >
> > > This ensures that the value is a new value as opposed to the current
> > > value and if something goes wrong, as David said, it will timeout; which
> > > is ok.
> >
> > Works for me.
> >
> > > But I come back to not really understanding why we are doing this at
> > > this point of relaxing (just get a new value of safepoint time).
> > > Because, if we accept timeouts now as a failure here, then really the
> > > whole test becomes:
> > >
> > > executeThreadDumps();
> > > executeThreadDumps();
> > >
> > > Since the first call will return when value > 0 and the second call will
> > > return when value2 > value (I still wonder why we want to ensure it
> > > works twice...).
> >
> > The test is trying to sanity check that we are actually recording the
> > time used by safepoints. So first check is that we can get a non-zero
> > value; second check is we get a greater non-zero value. It's just a
> > sanity test to try and catch if something gets unexpectedly broken in
> > the time tracking code.
> >
> > > So both failures and even testing for it is kind of redundant, once you
> > > have a do/while until a change?
> >
> > Yes - the problem with the tests that try to check internal VM behaviour
> > is that we have no specified way to do something, in this case execute
> > safepoints, that relates to internal VM behaviour, so we have to do
> > something we know will currently work even if not specified to do so -
> > e.g. dumping all thread stacks uses a global safepoint. The second
> > problem is that the timer granularity is so coarse that we then have to
> > guess how many times we need to do that something before seeing a
> > change. To make the test robust we can keep doing stuff until we see a
> > change and so the only way that will fail is if the overall timeout of
> > the test kicks in. Or we can try and second guess how long it should
> > take by introducing our own internal timeout - either directly or by
> > limiting the number of loops in this case. That has its own problems and
> > in general we have tried to reduce internal test timeouts (by removing
> > them) and let overall timeouts take charge.
> >
> > No ideal solution. And this has already consumed way too much of
> > everyone's time.
> >
> > Cheers,
> > David
> >
> > > Thanks,
> > > Jc
> > >
> > >
> > > On Tue, Jul 30, 2019 at 2:35 PM David Holmes <[email protected]
> > > <mailto:[email protected]>> wrote:
> > >
> > >     On 30/07/2019 10:39 pm, Baesken, Matthias wrote:
> > >      > Hi David,   "put that whole code (the while loop) in a helper
> > >     method."   was JC's idea,  and I like the idea .
> > >
> > >     Regardless I think the way you are using NUM_THREAD_DUMPS is
> really
> > >     confusing. As an all-caps static you'd expect it to be a constant.
> > >
> > >     Thanks,
> > >     David
> > >
> > >      > Let's see what others think .
> > >      >
> > >      >>
> > >      >> Overall tests like this are not very useful, yet very fragile.
> > >      >>
> > >      >
> > >      > I am also  fine with putting the test on the exclude list.
> > >      >
> > >      > Best regards, Matthias
> > >      >
> > >      >
> > >      >> -----Original Message-----
> > >      >> From: David Holmes <[email protected]
> > >     <mailto:[email protected]>>
> > >      >> Sent: Dienstag, 30. Juli 2019 14:12
> > >      >> To: Baesken, Matthias <[email protected]
> > >     <mailto:[email protected]>>; Jean Christophe
> > >      >> Beyler <[email protected] <mailto:[email protected]>>
> > >      >> Cc: [email protected]
> > >     <mailto:[email protected]>; serviceability-dev
> > >     <serviceability-
> > >      >> [email protected] <mailto:[email protected]>>
> > >      >> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java
> > >     fails on fast
> > >      >> Linux machines with Total safepoint time 0 ms
> > >      >>
> > >      >> Hi Matthias,
> > >      >>
> > >      >> On 30/07/2019 9:25 pm, Baesken, Matthias wrote:
> > >      >>> Hello  JC / David,   here is a second webrev  :
> > >      >>>
> > >      >>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.1/
> > >      >>>
> > >      >>> It moves   the  thread dump execution into a  method
> > >      >>> executeThreadDumps(long)     , and also adds  while loops
> > >     (but with a
> > >      >>> limitation  for the number of thread dumps, really don’t
> > >      >>> want to cause timeouts etc.).    I removed a check for
> > >      >>> MAX_VALUE_FOR_PASS   because we cannot go over
> > Long.MAX_VALUE .
> > >      >>
> > >      >> I don't think executeThreadDumps is worth factoring out like out.
> > >      >>
> > >      >> The handling of NUM_THREAD_DUMPS is a bit confusing. I'd rather
> it
> > >      >> remains a constant 100, and then you set a simple loop iteration
> > >     count
> > >      >> limit. Further with the proposed code when you get here:
> > >      >>
> > >      >>    85         NUM_THREAD_DUMPS = NUM_THREAD_DUMPS * 2;
> > >      >>
> > >      >> you don't even know what value you may be starting with.
> > >      >>
> > >      >> But I was thinking of simply:
> > >      >>
> > >      >> long value = 0;
> > >      >> do {
> > >      >>       Thread.getAllStackTraces();
> > >      >>       value = mbean.getTotalSafepointTime();
> > >      >> } while (value == 0);
> > >      >>
> > >      >> We'd only hit a timeout if something is completely broken -
> > >     which is fine.
> > >      >>
> > >      >> Overall tests like this are not very useful, yet very fragile.
> > >      >>
> > >      >> Thanks,
> > >      >> David
> > >      >>
> > >      >>> Hope you like this version  better.
> > >      >>>
> > >      >>> Best regards, Matthias
> > >      >>>
> > >      >>> *From:*Jean Christophe Beyler <[email protected]
> > >     <mailto:[email protected]>>
> > >      >>> *Sent:* Dienstag, 30. Juli 2019 05:39
> > >      >>> *To:* David Holmes <[email protected]
> > >     <mailto:[email protected]>>
> > >      >>> *Cc:* Baesken, Matthias <[email protected]
> > >     <mailto:[email protected]>>;
> > >      >>> [email protected]
> > >     <mailto:[email protected]>; serviceability-dev
> > >      >>> <[email protected]
> > >     <mailto:[email protected]>>
> > >      >>> *Subject:* Re: RFR: [XS] 8228658: test
> > >     GetTotalSafepointTime.java fails
> > >      >>> on fast Linux machines with Total safepoint time 0 ms
> > >      >>>
> > >      >>> Hi Matthias,
> > >      >>>
> > >      >>> I wonder if you should not do what David is suggesting and then
> > >     put that
> > >      >>> whole code (the while loop) in a helper method. Below you have a
> > >      >>> calculation again using value2 (which I wonder what the added
> > >     value of
> > >      >>> it is though) but anyway, that value2 could also be 0 at some
> > >     point, no?
> > >      >>>
> > >      >>> So would it not be best to just refactor the getAllStackTraces 
> > > and
> > >      >>> calculate safepoint time in a helper method for both value / 
> > > value2
> > >      >>> variables?
> > >      >>>
> > >      >>> Thanks,
> > >      >>>
> > >      >>> Jc
> > >      >>>
> > >      >>> On Mon, Jul 29, 2019 at 7:50 PM David Holmes
> > >     <[email protected] <mailto:[email protected]>
> > >      >>> <mailto:[email protected]
> > >     <mailto:[email protected]>>> wrote:
> > >      >>>
> > >      >>>      Hi Matthias,
> > >      >>>
> > >      >>>      On 29/07/2019 8:20 pm, Baesken, Matthias wrote:
> > >      >>>       > Hello , please review this small test fix .
> > >      >>>       >
> > >      >>>       > The test
> > >      >>>
> > >      >>
> >
> test/jdk/sun/management/HotspotRuntimeMBean/GetTotalSafepointTime.
> > >      >> java
> > >      >>>      fails sometimes on fast Linux machines with this error
> > >     message :
> > >      >>>       >
> > >      >>>       > java.lang.RuntimeException: Total safepoint time
> > >     illegal value: 0
> > >      >>>      ms (MIN = 1; MAX = 9223372036854775807)
> > >      >>>       >
> > >      >>>       > looks like the total safepoint time is too low
> > >     currently on these
> > >      >>>      machines, it is < 1 ms.
> > >      >>>       >
> > >      >>>       > There might be several ways to handle this :
> > >      >>>       >
> > >      >>>       >    *   Change the test  in a way that it might generate
> > >     nigher
> > >      >>>      safepoint times
> > >      >>>       >    *   Allow  safepoint time  == 0 ms
> > >      >>>       >    *   Offer an additional interface that gives
> > >     safepoint times
> > >      >>>      with finer granularity ( currently the HS has safepoint
> > >     time values
> > >      >>>      in ns , see  jdk/src/hotspot/share/runtime/safepoint.cpp
> > >      >>>        SafepointTracing::end
> > >      >>>       >
> > >      >>>       > But it is converted on ms in this code
> > >      >>>       >
> > >      >>>       > 114jlong RuntimeService::safepoint_time_ms() {
> > >      >>>       > 115  return UsePerfData ?
> > >      >>>       > 116
> > >      >>>
> > >     Management::ticks_to_ms(_safepoint_time_ticks->get_value()) : -1;
> > >      >>>       > 117}
> > >      >>>       >
> > >      >>>       > 064jlong Management::ticks_to_ms(jlong ticks) {
> > >      >>>       > 2065  assert(os::elapsed_frequency() > 0, "Must be
> > >     non-zero");
> > >      >>>       > 2066  return (jlong)(((double)ticks /
> > >      >>>      (double)os::elapsed_frequency())
> > >      >>>       > 2067                 * (double)1000.0);
> > >      >>>       > 2068}
> > >      >>>       >
> > >      >>>       >
> > >      >>>       >
> > >      >>>       > Currently I go for  the first attempt (and try to 
> > > generate
> > >      >>>      higher safepoint times in my patch) .
> > >      >>>
> > >      >>>      Yes that's probably best. Coarse-grained timing on very
> > >     fast machines
> > >      >>>      was bound to eventually lead to problems.
> > >      >>>
> > >      >>>      But perhaps a more future-proof approach is to just add a
> > >     do-while loop
> > >      >>>      around the stack dumps and only exit when we have a non-zero
> > >      >> safepoint
> > >      >>>      time?
> > >      >>>
> > >      >>>      Thanks,
> > >      >>>      David
> > >      >>>      -----
> > >      >>>
> > >      >>>       > Bug/webrev :
> > >      >>>       >
> > >      >>>       > https://bugs.openjdk.java.net/browse/JDK-8228658
> > >      >>>       >
> > >      >>>       > http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.0/
> > >      >>>       >
> > >      >>>       >
> > >      >>>       > Thanks, Matthias
> > >      >>>       >
> > >      >>>
> > >      >>>
> > >      >>> --
> > >      >>>
> > >      >>> Thanks,
> > >      >>>
> > >      >>> Jc
> > >      >>>
> > >
> > >
> > >
> > > --
> > >
> > > Thanks,
> > > Jc

RE: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast Linux machines with Total safepoint time 0 ms

Reply via email to