Hi David + JC , thanks for the reviews . David - I added the suggested print-outputs , and also the parameter to executeThreadDumps .
Best regards, Matthias > -----Original Message----- > From: David Holmes <david.hol...@oracle.com> > Sent: Mittwoch, 31. Juli 2019 23:57 > To: Baesken, Matthias <matthias.baes...@sap.com>; Jean Christophe > Beyler <jcbey...@google.com> > Cc: hotspot-...@openjdk.java.net; serviceability-dev <serviceability- > d...@openjdk.java.net> > Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on fast > Linux machines with Total safepoint time 0 ms > > On 1/08/2019 12:01 am, Baesken, Matthias wrote: > > > > Hi upload works again, now with webrev : > > > > http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.2/ > > Could you please add, for diagnostic purposes: > > System.out.println("Total safepoint time (ms): " + value); > > after: > > 60 long value = executeThreadDumps(); > > and > > 68 long value2 = executeThreadDumps(); > > that way if the test fails we can check logs to see what kind of > safepoint times have been observed previously. No need to see an updated > webrev just for that. > > I have one further suggestion, take it or leave it, that > executeThreadDumps() takes a parameter to specify the initial value, so > we'd have: > > 60 long value = executeThreadDumps(0); > > and > > 68 long value2 = executeThreadDumps(value); > > This might help detect getTotalSafepointTime() going backwards slightly > better than current code. > > Thanks, > David > > > Best regards, Matthias > > > > > >> -----Original Message----- > >> From: Baesken, Matthias > >> Sent: Mittwoch, 31. Juli 2019 14:05 > >> To: 'David Holmes' <david.hol...@oracle.com>; Jean Christophe Beyler > >> <jcbey...@google.com> > >> Cc: hotspot-...@openjdk.java.net; serviceability-dev <serviceability- > >> d...@openjdk.java.net> > >> Subject: RE: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on > fast > >> Linux machines with Total safepoint time 0 ms > >> > >> Hello, here is a version following the latest proposal of JC . > >> > >> Unfortunately attached as patch, sorry for that - the uploads / pushes > >> currently do not work from here . > >> > >> Best regards, Matthias > >> > >> > >>> -----Original Message----- > >>> From: David Holmes <david.hol...@oracle.com> > >>> Sent: Mittwoch, 31. Juli 2019 05:04 > >>> To: Jean Christophe Beyler <jcbey...@google.com> > >>> Cc: Baesken, Matthias <matthias.baes...@sap.com>; hotspot- > >>> d...@openjdk.java.net; serviceability-dev <serviceability- > >>> d...@openjdk.java.net> > >>> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java fails on > >> fast > >>> Linux machines with Total safepoint time 0 ms > >>> > >>> On 31/07/2019 9:08 am, Jean Christophe Beyler wrote: > >>>> FWIW, I would have done something like what David was suggesting, > just > >>>> slightly tweaked: > >>>> > >>>> public static long executeThreadDumps() { > >>>> long value; > >>>> long initial_value = mbean.getTotalSafepointTime(); > >>>> do { > >>>> Thread.getAllStackTraces(); > >>>> value = mbean.getTotalSafepointTime(); > >>>> } while (value == initial_value); > >>>> return value; > >>>> } > >>>> > >>>> This ensures that the value is a new value as opposed to the current > >>>> value and if something goes wrong, as David said, it will timeout; which > >>>> is ok. > >>> > >>> Works for me. > >>> > >>>> But I come back to not really understanding why we are doing this at > >>>> this point of relaxing (just get a new value of safepoint time). > >>>> Because, if we accept timeouts now as a failure here, then really the > >>>> whole test becomes: > >>>> > >>>> executeThreadDumps(); > >>>> executeThreadDumps(); > >>>> > >>>> Since the first call will return when value > 0 and the second call will > >>>> return when value2 > value (I still wonder why we want to ensure it > >>>> works twice...). > >>> > >>> The test is trying to sanity check that we are actually recording the > >>> time used by safepoints. So first check is that we can get a non-zero > >>> value; second check is we get a greater non-zero value. It's just a > >>> sanity test to try and catch if something gets unexpectedly broken in > >>> the time tracking code. > >>> > >>>> So both failures and even testing for it is kind of redundant, once you > >>>> have a do/while until a change? > >>> > >>> Yes - the problem with the tests that try to check internal VM behaviour > >>> is that we have no specified way to do something, in this case execute > >>> safepoints, that relates to internal VM behaviour, so we have to do > >>> something we know will currently work even if not specified to do so - > >>> e.g. dumping all thread stacks uses a global safepoint. The second > >>> problem is that the timer granularity is so coarse that we then have to > >>> guess how many times we need to do that something before seeing a > >>> change. To make the test robust we can keep doing stuff until we see a > >>> change and so the only way that will fail is if the overall timeout of > >>> the test kicks in. Or we can try and second guess how long it should > >>> take by introducing our own internal timeout - either directly or by > >>> limiting the number of loops in this case. That has its own problems and > >>> in general we have tried to reduce internal test timeouts (by removing > >>> them) and let overall timeouts take charge. > >>> > >>> No ideal solution. And this has already consumed way too much of > >>> everyone's time. > >>> > >>> Cheers, > >>> David > >>> > >>>> Thanks, > >>>> Jc > >>>> > >>>> > >>>> On Tue, Jul 30, 2019 at 2:35 PM David Holmes > <david.hol...@oracle.com > >>>> <mailto:david.hol...@oracle.com>> wrote: > >>>> > >>>> On 30/07/2019 10:39 pm, Baesken, Matthias wrote: > >>>> > Hi David, "put that whole code (the while loop) in a helper > >>>> method." was JC's idea, and I like the idea . > >>>> > >>>> Regardless I think the way you are using NUM_THREAD_DUMPS is > >> really > >>>> confusing. As an all-caps static you'd expect it to be a constant. > >>>> > >>>> Thanks, > >>>> David > >>>> > >>>> > Let's see what others think . > >>>> > > >>>> >> > >>>> >> Overall tests like this are not very useful, yet very fragile. > >>>> >> > >>>> > > >>>> > I am also fine with putting the test on the exclude list. > >>>> > > >>>> > Best regards, Matthias > >>>> > > >>>> > > >>>> >> -----Original Message----- > >>>> >> From: David Holmes <david.hol...@oracle.com > >>>> <mailto:david.hol...@oracle.com>> > >>>> >> Sent: Dienstag, 30. Juli 2019 14:12 > >>>> >> To: Baesken, Matthias <matthias.baes...@sap.com > >>>> <mailto:matthias.baes...@sap.com>>; Jean Christophe > >>>> >> Beyler <jcbey...@google.com <mailto:jcbey...@google.com>> > >>>> >> Cc: hotspot-...@openjdk.java.net > >>>> <mailto:hotspot-...@openjdk.java.net>; serviceability-dev > >>>> <serviceability- > >>>> >> d...@openjdk.java.net <mailto:d...@openjdk.java.net>> > >>>> >> Subject: Re: RFR: [XS] 8228658: test GetTotalSafepointTime.java > >>>> fails on fast > >>>> >> Linux machines with Total safepoint time 0 ms > >>>> >> > >>>> >> Hi Matthias, > >>>> >> > >>>> >> On 30/07/2019 9:25 pm, Baesken, Matthias wrote: > >>>> >>> Hello JC / David, here is a second webrev : > >>>> >>> > >>>> >>> http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.1/ > >>>> >>> > >>>> >>> It moves the thread dump execution into a method > >>>> >>> executeThreadDumps(long) , and also adds while loops > >>>> (but with a > >>>> >>> limitation for the number of thread dumps, really don’t > >>>> >>> want to cause timeouts etc.). I removed a check for > >>>> >>> MAX_VALUE_FOR_PASS because we cannot go over > >>> Long.MAX_VALUE . > >>>> >> > >>>> >> I don't think executeThreadDumps is worth factoring out like > >>>> out. > >>>> >> > >>>> >> The handling of NUM_THREAD_DUMPS is a bit confusing. I'd > rather > >> it > >>>> >> remains a constant 100, and then you set a simple loop iteration > >>>> count > >>>> >> limit. Further with the proposed code when you get here: > >>>> >> > >>>> >> 85 NUM_THREAD_DUMPS = NUM_THREAD_DUMPS * 2; > >>>> >> > >>>> >> you don't even know what value you may be starting with. > >>>> >> > >>>> >> But I was thinking of simply: > >>>> >> > >>>> >> long value = 0; > >>>> >> do { > >>>> >> Thread.getAllStackTraces(); > >>>> >> value = mbean.getTotalSafepointTime(); > >>>> >> } while (value == 0); > >>>> >> > >>>> >> We'd only hit a timeout if something is completely broken - > >>>> which is fine. > >>>> >> > >>>> >> Overall tests like this are not very useful, yet very fragile. > >>>> >> > >>>> >> Thanks, > >>>> >> David > >>>> >> > >>>> >>> Hope you like this version better. > >>>> >>> > >>>> >>> Best regards, Matthias > >>>> >>> > >>>> >>> *From:*Jean Christophe Beyler <jcbey...@google.com > >>>> <mailto:jcbey...@google.com>> > >>>> >>> *Sent:* Dienstag, 30. Juli 2019 05:39 > >>>> >>> *To:* David Holmes <david.hol...@oracle.com > >>>> <mailto:david.hol...@oracle.com>> > >>>> >>> *Cc:* Baesken, Matthias <matthias.baes...@sap.com > >>>> <mailto:matthias.baes...@sap.com>>; > >>>> >>> hotspot-...@openjdk.java.net > >>>> <mailto:hotspot-...@openjdk.java.net>; serviceability-dev > >>>> >>> <serviceability-dev@openjdk.java.net > >>>> <mailto:serviceability-dev@openjdk.java.net>> > >>>> >>> *Subject:* Re: RFR: [XS] 8228658: test > >>>> GetTotalSafepointTime.java fails > >>>> >>> on fast Linux machines with Total safepoint time 0 ms > >>>> >>> > >>>> >>> Hi Matthias, > >>>> >>> > >>>> >>> I wonder if you should not do what David is suggesting and then > >>>> put that > >>>> >>> whole code (the while loop) in a helper method. Below you > have a > >>>> >>> calculation again using value2 (which I wonder what the added > >>>> value of > >>>> >>> it is though) but anyway, that value2 could also be 0 at some > >>>> point, no? > >>>> >>> > >>>> >>> So would it not be best to just refactor the getAllStackTraces > and > >>>> >>> calculate safepoint time in a helper method for both value / > value2 > >>>> >>> variables? > >>>> >>> > >>>> >>> Thanks, > >>>> >>> > >>>> >>> Jc > >>>> >>> > >>>> >>> On Mon, Jul 29, 2019 at 7:50 PM David Holmes > >>>> <david.hol...@oracle.com <mailto:david.hol...@oracle.com> > >>>> >>> <mailto:david.hol...@oracle.com > >>>> <mailto:david.hol...@oracle.com>>> wrote: > >>>> >>> > >>>> >>> Hi Matthias, > >>>> >>> > >>>> >>> On 29/07/2019 8:20 pm, Baesken, Matthias wrote: > >>>> >>> > Hello , please review this small test fix . > >>>> >>> > > >>>> >>> > The test > >>>> >>> > >>>> >> > >>> > >> > test/jdk/sun/management/HotspotRuntimeMBean/GetTotalSafepointTime. > >>>> >> java > >>>> >>> fails sometimes on fast Linux machines with this error > >>>> message : > >>>> >>> > > >>>> >>> > java.lang.RuntimeException: Total safepoint time > >>>> illegal value: 0 > >>>> >>> ms (MIN = 1; MAX = 9223372036854775807) > >>>> >>> > > >>>> >>> > looks like the total safepoint time is too low > >>>> currently on these > >>>> >>> machines, it is < 1 ms. > >>>> >>> > > >>>> >>> > There might be several ways to handle this : > >>>> >>> > > >>>> >>> > * Change the test in a way that it might generate > >>>> nigher > >>>> >>> safepoint times > >>>> >>> > * Allow safepoint time == 0 ms > >>>> >>> > * Offer an additional interface that gives > >>>> safepoint times > >>>> >>> with finer granularity ( currently the HS has safepoint > >>>> time values > >>>> >>> in ns , see jdk/src/hotspot/share/runtime/safepoint.cpp > >>>> >>> SafepointTracing::end > >>>> >>> > > >>>> >>> > But it is converted on ms in this code > >>>> >>> > > >>>> >>> > 114jlong RuntimeService::safepoint_time_ms() { > >>>> >>> > 115 return UsePerfData ? > >>>> >>> > 116 > >>>> >>> > >>>> Management::ticks_to_ms(_safepoint_time_ticks->get_value()) : - > 1; > >>>> >>> > 117} > >>>> >>> > > >>>> >>> > 064jlong Management::ticks_to_ms(jlong ticks) { > >>>> >>> > 2065 assert(os::elapsed_frequency() > 0, "Must be > >>>> non-zero"); > >>>> >>> > 2066 return (jlong)(((double)ticks / > >>>> >>> (double)os::elapsed_frequency()) > >>>> >>> > 2067 * (double)1000.0); > >>>> >>> > 2068} > >>>> >>> > > >>>> >>> > > >>>> >>> > > >>>> >>> > Currently I go for the first attempt (and try to > >>>> generate > >>>> >>> higher safepoint times in my patch) . > >>>> >>> > >>>> >>> Yes that's probably best. Coarse-grained timing on very > >>>> fast machines > >>>> >>> was bound to eventually lead to problems. > >>>> >>> > >>>> >>> But perhaps a more future-proof approach is to just add a > >>>> do-while loop > >>>> >>> around the stack dumps and only exit when we have a non- > zero > >>>> >> safepoint > >>>> >>> time? > >>>> >>> > >>>> >>> Thanks, > >>>> >>> David > >>>> >>> ----- > >>>> >>> > >>>> >>> > Bug/webrev : > >>>> >>> > > >>>> >>> > https://bugs.openjdk.java.net/browse/JDK-8228658 > >>>> >>> > > >>>> >>> > > http://cr.openjdk.java.net/~mbaesken/webrevs/8228658.0/ > >>>> >>> > > >>>> >>> > > >>>> >>> > Thanks, Matthias > >>>> >>> > > >>>> >>> > >>>> >>> > >>>> >>> -- > >>>> >>> > >>>> >>> Thanks, > >>>> >>> > >>>> >>> Jc > >>>> >>> > >>>> > >>>> > >>>> > >>>> -- > >>>> > >>>> Thanks, > >>>> Jc