Re: Low-Overhead Heap Profiling

Robbin Ehn Mon, 16 Oct 2017 08:47:23 -0700

Hi JC,

I saw a webrev.12 in the directory, with only test changes(11->12), so I took 
that version.
I had a look and tested the tests, worked fine!


First glance at the code (looking at full v12) some minor things below, mostly 
unused stuff.

Thanks, Robbin

diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.cpp
--- a/src/hotspot/share/runtime/heapMonitoring.cpp      Mon Oct 16 16:54:06 
2017 +0200
+++ b/src/hotspot/share/runtime/heapMonitoring.cpp      Mon Oct 16 17:42:42 
2017 +0200
@@ -211,2 +211,3 @@
   void initialize(int max_storage) {
+    // validate max_storage to sane value ? What would 0 mean ?
     MutexLocker mu(HeapMonitor_lock);
@@ -227,8 +228,4 @@
   bool initialized() { return _initialized; }
-  volatile bool *initialized_address() { return &_initialized; }

  private:
-  // Protects the traces currently sampled (below).
-  volatile intptr_t _stack_storage_lock[1];
-
   // The traces currently sampled.
@@ -313,3 +310,2 @@
   _initialized(false) {
-    _stack_storage_lock[0] = 0;
 }
@@ -532,13 +528,2 @@

-// Delegate the initialization question to the underlying storage system.
-bool HeapMonitoring::initialized() {
-  return StackTraceStorage::storage()->initialized();
-}
-
-// Delegate the initialization question to the underlying storage system.
-bool *HeapMonitoring::initialized_address() {
-  return
-      const_cast<bool*>(StackTraceStorage::storage()->initialized_address());
-}
-
 void HeapMonitoring::get_live_traces(jvmtiStackTraces *traces) {
diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.hpp
--- a/src/hotspot/share/runtime/heapMonitoring.hpp      Mon Oct 16 16:54:06 
2017 +0200
+++ b/src/hotspot/share/runtime/heapMonitoring.hpp      Mon Oct 16 17:42:42 
2017 +0200
@@ -35,3 +35,2 @@
   static uint64_t _rnd;
-  static bool _initialized;
   static jint _monitoring_rate;
@@ -92,7 +91,2 @@

-  // Is the profiler initialized and where is the address to the initialized
-  // boolean.
-  static bool initialized();
-  static bool *initialized_address();
-
   // Called when o is to be sampled from a given thread and a given size.



On 10/10/2017 12:57 AM, JC Beyler wrote:

Dear all,

Thread-safety is back!! Here is the update webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/

Full webrev is here:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.11/

In order to really test this, I needed to add this so thought now was a good time. It required a few changes here for the creation to ensure correctness and safety. Now wekeep the static pointer but clear the data internally so on re-initialize, it will be a bit more costly than before. I don't think this is a huge use-case so I did notthink it was a problem. I used the internal MutexLocker, I think I used it well, let me know.


I also added three tests:

1) Stack depth test:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStackDepthTest.java.patch

This test shows that the maximum stack depth system is working.

2) Thread safety:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadTest.java.patch

The test creates 24 threads and they all allocate at the same time. The test 
then checks it does find samples from all the threads.

3) Thread on/off safety
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadOnOffTest.java.patch

The test creates 24 threads that all allocate a bunch of memory. Then another 
thread turns the sampling on/off.

Btw, both tests 2 & 3 failed without the locks.

As I worked on this, I saw a lot of places where the tests are doing very similar things, I'm going to clean up the code a bit and make a HeapAllocator class that all testscan call directly. This will greatly simplify the code.


Thanks for any comments/criticisms!
Jc


On Mon, Oct 2, 2017 at 8:52 PM, JC Beyler <[email protected] 
<mailto:[email protected]>> wrote:

    Dear all,

    Small update to the webrev:
    http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/>

    Full webrev is here:
    http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/>

    I updated a bit of the naming, removed a TODO comment, and I added a test 
for testing the sampling rate. I also updated the maximum stack depth to 1024, 
there is no
    reason to keep it so small. I did a micro benchmark that tests the overhead 
and it seems relatively the same.

    I compared allocations from a stack depth of 10 and allocations from a 
stack depth of 1024 (allocations are from the same helper method in
    
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java
    
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java>):
               - For an array of 1 integer allocated in a loop; stack depth 
1024 vs stack depth 10: 1% slower
               - For an array of 200k integers allocated in a loop; stack depth 
1024 vs stack depth 10: 3% slower

    So basically now moving the maximum stack depth to 1024 but we only copy 
over the stack depths actually used.

    For the next webrev, I will be adding a stack depth test to show that it 
works and probably put back the mutex locking so that we can see how difficult 
it is to keep
    thread safe.

    Let me know what you think!
    Jc



    On Mon, Sep 25, 2017 at 3:02 PM, JC Beyler <[email protected] 
<mailto:[email protected]>> wrote:

        Forgot to say that for my numbers:
          - Not in the test are the actual numbers I got for the various array 
sizes, I ran the program 30 times and parsed the output; here are the averages 
and standard
        deviation:
               1000:     1.28% average; 1.13% standard deviation
               10000:    1.59% average; 1.25% standard deviation
               100000:   1.26% average; 1.26% standard deviation

        The 1000/10000/100000 are the sizes of the arrays being allocated. 
These are allocated 100k times and the sampling rate is 111 times the size of 
the array.

        Thanks!
        Jc


        On Mon, Sep 25, 2017 at 3:01 PM, JC Beyler <[email protected] 
<mailto:[email protected]>> wrote:

            Hi all,

            After a bit of a break, I am back working on this :). As before, 
here are two webrevs:

            - Full change set: http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/>
            - Compared to version 8: 
http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/>
                 (This version is compared to version 8 I last showed but 
ported to the new folder hierarchy)

            In this version I have:
               - Handled Thomas' comments from his email of 07/03:
                    - Merged the logging to be standard
                    - Fixed up the code a bit where asked
                    - Added some notes about the code not being thread-safe yet
                - Removed additional dead code from the version that modifies 
interpreter/c1/c2
                - Fixed compiler issues so that it compiles with 
--disable-precompiled-header
                     - Tested with ./configure --with-boot-jdk=<jdk8> 
--with-debug-level=slowdebug --disable-precompiled-headers

            Additionally, I added a test to check the sanity of the sampler: 
HeapMonitorStatCorrectnessTest
            
(http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch
 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch>)
                - This allocates a number of arrays and checks that we obtain 
the number of samples we want with an accepted error of 5%. I tested it 100 
times and it
            passed everytime, I can test more if wanted
                - Not in the test are the actual numbers I got for the various 
array sizes, I ran the program 30 times and parsed the output; here are the 
averages and
            standard deviation:
                   1000:     1.28% average; 1.13% standard deviation
                   10000:    1.59% average; 1.25% standard deviation
                   100000:   1.26% average; 1.26% standard deviation

            What this means is that we were always at about 1~2% of the number 
of samples the test expected.

            Let me know what you think,
            Jc

            On Wed, Jul 5, 2017 at 9:31 PM, JC Beyler <[email protected] 
<mailto:[email protected]>> wrote:

                Hi all,

                I apologize, I have not yet handled your remarks but thought 
this new webrev would also be useful to see and comment on perhaps.

                Here is the latest webrev, it is generated slightly different 
than the others since now I'm using webrev.ksh without the -N option:
                http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/>

                And the webrev.07 to webrev.08 diff is here:
                http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/>

                (Let me know if it works well)

                It's a small change between versions but it:
                   - provides a fix that makes the average sample rate correct 
(more on that below).
                   - fixes the code to actually have it play nicely with the 
fast tlab refill
                   - cleaned up a bit the JVMTI text and now use jvmtiFrameInfo
                - moved the capability to be onload solo

                With this webrev, I've done a small study of the random number 
generator we use here for the sampling rate. I took a small program and it can 
be simplified to:

                for (outer loop)
                for (inner loop)
                int[] tmp = new int[arraySize];

                - I've fixed the outer and inner loops to being 800 for this 
experiment, meaning we allocate 640000 times an array of a given array size.

                - Each program provides the average sample size used for the 
whole execution

                - Then, I ran each variation 30 times and then calculated the 
average of the average sample size used for various array sizes. I selected the 
array size to
                be one of the following: 1, 10, 100, 1000.

                - When compared to 512kb, the average sample size of 30 runs:
                1: 4.62% of error
                10: 3.09% of error
                100: 0.36% of error
                1000: 0.1% of error
                10000: 0.03% of error

                What it shows is that, depending on the number of samples, the 
average does become better. This is because with an allocation of 1 element per 
array, it
                will take longer to hit one of the thresholds. This is seen by 
looking at the sample count statistic I put in. For the same number of 
iterations (800 *
                800), the different array sizes provoke:
                1: 62 samples
                10: 125 samples
                100: 788 samples
                1000: 6166 samples
                10000: 57721 samples

                And of course, the more samples you have, the more sample rates 
you pick, which means that your average gets closer using that math.

                Thanks,
                Jc

                On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler <[email protected] 
<mailto:[email protected]>> wrote:

                    Thanks Robbin,

                    This seems to have worked. When I have the next webrev 
ready, we will find out but I'm fairly confident it will work!

                    Thanks agian!
                    Jc

                    On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn <[email protected] 
<mailto:[email protected]>> wrote:

                        Hi JC,

                        On 06/29/2017 12:15 AM, JC Beyler wrote:

                            B) Incremental changes


                        I guess the most common work flow here is using mq :
                        hg qnew fix_v1
                        edit files
                        hg qrefresh
                        hg qnew fix_v2
                        edit files
                        hg qrefresh

                        if you do hg log you will see 2 commits

                        webrev.ksh -r -2 -o my_inc_v1_v2
                        webrev.ksh -o my_full_v2


                        In  your .hgrc you might need:
                        [extensions]
                        mq =

                        /Robbin


                            Again another newbiew question here...

                            For showing the incremental changes, is there a 
link that explains how to do that? I apologize for my newbie questions all the 
time :)

                            Right now, I do:

                               ksh ../webrev.ksh -m -N

                            That generates a webrev.zip and send it to Chuck 
Rasbold. He then uploads it to a new webrev.

                            I tried commiting my change and adding a small 
change. Then if I just do ksh ../webrev.ksh without any options, it seems to 
produce a similar
                            page but now with only the changes I had (so the 
06-07 comparison you were talking about) and a changeset that has it all. I 
imagine that is
                            what you meant.

                            Which means that my workflow would become:

                            1) Make changes
                            2) Make a webrev without any options to show just 
the differences with the tip
                            3) Amend my changes to my local commit so that I 
have it done with
                            4) Go to 1

                            Does that seem correct to you?

                            Note that when I do this, I only see the full 
change of a file in the full change set (Side note here: now the page says 
change set and not
                            patch, which is maybe why Serguei was having 
issues?).

                            Thanks!
                            Jc



                            On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn 
<[email protected] <mailto:[email protected]> 
<mailto:[email protected]
                            <mailto:[email protected]>>> wrote:

                                 Hi,

                                 On 06/28/2017 12:04 AM, JC Beyler wrote:

                                     Dear Thomas et al,

                                     Here is the newest webrev:
                            http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>>



                                 You have some more bits to in there but 
generally this looks good and really nice with more tests.
                                 I'll do and deep dive and re-test this when I 
get back from my long vacation with whatever patch version you have then.

                                 Also I think it's time you provide incremental 
(v06->07 changes) as well as complete change-sets.

                                 Thanks, Robbin




                                     Thomas, I "think" I have answered all your 
remarks. The summary is:

                                     - The statistic system is up and provides 
insight on what the heap sampler is doing
                                          - I've noticed that, though the 
sampling rate is at the right mean, we are missing some samples, I have not yet 
tracked out why
                            (details below)

                                     - I've run a tiny benchmark that is the 
worse case: it is a very tight loop and allocated a small array
                                          - In this case, I see no overhead 
when the system is off so that is a good start :)
                                          - I see right now a high overhead in 
this case when sampling is on. This is not a really too surprising but I'm 
going to see if
                            this is consistent with our
                                     internal implementation. The benchmark is 
really allocation stressful so I'm not too surprised but I want to do the due 
diligence.

                                        - The statistic system up is up and I 
have a new test
                            
http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
                            
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>
                                     
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
                            
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>>
                                           - I did a bit of a study about the 
random generator here, more details are below but basically it seems to work 
well

                                        - I added a capability but since this 
is the first time doing this, I was not sure I did it right
                                          - I did add a test though for it and 
the test seems to do what I expect (all methods are failing with the
                            JVMTI_ERROR_MUST_POSSESS_CAPABILITY error).
                                              -
                            
http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
                            
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>

<http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch

                            
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>>

                                        - I still need to figure out what to do 
about the multi-agent vs single-agent issue

                                        - As far as measurements, it seems I 
still need to look at:
                                          - Why we do the 20 random calls 
first, are they necessary?
                                          - Look at the mean of the sampling 
rate that the random generator does and also what is actually sampled
                                          - What is the overhead in terms of 
memory/performance when on?

                                     I have inlined my answers, I think I got 
them all in the new webrev, let me know your thoughts.

                                     Thanks again!
                                     Jc


                                     On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl 
<[email protected] <mailto:[email protected]>
                            <mailto:[email protected] 
<mailto:[email protected]>> <mailto:[email protected] 
<mailto:[email protected]>

                                     <mailto:[email protected] 
<mailto:[email protected]>>>> wrote:

                                          Hi,

                                          On Wed, 2017-06-21 at 13:45 -0700, JC 
Beyler wrote:
                                          > Hi all,
                                          >
                                          > First off: Thanks again to Robbin 
and Thomas for their reviews :)
                                          >
                                          > Next, I've uploaded a new webrev:
                                          > 
http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>
                                     
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>>

                                          >
                                          > Here is an update:
                                          >
                                          > - @Robbin, I forgot to say that yes 
I need to look at implementing
                                          > this for the other architectures 
and testing it before it is all
                                          > ready to go. Is it common to have 
it working on all possible
                                          > combinations or is there a subset 
that I should be doing first and we
                                          > can do the others later?
                                          > - I've tested slowdebug, built and 
ran the JTreg tests I wrote with
                                          > slowdebug and fixed a few more 
issues
                                          > - I've refactored a bit of the code 
following Thomas' comments
                                          >    - I think I've handled all the 
comments from Thomas (I put
                                          > comments inline below for the 
specifics)

                                          Thanks for handling all those.

                                          > - Following Thomas' comments on 
statistics, I want to add some
                                          > quality assurance tests and find 
that the easiest way would be to
                                          > have a few counters of what is 
happening in the sampler and expose
                                          > that to the user.
                                          >    - I'll be adding that in the 
next version if no one sees any
                                          > objections to that.
                                          >    - This will allow me to add a 
sanity test in JTreg about number of
                                          > samples and average of sampling rate
                                          >
                                          > @Thomas: I had a few questions that 
I inlined below but I will
                                          > summarize the "bigger ones" here:
                                          >    - You mentioned constants are 
not using the right conventions, I
                                          > looked around and didn't see any 
convention except normal naming then
                                          > for static constants. Is that right?

                                          I looked through 
https://wiki.openjdk.java.net/display/HotSpot/StyleGui 
<https://wiki.openjdk.java.net/display/HotSpot/StyleGui>
                            <https://wiki.openjdk.java.net/display/HotSpot/StyleGui 
<https://wiki.openjdk.java.net/display/HotSpot/StyleGui>>
                                     
<https://wiki.openjdk.java.net/display/HotSpot/StyleGui 
<https://wiki.openjdk.java.net/display/HotSpot/StyleGui>
                            <https://wiki.openjdk.java.net/display/HotSpot/StyleGui 
<https://wiki.openjdk.java.net/display/HotSpot/StyleGui>>>
                                          de and the rule is to "follow an 
existing pattern and must have a
                                          distinct appearance from other 
names". Which does not help a lot I
                                          guess :/ The GC team started using 
upper camel case, e.g.
                                          SomeOtherConstant, but very likely 
this is probably not applied
                                          consistently throughout. So I am fine 
with not adding another style
                                          (like kMaxStackDepth with the "k" in 
front with some unknown meaning)
                                          is fine.

                                          (Chances are you will find that style 
somewhere used anyway too,
                                          apologies if so :/)


                                     Thanks for that link, now I know where to 
look. I used the upper camel case in my code as well then :) I should have 
gotten them all.


                                           > PS: I've also inlined my answers 
to Thomas below:
                                           >
                                           > On Tue, Jun 13, 2017 at 8:03 AM, 
Thomas Schatzl <thomas.schatzl@oracl
                                           > e.com <http://e.com> <http://e.com> 
<http://e.com>> wrote:
                                           > > Hi all,
                                           > >
                                           > > On Mon, 2017-06-12 at 11:11 
-0700, JC Beyler wrote:
                                           > > > Dear all,
                                           > > >
                                           > > > I've continued working on this 
and have done the following
                                           > > webrev:
                                           > > > 
http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>>
                                     
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ 
<http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>>>

                                           > >
                                           > > [...]
                                           > > > Things I still need to do:
                                           > > >    - Have to fix that TLAB 
case for the FastTLABRefill
                                           > > >    - Have to start looking at 
the data to see that it is
                                           > > consistent and does gather the 
right samples, right frequency, etc.
                                           > > >    - Have to check the GC 
elements and what that produces
                                           > > >    - Run a slowdebug run and 
ensure I fixed all those issues you
                                           > > saw > Robbin
                                           > > >
                                           > > > Thanks for looking at the 
webrev and have a great week!
                                           > >
                                           > >   scratching a bit on the 
surface of this change, so apologies for
                                           > > rather shallow comments:
                                           > >
                                           > > - macroAssembler_x86.cpp:5604: 
while this is compiler code, and I
                                           > > am not sure this is final, 
please avoid littering the code with
                                           > > TODO remarks :) They tend to be 
candidates for later wtf moments
                                           > > only.
                                           > >
                                           > > Just file a CR for that.
                                           > >
                                           > Newcomer question: what is a CR 
and not sure I have the rights to do
                                           > that yet ? :)

                                          Apologies. CR is a change request, 
this suggests to file a bug in the
                                          bug tracker. And you are right, you 
can't just create a new account in
                                          the OpenJDK JIRA yourselves. :(


                                     Ok good to know, I'll continue with my own 
todo list but I'll work hard on not letting it slip in the webrevs anymore :)


                                          I was mostly referring to the "... but it 
is a TODO" part of that
                                          comment in macroassembler_x86.cpp. 
Comments about the why of the code
                                          are appreciated.

                                          [Note that I now understand that this 
is to some degree still work in
                                          progress. As long as the final 
changeset does no contain TODO's I am
                                          fine (and it's not a hard objection, rather 
their use in "final" code
                                          is typically limited in my 
experience)]

                                          5603   // Currently, if this happens, 
just set back the actual end to
                                          where it was.
                                          5604   // We miss a chance to sample 
here.

                                          Would be okay, if explaining "this" and the 
"why" of missing a chance
                                          to sample here would be best.

                                          Like maybe:

                                          // If we needed to refill TLABs, just 
set the actual end point to
                                          // the end of the TLAB again. We do 
not sample here although we could.

                                     Done with your comment, it works well in 
my mind.

                                          I am not sure whether "miss a chance to 
sample" meant "we could, but
                                          consciously don't because it's not that 
useful" or "it would be
                                          necessary but don't because it's too 
complicated to do.".

                                          Looking at the original comment once 
more, I am also not sure if that
                                          comment shouldn't referring to the 
"end" variable (not actual_end)
                                          because that's the variable that is 
responsible for taking the sampling
                                          path? (Going from the member 
description of ThreadLocalAllocBuffer).


                                     I've moved this code and it no longer 
shows up here but the rationale and answer was:

                                     So.. Yes, end is the variable provoking 
the sampling. Actual end is the actual end of the TLAB.

                                     What was happening here is that the code 
is resetting _end to point towards the end of the new TLAB. Because, we now 
have the end for
                            sampling and _actual_end for
                                     the actual end, we need to update the 
actual_end as well.

                                     Normally, were we to do the real work 
here, we would calculate the (end - start) offset, then do:

                                     - Set the new end to : start + (old_end - 
old_start)
                                     - Set the actual end like we do here now 
where it because it is the actual end.

                                     Why is this not done here now anymore?
                                         - I was still debating which path to 
take:
                                            - Do it in the fast refill code, it 
has its perks:
                                                - In a world where fast refills 
are happening all the time or a lot, we can augment there the code to do the 
sampling
                                            - Remember what we had as an end 
before leaving the slowpath and check on return
                                                - This is what I'm doing now, 
it removes the need to go fix up all fast refill paths but if you remain in 
fast refill paths,
                            you won't get sampling. I
                                     have to think of the consequences of that, 
maybe a future change later on?
                                                   - I have the statistics now 
so I'm going to study that
                                                      -> By the way, though my 
statistics are showing I'm missing some samples, if I turn off FastTlabRefill, it 
is the same
                            loss so for now, it seems
                                     this does not occur in my simple test.



                                          But maybe I am only confused and it's 
best to just leave the comment
                                          away. :)

                                          Thinking about it some more, doesn't 
this not-sampling in this case
                                          mean that sampling does not work in 
any collector that does inline TLAB
                                          allocation at the moment? (Or is 
inline TLAB alloc automatically
                                          disabled with sampling somehow?)

                                          That would indeed be a bigger TODO 
then :)


                                     Agreed, this remark made me think that 
perhaps as a first step the new way of doing it is better but I did have to:
                                        - Remove the const of the 
ThreadLocalBuffer remaining and hard_end methods
                                        - Move hard_end out of the header file 
to have a bit more logic there

                                     Please let me know what you think of that 
and if you prefer it this way or changing the fast refills. (I prefer this way 
now because it
                            is more incremental).


                                          > > - calling 
HeapMonitoring::do_weak_oops() (which should probably be
                                          > > called weak_oops_do() like other 
similar methods) only if string
                                          > > deduplication is enabled (in 
g1CollectedHeap.cpp:4511) seems wrong.
                                          >
                                          > The call should be at least around 
6 lines up outside the if.
                                          >
                                          > Preferentially in a method like 
process_weak_jni_handles(), including
                                          > additional logging. (No new (G1) gc 
phase without minimal logging
                                          > :)).
                                          > Done but really not sure because:
                                          >
                                          > I put for logging:
                                          >   log_develop_trace(gc, 
freelist)("G1ConcRegionFreeing [other] : heap
                                          > monitoring");

                                          I would think that "gc, ref" would be 
more appropriate log tags for
                                          this similar to jni handles.
                                          (I am als not sure what weak 
reference handling has to do with
                                          G1ConcRegionFreeing, so I am a bit 
puzzled)


                                     I was not sure what to put for the tags or 
really as the message. I cleaned it up a bit now to:
                                          log_develop_trace(gc, ref)("HeapSampling 
[other] : heap monitoring processing");



                                          > Since weak_jni_handles didn't have 
logging for me to be inspired
                                          > from, I did that but unconvinced 
this is what should be done.

                                          The JNI handle processing does have 
logging, but only in
                                          
ReferenceProcessor::process_discovered_references(). In
                                          process_weak_jni_handles() only 
overall time is measured (in a G1
                                          specific way, since only G1 supports 
disabling reference procesing) :/

                                          The code in ReferenceProcessor prints 
both time taken
                                          referenceProcessor.cpp:254, as well 
as the count, but strangely only in
                                          debug VMs.

                                          I have no idea why this logging is 
that unimportant to only print that
                                          in a debug VM. However there are 
reviews out for changing this area a
                                          bit, so it might be useful to wait 
for that (JDK-8173335).


                                     I cleaned it up a bit anyway and now it 
returns the count of objects that are in the system.


                                          > > - the change doubles the size of
                                          > > CollectedHeap::allocate_from_tlab_slow() 
above the "small and nice"
                                          > > threshold. Maybe it could be 
refactored a bit.
                                          > Done I think, it looks better to me 
:).

                                          In 
ThreadLocalAllocBuffer::handle_sample() I think the
                                          
set_back_actual_end()/pick_next_sample() calls could be hoisted out of
                                          the "if" :)


                                     Done!


                                          > > - referenceProcessor.cpp:261: the 
change should add logging about
                                          > > the number of references 
encountered, maybe after the corresponding
                                          > > "JNI weak reference count" log 
message.
                                          > Just to double check, are you 
saying that you'd like to have the heap
                                          > sampler to keep in store how many 
sampled objects were encountered in
                                          > the HeapMonitoring::weak_oops_do?
                                          >    - Would a return of the method 
with the number of handled
                                          > references and logging that work?

                                          Yes, it's fine if 
HeapMonitoring::weak_oops_do() only returned the
                                          number of processed weak oops.


                                     Done also (but I admit I have not tested 
the output yet) :)


                                          >    - Additionally, would you prefer 
it in a separate block with its
                                          > GCTraceTime?

                                          Yes. Both kinds of information is 
interesting: while the time taken is
                                          typically more important, the next 
question would be why, and the
                                          number of references typically goes a 
long way there.

                                          See above though, it is probably best 
to wait a bit.


                                     Agreed that I "could" wait but, if it's 
ok, I'll just refactor/remove this when we get closer to something final. Either, 
JDK-8173335
                                     has gone in and I will notice it now or it 
will soon and I can change it then.


                                          > > - threadLocalAllocBuffer.cpp:331: one more 
"TODO"
                                          > Removed it and added it to my 
personal todos to look at.
                                          >      > >
                                          > > - threadLocalAllocBuffer.hpp: 
ThreadLocalAllocBuffer class
                                          > > documentation should be updated 
about the sampling additions. I
                                          > > would have no clue what the difference 
between "actual_end" and
                                          > > "end" would be from the given 
information.
                                          > If you are talking about the 
comments in this file, I made them more
                                          > clear I hope in the new webrev. If 
it was somewhere else, let me know
                                          > where to change.

                                          Thanks, that's much better. Maybe a 
note in the comment of the class
                                          that ThreadLocalBuffer provides some 
sampling facility by modifying the
                                          end() of the TLAB to cause "frequent" 
calls into the runtime call where
                                          actual sampling takes place.


                                     Done, I think it's better now. Added 
something about the slow_path_end as well.


                                          > > - in heapMonitoring.hpp: there 
are some random comments about some
                                          > > code that has been grabbed from 
"util/math/fastmath.[h|cc]". I
                                          > > can't tell whether this is code 
that can be used but I assume that
                                          > > Noam Shazeer is okay with that 
(i.e. that's all Google code).
                                          > Jeremy and I double checked and we 
can release that as I thought. I
                                          > removed the comment from that piece 
of code entirely.

                                          Thanks.

                                          > > - heapMonitoring.hpp/cpp static 
constant naming does not correspond
                                          > > to Hotspot's. Additionally, in 
Hotspot static methods are cased
                                          > > like other methods.
                                          > I think I fixed the methods to be 
cased the same way as all other
                                          > methods. For static constants, I 
was not sure. I fixed a few other
                                          > variables but I could not seem to 
really see a consistent trend for
                                          > constants. I made them as variables 
but I'm not sure now.

                                          Sorry again, style is a kind of mess. 
The goal of my suggestions here
                                          is only to prevent yet another style 
creeping in.

                                          > > - in heapMonitoring.cpp there are 
a few cryptic comments at the top
                                          > > that seem to refer to internal 
stuff that should probably be
                                          > > removed.
                                          > Sorry about that! My personal todos 
not cleared out.

                                          I am happy about comments, but I 
simply did not understand any of that
                                          and I do not know about other readers 
as well.

                                          If you think you will remember 
removing/updating them until the review
                                          proper (I misunderstood the review 
situation a little it seems).

                                          > > I did not think through the 
impact of the TLAB changes on collector
                                          > > behavior yet (if there are). Also 
I did not check for problems with
                                          > > concurrent mark and SATB/G1 (if 
there are).
                                          > I would love to know your thoughts 
on this, I think this is fine. I

                                          I think so too now. No objects are 
made live out of thin air :)

                                          > see issues with multiple threads 
right now hitting the stack storage
                                          > instance. Previous webrevs had a 
mutex lock here but we took it out
                                          > for simplificity (and only for now).

                                          :) When looking at this after some 
thinking I now assume for this
                                          review that this code is not MT safe 
at all. There seems to be more
                                          synchronization missing than just the 
one for the StackTraceStorage. So
                                          no comments about this here.


                                     I doubled checked a bit (quickly I admit) 
but it seems that synchronization in StackTraceStorage is really all you need 
(all methods
                            lead to a StackTraceStorage one
                                     and can be multithreaded outside of that).
                                     There is a question about the 
initialization where the method HeapMonitoring::initialize_profiling is not 
thread safe.
                                     It would work (famous last words) and not 
crash if there was a race but we could add a synchronization point there as 
well (and
                            therefore on the stop as well).

                                     But anyway I will really check and do this 
once we add back synchronization.


                                          Also, this would require some kind of 
specification of what is allowed
                                          to be called when and where.


                                     Would we specify this with the methods in 
the jvmti.xml file? We could start by specifying in each that they are not 
thread safe but I
                            saw no mention of that for
                                     other methods.


                                          One potentially relevant observation 
about locking here: depending on
                                          sampling frequency, 
StackTraceStore::add_trace() may be rather
                                          frequently called. I assume that you 
are going to do measurements :)


                                     Though we don't have the TLAB 
implementation in our code, the compiler generated sampler uses 2% of overhead 
with a 512k sampling rate.
                            I can do real measurements
                                     when the code settles and we can see how 
costly this is as a TLAB implementation.
                                     However, my theory is that if the rate is 
512k, the memory/performance overhead should be minimal since it is what we saw 
with our
                            code/workloads (though not called
                                     the same way, we call it essentially at 
the same rate).
                                     If you have a benchmark you'd like me to 
test, let me know!

                                     Right now, with my really small test, this 
does use a bit of overhead even for a 512k sample size. I don't know yet why, 
I'm going to
                            see what is going on.

                                     Finally, I think it is not reasonable to 
suppose the overhead to be negligible if the sampling rate used is too low. The 
user should
                            know that the lower the rate,
                                     the higher the overhead (documentation 
TODO?).


                                          I am not sure what the expected usage 
of the API is, but
                                          StackTraceStore::add_trace() seems to 
be able to grow without bounds.
                                          Only a GC truncates them to the live 
ones. That in itself seems to be
                                          problematic (GCs can be *wide* 
apart), and of course some of the API
                                          methods add to that because they 
duplicate that unbounded array. Do you
                                          have any concerns/measurements about 
this?


                                     So, the theory is that yes add_trace can 
be able to grow without bounds but it grows at a sample per 512k of allocated 
space. The
                            stacks it gathers are currently
                                     maxed at 64 (I'd like to expand that to an 
option to the user though at some point). So I have no concerns because:

                                     - If really this is taking a lot of space, 
that means the job is keeping a lot of objects in memory as well, therefore the 
entire heap
                            is getting huge
                                     - If this is the case, you will be 
triggering a GC at some point anyway.

                                     (I'm putting under the rug the issue of "What 
if we set the rate to 1 for example" because as you lower the sampling rate, we 
cannot
                            guarantee low overhead; the
                                     idea behind this feature is to have a 
means of having meaningful allocated samples at a low overhead)

                                     I have no measurements really right now 
but since I now have some statistics I can poll, I will look a bit more at this 
question.

                                     I have the same last sentence than above: 
the user should expect this to happen if the sampling rate is too small. That 
probably can be
                            reflected in the
                                     StartHeapSampling as a note : careful this 
might impact your performance.


                                          Also, these stack traces might hold 
on to huge arrays. Any
                                          consideration of that? Particularly 
it might be the cause for OOMEs in
                                          tight memory situations.


                                     There is a stack size maximum that is set 
to 64 so it should not hold huge arrays. I don't think this is an issue but I 
can double
                            check with a test or two.


                                          - please consider adding a safepoint 
check in
                                          HeapMonitoring::weak_oops_do to 
prevent accidental misuse.

                                          - in struct StackTraceStorage, the 
public fields may also need
                                          underscores. At least some files in 
the runtime directory have structs
                                          with underscored public members (and 
some don't). The runtime team
                                          should probably comment on that.


                                     Agreed I did not know. I looked around and 
a lot of structs did not have them it seemed so I left it as is. I will happily 
change it if
                            someone prefers (I was not
                                     sure if you really preferred or not, your 
sentence seemed to be more a note of "this might need to change but I don't 
know if the
                            runtime team enforces that", let
                                     me know if I read that wrongly).


                                          - In 
StackTraceStorage::weak_oops_do(), when examining the
                                          StackTraceData, maybe it is useful to 
consider having a non-NULL
                                          reference outside of the heap's 
reserved space an error. There should
                                          be no oop outside of the heap's 
reserved space ever.

                                          Unless you allow storing random 
values in StackTraceData::obj, which I
                                          would not encourage.


                                     I suppose you are talking about this part:
                                     if ((value != NULL && 
Universe::heap()->is_in_reserved(value)) &&
                                                 (is_alive == NULL || 
is_alive->do_object_b(value))) {

                                     What you are saying is that I could have 
something like:
                                     if (value != my_non_null_reference &&
                                                 (is_alive == NULL || 
is_alive->do_object_b(value))) {

                                     Is that what you meant? Is there really a 
reason to do so? When I look at the code, is_in_reserved seems like a O(1) 
method call. I'm
                            not even sure we can have a
                                     NULL value to be honest. I might have to 
study that to see if this was not a paranoid test to begin with.

                                     The is_alive code has now morphed due to 
the comment below.



                                          - HeapMonitoring::weak_oops_do() does 
not seem to use the
                                          passed AbstractRefProcTaskExecutor.


                                     It did use it:
                                        size_t HeapMonitoring::weak_oops_do(
                                           AbstractRefProcTaskExecutor 
*task_executor,
                                           BoolObjectClosure* is_alive,
                                           OopClosure *f,
                                           VoidClosure *complete_gc) {
                                         assert(SafepointSynchronize::is_at_safepoint(), 
"must be at safepoint");

                                         if (task_executor != NULL) {
                                           
task_executor->set_single_threaded_mode();
                                         }
                                         return 
StackTraceStorage::storage()->weak_oops_do(is_alive, f, complete_gc);
                                     }

                                     But due to the comment below, I refactored 
this, so this is no longer here. Now I have an always true closure that is 
passed.


                                          - I do not understand allowing to 
call this method with a NULL
                                          complete_gc closure. This would mean 
that objects referenced from the
                                          object that is referenced by the 
StackTraceData are not pulled, meaning
                                          they would get stale.

                                          - same with is_alive parameter value 
of NULL


                                     So these questions made me look a bit 
closer at this code. This code I think was written this way to have a very 
small impact on the
                            file but you are right, there
                                     is no reason for this here. I've 
simplified the code by making in referenceProcessor.cpp a process_HeapSampling 
method that handles
                            everything there.

                                     The code allowed NULLs because it depended 
on where you were coming from and how the code was being called.

                                     - I added a static always_true variable 
and pass that now to be more consistent with the rest of the code.
                                     - I moved the complete_gc into 
process_phaseHeapSampling now (new method) and handle the task_executor and the 
complete_gc there
                                          - Newbie question: in our code we did 
a set_single_threaded_mode but I see that process_phaseJNI does it right before 
its call, do
                            I need to do it for the
                                     process_phaseHeapSample?
                                     That API is much cleaner (in my mind) and 
is consistent with what is done around it (again in my mind).


                                          - heapMonitoring.cpp:590: I do not 
completely understand the purpose of
                                          this code: in the end this results in 
a fixed value directly dependent
                                          on the Thread address anyway? In the 
end this results in a fixed value
                                          directly dependent on the Thread 
address anyway?
                                          IOW, what is special about exactly 20 
rounds?


                                     So we really want a fast random number 
generator that has a specific mean (512k is the default we use). The code uses 
the thread
                            address as the start number of the
                                     sequence (why not, it is random enough is 
rationale). Then instead of just starting there, we prime the sequence and 
really only start
                            at the 21st number, it is
                                     arbitrary and I have not done a study to 
see if we could do more or less of that.

                                     As I have the statistics of the system up 
and running, I'll run some experiments to see if this is needed, is 20 good, or 
not.


                                          - also I would consider stripping a 
few bits of the threads' address as
                                          initialization value for your rng. 
The last three bits (and probably
                                          more, check whether the Thread object 
is allocated on special
                                          boundaries) are always zero for them.
                                          Not sure if the given "random" value 
is random enough before/after,
                                          this method, so just skip that 
comment if you think this is not
                                          required.


                                     I don't know is the honest answer. I think what is 
important is that we tend towards a mean and it is random "enough" to not fall 
in
                            pitfalls of only sampling a
                                     subset of objects due to their allocation 
order. I added that as test to do to see if it changes the mean in any way for 
the 512k
                            default value and/or if the first
                                     1000 elements look better.


                                          Some more random nits I did not find 
a place to put anywhere:

                                          - 
ThreadLocalAllocBuffer::_extra_space does not seem to be used
                                          anywhere?


                                     Good catch :).


                                          - Maybe indent the declaration of 
ThreadLocalAllocBuffer::_bytes_until_sample to align below the other members of 
that group.


                                     Done moved it up a bit to have non static 
members together and static separate.

                                          Thanks,
                                             Thomas


                                     Thanks for your review!
                                     Jc

Re: Low-Overhead Heap Profiling

Reply via email to