Re: Low-Overhead Heap Profiling

JC Beyler Mon, 09 Oct 2017 15:59:20 -0700

Dear all,

Thread-safety is back!! Here is the update webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/


Full webrev is here:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.11/

In order to really test this, I needed to add this so thought now was a
good time. It required a few changes here for the creation to ensure
correctness and safety. Now we keep the static pointer but clear the data
internally so on re-initialize, it will be a bit more costly than before. I
don't think this is a huge use-case so I did not think it was a problem. I
used the internal MutexLocker, I think I used it well, let me know.

I also added three tests:

1) Stack depth test:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStackDepthTest.java.patch

This test shows that the maximum stack depth system is working.

2) Thread safety:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadTest.java.patch

The test creates 24 threads and they all allocate at the same time. The
test then checks it does find samples from all the threads.

3) Thread on/off safety
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadOnOffTest.java.patch

The test creates 24 threads that all allocate a bunch of memory. Then
another thread turns the sampling on/off.

Btw, both tests 2 & 3 failed without the locks.

As I worked on this, I saw a lot of places where the tests are doing very
similar things, I'm going to clean up the code a bit and make a
HeapAllocator class that all tests can call directly. This will greatly
simplify the code.

Thanks for any comments/criticisms!
Jc


On Mon, Oct 2, 2017 at 8:52 PM, JC Beyler <jcbey...@google.com> wrote:

> Dear all,
>
> Small update to the webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/
>
> Full webrev is here:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/
>
> I updated a bit of the naming, removed a TODO comment, and I added a test
> for testing the sampling rate. I also updated the maximum stack depth to
> 1024, there is no reason to keep it so small. I did a micro benchmark that
> tests the overhead and it seems relatively the same.
>
> I compared allocations from a stack depth of 10 and allocations from a
> stack depth of 1024 (allocations are from the same helper method in
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/
> raw_files/new/test/hotspot/jtreg/serviceability/jvmti/
> HeapMonitor/MyPackage/HeapMonitorStatRateTest.java):
>           - For an array of 1 integer allocated in a loop; stack depth
> 1024 vs stack depth 10: 1% slower
>           - For an array of 200k integers allocated in a loop; stack depth
> 1024 vs stack depth 10: 3% slower
>
> So basically now moving the maximum stack depth to 1024 but we only copy
> over the stack depths actually used.
>
> For the next webrev, I will be adding a stack depth test to show that it
> works and probably put back the mutex locking so that we can see how
> difficult it is to keep thread safe.
>
> Let me know what you think!
> Jc
>
>
>
> On Mon, Sep 25, 2017 at 3:02 PM, JC Beyler <jcbey...@google.com> wrote:
>
>> Forgot to say that for my numbers:
>>  - Not in the test are the actual numbers I got for the various array
>> sizes, I ran the program 30 times and parsed the output; here are the
>> averages and standard deviation:
>>       1000:     1.28% average; 1.13% standard deviation
>>       10000:    1.59% average; 1.25% standard deviation
>>       100000:   1.26% average; 1.26% standard deviation
>>
>> The 1000/10000/100000 are the sizes of the arrays being allocated. These
>> are allocated 100k times and the sampling rate is 111 times the size of the
>> array.
>>
>> Thanks!
>> Jc
>>
>>
>> On Mon, Sep 25, 2017 at 3:01 PM, JC Beyler <jcbey...@google.com> wrote:
>>
>>> Hi all,
>>>
>>> After a bit of a break, I am back working on this :). As before, here
>>> are two webrevs:
>>>
>>> - Full change set: http://cr.openjdk.java.ne
>>> t/~rasbold/8171119/webrev.09/
>>> - Compared to version 8: http://cr.openjdk.java.net/
>>> ~rasbold/8171119/webrev.08_09/
>>>     (This version is compared to version 8 I last showed but ported to
>>> the new folder hierarchy)
>>>
>>> In this version I have:
>>>   - Handled Thomas' comments from his email of 07/03:
>>>        - Merged the logging to be standard
>>>        - Fixed up the code a bit where asked
>>>        - Added some notes about the code not being thread-safe yet
>>>    - Removed additional dead code from the version that modifies
>>> interpreter/c1/c2
>>>    - Fixed compiler issues so that it compiles with
>>> --disable-precompiled-header
>>>         - Tested with ./configure --with-boot-jdk=<jdk8>
>>> --with-debug-level=slowdebug --disable-precompiled-headers
>>>
>>> Additionally, I added a test to check the sanity of the sampler:
>>> HeapMonitorStatCorrectnessTest (http://cr.openjdk.java.net/~r
>>> asbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceabilit
>>> y/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch)
>>>    - This allocates a number of arrays and checks that we obtain the
>>> number of samples we want with an accepted error of 5%. I tested it 100
>>> times and it passed everytime, I can test more if wanted
>>>    - Not in the test are the actual numbers I got for the various array
>>> sizes, I ran the program 30 times and parsed the output; here are the
>>> averages and standard deviation:
>>>       1000:     1.28% average; 1.13% standard deviation
>>>       10000:    1.59% average; 1.25% standard deviation
>>>       100000:   1.26% average; 1.26% standard deviation
>>>
>>> What this means is that we were always at about 1~2% of the number of
>>> samples the test expected.
>>>
>>> Let me know what you think,
>>> Jc
>>>
>>>
>>>
>>> On Wed, Jul 5, 2017 at 9:31 PM, JC Beyler <jcbey...@google.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I apologize, I have not yet handled your remarks but thought this new
>>>> webrev would also be useful to see and comment on perhaps.
>>>>
>>>> Here is the latest webrev, it is generated slightly different than the
>>>> others since now I'm using webrev.ksh without the -N option:
>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/
>>>>
>>>> And the webrev.07 to webrev.08 diff is here:
>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/
>>>>
>>>> (Let me know if it works well)
>>>>
>>>> It's a small change between versions but it:
>>>>   - provides a fix that makes the average sample rate correct (more on
>>>> that below).
>>>>   - fixes the code to actually have it play nicely with the fast tlab
>>>> refill
>>>>   - cleaned up a bit the JVMTI text and now use jvmtiFrameInfo
>>>> - moved the capability to be onload solo
>>>>
>>>> With this webrev, I've done a small study of the random number
>>>> generator we use here for the sampling rate. I took a small program and it
>>>> can be simplified to:
>>>>
>>>> for (outer loop)
>>>> for (inner loop)
>>>> int[] tmp = new int[arraySize];
>>>>
>>>> - I've fixed the outer and inner loops to being 800 for this
>>>> experiment, meaning we allocate 640000 times an array of a given array
>>>> size.
>>>>
>>>> - Each program provides the average sample size used for the whole
>>>> execution
>>>>
>>>> - Then, I ran each variation 30 times and then calculated the average
>>>> of the average sample size used for various array sizes. I selected the
>>>> array size to be one of the following: 1, 10, 100, 1000.
>>>>
>>>> - When compared to 512kb, the average sample size of 30 runs:
>>>> 1: 4.62% of error
>>>> 10: 3.09% of error
>>>> 100: 0.36% of error
>>>> 1000: 0.1% of error
>>>> 10000: 0.03% of error
>>>>
>>>> What it shows is that, depending on the number of samples, the average
>>>> does become better. This is because with an allocation of 1 element per
>>>> array, it will take longer to hit one of the thresholds. This is seen by
>>>> looking at the sample count statistic I put in. For the same number of
>>>> iterations (800 * 800), the different array sizes provoke:
>>>> 1: 62 samples
>>>> 10: 125 samples
>>>> 100: 788 samples
>>>> 1000: 6166 samples
>>>> 10000: 57721 samples
>>>>
>>>> And of course, the more samples you have, the more sample rates you
>>>> pick, which means that your average gets closer using that math.
>>>>
>>>> Thanks,
>>>> Jc
>>>>
>>>> On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler <jcbey...@google.com>
>>>> wrote:
>>>>
>>>>> Thanks Robbin,
>>>>>
>>>>> This seems to have worked. When I have the next webrev ready, we will
>>>>> find out but I'm fairly confident it will work!
>>>>>
>>>>> Thanks agian!
>>>>> Jc
>>>>>
>>>>> On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn <robbin....@oracle.com>
>>>>> wrote:
>>>>>
>>>>>> Hi JC,
>>>>>>
>>>>>> On 06/29/2017 12:15 AM, JC Beyler wrote:
>>>>>>
>>>>>>> B) Incremental changes
>>>>>>>
>>>>>>
>>>>>> I guess the most common work flow here is using mq :
>>>>>> hg qnew fix_v1
>>>>>> edit files
>>>>>> hg qrefresh
>>>>>> hg qnew fix_v2
>>>>>> edit files
>>>>>> hg qrefresh
>>>>>>
>>>>>> if you do hg log you will see 2 commits
>>>>>>
>>>>>> webrev.ksh -r -2 -o my_inc_v1_v2
>>>>>> webrev.ksh -o my_full_v2
>>>>>>
>>>>>>
>>>>>> In  your .hgrc you might need:
>>>>>> [extensions]
>>>>>> mq =
>>>>>>
>>>>>> /Robbin
>>>>>>
>>>>>>
>>>>>>> Again another newbiew question here...
>>>>>>>
>>>>>>> For showing the incremental changes, is there a link that explains
>>>>>>> how to do that? I apologize for my newbie questions all the time :)
>>>>>>>
>>>>>>> Right now, I do:
>>>>>>>
>>>>>>>   ksh ../webrev.ksh -m -N
>>>>>>>
>>>>>>> That generates a webrev.zip and send it to Chuck Rasbold. He then
>>>>>>> uploads it to a new webrev.
>>>>>>>
>>>>>>> I tried commiting my change and adding a small change. Then if I
>>>>>>> just do ksh ../webrev.ksh without any options, it seems to produce a
>>>>>>> similar page but now with only the changes I had (so the 06-07 
>>>>>>> comparison
>>>>>>> you were talking about) and a changeset that has it all. I imagine that 
>>>>>>> is
>>>>>>> what you meant.
>>>>>>>
>>>>>>> Which means that my workflow would become:
>>>>>>>
>>>>>>> 1) Make changes
>>>>>>> 2) Make a webrev without any options to show just the differences
>>>>>>> with the tip
>>>>>>> 3) Amend my changes to my local commit so that I have it done with
>>>>>>> 4) Go to 1
>>>>>>>
>>>>>>> Does that seem correct to you?
>>>>>>>
>>>>>>> Note that when I do this, I only see the full change of a file in
>>>>>>> the full change set (Side note here: now the page says change set and 
>>>>>>> not
>>>>>>> patch, which is maybe why Serguei was having issues?).
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Jc
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn <robbin....@oracle.com
>>>>>>> <mailto:robbin....@oracle.com>> wrote:
>>>>>>>
>>>>>>>     Hi,
>>>>>>>
>>>>>>>     On 06/28/2017 12:04 AM, JC Beyler wrote:
>>>>>>>
>>>>>>>         Dear Thomas et al,
>>>>>>>
>>>>>>>         Here is the newest webrev:
>>>>>>>         http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ <
>>>>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>     You have some more bits to in there but generally this looks
>>>>>>> good and really nice with more tests.
>>>>>>>     I'll do and deep dive and re-test this when I get back from my
>>>>>>> long vacation with whatever patch version you have then.
>>>>>>>
>>>>>>>     Also I think it's time you provide incremental (v06->07 changes)
>>>>>>> as well as complete change-sets.
>>>>>>>
>>>>>>>     Thanks, Robbin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>         Thomas, I "think" I have answered all your remarks. The
>>>>>>> summary is:
>>>>>>>
>>>>>>>         - The statistic system is up and provides insight on what
>>>>>>> the heap sampler is doing
>>>>>>>              - I've noticed that, though the sampling rate is at the
>>>>>>> right mean, we are missing some samples, I have not yet tracked out why
>>>>>>> (details below)
>>>>>>>
>>>>>>>         - I've run a tiny benchmark that is the worse case: it is a
>>>>>>> very tight loop and allocated a small array
>>>>>>>              - In this case, I see no overhead when the system is
>>>>>>> off so that is a good start :)
>>>>>>>              - I see right now a high overhead in this case when
>>>>>>> sampling is on. This is not a really too surprising but I'm going to 
>>>>>>> see if
>>>>>>> this is consistent with our
>>>>>>>         internal implementation. The benchmark is really allocation
>>>>>>> stressful so I'm not too surprised but I want to do the due diligence.
>>>>>>>
>>>>>>>            - The statistic system up is up and I have a new test
>>>>>>>         http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/s
>>>>>>> erviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTes
>>>>>>> t.java.patch
>>>>>>>         <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/
>>>>>>> serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTe
>>>>>>> st.java.patch>
>>>>>>>               - I did a bit of a study about the random generator
>>>>>>> here, more details are below but basically it seems to work well
>>>>>>>
>>>>>>>            - I added a capability but since this is the first time
>>>>>>> doing this, I was not sure I did it right
>>>>>>>              - I did add a test though for it and the test seems to
>>>>>>> do what I expect (all methods are failing with the
>>>>>>> JVMTI_ERROR_MUST_POSSESS_CAPABILITY error).
>>>>>>>                  - http://cr.openjdk.java.net/~ra
>>>>>>> sbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonito
>>>>>>> r/MyPackage/HeapMonitorNoCapabilityTest.java.patch
>>>>>>>         <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/
>>>>>>> serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapa
>>>>>>> bilityTest.java.patch>
>>>>>>>
>>>>>>>            - I still need to figure out what to do about the
>>>>>>> multi-agent vs single-agent issue
>>>>>>>
>>>>>>>            - As far as measurements, it seems I still need to look
>>>>>>> at:
>>>>>>>              - Why we do the 20 random calls first, are they
>>>>>>> necessary?
>>>>>>>              - Look at the mean of the sampling rate that the random
>>>>>>> generator does and also what is actually sampled
>>>>>>>              - What is the overhead in terms of memory/performance
>>>>>>> when on?
>>>>>>>
>>>>>>>         I have inlined my answers, I think I got them all in the new
>>>>>>> webrev, let me know your thoughts.
>>>>>>>
>>>>>>>         Thanks again!
>>>>>>>         Jc
>>>>>>>
>>>>>>>
>>>>>>>         On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl <
>>>>>>> thomas.scha...@oracle.com <mailto:thomas.scha...@oracle.com>
>>>>>>> <mailto:thomas.scha...@oracle.com
>>>>>>>
>>>>>>>         <mailto:thomas.scha...@oracle.com>>> wrote:
>>>>>>>
>>>>>>>              Hi,
>>>>>>>
>>>>>>>              On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
>>>>>>>              > Hi all,
>>>>>>>              >
>>>>>>>              > First off: Thanks again to Robbin and Thomas for
>>>>>>> their reviews :)
>>>>>>>              >
>>>>>>>              > Next, I've uploaded a new webrev:
>>>>>>>              > http://cr.openjdk.java.net/~ra
>>>>>>> sbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~r
>>>>>>> asbold/8171119/webrev.06/>
>>>>>>>         <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <
>>>>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>
>>>>>>>
>>>>>>>              >
>>>>>>>              > Here is an update:
>>>>>>>              >
>>>>>>>              > - @Robbin, I forgot to say that yes I need to look at
>>>>>>> implementing
>>>>>>>              > this for the other architectures and testing it
>>>>>>> before it is all
>>>>>>>              > ready to go. Is it common to have it working on all
>>>>>>> possible
>>>>>>>              > combinations or is there a subset that I should be
>>>>>>> doing first and we
>>>>>>>              > can do the others later?
>>>>>>>              > - I've tested slowdebug, built and ran the JTreg
>>>>>>> tests I wrote with
>>>>>>>              > slowdebug and fixed a few more issues
>>>>>>>              > - I've refactored a bit of the code following Thomas'
>>>>>>> comments
>>>>>>>              >    - I think I've handled all the comments from
>>>>>>> Thomas (I put
>>>>>>>              > comments inline below for the specifics)
>>>>>>>
>>>>>>>              Thanks for handling all those.
>>>>>>>
>>>>>>>              > - Following Thomas' comments on statistics, I want to
>>>>>>> add some
>>>>>>>              > quality assurance tests and find that the easiest way
>>>>>>> would be to
>>>>>>>              > have a few counters of what is happening in the
>>>>>>> sampler and expose
>>>>>>>              > that to the user.
>>>>>>>              >    - I'll be adding that in the next version if no
>>>>>>> one sees any
>>>>>>>              > objections to that.
>>>>>>>              >    - This will allow me to add a sanity test in JTreg
>>>>>>> about number of
>>>>>>>              > samples and average of sampling rate
>>>>>>>              >
>>>>>>>              > @Thomas: I had a few questions that I inlined below
>>>>>>> but I will
>>>>>>>              > summarize the "bigger ones" here:
>>>>>>>              >    - You mentioned constants are not using the right
>>>>>>> conventions, I
>>>>>>>              > looked around and didn't see any convention except
>>>>>>> normal naming then
>>>>>>>              > for static constants. Is that right?
>>>>>>>
>>>>>>>              I looked through https://wiki.openjdk.java.net/
>>>>>>> display/HotSpot/StyleGui <https://wiki.openjdk.java.net
>>>>>>> /display/HotSpot/StyleGui>
>>>>>>>         <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <
>>>>>>> https://wiki.openjdk.java.net/display/HotSpot/StyleGui>>
>>>>>>>              de and the rule is to "follow an existing pattern and
>>>>>>> must have a
>>>>>>>              distinct appearance from other names". Which does not
>>>>>>> help a lot I
>>>>>>>              guess :/ The GC team started using upper camel case,
>>>>>>> e.g.
>>>>>>>              SomeOtherConstant, but very likely this is probably not
>>>>>>> applied
>>>>>>>              consistently throughout. So I am fine with not adding
>>>>>>> another style
>>>>>>>              (like kMaxStackDepth with the "k" in front with some
>>>>>>> unknown meaning)
>>>>>>>              is fine.
>>>>>>>
>>>>>>>              (Chances are you will find that style somewhere used
>>>>>>> anyway too,
>>>>>>>              apologies if so :/)
>>>>>>>
>>>>>>>
>>>>>>>         Thanks for that link, now I know where to look. I used the
>>>>>>> upper camel case in my code as well then :) I should have gotten them 
>>>>>>> all.
>>>>>>>
>>>>>>>
>>>>>>>               > PS: I've also inlined my answers to Thomas below:
>>>>>>>               >
>>>>>>>               > On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl
>>>>>>> <thomas.schatzl@oracl
>>>>>>>               > e.com <http://e.com> <http://e.com>> wrote:
>>>>>>>               > > Hi all,
>>>>>>>               > >
>>>>>>>               > > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
>>>>>>>               > > > Dear all,
>>>>>>>               > > >
>>>>>>>               > > > I've continued working on this and have done the
>>>>>>> following
>>>>>>>               > > webrev:
>>>>>>>               > > > http://cr.openjdk.java.net/~ra
>>>>>>> sbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~r
>>>>>>> asbold/8171119/webrev.05/>
>>>>>>>         <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <
>>>>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>>
>>>>>>>
>>>>>>>               > >
>>>>>>>               > > [...]
>>>>>>>               > > > Things I still need to do:
>>>>>>>               > > >    - Have to fix that TLAB case for the
>>>>>>> FastTLABRefill
>>>>>>>               > > >    - Have to start looking at the data to see
>>>>>>> that it is
>>>>>>>               > > consistent and does gather the right samples,
>>>>>>> right frequency, etc.
>>>>>>>               > > >    - Have to check the GC elements and what that
>>>>>>> produces
>>>>>>>               > > >    - Run a slowdebug run and ensure I fixed all
>>>>>>> those issues you
>>>>>>>               > > saw > Robbin
>>>>>>>               > > >
>>>>>>>               > > > Thanks for looking at the webrev and have a
>>>>>>> great week!
>>>>>>>               > >
>>>>>>>               > >   scratching a bit on the surface of this change,
>>>>>>> so apologies for
>>>>>>>               > > rather shallow comments:
>>>>>>>               > >
>>>>>>>               > > - macroAssembler_x86.cpp:5604: while this is
>>>>>>> compiler code, and I
>>>>>>>               > > am not sure this is final, please avoid littering
>>>>>>> the code with
>>>>>>>               > > TODO remarks :) They tend to be candidates for
>>>>>>> later wtf moments
>>>>>>>               > > only.
>>>>>>>               > >
>>>>>>>               > > Just file a CR for that.
>>>>>>>               > >
>>>>>>>               > Newcomer question: what is a CR and not sure I have
>>>>>>> the rights to do
>>>>>>>               > that yet ? :)
>>>>>>>
>>>>>>>              Apologies. CR is a change request, this suggests to
>>>>>>> file a bug in the
>>>>>>>              bug tracker. And you are right, you can't just create a
>>>>>>> new account in
>>>>>>>              the OpenJDK JIRA yourselves. :(
>>>>>>>
>>>>>>>
>>>>>>>         Ok good to know, I'll continue with my own todo list but
>>>>>>> I'll work hard on not letting it slip in the webrevs anymore :)
>>>>>>>
>>>>>>>
>>>>>>>              I was mostly referring to the "... but it is a TODO"
>>>>>>> part of that
>>>>>>>              comment in macroassembler_x86.cpp. Comments about the
>>>>>>> why of the code
>>>>>>>              are appreciated.
>>>>>>>
>>>>>>>              [Note that I now understand that this is to some degree
>>>>>>> still work in
>>>>>>>              progress. As long as the final changeset does no
>>>>>>> contain TODO's I am
>>>>>>>              fine (and it's not a hard objection, rather their use
>>>>>>> in "final" code
>>>>>>>              is typically limited in my experience)]
>>>>>>>
>>>>>>>              5603   // Currently, if this happens, just set back the
>>>>>>> actual end to
>>>>>>>              where it was.
>>>>>>>              5604   // We miss a chance to sample here.
>>>>>>>
>>>>>>>              Would be okay, if explaining "this" and the "why" of
>>>>>>> missing a chance
>>>>>>>              to sample here would be best.
>>>>>>>
>>>>>>>              Like maybe:
>>>>>>>
>>>>>>>              // If we needed to refill TLABs, just set the actual
>>>>>>> end point to
>>>>>>>              // the end of the TLAB again. We do not sample here
>>>>>>> although we could.
>>>>>>>
>>>>>>>         Done with your comment, it works well in my mind.
>>>>>>>
>>>>>>>              I am not sure whether "miss a chance to sample" meant
>>>>>>> "we could, but
>>>>>>>              consciously don't because it's not that useful" or "it
>>>>>>> would be
>>>>>>>              necessary but don't because it's too complicated to
>>>>>>> do.".
>>>>>>>
>>>>>>>              Looking at the original comment once more, I am also
>>>>>>> not sure if that
>>>>>>>              comment shouldn't referring to the "end" variable (not
>>>>>>> actual_end)
>>>>>>>              because that's the variable that is responsible for
>>>>>>> taking the sampling
>>>>>>>              path? (Going from the member description of
>>>>>>> ThreadLocalAllocBuffer).
>>>>>>>
>>>>>>>
>>>>>>>         I've moved this code and it no longer shows up here but the
>>>>>>> rationale and answer was:
>>>>>>>
>>>>>>>         So.. Yes, end is the variable provoking the sampling. Actual
>>>>>>> end is the actual end of the TLAB.
>>>>>>>
>>>>>>>         What was happening here is that the code is resetting _end
>>>>>>> to point towards the end of the new TLAB. Because, we now have the end 
>>>>>>> for
>>>>>>> sampling and _actual_end for
>>>>>>>         the actual end, we need to update the actual_end as well.
>>>>>>>
>>>>>>>         Normally, were we to do the real work here, we would
>>>>>>> calculate the (end - start) offset, then do:
>>>>>>>
>>>>>>>         - Set the new end to : start + (old_end - old_start)
>>>>>>>         - Set the actual end like we do here now where it because it
>>>>>>> is the actual end.
>>>>>>>
>>>>>>>         Why is this not done here now anymore?
>>>>>>>             - I was still debating which path to take:
>>>>>>>                - Do it in the fast refill code, it has its perks:
>>>>>>>                    - In a world where fast refills are happening all
>>>>>>> the time or a lot, we can augment there the code to do the sampling
>>>>>>>                - Remember what we had as an end before leaving the
>>>>>>> slowpath and check on return
>>>>>>>                    - This is what I'm doing now, it removes the need
>>>>>>> to go fix up all fast refill paths but if you remain in fast refill 
>>>>>>> paths,
>>>>>>> you won't get sampling. I
>>>>>>>         have to think of the consequences of that, maybe a future
>>>>>>> change later on?
>>>>>>>                       - I have the statistics now so I'm going to
>>>>>>> study that
>>>>>>>                          -> By the way, though my statistics are
>>>>>>> showing I'm missing some samples, if I turn off FastTlabRefill, it is 
>>>>>>> the
>>>>>>> same loss so for now, it seems
>>>>>>>         this does not occur in my simple test.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              But maybe I am only confused and it's best to just
>>>>>>> leave the comment
>>>>>>>              away. :)
>>>>>>>
>>>>>>>              Thinking about it some more, doesn't this not-sampling
>>>>>>> in this case
>>>>>>>              mean that sampling does not work in any collector that
>>>>>>> does inline TLAB
>>>>>>>              allocation at the moment? (Or is inline TLAB alloc
>>>>>>> automatically
>>>>>>>              disabled with sampling somehow?)
>>>>>>>
>>>>>>>              That would indeed be a bigger TODO then :)
>>>>>>>
>>>>>>>
>>>>>>>         Agreed, this remark made me think that perhaps as a first
>>>>>>> step the new way of doing it is better but I did have to:
>>>>>>>            - Remove the const of the ThreadLocalBuffer remaining and
>>>>>>> hard_end methods
>>>>>>>            - Move hard_end out of the header file to have a bit more
>>>>>>> logic there
>>>>>>>
>>>>>>>         Please let me know what you think of that and if you prefer
>>>>>>> it this way or changing the fast refills. (I prefer this way now 
>>>>>>> because it
>>>>>>> is more incremental).
>>>>>>>
>>>>>>>
>>>>>>>              > > - calling HeapMonitoring::do_weak_oops() (which
>>>>>>> should probably be
>>>>>>>              > > called weak_oops_do() like other similar methods)
>>>>>>> only if string
>>>>>>>              > > deduplication is enabled (in
>>>>>>> g1CollectedHeap.cpp:4511) seems wrong.
>>>>>>>              >
>>>>>>>              > The call should be at least around 6 lines up outside
>>>>>>> the if.
>>>>>>>              >
>>>>>>>              > Preferentially in a method like
>>>>>>> process_weak_jni_handles(), including
>>>>>>>              > additional logging. (No new (G1) gc phase without
>>>>>>> minimal logging
>>>>>>>              > :)).
>>>>>>>              > Done but really not sure because:
>>>>>>>              >
>>>>>>>              > I put for logging:
>>>>>>>              >   log_develop_trace(gc,
>>>>>>> freelist)("G1ConcRegionFreeing [other] : heap
>>>>>>>              > monitoring");
>>>>>>>
>>>>>>>              I would think that "gc, ref" would be more appropriate
>>>>>>> log tags for
>>>>>>>              this similar to jni handles.
>>>>>>>              (I am als not sure what weak reference handling has to
>>>>>>> do with
>>>>>>>              G1ConcRegionFreeing, so I am a bit puzzled)
>>>>>>>
>>>>>>>
>>>>>>>         I was not sure what to put for the tags or really as the
>>>>>>> message. I cleaned it up a bit now to:
>>>>>>>              log_develop_trace(gc, ref)("HeapSampling [other] : heap
>>>>>>> monitoring processing");
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              > Since weak_jni_handles didn't have logging for me to
>>>>>>> be inspired
>>>>>>>              > from, I did that but unconvinced this is what should
>>>>>>> be done.
>>>>>>>
>>>>>>>              The JNI handle processing does have logging, but only in
>>>>>>>              ReferenceProcessor::process_discovered_references(). In
>>>>>>>              process_weak_jni_handles() only overall time is
>>>>>>> measured (in a G1
>>>>>>>              specific way, since only G1 supports disabling
>>>>>>> reference procesing) :/
>>>>>>>
>>>>>>>              The code in ReferenceProcessor prints both time taken
>>>>>>>              referenceProcessor.cpp:254, as well as the count, but
>>>>>>> strangely only in
>>>>>>>              debug VMs.
>>>>>>>
>>>>>>>              I have no idea why this logging is that unimportant to
>>>>>>> only print that
>>>>>>>              in a debug VM. However there are reviews out for
>>>>>>> changing this area a
>>>>>>>              bit, so it might be useful to wait for that
>>>>>>> (JDK-8173335).
>>>>>>>
>>>>>>>
>>>>>>>         I cleaned it up a bit anyway and now it returns the count of
>>>>>>> objects that are in the system.
>>>>>>>
>>>>>>>
>>>>>>>              > > - the change doubles the size of
>>>>>>>              > > CollectedHeap::allocate_from_tlab_slow() above the
>>>>>>> "small and nice"
>>>>>>>              > > threshold. Maybe it could be refactored a bit.
>>>>>>>              > Done I think, it looks better to me :).
>>>>>>>
>>>>>>>              In ThreadLocalAllocBuffer::handle_sample() I think the
>>>>>>>              set_back_actual_end()/pick_next_sample() calls could
>>>>>>> be hoisted out of
>>>>>>>              the "if" :)
>>>>>>>
>>>>>>>
>>>>>>>         Done!
>>>>>>>
>>>>>>>
>>>>>>>              > > - referenceProcessor.cpp:261: the change should add
>>>>>>> logging about
>>>>>>>              > > the number of references encountered, maybe after
>>>>>>> the corresponding
>>>>>>>              > > "JNI weak reference count" log message.
>>>>>>>              > Just to double check, are you saying that you'd like
>>>>>>> to have the heap
>>>>>>>              > sampler to keep in store how many sampled objects
>>>>>>> were encountered in
>>>>>>>              > the HeapMonitoring::weak_oops_do?
>>>>>>>              >    - Would a return of the method with the number of
>>>>>>> handled
>>>>>>>              > references and logging that work?
>>>>>>>
>>>>>>>              Yes, it's fine if HeapMonitoring::weak_oops_do() only
>>>>>>> returned the
>>>>>>>              number of processed weak oops.
>>>>>>>
>>>>>>>
>>>>>>>         Done also (but I admit I have not tested the output yet) :)
>>>>>>>
>>>>>>>
>>>>>>>              >    - Additionally, would you prefer it in a separate
>>>>>>> block with its
>>>>>>>              > GCTraceTime?
>>>>>>>
>>>>>>>              Yes. Both kinds of information is interesting: while
>>>>>>> the time taken is
>>>>>>>              typically more important, the next question would be
>>>>>>> why, and the
>>>>>>>              number of references typically goes a long way there.
>>>>>>>
>>>>>>>              See above though, it is probably best to wait a bit.
>>>>>>>
>>>>>>>
>>>>>>>         Agreed that I "could" wait but, if it's ok, I'll just
>>>>>>> refactor/remove this when we get closer to something final. Either,
>>>>>>> JDK-8173335
>>>>>>>         has gone in and I will notice it now or it will soon and I
>>>>>>> can change it then.
>>>>>>>
>>>>>>>
>>>>>>>              > > - threadLocalAllocBuffer.cpp:331: one more "TODO"
>>>>>>>              > Removed it and added it to my personal todos to look
>>>>>>> at.
>>>>>>>              >      > >
>>>>>>>              > > - threadLocalAllocBuffer.hpp:
>>>>>>> ThreadLocalAllocBuffer class
>>>>>>>              > > documentation should be updated about the sampling
>>>>>>> additions. I
>>>>>>>              > > would have no clue what the difference between
>>>>>>> "actual_end" and
>>>>>>>              > > "end" would be from the given information.
>>>>>>>              > If you are talking about the comments in this file, I
>>>>>>> made them more
>>>>>>>              > clear I hope in the new webrev. If it was somewhere
>>>>>>> else, let me know
>>>>>>>              > where to change.
>>>>>>>
>>>>>>>              Thanks, that's much better. Maybe a note in the comment
>>>>>>> of the class
>>>>>>>              that ThreadLocalBuffer provides some sampling facility
>>>>>>> by modifying the
>>>>>>>              end() of the TLAB to cause "frequent" calls into the
>>>>>>> runtime call where
>>>>>>>              actual sampling takes place.
>>>>>>>
>>>>>>>
>>>>>>>         Done, I think it's better now. Added something about the
>>>>>>> slow_path_end as well.
>>>>>>>
>>>>>>>
>>>>>>>              > > - in heapMonitoring.hpp: there are some random
>>>>>>> comments about some
>>>>>>>              > > code that has been grabbed from
>>>>>>> "util/math/fastmath.[h|cc]". I
>>>>>>>              > > can't tell whether this is code that can be used
>>>>>>> but I assume that
>>>>>>>              > > Noam Shazeer is okay with that (i.e. that's all
>>>>>>> Google code).
>>>>>>>              > Jeremy and I double checked and we can release that
>>>>>>> as I thought. I
>>>>>>>              > removed the comment from that piece of code entirely.
>>>>>>>
>>>>>>>              Thanks.
>>>>>>>
>>>>>>>              > > - heapMonitoring.hpp/cpp static constant naming
>>>>>>> does not correspond
>>>>>>>              > > to Hotspot's. Additionally, in Hotspot static
>>>>>>> methods are cased
>>>>>>>              > > like other methods.
>>>>>>>              > I think I fixed the methods to be cased the same way
>>>>>>> as all other
>>>>>>>              > methods. For static constants, I was not sure. I
>>>>>>> fixed a few other
>>>>>>>              > variables but I could not seem to really see a
>>>>>>> consistent trend for
>>>>>>>              > constants. I made them as variables but I'm not sure
>>>>>>> now.
>>>>>>>
>>>>>>>              Sorry again, style is a kind of mess. The goal of my
>>>>>>> suggestions here
>>>>>>>              is only to prevent yet another style creeping in.
>>>>>>>
>>>>>>>              > > - in heapMonitoring.cpp there are a few cryptic
>>>>>>> comments at the top
>>>>>>>              > > that seem to refer to internal stuff that should
>>>>>>> probably be
>>>>>>>              > > removed.
>>>>>>>              > Sorry about that! My personal todos not cleared out.
>>>>>>>
>>>>>>>              I am happy about comments, but I simply did not
>>>>>>> understand any of that
>>>>>>>              and I do not know about other readers as well.
>>>>>>>
>>>>>>>              If you think you will remember removing/updating them
>>>>>>> until the review
>>>>>>>              proper (I misunderstood the review situation a little
>>>>>>> it seems).
>>>>>>>
>>>>>>>              > > I did not think through the impact of the TLAB
>>>>>>> changes on collector
>>>>>>>              > > behavior yet (if there are). Also I did not check
>>>>>>> for problems with
>>>>>>>              > > concurrent mark and SATB/G1 (if there are).
>>>>>>>              > I would love to know your thoughts on this, I think
>>>>>>> this is fine. I
>>>>>>>
>>>>>>>              I think so too now. No objects are made live out of
>>>>>>> thin air :)
>>>>>>>
>>>>>>>              > see issues with multiple threads right now hitting
>>>>>>> the stack storage
>>>>>>>              > instance. Previous webrevs had a mutex lock here but
>>>>>>> we took it out
>>>>>>>              > for simplificity (and only for now).
>>>>>>>
>>>>>>>              :) When looking at this after some thinking I now
>>>>>>> assume for this
>>>>>>>              review that this code is not MT safe at all. There
>>>>>>> seems to be more
>>>>>>>              synchronization missing than just the one for the
>>>>>>> StackTraceStorage. So
>>>>>>>              no comments about this here.
>>>>>>>
>>>>>>>
>>>>>>>         I doubled checked a bit (quickly I admit) but it seems that
>>>>>>> synchronization in StackTraceStorage is really all you need (all methods
>>>>>>> lead to a StackTraceStorage one
>>>>>>>         and can be multithreaded outside of that).
>>>>>>>         There is a question about the initialization where the
>>>>>>> method HeapMonitoring::initialize_profiling is not thread safe.
>>>>>>>         It would work (famous last words) and not crash if there was
>>>>>>> a race but we could add a synchronization point there as well (and
>>>>>>> therefore on the stop as well).
>>>>>>>
>>>>>>>         But anyway I will really check and do this once we add back
>>>>>>> synchronization.
>>>>>>>
>>>>>>>
>>>>>>>              Also, this would require some kind of specification of
>>>>>>> what is allowed
>>>>>>>              to be called when and where.
>>>>>>>
>>>>>>>
>>>>>>>         Would we specify this with the methods in the jvmti.xml
>>>>>>> file? We could start by specifying in each that they are not thread safe
>>>>>>> but I saw no mention of that for
>>>>>>>         other methods.
>>>>>>>
>>>>>>>
>>>>>>>              One potentially relevant observation about locking
>>>>>>> here: depending on
>>>>>>>              sampling frequency, StackTraceStore::add_trace() may be
>>>>>>> rather
>>>>>>>              frequently called. I assume that you are going to do
>>>>>>> measurements :)
>>>>>>>
>>>>>>>
>>>>>>>         Though we don't have the TLAB implementation in our code,
>>>>>>> the compiler generated sampler uses 2% of overhead with a 512k sampling
>>>>>>> rate. I can do real measurements
>>>>>>>         when the code settles and we can see how costly this is as a
>>>>>>> TLAB implementation.
>>>>>>>         However, my theory is that if the rate is 512k, the
>>>>>>> memory/performance overhead should be minimal since it is what we saw 
>>>>>>> with
>>>>>>> our code/workloads (though not called
>>>>>>>         the same way, we call it essentially at the same rate).
>>>>>>>         If you have a benchmark you'd like me to test, let me know!
>>>>>>>
>>>>>>>         Right now, with my really small test, this does use a bit of
>>>>>>> overhead even for a 512k sample size. I don't know yet why, I'm going to
>>>>>>> see what is going on.
>>>>>>>
>>>>>>>         Finally, I think it is not reasonable to suppose the
>>>>>>> overhead to be negligible if the sampling rate used is too low. The user
>>>>>>> should know that the lower the rate,
>>>>>>>         the higher the overhead (documentation TODO?).
>>>>>>>
>>>>>>>
>>>>>>>              I am not sure what the expected usage of the API is, but
>>>>>>>              StackTraceStore::add_trace() seems to be able to grow
>>>>>>> without bounds.
>>>>>>>              Only a GC truncates them to the live ones. That in
>>>>>>> itself seems to be
>>>>>>>              problematic (GCs can be *wide* apart), and of course
>>>>>>> some of the API
>>>>>>>              methods add to that because they duplicate that
>>>>>>> unbounded array. Do you
>>>>>>>              have any concerns/measurements about this?
>>>>>>>
>>>>>>>
>>>>>>>         So, the theory is that yes add_trace can be able to grow
>>>>>>> without bounds but it grows at a sample per 512k of allocated space. The
>>>>>>> stacks it gathers are currently
>>>>>>>         maxed at 64 (I'd like to expand that to an option to the
>>>>>>> user though at some point). So I have no concerns because:
>>>>>>>
>>>>>>>         - If really this is taking a lot of space, that means the
>>>>>>> job is keeping a lot of objects in memory as well, therefore the entire
>>>>>>> heap is getting huge
>>>>>>>         - If this is the case, you will be triggering a GC at some
>>>>>>> point anyway.
>>>>>>>
>>>>>>>         (I'm putting under the rug the issue of "What if we set the
>>>>>>> rate to 1 for example" because as you lower the sampling rate, we cannot
>>>>>>> guarantee low overhead; the
>>>>>>>         idea behind this feature is to have a means of having
>>>>>>> meaningful allocated samples at a low overhead)
>>>>>>>
>>>>>>>         I have no measurements really right now but since I now have
>>>>>>> some statistics I can poll, I will look a bit more at this question.
>>>>>>>
>>>>>>>         I have the same last sentence than above: the user should
>>>>>>> expect this to happen if the sampling rate is too small. That probably 
>>>>>>> can
>>>>>>> be reflected in the
>>>>>>>         StartHeapSampling as a note : careful this might impact your
>>>>>>> performance.
>>>>>>>
>>>>>>>
>>>>>>>              Also, these stack traces might hold on to huge arrays.
>>>>>>> Any
>>>>>>>              consideration of that? Particularly it might be the
>>>>>>> cause for OOMEs in
>>>>>>>              tight memory situations.
>>>>>>>
>>>>>>>
>>>>>>>         There is a stack size maximum that is set to 64 so it should
>>>>>>> not hold huge arrays. I don't think this is an issue but I can double 
>>>>>>> check
>>>>>>> with a test or two.
>>>>>>>
>>>>>>>
>>>>>>>              - please consider adding a safepoint check in
>>>>>>>              HeapMonitoring::weak_oops_do to prevent accidental
>>>>>>> misuse.
>>>>>>>
>>>>>>>              - in struct StackTraceStorage, the public fields may
>>>>>>> also need
>>>>>>>              underscores. At least some files in the runtime
>>>>>>> directory have structs
>>>>>>>              with underscored public members (and some don't). The
>>>>>>> runtime team
>>>>>>>              should probably comment on that.
>>>>>>>
>>>>>>>
>>>>>>>         Agreed I did not know. I looked around and a lot of structs
>>>>>>> did not have them it seemed so I left it as is. I will happily change 
>>>>>>> it if
>>>>>>> someone prefers (I was not
>>>>>>>         sure if you really preferred or not, your sentence seemed to
>>>>>>> be more a note of "this might need to change but I don't know if the
>>>>>>> runtime team enforces that", let
>>>>>>>         me know if I read that wrongly).
>>>>>>>
>>>>>>>
>>>>>>>              - In StackTraceStorage::weak_oops_do(), when examining
>>>>>>> the
>>>>>>>              StackTraceData, maybe it is useful to consider having a
>>>>>>> non-NULL
>>>>>>>              reference outside of the heap's reserved space an
>>>>>>> error. There should
>>>>>>>              be no oop outside of the heap's reserved space ever.
>>>>>>>
>>>>>>>              Unless you allow storing random values in
>>>>>>> StackTraceData::obj, which I
>>>>>>>              would not encourage.
>>>>>>>
>>>>>>>
>>>>>>>         I suppose you are talking about this part:
>>>>>>>         if ((value != NULL && Universe::heap()->is_in_reserved(value))
>>>>>>> &&
>>>>>>>                     (is_alive == NULL ||
>>>>>>> is_alive->do_object_b(value))) {
>>>>>>>
>>>>>>>         What you are saying is that I could have something like:
>>>>>>>         if (value != my_non_null_reference &&
>>>>>>>                     (is_alive == NULL ||
>>>>>>> is_alive->do_object_b(value))) {
>>>>>>>
>>>>>>>         Is that what you meant? Is there really a reason to do so?
>>>>>>> When I look at the code, is_in_reserved seems like a O(1) method call. 
>>>>>>> I'm
>>>>>>> not even sure we can have a
>>>>>>>         NULL value to be honest. I might have to study that to see
>>>>>>> if this was not a paranoid test to begin with.
>>>>>>>
>>>>>>>         The is_alive code has now morphed due to the comment below.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              - HeapMonitoring::weak_oops_do() does not seem to use
>>>>>>> the
>>>>>>>              passed AbstractRefProcTaskExecutor.
>>>>>>>
>>>>>>>
>>>>>>>         It did use it:
>>>>>>>            size_t HeapMonitoring::weak_oops_do(
>>>>>>>               AbstractRefProcTaskExecutor *task_executor,
>>>>>>>               BoolObjectClosure* is_alive,
>>>>>>>               OopClosure *f,
>>>>>>>               VoidClosure *complete_gc) {
>>>>>>>             assert(SafepointSynchronize::is_at_safepoint(), "must
>>>>>>> be at safepoint");
>>>>>>>
>>>>>>>             if (task_executor != NULL) {
>>>>>>>               task_executor->set_single_threaded_mode();
>>>>>>>             }
>>>>>>>             return StackTraceStorage::storage()->weak_oops_do(is_alive,
>>>>>>> f, complete_gc);
>>>>>>>         }
>>>>>>>
>>>>>>>         But due to the comment below, I refactored this, so this is
>>>>>>> no longer here. Now I have an always true closure that is passed.
>>>>>>>
>>>>>>>
>>>>>>>              - I do not understand allowing to call this method with
>>>>>>> a NULL
>>>>>>>              complete_gc closure. This would mean that objects
>>>>>>> referenced from the
>>>>>>>              object that is referenced by the StackTraceData are not
>>>>>>> pulled, meaning
>>>>>>>              they would get stale.
>>>>>>>
>>>>>>>              - same with is_alive parameter value of NULL
>>>>>>>
>>>>>>>
>>>>>>>         So these questions made me look a bit closer at this code.
>>>>>>> This code I think was written this way to have a very small impact on 
>>>>>>> the
>>>>>>> file but you are right, there
>>>>>>>         is no reason for this here. I've simplified the code by
>>>>>>> making in referenceProcessor.cpp a process_HeapSampling method that 
>>>>>>> handles
>>>>>>> everything there.
>>>>>>>
>>>>>>>         The code allowed NULLs because it depended on where you were
>>>>>>> coming from and how the code was being called.
>>>>>>>
>>>>>>>         - I added a static always_true variable and pass that now to
>>>>>>> be more consistent with the rest of the code.
>>>>>>>         - I moved the complete_gc into process_phaseHeapSampling now
>>>>>>> (new method) and handle the task_executor and the complete_gc there
>>>>>>>              - Newbie question: in our code we did a
>>>>>>> set_single_threaded_mode but I see that process_phaseJNI does it right
>>>>>>> before its call, do I need to do it for the
>>>>>>>         process_phaseHeapSample?
>>>>>>>         That API is much cleaner (in my mind) and is consistent with
>>>>>>> what is done around it (again in my mind).
>>>>>>>
>>>>>>>
>>>>>>>              - heapMonitoring.cpp:590: I do not completely
>>>>>>> understand the purpose of
>>>>>>>              this code: in the end this results in a fixed value
>>>>>>> directly dependent
>>>>>>>              on the Thread address anyway? In the end this results
>>>>>>> in a fixed value
>>>>>>>              directly dependent on the Thread address anyway?
>>>>>>>              IOW, what is special about exactly 20 rounds?
>>>>>>>
>>>>>>>
>>>>>>>         So we really want a fast random number generator that has a
>>>>>>> specific mean (512k is the default we use). The code uses the thread
>>>>>>> address as the start number of the
>>>>>>>         sequence (why not, it is random enough is rationale). Then
>>>>>>> instead of just starting there, we prime the sequence and really only 
>>>>>>> start
>>>>>>> at the 21st number, it is
>>>>>>>         arbitrary and I have not done a study to see if we could do
>>>>>>> more or less of that.
>>>>>>>
>>>>>>>         As I have the statistics of the system up and running, I'll
>>>>>>> run some experiments to see if this is needed, is 20 good, or not.
>>>>>>>
>>>>>>>
>>>>>>>              - also I would consider stripping a few bits of the
>>>>>>> threads' address as
>>>>>>>              initialization value for your rng. The last three bits
>>>>>>> (and probably
>>>>>>>              more, check whether the Thread object is allocated on
>>>>>>> special
>>>>>>>              boundaries) are always zero for them.
>>>>>>>              Not sure if the given "random" value is random enough
>>>>>>> before/after,
>>>>>>>              this method, so just skip that comment if you think
>>>>>>> this is not
>>>>>>>              required.
>>>>>>>
>>>>>>>
>>>>>>>         I don't know is the honest answer. I think what is important
>>>>>>> is that we tend towards a mean and it is random "enough" to not fall in
>>>>>>> pitfalls of only sampling a
>>>>>>>         subset of objects due to their allocation order. I added
>>>>>>> that as test to do to see if it changes the mean in any way for the 512k
>>>>>>> default value and/or if the first
>>>>>>>         1000 elements look better.
>>>>>>>
>>>>>>>
>>>>>>>              Some more random nits I did not find a place to put
>>>>>>> anywhere:
>>>>>>>
>>>>>>>              - ThreadLocalAllocBuffer::_extra_space does not seem
>>>>>>> to be used
>>>>>>>              anywhere?
>>>>>>>
>>>>>>>
>>>>>>>         Good catch :).
>>>>>>>
>>>>>>>
>>>>>>>              - Maybe indent the declaration of
>>>>>>> ThreadLocalAllocBuffer::_bytes_until_sample to align below the
>>>>>>> other members of that group.
>>>>>>>
>>>>>>>
>>>>>>>         Done moved it up a bit to have non static members together
>>>>>>> and static separate.
>>>>>>>
>>>>>>>              Thanks,
>>>>>>>                 Thomas
>>>>>>>
>>>>>>>
>>>>>>>         Thanks for your review!
>>>>>>>         Jc
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Low-Overhead Heap Profiling

Reply via email to