Dear all, Thread-safety is back!! Here is the update webrev: http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/
Full webrev is here: http://cr.openjdk.java.net/~rasbold/8171119/webrev.11/ In order to really test this, I needed to add this so thought now was a good time. It required a few changes here for the creation to ensure correctness and safety. Now we keep the static pointer but clear the data internally so on re-initialize, it will be a bit more costly than before. I don't think this is a huge use-case so I did not think it was a problem. I used the internal MutexLocker, I think I used it well, let me know. I also added three tests: 1) Stack depth test: http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStackDepthTest.java.patch This test shows that the maximum stack depth system is working. 2) Thread safety: http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadTest.java.patch The test creates 24 threads and they all allocate at the same time. The test then checks it does find samples from all the threads. 3) Thread on/off safety http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadOnOffTest.java.patch The test creates 24 threads that all allocate a bunch of memory. Then another thread turns the sampling on/off. Btw, both tests 2 & 3 failed without the locks. As I worked on this, I saw a lot of places where the tests are doing very similar things, I'm going to clean up the code a bit and make a HeapAllocator class that all tests can call directly. This will greatly simplify the code. Thanks for any comments/criticisms! Jc On Mon, Oct 2, 2017 at 8:52 PM, JC Beyler <jcbey...@google.com> wrote: > Dear all, > > Small update to the webrev: > http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/ > > Full webrev is here: > http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/ > > I updated a bit of the naming, removed a TODO comment, and I added a test > for testing the sampling rate. I also updated the maximum stack depth to > 1024, there is no reason to keep it so small. I did a micro benchmark that > tests the overhead and it seems relatively the same. > > I compared allocations from a stack depth of 10 and allocations from a > stack depth of 1024 (allocations are from the same helper method in > http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/ > raw_files/new/test/hotspot/jtreg/serviceability/jvmti/ > HeapMonitor/MyPackage/HeapMonitorStatRateTest.java): > - For an array of 1 integer allocated in a loop; stack depth > 1024 vs stack depth 10: 1% slower > - For an array of 200k integers allocated in a loop; stack depth > 1024 vs stack depth 10: 3% slower > > So basically now moving the maximum stack depth to 1024 but we only copy > over the stack depths actually used. > > For the next webrev, I will be adding a stack depth test to show that it > works and probably put back the mutex locking so that we can see how > difficult it is to keep thread safe. > > Let me know what you think! > Jc > > > > On Mon, Sep 25, 2017 at 3:02 PM, JC Beyler <jcbey...@google.com> wrote: > >> Forgot to say that for my numbers: >> - Not in the test are the actual numbers I got for the various array >> sizes, I ran the program 30 times and parsed the output; here are the >> averages and standard deviation: >> 1000: 1.28% average; 1.13% standard deviation >> 10000: 1.59% average; 1.25% standard deviation >> 100000: 1.26% average; 1.26% standard deviation >> >> The 1000/10000/100000 are the sizes of the arrays being allocated. These >> are allocated 100k times and the sampling rate is 111 times the size of the >> array. >> >> Thanks! >> Jc >> >> >> On Mon, Sep 25, 2017 at 3:01 PM, JC Beyler <jcbey...@google.com> wrote: >> >>> Hi all, >>> >>> After a bit of a break, I am back working on this :). As before, here >>> are two webrevs: >>> >>> - Full change set: http://cr.openjdk.java.ne >>> t/~rasbold/8171119/webrev.09/ >>> - Compared to version 8: http://cr.openjdk.java.net/ >>> ~rasbold/8171119/webrev.08_09/ >>> (This version is compared to version 8 I last showed but ported to >>> the new folder hierarchy) >>> >>> In this version I have: >>> - Handled Thomas' comments from his email of 07/03: >>> - Merged the logging to be standard >>> - Fixed up the code a bit where asked >>> - Added some notes about the code not being thread-safe yet >>> - Removed additional dead code from the version that modifies >>> interpreter/c1/c2 >>> - Fixed compiler issues so that it compiles with >>> --disable-precompiled-header >>> - Tested with ./configure --with-boot-jdk=<jdk8> >>> --with-debug-level=slowdebug --disable-precompiled-headers >>> >>> Additionally, I added a test to check the sanity of the sampler: >>> HeapMonitorStatCorrectnessTest (http://cr.openjdk.java.net/~r >>> asbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceabilit >>> y/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch) >>> - This allocates a number of arrays and checks that we obtain the >>> number of samples we want with an accepted error of 5%. I tested it 100 >>> times and it passed everytime, I can test more if wanted >>> - Not in the test are the actual numbers I got for the various array >>> sizes, I ran the program 30 times and parsed the output; here are the >>> averages and standard deviation: >>> 1000: 1.28% average; 1.13% standard deviation >>> 10000: 1.59% average; 1.25% standard deviation >>> 100000: 1.26% average; 1.26% standard deviation >>> >>> What this means is that we were always at about 1~2% of the number of >>> samples the test expected. >>> >>> Let me know what you think, >>> Jc >>> >>> >>> >>> On Wed, Jul 5, 2017 at 9:31 PM, JC Beyler <jcbey...@google.com> wrote: >>> >>>> Hi all, >>>> >>>> I apologize, I have not yet handled your remarks but thought this new >>>> webrev would also be useful to see and comment on perhaps. >>>> >>>> Here is the latest webrev, it is generated slightly different than the >>>> others since now I'm using webrev.ksh without the -N option: >>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/ >>>> >>>> And the webrev.07 to webrev.08 diff is here: >>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/ >>>> >>>> (Let me know if it works well) >>>> >>>> It's a small change between versions but it: >>>> - provides a fix that makes the average sample rate correct (more on >>>> that below). >>>> - fixes the code to actually have it play nicely with the fast tlab >>>> refill >>>> - cleaned up a bit the JVMTI text and now use jvmtiFrameInfo >>>> - moved the capability to be onload solo >>>> >>>> With this webrev, I've done a small study of the random number >>>> generator we use here for the sampling rate. I took a small program and it >>>> can be simplified to: >>>> >>>> for (outer loop) >>>> for (inner loop) >>>> int[] tmp = new int[arraySize]; >>>> >>>> - I've fixed the outer and inner loops to being 800 for this >>>> experiment, meaning we allocate 640000 times an array of a given array >>>> size. >>>> >>>> - Each program provides the average sample size used for the whole >>>> execution >>>> >>>> - Then, I ran each variation 30 times and then calculated the average >>>> of the average sample size used for various array sizes. I selected the >>>> array size to be one of the following: 1, 10, 100, 1000. >>>> >>>> - When compared to 512kb, the average sample size of 30 runs: >>>> 1: 4.62% of error >>>> 10: 3.09% of error >>>> 100: 0.36% of error >>>> 1000: 0.1% of error >>>> 10000: 0.03% of error >>>> >>>> What it shows is that, depending on the number of samples, the average >>>> does become better. This is because with an allocation of 1 element per >>>> array, it will take longer to hit one of the thresholds. This is seen by >>>> looking at the sample count statistic I put in. For the same number of >>>> iterations (800 * 800), the different array sizes provoke: >>>> 1: 62 samples >>>> 10: 125 samples >>>> 100: 788 samples >>>> 1000: 6166 samples >>>> 10000: 57721 samples >>>> >>>> And of course, the more samples you have, the more sample rates you >>>> pick, which means that your average gets closer using that math. >>>> >>>> Thanks, >>>> Jc >>>> >>>> On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler <jcbey...@google.com> >>>> wrote: >>>> >>>>> Thanks Robbin, >>>>> >>>>> This seems to have worked. When I have the next webrev ready, we will >>>>> find out but I'm fairly confident it will work! >>>>> >>>>> Thanks agian! >>>>> Jc >>>>> >>>>> On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn <robbin....@oracle.com> >>>>> wrote: >>>>> >>>>>> Hi JC, >>>>>> >>>>>> On 06/29/2017 12:15 AM, JC Beyler wrote: >>>>>> >>>>>>> B) Incremental changes >>>>>>> >>>>>> >>>>>> I guess the most common work flow here is using mq : >>>>>> hg qnew fix_v1 >>>>>> edit files >>>>>> hg qrefresh >>>>>> hg qnew fix_v2 >>>>>> edit files >>>>>> hg qrefresh >>>>>> >>>>>> if you do hg log you will see 2 commits >>>>>> >>>>>> webrev.ksh -r -2 -o my_inc_v1_v2 >>>>>> webrev.ksh -o my_full_v2 >>>>>> >>>>>> >>>>>> In your .hgrc you might need: >>>>>> [extensions] >>>>>> mq = >>>>>> >>>>>> /Robbin >>>>>> >>>>>> >>>>>>> Again another newbiew question here... >>>>>>> >>>>>>> For showing the incremental changes, is there a link that explains >>>>>>> how to do that? I apologize for my newbie questions all the time :) >>>>>>> >>>>>>> Right now, I do: >>>>>>> >>>>>>> ksh ../webrev.ksh -m -N >>>>>>> >>>>>>> That generates a webrev.zip and send it to Chuck Rasbold. He then >>>>>>> uploads it to a new webrev. >>>>>>> >>>>>>> I tried commiting my change and adding a small change. Then if I >>>>>>> just do ksh ../webrev.ksh without any options, it seems to produce a >>>>>>> similar page but now with only the changes I had (so the 06-07 >>>>>>> comparison >>>>>>> you were talking about) and a changeset that has it all. I imagine that >>>>>>> is >>>>>>> what you meant. >>>>>>> >>>>>>> Which means that my workflow would become: >>>>>>> >>>>>>> 1) Make changes >>>>>>> 2) Make a webrev without any options to show just the differences >>>>>>> with the tip >>>>>>> 3) Amend my changes to my local commit so that I have it done with >>>>>>> 4) Go to 1 >>>>>>> >>>>>>> Does that seem correct to you? >>>>>>> >>>>>>> Note that when I do this, I only see the full change of a file in >>>>>>> the full change set (Side note here: now the page says change set and >>>>>>> not >>>>>>> patch, which is maybe why Serguei was having issues?). >>>>>>> >>>>>>> Thanks! >>>>>>> Jc >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn <robbin....@oracle.com >>>>>>> <mailto:robbin....@oracle.com>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> On 06/28/2017 12:04 AM, JC Beyler wrote: >>>>>>> >>>>>>> Dear Thomas et al, >>>>>>> >>>>>>> Here is the newest webrev: >>>>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ < >>>>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/> >>>>>>> >>>>>>> >>>>>>> >>>>>>> You have some more bits to in there but generally this looks >>>>>>> good and really nice with more tests. >>>>>>> I'll do and deep dive and re-test this when I get back from my >>>>>>> long vacation with whatever patch version you have then. >>>>>>> >>>>>>> Also I think it's time you provide incremental (v06->07 changes) >>>>>>> as well as complete change-sets. >>>>>>> >>>>>>> Thanks, Robbin >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thomas, I "think" I have answered all your remarks. The >>>>>>> summary is: >>>>>>> >>>>>>> - The statistic system is up and provides insight on what >>>>>>> the heap sampler is doing >>>>>>> - I've noticed that, though the sampling rate is at the >>>>>>> right mean, we are missing some samples, I have not yet tracked out why >>>>>>> (details below) >>>>>>> >>>>>>> - I've run a tiny benchmark that is the worse case: it is a >>>>>>> very tight loop and allocated a small array >>>>>>> - In this case, I see no overhead when the system is >>>>>>> off so that is a good start :) >>>>>>> - I see right now a high overhead in this case when >>>>>>> sampling is on. This is not a really too surprising but I'm going to >>>>>>> see if >>>>>>> this is consistent with our >>>>>>> internal implementation. The benchmark is really allocation >>>>>>> stressful so I'm not too surprised but I want to do the due diligence. >>>>>>> >>>>>>> - The statistic system up is up and I have a new test >>>>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/s >>>>>>> erviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTes >>>>>>> t.java.patch >>>>>>> <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/ >>>>>>> serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTe >>>>>>> st.java.patch> >>>>>>> - I did a bit of a study about the random generator >>>>>>> here, more details are below but basically it seems to work well >>>>>>> >>>>>>> - I added a capability but since this is the first time >>>>>>> doing this, I was not sure I did it right >>>>>>> - I did add a test though for it and the test seems to >>>>>>> do what I expect (all methods are failing with the >>>>>>> JVMTI_ERROR_MUST_POSSESS_CAPABILITY error). >>>>>>> - http://cr.openjdk.java.net/~ra >>>>>>> sbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonito >>>>>>> r/MyPackage/HeapMonitorNoCapabilityTest.java.patch >>>>>>> <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/ >>>>>>> serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapa >>>>>>> bilityTest.java.patch> >>>>>>> >>>>>>> - I still need to figure out what to do about the >>>>>>> multi-agent vs single-agent issue >>>>>>> >>>>>>> - As far as measurements, it seems I still need to look >>>>>>> at: >>>>>>> - Why we do the 20 random calls first, are they >>>>>>> necessary? >>>>>>> - Look at the mean of the sampling rate that the random >>>>>>> generator does and also what is actually sampled >>>>>>> - What is the overhead in terms of memory/performance >>>>>>> when on? >>>>>>> >>>>>>> I have inlined my answers, I think I got them all in the new >>>>>>> webrev, let me know your thoughts. >>>>>>> >>>>>>> Thanks again! >>>>>>> Jc >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl < >>>>>>> thomas.scha...@oracle.com <mailto:thomas.scha...@oracle.com> >>>>>>> <mailto:thomas.scha...@oracle.com >>>>>>> >>>>>>> <mailto:thomas.scha...@oracle.com>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote: >>>>>>> > Hi all, >>>>>>> > >>>>>>> > First off: Thanks again to Robbin and Thomas for >>>>>>> their reviews :) >>>>>>> > >>>>>>> > Next, I've uploaded a new webrev: >>>>>>> > http://cr.openjdk.java.net/~ra >>>>>>> sbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~r >>>>>>> asbold/8171119/webrev.06/> >>>>>>> <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ < >>>>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>> >>>>>>> >>>>>>> > >>>>>>> > Here is an update: >>>>>>> > >>>>>>> > - @Robbin, I forgot to say that yes I need to look at >>>>>>> implementing >>>>>>> > this for the other architectures and testing it >>>>>>> before it is all >>>>>>> > ready to go. Is it common to have it working on all >>>>>>> possible >>>>>>> > combinations or is there a subset that I should be >>>>>>> doing first and we >>>>>>> > can do the others later? >>>>>>> > - I've tested slowdebug, built and ran the JTreg >>>>>>> tests I wrote with >>>>>>> > slowdebug and fixed a few more issues >>>>>>> > - I've refactored a bit of the code following Thomas' >>>>>>> comments >>>>>>> > - I think I've handled all the comments from >>>>>>> Thomas (I put >>>>>>> > comments inline below for the specifics) >>>>>>> >>>>>>> Thanks for handling all those. >>>>>>> >>>>>>> > - Following Thomas' comments on statistics, I want to >>>>>>> add some >>>>>>> > quality assurance tests and find that the easiest way >>>>>>> would be to >>>>>>> > have a few counters of what is happening in the >>>>>>> sampler and expose >>>>>>> > that to the user. >>>>>>> > - I'll be adding that in the next version if no >>>>>>> one sees any >>>>>>> > objections to that. >>>>>>> > - This will allow me to add a sanity test in JTreg >>>>>>> about number of >>>>>>> > samples and average of sampling rate >>>>>>> > >>>>>>> > @Thomas: I had a few questions that I inlined below >>>>>>> but I will >>>>>>> > summarize the "bigger ones" here: >>>>>>> > - You mentioned constants are not using the right >>>>>>> conventions, I >>>>>>> > looked around and didn't see any convention except >>>>>>> normal naming then >>>>>>> > for static constants. Is that right? >>>>>>> >>>>>>> I looked through https://wiki.openjdk.java.net/ >>>>>>> display/HotSpot/StyleGui <https://wiki.openjdk.java.net >>>>>>> /display/HotSpot/StyleGui> >>>>>>> <https://wiki.openjdk.java.net/display/HotSpot/StyleGui < >>>>>>> https://wiki.openjdk.java.net/display/HotSpot/StyleGui>> >>>>>>> de and the rule is to "follow an existing pattern and >>>>>>> must have a >>>>>>> distinct appearance from other names". Which does not >>>>>>> help a lot I >>>>>>> guess :/ The GC team started using upper camel case, >>>>>>> e.g. >>>>>>> SomeOtherConstant, but very likely this is probably not >>>>>>> applied >>>>>>> consistently throughout. So I am fine with not adding >>>>>>> another style >>>>>>> (like kMaxStackDepth with the "k" in front with some >>>>>>> unknown meaning) >>>>>>> is fine. >>>>>>> >>>>>>> (Chances are you will find that style somewhere used >>>>>>> anyway too, >>>>>>> apologies if so :/) >>>>>>> >>>>>>> >>>>>>> Thanks for that link, now I know where to look. I used the >>>>>>> upper camel case in my code as well then :) I should have gotten them >>>>>>> all. >>>>>>> >>>>>>> >>>>>>> > PS: I've also inlined my answers to Thomas below: >>>>>>> > >>>>>>> > On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl >>>>>>> <thomas.schatzl@oracl >>>>>>> > e.com <http://e.com> <http://e.com>> wrote: >>>>>>> > > Hi all, >>>>>>> > > >>>>>>> > > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote: >>>>>>> > > > Dear all, >>>>>>> > > > >>>>>>> > > > I've continued working on this and have done the >>>>>>> following >>>>>>> > > webrev: >>>>>>> > > > http://cr.openjdk.java.net/~ra >>>>>>> sbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~r >>>>>>> asbold/8171119/webrev.05/> >>>>>>> <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ < >>>>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>> >>>>>>> >>>>>>> > > >>>>>>> > > [...] >>>>>>> > > > Things I still need to do: >>>>>>> > > > - Have to fix that TLAB case for the >>>>>>> FastTLABRefill >>>>>>> > > > - Have to start looking at the data to see >>>>>>> that it is >>>>>>> > > consistent and does gather the right samples, >>>>>>> right frequency, etc. >>>>>>> > > > - Have to check the GC elements and what that >>>>>>> produces >>>>>>> > > > - Run a slowdebug run and ensure I fixed all >>>>>>> those issues you >>>>>>> > > saw > Robbin >>>>>>> > > > >>>>>>> > > > Thanks for looking at the webrev and have a >>>>>>> great week! >>>>>>> > > >>>>>>> > > scratching a bit on the surface of this change, >>>>>>> so apologies for >>>>>>> > > rather shallow comments: >>>>>>> > > >>>>>>> > > - macroAssembler_x86.cpp:5604: while this is >>>>>>> compiler code, and I >>>>>>> > > am not sure this is final, please avoid littering >>>>>>> the code with >>>>>>> > > TODO remarks :) They tend to be candidates for >>>>>>> later wtf moments >>>>>>> > > only. >>>>>>> > > >>>>>>> > > Just file a CR for that. >>>>>>> > > >>>>>>> > Newcomer question: what is a CR and not sure I have >>>>>>> the rights to do >>>>>>> > that yet ? :) >>>>>>> >>>>>>> Apologies. CR is a change request, this suggests to >>>>>>> file a bug in the >>>>>>> bug tracker. And you are right, you can't just create a >>>>>>> new account in >>>>>>> the OpenJDK JIRA yourselves. :( >>>>>>> >>>>>>> >>>>>>> Ok good to know, I'll continue with my own todo list but >>>>>>> I'll work hard on not letting it slip in the webrevs anymore :) >>>>>>> >>>>>>> >>>>>>> I was mostly referring to the "... but it is a TODO" >>>>>>> part of that >>>>>>> comment in macroassembler_x86.cpp. Comments about the >>>>>>> why of the code >>>>>>> are appreciated. >>>>>>> >>>>>>> [Note that I now understand that this is to some degree >>>>>>> still work in >>>>>>> progress. As long as the final changeset does no >>>>>>> contain TODO's I am >>>>>>> fine (and it's not a hard objection, rather their use >>>>>>> in "final" code >>>>>>> is typically limited in my experience)] >>>>>>> >>>>>>> 5603 // Currently, if this happens, just set back the >>>>>>> actual end to >>>>>>> where it was. >>>>>>> 5604 // We miss a chance to sample here. >>>>>>> >>>>>>> Would be okay, if explaining "this" and the "why" of >>>>>>> missing a chance >>>>>>> to sample here would be best. >>>>>>> >>>>>>> Like maybe: >>>>>>> >>>>>>> // If we needed to refill TLABs, just set the actual >>>>>>> end point to >>>>>>> // the end of the TLAB again. We do not sample here >>>>>>> although we could. >>>>>>> >>>>>>> Done with your comment, it works well in my mind. >>>>>>> >>>>>>> I am not sure whether "miss a chance to sample" meant >>>>>>> "we could, but >>>>>>> consciously don't because it's not that useful" or "it >>>>>>> would be >>>>>>> necessary but don't because it's too complicated to >>>>>>> do.". >>>>>>> >>>>>>> Looking at the original comment once more, I am also >>>>>>> not sure if that >>>>>>> comment shouldn't referring to the "end" variable (not >>>>>>> actual_end) >>>>>>> because that's the variable that is responsible for >>>>>>> taking the sampling >>>>>>> path? (Going from the member description of >>>>>>> ThreadLocalAllocBuffer). >>>>>>> >>>>>>> >>>>>>> I've moved this code and it no longer shows up here but the >>>>>>> rationale and answer was: >>>>>>> >>>>>>> So.. Yes, end is the variable provoking the sampling. Actual >>>>>>> end is the actual end of the TLAB. >>>>>>> >>>>>>> What was happening here is that the code is resetting _end >>>>>>> to point towards the end of the new TLAB. Because, we now have the end >>>>>>> for >>>>>>> sampling and _actual_end for >>>>>>> the actual end, we need to update the actual_end as well. >>>>>>> >>>>>>> Normally, were we to do the real work here, we would >>>>>>> calculate the (end - start) offset, then do: >>>>>>> >>>>>>> - Set the new end to : start + (old_end - old_start) >>>>>>> - Set the actual end like we do here now where it because it >>>>>>> is the actual end. >>>>>>> >>>>>>> Why is this not done here now anymore? >>>>>>> - I was still debating which path to take: >>>>>>> - Do it in the fast refill code, it has its perks: >>>>>>> - In a world where fast refills are happening all >>>>>>> the time or a lot, we can augment there the code to do the sampling >>>>>>> - Remember what we had as an end before leaving the >>>>>>> slowpath and check on return >>>>>>> - This is what I'm doing now, it removes the need >>>>>>> to go fix up all fast refill paths but if you remain in fast refill >>>>>>> paths, >>>>>>> you won't get sampling. I >>>>>>> have to think of the consequences of that, maybe a future >>>>>>> change later on? >>>>>>> - I have the statistics now so I'm going to >>>>>>> study that >>>>>>> -> By the way, though my statistics are >>>>>>> showing I'm missing some samples, if I turn off FastTlabRefill, it is >>>>>>> the >>>>>>> same loss so for now, it seems >>>>>>> this does not occur in my simple test. >>>>>>> >>>>>>> >>>>>>> >>>>>>> But maybe I am only confused and it's best to just >>>>>>> leave the comment >>>>>>> away. :) >>>>>>> >>>>>>> Thinking about it some more, doesn't this not-sampling >>>>>>> in this case >>>>>>> mean that sampling does not work in any collector that >>>>>>> does inline TLAB >>>>>>> allocation at the moment? (Or is inline TLAB alloc >>>>>>> automatically >>>>>>> disabled with sampling somehow?) >>>>>>> >>>>>>> That would indeed be a bigger TODO then :) >>>>>>> >>>>>>> >>>>>>> Agreed, this remark made me think that perhaps as a first >>>>>>> step the new way of doing it is better but I did have to: >>>>>>> - Remove the const of the ThreadLocalBuffer remaining and >>>>>>> hard_end methods >>>>>>> - Move hard_end out of the header file to have a bit more >>>>>>> logic there >>>>>>> >>>>>>> Please let me know what you think of that and if you prefer >>>>>>> it this way or changing the fast refills. (I prefer this way now >>>>>>> because it >>>>>>> is more incremental). >>>>>>> >>>>>>> >>>>>>> > > - calling HeapMonitoring::do_weak_oops() (which >>>>>>> should probably be >>>>>>> > > called weak_oops_do() like other similar methods) >>>>>>> only if string >>>>>>> > > deduplication is enabled (in >>>>>>> g1CollectedHeap.cpp:4511) seems wrong. >>>>>>> > >>>>>>> > The call should be at least around 6 lines up outside >>>>>>> the if. >>>>>>> > >>>>>>> > Preferentially in a method like >>>>>>> process_weak_jni_handles(), including >>>>>>> > additional logging. (No new (G1) gc phase without >>>>>>> minimal logging >>>>>>> > :)). >>>>>>> > Done but really not sure because: >>>>>>> > >>>>>>> > I put for logging: >>>>>>> > log_develop_trace(gc, >>>>>>> freelist)("G1ConcRegionFreeing [other] : heap >>>>>>> > monitoring"); >>>>>>> >>>>>>> I would think that "gc, ref" would be more appropriate >>>>>>> log tags for >>>>>>> this similar to jni handles. >>>>>>> (I am als not sure what weak reference handling has to >>>>>>> do with >>>>>>> G1ConcRegionFreeing, so I am a bit puzzled) >>>>>>> >>>>>>> >>>>>>> I was not sure what to put for the tags or really as the >>>>>>> message. I cleaned it up a bit now to: >>>>>>> log_develop_trace(gc, ref)("HeapSampling [other] : heap >>>>>>> monitoring processing"); >>>>>>> >>>>>>> >>>>>>> >>>>>>> > Since weak_jni_handles didn't have logging for me to >>>>>>> be inspired >>>>>>> > from, I did that but unconvinced this is what should >>>>>>> be done. >>>>>>> >>>>>>> The JNI handle processing does have logging, but only in >>>>>>> ReferenceProcessor::process_discovered_references(). In >>>>>>> process_weak_jni_handles() only overall time is >>>>>>> measured (in a G1 >>>>>>> specific way, since only G1 supports disabling >>>>>>> reference procesing) :/ >>>>>>> >>>>>>> The code in ReferenceProcessor prints both time taken >>>>>>> referenceProcessor.cpp:254, as well as the count, but >>>>>>> strangely only in >>>>>>> debug VMs. >>>>>>> >>>>>>> I have no idea why this logging is that unimportant to >>>>>>> only print that >>>>>>> in a debug VM. However there are reviews out for >>>>>>> changing this area a >>>>>>> bit, so it might be useful to wait for that >>>>>>> (JDK-8173335). >>>>>>> >>>>>>> >>>>>>> I cleaned it up a bit anyway and now it returns the count of >>>>>>> objects that are in the system. >>>>>>> >>>>>>> >>>>>>> > > - the change doubles the size of >>>>>>> > > CollectedHeap::allocate_from_tlab_slow() above the >>>>>>> "small and nice" >>>>>>> > > threshold. Maybe it could be refactored a bit. >>>>>>> > Done I think, it looks better to me :). >>>>>>> >>>>>>> In ThreadLocalAllocBuffer::handle_sample() I think the >>>>>>> set_back_actual_end()/pick_next_sample() calls could >>>>>>> be hoisted out of >>>>>>> the "if" :) >>>>>>> >>>>>>> >>>>>>> Done! >>>>>>> >>>>>>> >>>>>>> > > - referenceProcessor.cpp:261: the change should add >>>>>>> logging about >>>>>>> > > the number of references encountered, maybe after >>>>>>> the corresponding >>>>>>> > > "JNI weak reference count" log message. >>>>>>> > Just to double check, are you saying that you'd like >>>>>>> to have the heap >>>>>>> > sampler to keep in store how many sampled objects >>>>>>> were encountered in >>>>>>> > the HeapMonitoring::weak_oops_do? >>>>>>> > - Would a return of the method with the number of >>>>>>> handled >>>>>>> > references and logging that work? >>>>>>> >>>>>>> Yes, it's fine if HeapMonitoring::weak_oops_do() only >>>>>>> returned the >>>>>>> number of processed weak oops. >>>>>>> >>>>>>> >>>>>>> Done also (but I admit I have not tested the output yet) :) >>>>>>> >>>>>>> >>>>>>> > - Additionally, would you prefer it in a separate >>>>>>> block with its >>>>>>> > GCTraceTime? >>>>>>> >>>>>>> Yes. Both kinds of information is interesting: while >>>>>>> the time taken is >>>>>>> typically more important, the next question would be >>>>>>> why, and the >>>>>>> number of references typically goes a long way there. >>>>>>> >>>>>>> See above though, it is probably best to wait a bit. >>>>>>> >>>>>>> >>>>>>> Agreed that I "could" wait but, if it's ok, I'll just >>>>>>> refactor/remove this when we get closer to something final. Either, >>>>>>> JDK-8173335 >>>>>>> has gone in and I will notice it now or it will soon and I >>>>>>> can change it then. >>>>>>> >>>>>>> >>>>>>> > > - threadLocalAllocBuffer.cpp:331: one more "TODO" >>>>>>> > Removed it and added it to my personal todos to look >>>>>>> at. >>>>>>> > > > >>>>>>> > > - threadLocalAllocBuffer.hpp: >>>>>>> ThreadLocalAllocBuffer class >>>>>>> > > documentation should be updated about the sampling >>>>>>> additions. I >>>>>>> > > would have no clue what the difference between >>>>>>> "actual_end" and >>>>>>> > > "end" would be from the given information. >>>>>>> > If you are talking about the comments in this file, I >>>>>>> made them more >>>>>>> > clear I hope in the new webrev. If it was somewhere >>>>>>> else, let me know >>>>>>> > where to change. >>>>>>> >>>>>>> Thanks, that's much better. Maybe a note in the comment >>>>>>> of the class >>>>>>> that ThreadLocalBuffer provides some sampling facility >>>>>>> by modifying the >>>>>>> end() of the TLAB to cause "frequent" calls into the >>>>>>> runtime call where >>>>>>> actual sampling takes place. >>>>>>> >>>>>>> >>>>>>> Done, I think it's better now. Added something about the >>>>>>> slow_path_end as well. >>>>>>> >>>>>>> >>>>>>> > > - in heapMonitoring.hpp: there are some random >>>>>>> comments about some >>>>>>> > > code that has been grabbed from >>>>>>> "util/math/fastmath.[h|cc]". I >>>>>>> > > can't tell whether this is code that can be used >>>>>>> but I assume that >>>>>>> > > Noam Shazeer is okay with that (i.e. that's all >>>>>>> Google code). >>>>>>> > Jeremy and I double checked and we can release that >>>>>>> as I thought. I >>>>>>> > removed the comment from that piece of code entirely. >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> > > - heapMonitoring.hpp/cpp static constant naming >>>>>>> does not correspond >>>>>>> > > to Hotspot's. Additionally, in Hotspot static >>>>>>> methods are cased >>>>>>> > > like other methods. >>>>>>> > I think I fixed the methods to be cased the same way >>>>>>> as all other >>>>>>> > methods. For static constants, I was not sure. I >>>>>>> fixed a few other >>>>>>> > variables but I could not seem to really see a >>>>>>> consistent trend for >>>>>>> > constants. I made them as variables but I'm not sure >>>>>>> now. >>>>>>> >>>>>>> Sorry again, style is a kind of mess. The goal of my >>>>>>> suggestions here >>>>>>> is only to prevent yet another style creeping in. >>>>>>> >>>>>>> > > - in heapMonitoring.cpp there are a few cryptic >>>>>>> comments at the top >>>>>>> > > that seem to refer to internal stuff that should >>>>>>> probably be >>>>>>> > > removed. >>>>>>> > Sorry about that! My personal todos not cleared out. >>>>>>> >>>>>>> I am happy about comments, but I simply did not >>>>>>> understand any of that >>>>>>> and I do not know about other readers as well. >>>>>>> >>>>>>> If you think you will remember removing/updating them >>>>>>> until the review >>>>>>> proper (I misunderstood the review situation a little >>>>>>> it seems). >>>>>>> >>>>>>> > > I did not think through the impact of the TLAB >>>>>>> changes on collector >>>>>>> > > behavior yet (if there are). Also I did not check >>>>>>> for problems with >>>>>>> > > concurrent mark and SATB/G1 (if there are). >>>>>>> > I would love to know your thoughts on this, I think >>>>>>> this is fine. I >>>>>>> >>>>>>> I think so too now. No objects are made live out of >>>>>>> thin air :) >>>>>>> >>>>>>> > see issues with multiple threads right now hitting >>>>>>> the stack storage >>>>>>> > instance. Previous webrevs had a mutex lock here but >>>>>>> we took it out >>>>>>> > for simplificity (and only for now). >>>>>>> >>>>>>> :) When looking at this after some thinking I now >>>>>>> assume for this >>>>>>> review that this code is not MT safe at all. There >>>>>>> seems to be more >>>>>>> synchronization missing than just the one for the >>>>>>> StackTraceStorage. So >>>>>>> no comments about this here. >>>>>>> >>>>>>> >>>>>>> I doubled checked a bit (quickly I admit) but it seems that >>>>>>> synchronization in StackTraceStorage is really all you need (all methods >>>>>>> lead to a StackTraceStorage one >>>>>>> and can be multithreaded outside of that). >>>>>>> There is a question about the initialization where the >>>>>>> method HeapMonitoring::initialize_profiling is not thread safe. >>>>>>> It would work (famous last words) and not crash if there was >>>>>>> a race but we could add a synchronization point there as well (and >>>>>>> therefore on the stop as well). >>>>>>> >>>>>>> But anyway I will really check and do this once we add back >>>>>>> synchronization. >>>>>>> >>>>>>> >>>>>>> Also, this would require some kind of specification of >>>>>>> what is allowed >>>>>>> to be called when and where. >>>>>>> >>>>>>> >>>>>>> Would we specify this with the methods in the jvmti.xml >>>>>>> file? We could start by specifying in each that they are not thread safe >>>>>>> but I saw no mention of that for >>>>>>> other methods. >>>>>>> >>>>>>> >>>>>>> One potentially relevant observation about locking >>>>>>> here: depending on >>>>>>> sampling frequency, StackTraceStore::add_trace() may be >>>>>>> rather >>>>>>> frequently called. I assume that you are going to do >>>>>>> measurements :) >>>>>>> >>>>>>> >>>>>>> Though we don't have the TLAB implementation in our code, >>>>>>> the compiler generated sampler uses 2% of overhead with a 512k sampling >>>>>>> rate. I can do real measurements >>>>>>> when the code settles and we can see how costly this is as a >>>>>>> TLAB implementation. >>>>>>> However, my theory is that if the rate is 512k, the >>>>>>> memory/performance overhead should be minimal since it is what we saw >>>>>>> with >>>>>>> our code/workloads (though not called >>>>>>> the same way, we call it essentially at the same rate). >>>>>>> If you have a benchmark you'd like me to test, let me know! >>>>>>> >>>>>>> Right now, with my really small test, this does use a bit of >>>>>>> overhead even for a 512k sample size. I don't know yet why, I'm going to >>>>>>> see what is going on. >>>>>>> >>>>>>> Finally, I think it is not reasonable to suppose the >>>>>>> overhead to be negligible if the sampling rate used is too low. The user >>>>>>> should know that the lower the rate, >>>>>>> the higher the overhead (documentation TODO?). >>>>>>> >>>>>>> >>>>>>> I am not sure what the expected usage of the API is, but >>>>>>> StackTraceStore::add_trace() seems to be able to grow >>>>>>> without bounds. >>>>>>> Only a GC truncates them to the live ones. That in >>>>>>> itself seems to be >>>>>>> problematic (GCs can be *wide* apart), and of course >>>>>>> some of the API >>>>>>> methods add to that because they duplicate that >>>>>>> unbounded array. Do you >>>>>>> have any concerns/measurements about this? >>>>>>> >>>>>>> >>>>>>> So, the theory is that yes add_trace can be able to grow >>>>>>> without bounds but it grows at a sample per 512k of allocated space. The >>>>>>> stacks it gathers are currently >>>>>>> maxed at 64 (I'd like to expand that to an option to the >>>>>>> user though at some point). So I have no concerns because: >>>>>>> >>>>>>> - If really this is taking a lot of space, that means the >>>>>>> job is keeping a lot of objects in memory as well, therefore the entire >>>>>>> heap is getting huge >>>>>>> - If this is the case, you will be triggering a GC at some >>>>>>> point anyway. >>>>>>> >>>>>>> (I'm putting under the rug the issue of "What if we set the >>>>>>> rate to 1 for example" because as you lower the sampling rate, we cannot >>>>>>> guarantee low overhead; the >>>>>>> idea behind this feature is to have a means of having >>>>>>> meaningful allocated samples at a low overhead) >>>>>>> >>>>>>> I have no measurements really right now but since I now have >>>>>>> some statistics I can poll, I will look a bit more at this question. >>>>>>> >>>>>>> I have the same last sentence than above: the user should >>>>>>> expect this to happen if the sampling rate is too small. That probably >>>>>>> can >>>>>>> be reflected in the >>>>>>> StartHeapSampling as a note : careful this might impact your >>>>>>> performance. >>>>>>> >>>>>>> >>>>>>> Also, these stack traces might hold on to huge arrays. >>>>>>> Any >>>>>>> consideration of that? Particularly it might be the >>>>>>> cause for OOMEs in >>>>>>> tight memory situations. >>>>>>> >>>>>>> >>>>>>> There is a stack size maximum that is set to 64 so it should >>>>>>> not hold huge arrays. I don't think this is an issue but I can double >>>>>>> check >>>>>>> with a test or two. >>>>>>> >>>>>>> >>>>>>> - please consider adding a safepoint check in >>>>>>> HeapMonitoring::weak_oops_do to prevent accidental >>>>>>> misuse. >>>>>>> >>>>>>> - in struct StackTraceStorage, the public fields may >>>>>>> also need >>>>>>> underscores. At least some files in the runtime >>>>>>> directory have structs >>>>>>> with underscored public members (and some don't). The >>>>>>> runtime team >>>>>>> should probably comment on that. >>>>>>> >>>>>>> >>>>>>> Agreed I did not know. I looked around and a lot of structs >>>>>>> did not have them it seemed so I left it as is. I will happily change >>>>>>> it if >>>>>>> someone prefers (I was not >>>>>>> sure if you really preferred or not, your sentence seemed to >>>>>>> be more a note of "this might need to change but I don't know if the >>>>>>> runtime team enforces that", let >>>>>>> me know if I read that wrongly). >>>>>>> >>>>>>> >>>>>>> - In StackTraceStorage::weak_oops_do(), when examining >>>>>>> the >>>>>>> StackTraceData, maybe it is useful to consider having a >>>>>>> non-NULL >>>>>>> reference outside of the heap's reserved space an >>>>>>> error. There should >>>>>>> be no oop outside of the heap's reserved space ever. >>>>>>> >>>>>>> Unless you allow storing random values in >>>>>>> StackTraceData::obj, which I >>>>>>> would not encourage. >>>>>>> >>>>>>> >>>>>>> I suppose you are talking about this part: >>>>>>> if ((value != NULL && Universe::heap()->is_in_reserved(value)) >>>>>>> && >>>>>>> (is_alive == NULL || >>>>>>> is_alive->do_object_b(value))) { >>>>>>> >>>>>>> What you are saying is that I could have something like: >>>>>>> if (value != my_non_null_reference && >>>>>>> (is_alive == NULL || >>>>>>> is_alive->do_object_b(value))) { >>>>>>> >>>>>>> Is that what you meant? Is there really a reason to do so? >>>>>>> When I look at the code, is_in_reserved seems like a O(1) method call. >>>>>>> I'm >>>>>>> not even sure we can have a >>>>>>> NULL value to be honest. I might have to study that to see >>>>>>> if this was not a paranoid test to begin with. >>>>>>> >>>>>>> The is_alive code has now morphed due to the comment below. >>>>>>> >>>>>>> >>>>>>> >>>>>>> - HeapMonitoring::weak_oops_do() does not seem to use >>>>>>> the >>>>>>> passed AbstractRefProcTaskExecutor. >>>>>>> >>>>>>> >>>>>>> It did use it: >>>>>>> size_t HeapMonitoring::weak_oops_do( >>>>>>> AbstractRefProcTaskExecutor *task_executor, >>>>>>> BoolObjectClosure* is_alive, >>>>>>> OopClosure *f, >>>>>>> VoidClosure *complete_gc) { >>>>>>> assert(SafepointSynchronize::is_at_safepoint(), "must >>>>>>> be at safepoint"); >>>>>>> >>>>>>> if (task_executor != NULL) { >>>>>>> task_executor->set_single_threaded_mode(); >>>>>>> } >>>>>>> return StackTraceStorage::storage()->weak_oops_do(is_alive, >>>>>>> f, complete_gc); >>>>>>> } >>>>>>> >>>>>>> But due to the comment below, I refactored this, so this is >>>>>>> no longer here. Now I have an always true closure that is passed. >>>>>>> >>>>>>> >>>>>>> - I do not understand allowing to call this method with >>>>>>> a NULL >>>>>>> complete_gc closure. This would mean that objects >>>>>>> referenced from the >>>>>>> object that is referenced by the StackTraceData are not >>>>>>> pulled, meaning >>>>>>> they would get stale. >>>>>>> >>>>>>> - same with is_alive parameter value of NULL >>>>>>> >>>>>>> >>>>>>> So these questions made me look a bit closer at this code. >>>>>>> This code I think was written this way to have a very small impact on >>>>>>> the >>>>>>> file but you are right, there >>>>>>> is no reason for this here. I've simplified the code by >>>>>>> making in referenceProcessor.cpp a process_HeapSampling method that >>>>>>> handles >>>>>>> everything there. >>>>>>> >>>>>>> The code allowed NULLs because it depended on where you were >>>>>>> coming from and how the code was being called. >>>>>>> >>>>>>> - I added a static always_true variable and pass that now to >>>>>>> be more consistent with the rest of the code. >>>>>>> - I moved the complete_gc into process_phaseHeapSampling now >>>>>>> (new method) and handle the task_executor and the complete_gc there >>>>>>> - Newbie question: in our code we did a >>>>>>> set_single_threaded_mode but I see that process_phaseJNI does it right >>>>>>> before its call, do I need to do it for the >>>>>>> process_phaseHeapSample? >>>>>>> That API is much cleaner (in my mind) and is consistent with >>>>>>> what is done around it (again in my mind). >>>>>>> >>>>>>> >>>>>>> - heapMonitoring.cpp:590: I do not completely >>>>>>> understand the purpose of >>>>>>> this code: in the end this results in a fixed value >>>>>>> directly dependent >>>>>>> on the Thread address anyway? In the end this results >>>>>>> in a fixed value >>>>>>> directly dependent on the Thread address anyway? >>>>>>> IOW, what is special about exactly 20 rounds? >>>>>>> >>>>>>> >>>>>>> So we really want a fast random number generator that has a >>>>>>> specific mean (512k is the default we use). The code uses the thread >>>>>>> address as the start number of the >>>>>>> sequence (why not, it is random enough is rationale). Then >>>>>>> instead of just starting there, we prime the sequence and really only >>>>>>> start >>>>>>> at the 21st number, it is >>>>>>> arbitrary and I have not done a study to see if we could do >>>>>>> more or less of that. >>>>>>> >>>>>>> As I have the statistics of the system up and running, I'll >>>>>>> run some experiments to see if this is needed, is 20 good, or not. >>>>>>> >>>>>>> >>>>>>> - also I would consider stripping a few bits of the >>>>>>> threads' address as >>>>>>> initialization value for your rng. The last three bits >>>>>>> (and probably >>>>>>> more, check whether the Thread object is allocated on >>>>>>> special >>>>>>> boundaries) are always zero for them. >>>>>>> Not sure if the given "random" value is random enough >>>>>>> before/after, >>>>>>> this method, so just skip that comment if you think >>>>>>> this is not >>>>>>> required. >>>>>>> >>>>>>> >>>>>>> I don't know is the honest answer. I think what is important >>>>>>> is that we tend towards a mean and it is random "enough" to not fall in >>>>>>> pitfalls of only sampling a >>>>>>> subset of objects due to their allocation order. I added >>>>>>> that as test to do to see if it changes the mean in any way for the 512k >>>>>>> default value and/or if the first >>>>>>> 1000 elements look better. >>>>>>> >>>>>>> >>>>>>> Some more random nits I did not find a place to put >>>>>>> anywhere: >>>>>>> >>>>>>> - ThreadLocalAllocBuffer::_extra_space does not seem >>>>>>> to be used >>>>>>> anywhere? >>>>>>> >>>>>>> >>>>>>> Good catch :). >>>>>>> >>>>>>> >>>>>>> - Maybe indent the declaration of >>>>>>> ThreadLocalAllocBuffer::_bytes_until_sample to align below the >>>>>>> other members of that group. >>>>>>> >>>>>>> >>>>>>> Done moved it up a bit to have non static members together >>>>>>> and static separate. >>>>>>> >>>>>>> Thanks, >>>>>>> Thomas >>>>>>> >>>>>>> >>>>>>> Thanks for your review! >>>>>>> Jc >>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >>> >> >