Re: Low-Overhead Heap Profiling

Karen Kinnear Fri, 26 Jun 2015 09:17:07 -0700

Thanks Jeremy - that helps -  not such a big deal in terms of maintaining - so 
back to the bigger discussion.


thanks,
Karen

On Jun 26, 2015, at 1:34 AM, Jeremy Manson wrote:

> Karen,
> 
> I understand your concerns.  For reference, this is the additional code in 
> the x86 assembler.  There are very small stubs in C1 and the template 
> interpreter to call out to this macro (forgive the gratuitous use of the 
> string "google" - we sprinkle it around a bit because it makes it a little 
> easier to distinguish our code from upstream code).
> 
> #define GOOGLE_HEAP_MONITORING(ma, thread, var_size_in_bytes, 
> con_size_in_bytes, object, t1, t2, sample_invocation) \
> do {                                                                \
>   {                                                                 \
>     SkipIfEqual skip_if(ma, &GoogleHeapMonitor, 0);                 \
>     Label skip_sample;                                              \
>     Register thr = thread;                                          \
>     if (!thr->is_valid()) {                                         \
>       NOT_LP64(assert(t1 != noreg,                                  \
>                       "Need temporary register for constants"));    \
>       thr = NOT_LP64(t1) LP64_ONLY(r15_thread);                     \
>       NOT_LP64(ma -> get_thread(thr);)                              \
>     }                                                               \
>     /* Trigger heap monitoring event */                             \
>     Address bus(thr,                                                \
>                 JavaThread::google_bytes_until_sample_offset());    \
>                                                                     \
>     if (var_size_in_bytes->is_valid()) {                            \
>       ma -> NOT_LP64(subl) LP64_ONLY(subq)(bus, var_size_in_bytes); \
>     } else {                                                        \
>       int csib = (con_size_in_bytes);                               \
>       assert(t2 != noreg,                                           \
>              "Need temporary register for constants");              \
>       ma -> NOT_LP64(movl) LP64_ONLY(mov64)(t2, csib);              \
>       ma -> NOT_LP64(subl) LP64_ONLY(subq)(bus, t2);                \
>     }                                                               \
>                                                                     \
>     ma -> jcc(Assembler::positive, skip_sample);                    \
>                                                                     \
>     {                                                               \
>       sample_invocation                                             \
>     }                                                               \
>     ma -> bind(skip_sample);                                        \
>   }                                                                 \
> } while(0)
> 
> It's not all that hard to port to additional architectures, but we'll have to 
> think about it.
> 
> Jeremy
> 
> On Thu, Jun 25, 2015 at 1:41 PM, Karen Kinnear <karen.kinn...@oracle.com> 
> wrote:
> Jeremy,
> 
> Did I follow this correctly - that your approach modifies the compilers and 
> interpreters and Tony's modifies the
> common allocation code?
> 
> Given that the number of compilers and interpreters and interpreter platforms 
> keeps expanding - I'd like to
> add a vote to have heap allocation profiling in common allocation code.
> 
> thanks,
> Karen
> 
> On Jun 25, 2015, at 4:28 PM, Tony Printezis wrote:
> 
>> Hi Jeremy,
>> 
>> Inline.
>> 
>> On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremyman...@google.com) 
>> wrote:
>> 
>>> 
>>> 
>>> On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <tprinte...@twitter.com> 
>>> wrote:
>>> Hi Jeremy,
>>> 
>>> Please see inline.
>>> 
>>> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com) 
>>> wrote:
>>> 
>>>> I don't want the size of the TLAB, which is ergonomically adjusted, to be 
>>>> tied to the sampling rate.  There is no reason to do that.  I want 
>>>> reasonable statistical sampling of the allocations.  
>>> 
>>> 
>>> As I said explicitly in my e-mail, I totally agree with this. Which is why 
>>> I never suggested to resize TLABs in order to vary the sampling rate. 
>>> (Apologies if my e-mail was not clear.)
>>> 
>>> 
>>> My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs 
>>> entirely
>> 
>> 
>> This is correct: We’ll also have to intercept the outside-TLAB allocs. But, 
>> IMHO, this is a feature as it’s helpful to know how many (and which) 
>> allocations happen outside TLABs. These are generally very infrequent (and 
>> slow anyway), so sampling all of those, instead of only sampling some of 
>> them, does not have much of an overhead. But, you could also do sampling for 
>> the outside-TLAB allocs too, if you want: just accumulate their size on a 
>> separate per-thread counter and sample the one that bumps that counter goes 
>> over a limit.
>> 
>> An additional observation (orthogonal to the main point, but I thought I’d 
>> mention it anyway): For the outside-TLAB allocs it’d be helpful to also know 
>> which generation the object ended up in (e.g., young gen or 
>> direct-to-old-gen). This is very helpful in some situations when you’re 
>> trying to work out which allocation(s) grew the old gen occupancy between 
>> two young GCs.
>> 
>> FWIW, the existing JFR events follow the approach I described above:
>> 
>> * one event for each new TLAB + first alloc in that TLAB (my proposal 
>> basically generalizes this and removes the 1-1 relationship between object 
>> alloc sampling and new TLAB operation)
>> 
>> * one event for all allocs outside a TLAB
>> 
>> I think the above separation is helpful. But if you think it could confuse 
>> users, you can of course easily just combine the information (but I strongly 
>> believe it’s better to report the information separately).
>> 
>> 
>> 
>>> (and, in fact, not work if TLAB support is turned off)? 
>> 
>> 
>> Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine 
>> question.)
>> 
>> 
>> 
>>>  I might be missing something obvious (and see my response below).
>> 
>> 
>>>  
>>>> All this requires is a separate counter that is set to the next sampling 
>>>> interval, and decremented when an allocation happens, which goes into a 
>>>> slow path when the decrement hits 0.  Doing a subtraction and a pointer 
>>>> bump in allocation instead of just a pointer bump is basically free.  
>>> 
>>> 
>>> Maybe on intel is cheap, but maybe it’s not on other platforms that other 
>>> folks care about.
>>> 
>>> Really?  A memory read and a subtraction?  Which architectures care about 
>>> that?
>> 
>> 
>> I was not concerned with the read and subtraction, I was more concerned with 
>> the conditional that follows them (intel has great branch prediction).
>> 
>> And a personal pet peeve (based on past experience): How many “free” 
>> instructions do you have to add before they are not free any more?
>> 
>> 
>> 
>>> 
>>> Again, notice that no one has complained about the addition that was added 
>>> for total bytes allocated per thread.  I note that was actually added in 
>>> the 6u20 timeframe.
>>> 
>>>> Note that it has been doing an additional addition (to keep track of per 
>>>> thread allocation) as part of allocation since Java 7, 
>>> 
>>> 
>>> Interesting. I hadn’t realized that. Does that keep track of total size 
>>> allocated per thread or number of allocated objects per thread? If it’s the 
>>> former, why isn’t it possible to calculate that from the TLABs information?
>>> 
>>> 
>>> Total size allocated per thread.  It isn't possible to calculate that from 
>>> the TLAB because of out-of-TLAB allocation 
>> 
>> 
>> The allocating Thread is passed to the slow (outside-TLAB) alloc path so it 
>> would be trivial to update the per-thread allocation stats from there too 
>> (in fact, it does; see below).
>> 
>> 
>> 
>>> (and hypothetically disabled TLABs).
>> 
>> 
>> Anyone cares? :-)
>> 
>> 
>> 
>>> 
>>> For some reason, they never included it in the ThreadMXBean interface, but 
>>> it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean 
>>> to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on 
>>> it.
>>> 
>> 
>> Thanks for the tip. I’ll look into this...
>> 
>>>> and no one has complained.
>>>> 
>>>> I'm not worried about the ease of implementation here, because we've 
>>>> already implemented it.  
>>> 
>>> 
>>> Yeah, but someone will have to maintain it moving forward.
>>> 
>>> 
>>> I've been maintaining it internally to Google for 5 years.  It's actually 
>>> pretty self-contained.  The only work involved is when they refactor 
>>> something (so I've had to move it), or when a bug in the existing 
>>> implementation is discovered.  It is very closely parallel to the TLAB 
>>> code, which doesn't change much / at all.
>>> 
>> 
>> The TLAB code has really not changed much for a while. ;-) (but haven’t 
>> looked at the JDK 9 source very closely though…)
>> 
>>>> It hasn't even been hard for us to do the forward port, except when the 
>>>> relevant Hotspot code is significantly refactored.
>>>> 
>>>> We can also turn the sampling off, if we want.  We can set the sampling 
>>>> rate to 2^32, have the sampling code do nothing, and no one will ever 
>>>> notice.  
>>> 
>>> 
>>> You still have extra instructions in the allocation path, so it’s not 
>>> turned off (i.e., you have the tax without any benefit).
>>> 
>>> 
>>> Hey, you have a counter in your allocation path you've never noticed, which 
>>> none of your code uses.  Pipelining is a wonderful thing.  :)
>> 
>> 
>> See above re: “free” instructions.
>> 
>> 
>> 
>>>> 
>>>> In fact, we could just have the sampling code do nothing, and no one would 
>>>> ever notice.
>>>> 
>>>> Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 
>>>> made it more expensive to grab a stack trace (the cost became proportional 
>>>> to the number of loaded classes), but we have a patch that mitigates that, 
>>>> which we would also be happy to upstream.
>>>> 
>>>> As for the other concern: my concern about *just* having the callback 
>>>> mechanism is that there is quite a lot you can't do from user code during 
>>>> an allocation, because of lack of access to JNI.
>>> 
>>> 
>>> Maybe I missed something. Are the callbacks in Java? I.e., do you call them 
>>> using JNI from the slow path you call directly from the allocation code?
>>> 
>>> (For context: this referred to the hypothetical feature where we can 
>>> provide a callback that invokes some code from allocation.)
>>> 
>>> (It's not actually hypothetical, because we've already implemented it, but 
>>> let's call it hypothetical for the moment.)
>> 
>> 
>> OK.
>> 
>> 
>> 
>>> 
>>> We invoke native code.  You can't invoke any Java code during allocation, 
>>> including calling JNI methods, because that would make allocation 
>>> potentially reentrant, which doesn't work for all sorts of reasons.
>> 
>> 
>> That’s what I was worried about….
>> 
>> 
>> 
>>>   The native code doesn't even get passed a JNIEnv * - there is nothing it 
>>> can do with it without making the VM crash a lot.
>> 
>> 
>> So, thanks for the clarification. Being able to attach a callback to this 
>> in, say, the JVM it’d be totally fine. I was worried that you wanted to call 
>> Java. :-)
>> 
>> 
>> 
>>> 
>>> Or, rather, you might be able to do that, but it would take a lot of 
>>> Hotspot rearchitecting.  When I tried to do it, I realized it would be an 
>>> extremely deep dive.
>> 
>> 
>> I believe you. :-)
>> 
>> 
>> 
>>>> 
>>>>   However, you can do pretty much anything from the VM itself.  Crucially 
>>>> (for us), we don't just log the stack traces, we also keep track of which 
>>>> are live and which aren't.  We can't do this in a callback, if the 
>>>> callback can't create weak refs to the object.
>>>> 
>>>> What we do at Google is to have two methods: one that you pass a callback 
>>>> to (the callback gets invoked with a StackTraceData object, as I've 
>>>> defined above), and another that just tells you which sampled objects are 
>>>> still live.  We could also add a third, which allowed a callback to set 
>>>> the sampling interval (basically, the VM would call it to get the integer 
>>>> number of bytes to be allocated before the next sample).  
>>>> 
>>>> Would people be amenable to that?  It makes the code more complex, but, as 
>>>> I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB 
>>>> object come from?").
>>> 
>>> 
>>> Well, that 1GB object would have most likely been allocated outside a TLAB 
>>> and you could have identified it by instrumenting the “outside-of-TLAB 
>>> allocation path” (just saying…).
>>> 
>>> 
>>> That's orthogonal to the point I was making in the quote above - the point 
>>> I was making there was that we want to be able to detect what sampled 
>>> objects are live.  We can do that regardless of how we implement the 
>>> sampling (although it did involve my making a new kind of weak oop 
>>> processing mechanism inside the VM).
>> 
>> 
>> Yeah, I was thinking of doing something similar (tracking object lifetimes, 
>> and other attributes, with WeakRefs). 
>> 
>> 
>> 
>>> 
>>> But to the question of whether we can just instrument the outside-of-tlab 
>>> allocation path...  There are a few weirdnesses here.  The first one that 
>>> jumps to mind is that there's also a fast path for allocating in the YG 
>>> outside of TLABs, if an object is too large to fit in the current TLAB.  
>>> Those objects would never get sampled.  So "outside of tlab" doesn't always 
>>> mean "slow path".
>> 
>> 
>> CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow 
>> path called when a TLAB allocation fails because the object doesn’t fit in 
>> the current TLAB. It checks (alocate_from_tlab() / 
>> allocate_from_tlab_slow()) whether to refill the current TLAB or keep the 
>> TLAB and delegate to the GC (mem_allocate()) to allocate the object outside 
>> a TLAB (either in the young or old gen; the GC might also decide to do a 
>> collection at this point if, say, the eden is full...). So, it depends on 
>> what you mean by slow path but, yes, any alloocations that go through the 
>> above path should be considered as “slow path” allocations.
>> 
>> One more piece of data: AllocTracer::send_allocation_outside_tlab_event() 
>> (the JFR entry point for outside-TLAB allocs) is fired from 
>> common_mem_allocate_noint(). So, if there are other non-TLAB allocation 
>> paths outside that method, that entry point has been placed incorrectly 
>> (it’s possible of course; but I think that it’s actually placed correctly).
>> 
>> (note: I only looked at the JDK 8 sources, haven’t checked the JDK 9 sources 
>> yet, the above might have been changed)
>> 
>> BTW, when looking at the common_mem_allocate_noinit() code I noticed the 
>> following:
>> 
>> THREAD->incr_allocated_bytes(size * HeapWordSize);
>> (as predicted earlier)
>> 
>> 
>> 
>>> 
>>> Another one that jumps to mind is that we don't know whether the 
>>> outside-of-TLAB path actually passes the sampling threshold, especially if 
>>> we let users configure the sampling threshold.  So how would we know 
>>> whether to sample it?
>> 
>> 
>> See above (IMHO: sample all of them).
>> 
>> 
>> 
>>> 
>>> You also have to keep track of the sampling interval in the code where we 
>>> allocate new TLABs, in case the sampling threshold is larger than the TLAB 
>>> size.  That's not a big deal, of course.
>> 
>> 
>> Of course, but that’s kinda trivial. BTW, one approach here would be “given 
>> that refilling a TLAB is slow anyway, always sample the first object in each 
>> TLAB irrespective of desired sampling frequence”. Another would be “don’t do 
>> that, I set the sampling frequency pretty low not to be flooded with data 
>> when the TLABs are very small”. I have to say I’m in the latter camp.
>> 
>> 
>> 
>>> 
>>> 
>>> And, every time the TLAB code changes, we have to consider whether / how 
>>> those changes affect this sampling mechanism.
>> 
>> 
>> Yes, but how often does the TLAB code change? :-)
>> 
>> 
>> 
>>> 
>>> I guess my larger point is that there are so many little corner cases with 
>>> TLAB allocation, including whether it even happens, that basing the 
>>> sampling strategy around it seems like a cop-out.  
>> 
>> 
>> There are not many little corner cases. There are two cases: allocation 
>> inside a TLAB, allocation outside a TLAB. The former is by far the most 
>> common. The latter is generally very infrequent and has a well-defined code 
>> path (I described it earlier). And, as I said, it could be very helpful and 
>> informative to treat (and account for) the two cases separately.
>> 
>> 
>> 
>>> And my belief is that the arguments against our strategy don't really hold 
>>> water, especially given the presence of the per-thread allocation counter 
>>> that no one noticed.  
>> 
>> 
>> I’ve already addressed that.
>> 
>> 
>> 
>>> 
>>> Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck 
>>> Rasbold).  All we really need is to write an acceptable JEP, to adjust the 
>>> code based on the changes the community wants, and someone from Oracle 
>>> willing to say "yes".
>> 
>> 
>>> 
>>> For reference, to keep track of sampling, the delta to C2 is about 150 LOC 
>>> (much of which is newlines-because-of-formatting for methods that take a 
>>> lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86 
>>> template interpreter is about 20 LOC, and the delta for the assembler is 
>>> about 40 LOC.      It's not completely trivial, but the code hasn't changed 
>>> substantially in the 5 years since I wrote it (other than a couple of 
>>> bugfixes).
>>> 
>>> Obviously, assembler/template interpreter would have to be dup'd across 
>>> platforms - we can do that for PPC and aarch64, on which we do active 
>>> development, at least.
>> 
>> 
>> I’ll again vote for the simplicity of having a simple change in only one 
>> place (OK, two places…).
>> 
>> 
>> 
>>> 
>>> But, seriously, why didn’t you like my proposal? It can do anything your 
>>> scheme can with fewer and simpler code changes. The only thing that it 
>>> cannot do is to sample based on object count (i.e., every 100 objects) 
>>> instead of based on object size (i.e., every 1MB of allocations). But I 
>>> think doing sampling based on size is the right approach here (IMHO).
>>> 
>>> I agree that sampling based on size is the right approach.  
>>> 
>>> (And your approach is definitely simpler - I don't mean to discount it.  
>>> And if that's what it takes to get this feature accepted, we'll do it, but 
>>> I'll grumble about it.)
>> 
>> 
>> That’s fine. :-)
>> 
>> Tony
>> 
>> 
>> 
>> 
>> -----
>> 
>> Tony Printezis | JVM/GC Engineer / VM Team | Twitter
>> 
>> @TonyPrintezis
>> tprinte...@twitter.com
>> 
>> 
> 
>

Re: Low-Overhead Heap Profiling

Reply via email to