Thanks Jeremy - that helps - not such a big deal in terms of maintaining - so
back to the bigger discussion.
thanks,
Karen
On Jun 26, 2015, at 1:34 AM, Jeremy Manson wrote:
> Karen,
>
> I understand your concerns. For reference, this is the additional code in
> the x86 assembler. There are very small stubs in C1 and the template
> interpreter to call out to this macro (forgive the gratuitous use of the
> string "google" - we sprinkle it around a bit because it makes it a little
> easier to distinguish our code from upstream code).
>
> #define GOOGLE_HEAP_MONITORING(ma, thread, var_size_in_bytes,
> con_size_in_bytes, object, t1, t2, sample_invocation) \
> do { \
> { \
> SkipIfEqual skip_if(ma, &GoogleHeapMonitor, 0); \
> Label skip_sample; \
> Register thr = thread; \
> if (!thr->is_valid()) { \
> NOT_LP64(assert(t1 != noreg, \
> "Need temporary register for constants")); \
> thr = NOT_LP64(t1) LP64_ONLY(r15_thread); \
> NOT_LP64(ma -> get_thread(thr);) \
> } \
> /* Trigger heap monitoring event */ \
> Address bus(thr, \
> JavaThread::google_bytes_until_sample_offset()); \
> \
> if (var_size_in_bytes->is_valid()) { \
> ma -> NOT_LP64(subl) LP64_ONLY(subq)(bus, var_size_in_bytes); \
> } else { \
> int csib = (con_size_in_bytes); \
> assert(t2 != noreg, \
> "Need temporary register for constants"); \
> ma -> NOT_LP64(movl) LP64_ONLY(mov64)(t2, csib); \
> ma -> NOT_LP64(subl) LP64_ONLY(subq)(bus, t2); \
> } \
> \
> ma -> jcc(Assembler::positive, skip_sample); \
> \
> { \
> sample_invocation \
> } \
> ma -> bind(skip_sample); \
> } \
> } while(0)
>
> It's not all that hard to port to additional architectures, but we'll have to
> think about it.
>
> Jeremy
>
> On Thu, Jun 25, 2015 at 1:41 PM, Karen Kinnear <[email protected]>
> wrote:
> Jeremy,
>
> Did I follow this correctly - that your approach modifies the compilers and
> interpreters and Tony's modifies the
> common allocation code?
>
> Given that the number of compilers and interpreters and interpreter platforms
> keeps expanding - I'd like to
> add a vote to have heap allocation profiling in common allocation code.
>
> thanks,
> Karen
>
> On Jun 25, 2015, at 4:28 PM, Tony Printezis wrote:
>
>> Hi Jeremy,
>>
>> Inline.
>>
>> On June 24, 2015 at 7:26:55 PM, Jeremy Manson ([email protected])
>> wrote:
>>
>>>
>>>
>>> On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <[email protected]>
>>> wrote:
>>> Hi Jeremy,
>>>
>>> Please see inline.
>>>
>>> On June 23, 2015 at 7:22:13 PM, Jeremy Manson ([email protected])
>>> wrote:
>>>
>>>> I don't want the size of the TLAB, which is ergonomically adjusted, to be
>>>> tied to the sampling rate. There is no reason to do that. I want
>>>> reasonable statistical sampling of the allocations.
>>>
>>>
>>> As I said explicitly in my e-mail, I totally agree with this. Which is why
>>> I never suggested to resize TLABs in order to vary the sampling rate.
>>> (Apologies if my e-mail was not clear.)
>>>
>>>
>>> My fault - I misread it. Doesn't your proposal miss out of TLAB allocs
>>> entirely
>>
>>
>> This is correct: We’ll also have to intercept the outside-TLAB allocs. But,
>> IMHO, this is a feature as it’s helpful to know how many (and which)
>> allocations happen outside TLABs. These are generally very infrequent (and
>> slow anyway), so sampling all of those, instead of only sampling some of
>> them, does not have much of an overhead. But, you could also do sampling for
>> the outside-TLAB allocs too, if you want: just accumulate their size on a
>> separate per-thread counter and sample the one that bumps that counter goes
>> over a limit.
>>
>> An additional observation (orthogonal to the main point, but I thought I’d
>> mention it anyway): For the outside-TLAB allocs it’d be helpful to also know
>> which generation the object ended up in (e.g., young gen or
>> direct-to-old-gen). This is very helpful in some situations when you’re
>> trying to work out which allocation(s) grew the old gen occupancy between
>> two young GCs.
>>
>> FWIW, the existing JFR events follow the approach I described above:
>>
>> * one event for each new TLAB + first alloc in that TLAB (my proposal
>> basically generalizes this and removes the 1-1 relationship between object
>> alloc sampling and new TLAB operation)
>>
>> * one event for all allocs outside a TLAB
>>
>> I think the above separation is helpful. But if you think it could confuse
>> users, you can of course easily just combine the information (but I strongly
>> believe it’s better to report the information separately).
>>
>>
>>
>>> (and, in fact, not work if TLAB support is turned off)?
>>
>>
>> Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine
>> question.)
>>
>>
>>
>>> I might be missing something obvious (and see my response below).
>>
>>
>>>
>>>> All this requires is a separate counter that is set to the next sampling
>>>> interval, and decremented when an allocation happens, which goes into a
>>>> slow path when the decrement hits 0. Doing a subtraction and a pointer
>>>> bump in allocation instead of just a pointer bump is basically free.
>>>
>>>
>>> Maybe on intel is cheap, but maybe it’s not on other platforms that other
>>> folks care about.
>>>
>>> Really? A memory read and a subtraction? Which architectures care about
>>> that?
>>
>>
>> I was not concerned with the read and subtraction, I was more concerned with
>> the conditional that follows them (intel has great branch prediction).
>>
>> And a personal pet peeve (based on past experience): How many “free”
>> instructions do you have to add before they are not free any more?
>>
>>
>>
>>>
>>> Again, notice that no one has complained about the addition that was added
>>> for total bytes allocated per thread. I note that was actually added in
>>> the 6u20 timeframe.
>>>
>>>> Note that it has been doing an additional addition (to keep track of per
>>>> thread allocation) as part of allocation since Java 7,
>>>
>>>
>>> Interesting. I hadn’t realized that. Does that keep track of total size
>>> allocated per thread or number of allocated objects per thread? If it’s the
>>> former, why isn’t it possible to calculate that from the TLABs information?
>>>
>>>
>>> Total size allocated per thread. It isn't possible to calculate that from
>>> the TLAB because of out-of-TLAB allocation
>>
>>
>> The allocating Thread is passed to the slow (outside-TLAB) alloc path so it
>> would be trivial to update the per-thread allocation stats from there too
>> (in fact, it does; see below).
>>
>>
>>
>>> (and hypothetically disabled TLABs).
>>
>>
>> Anyone cares? :-)
>>
>>
>>
>>>
>>> For some reason, they never included it in the ThreadMXBean interface, but
>>> it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean
>>> to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on
>>> it.
>>>
>>
>> Thanks for the tip. I’ll look into this...
>>
>>>> and no one has complained.
>>>>
>>>> I'm not worried about the ease of implementation here, because we've
>>>> already implemented it.
>>>
>>>
>>> Yeah, but someone will have to maintain it moving forward.
>>>
>>>
>>> I've been maintaining it internally to Google for 5 years. It's actually
>>> pretty self-contained. The only work involved is when they refactor
>>> something (so I've had to move it), or when a bug in the existing
>>> implementation is discovered. It is very closely parallel to the TLAB
>>> code, which doesn't change much / at all.
>>>
>>
>> The TLAB code has really not changed much for a while. ;-) (but haven’t
>> looked at the JDK 9 source very closely though…)
>>
>>>> It hasn't even been hard for us to do the forward port, except when the
>>>> relevant Hotspot code is significantly refactored.
>>>>
>>>> We can also turn the sampling off, if we want. We can set the sampling
>>>> rate to 2^32, have the sampling code do nothing, and no one will ever
>>>> notice.
>>>
>>>
>>> You still have extra instructions in the allocation path, so it’s not
>>> turned off (i.e., you have the tax without any benefit).
>>>
>>>
>>> Hey, you have a counter in your allocation path you've never noticed, which
>>> none of your code uses. Pipelining is a wonderful thing. :)
>>
>>
>> See above re: “free” instructions.
>>
>>
>>
>>>>
>>>> In fact, we could just have the sampling code do nothing, and no one would
>>>> ever notice.
>>>>
>>>> Honestly, no one ever notices the overhead of the sampling, anyway. JDK8
>>>> made it more expensive to grab a stack trace (the cost became proportional
>>>> to the number of loaded classes), but we have a patch that mitigates that,
>>>> which we would also be happy to upstream.
>>>>
>>>> As for the other concern: my concern about *just* having the callback
>>>> mechanism is that there is quite a lot you can't do from user code during
>>>> an allocation, because of lack of access to JNI.
>>>
>>>
>>> Maybe I missed something. Are the callbacks in Java? I.e., do you call them
>>> using JNI from the slow path you call directly from the allocation code?
>>>
>>> (For context: this referred to the hypothetical feature where we can
>>> provide a callback that invokes some code from allocation.)
>>>
>>> (It's not actually hypothetical, because we've already implemented it, but
>>> let's call it hypothetical for the moment.)
>>
>>
>> OK.
>>
>>
>>
>>>
>>> We invoke native code. You can't invoke any Java code during allocation,
>>> including calling JNI methods, because that would make allocation
>>> potentially reentrant, which doesn't work for all sorts of reasons.
>>
>>
>> That’s what I was worried about….
>>
>>
>>
>>> The native code doesn't even get passed a JNIEnv * - there is nothing it
>>> can do with it without making the VM crash a lot.
>>
>>
>> So, thanks for the clarification. Being able to attach a callback to this
>> in, say, the JVM it’d be totally fine. I was worried that you wanted to call
>> Java. :-)
>>
>>
>>
>>>
>>> Or, rather, you might be able to do that, but it would take a lot of
>>> Hotspot rearchitecting. When I tried to do it, I realized it would be an
>>> extremely deep dive.
>>
>>
>> I believe you. :-)
>>
>>
>>
>>>>
>>>> However, you can do pretty much anything from the VM itself. Crucially
>>>> (for us), we don't just log the stack traces, we also keep track of which
>>>> are live and which aren't. We can't do this in a callback, if the
>>>> callback can't create weak refs to the object.
>>>>
>>>> What we do at Google is to have two methods: one that you pass a callback
>>>> to (the callback gets invoked with a StackTraceData object, as I've
>>>> defined above), and another that just tells you which sampled objects are
>>>> still live. We could also add a third, which allowed a callback to set
>>>> the sampling interval (basically, the VM would call it to get the integer
>>>> number of bytes to be allocated before the next sample).
>>>>
>>>> Would people be amenable to that? It makes the code more complex, but, as
>>>> I say, it's nice for detecting memory leaks ("Hey! Where did that 1 GB
>>>> object come from?").
>>>
>>>
>>> Well, that 1GB object would have most likely been allocated outside a TLAB
>>> and you could have identified it by instrumenting the “outside-of-TLAB
>>> allocation path” (just saying…).
>>>
>>>
>>> That's orthogonal to the point I was making in the quote above - the point
>>> I was making there was that we want to be able to detect what sampled
>>> objects are live. We can do that regardless of how we implement the
>>> sampling (although it did involve my making a new kind of weak oop
>>> processing mechanism inside the VM).
>>
>>
>> Yeah, I was thinking of doing something similar (tracking object lifetimes,
>> and other attributes, with WeakRefs).
>>
>>
>>
>>>
>>> But to the question of whether we can just instrument the outside-of-tlab
>>> allocation path... There are a few weirdnesses here. The first one that
>>> jumps to mind is that there's also a fast path for allocating in the YG
>>> outside of TLABs, if an object is too large to fit in the current TLAB.
>>> Those objects would never get sampled. So "outside of tlab" doesn't always
>>> mean "slow path".
>>
>>
>> CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow
>> path called when a TLAB allocation fails because the object doesn’t fit in
>> the current TLAB. It checks (alocate_from_tlab() /
>> allocate_from_tlab_slow()) whether to refill the current TLAB or keep the
>> TLAB and delegate to the GC (mem_allocate()) to allocate the object outside
>> a TLAB (either in the young or old gen; the GC might also decide to do a
>> collection at this point if, say, the eden is full...). So, it depends on
>> what you mean by slow path but, yes, any alloocations that go through the
>> above path should be considered as “slow path” allocations.
>>
>> One more piece of data: AllocTracer::send_allocation_outside_tlab_event()
>> (the JFR entry point for outside-TLAB allocs) is fired from
>> common_mem_allocate_noint(). So, if there are other non-TLAB allocation
>> paths outside that method, that entry point has been placed incorrectly
>> (it’s possible of course; but I think that it’s actually placed correctly).
>>
>> (note: I only looked at the JDK 8 sources, haven’t checked the JDK 9 sources
>> yet, the above might have been changed)
>>
>> BTW, when looking at the common_mem_allocate_noinit() code I noticed the
>> following:
>>
>> THREAD->incr_allocated_bytes(size * HeapWordSize);
>> (as predicted earlier)
>>
>>
>>
>>>
>>> Another one that jumps to mind is that we don't know whether the
>>> outside-of-TLAB path actually passes the sampling threshold, especially if
>>> we let users configure the sampling threshold. So how would we know
>>> whether to sample it?
>>
>>
>> See above (IMHO: sample all of them).
>>
>>
>>
>>>
>>> You also have to keep track of the sampling interval in the code where we
>>> allocate new TLABs, in case the sampling threshold is larger than the TLAB
>>> size. That's not a big deal, of course.
>>
>>
>> Of course, but that’s kinda trivial. BTW, one approach here would be “given
>> that refilling a TLAB is slow anyway, always sample the first object in each
>> TLAB irrespective of desired sampling frequence”. Another would be “don’t do
>> that, I set the sampling frequency pretty low not to be flooded with data
>> when the TLABs are very small”. I have to say I’m in the latter camp.
>>
>>
>>
>>>
>>>
>>> And, every time the TLAB code changes, we have to consider whether / how
>>> those changes affect this sampling mechanism.
>>
>>
>> Yes, but how often does the TLAB code change? :-)
>>
>>
>>
>>>
>>> I guess my larger point is that there are so many little corner cases with
>>> TLAB allocation, including whether it even happens, that basing the
>>> sampling strategy around it seems like a cop-out.
>>
>>
>> There are not many little corner cases. There are two cases: allocation
>> inside a TLAB, allocation outside a TLAB. The former is by far the most
>> common. The latter is generally very infrequent and has a well-defined code
>> path (I described it earlier). And, as I said, it could be very helpful and
>> informative to treat (and account for) the two cases separately.
>>
>>
>>
>>> And my belief is that the arguments against our strategy don't really hold
>>> water, especially given the presence of the per-thread allocation counter
>>> that no one noticed.
>>
>>
>> I’ve already addressed that.
>>
>>
>>
>>>
>>> Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck
>>> Rasbold). All we really need is to write an acceptable JEP, to adjust the
>>> code based on the changes the community wants, and someone from Oracle
>>> willing to say "yes".
>>
>>
>>>
>>> For reference, to keep track of sampling, the delta to C2 is about 150 LOC
>>> (much of which is newlines-because-of-formatting for methods that take a
>>> lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86
>>> template interpreter is about 20 LOC, and the delta for the assembler is
>>> about 40 LOC. It's not completely trivial, but the code hasn't changed
>>> substantially in the 5 years since I wrote it (other than a couple of
>>> bugfixes).
>>>
>>> Obviously, assembler/template interpreter would have to be dup'd across
>>> platforms - we can do that for PPC and aarch64, on which we do active
>>> development, at least.
>>
>>
>> I’ll again vote for the simplicity of having a simple change in only one
>> place (OK, two places…).
>>
>>
>>
>>>
>>> But, seriously, why didn’t you like my proposal? It can do anything your
>>> scheme can with fewer and simpler code changes. The only thing that it
>>> cannot do is to sample based on object count (i.e., every 100 objects)
>>> instead of based on object size (i.e., every 1MB of allocations). But I
>>> think doing sampling based on size is the right approach here (IMHO).
>>>
>>> I agree that sampling based on size is the right approach.
>>>
>>> (And your approach is definitely simpler - I don't mean to discount it.
>>> And if that's what it takes to get this feature accepted, we'll do it, but
>>> I'll grumble about it.)
>>
>>
>> That’s fine. :-)
>>
>> Tony
>>
>>
>>
>>
>> -----
>>
>> Tony Printezis | JVM/GC Engineer / VM Team | Twitter
>>
>> @TonyPrintezis
>> [email protected]
>>
>>
>
>