subject:"Re\: Low\-Overhead Heap Profiling"

Re: Low-Overhead Heap Profiling

2018-02-02 Thread Erik Österlund


Hi JC,

Hope I am reviewing the right version of your work. Here goes...

src/hotspot/share/gc/shared/collectedHeap.inline.hpp:

 159 AllocTracer::send_allocation_outside_tlab(klass, result, size 
* HeapWordSize, THREAD);

 160
 161 THREAD->tlab().handle_sample(THREAD, result, size);
 162 return result;
 163   }

Should not call tlab()->X without checking if (UseTLAB) IMO.

src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp:

So first of all, there seems to quite a few ends. There is an "end", a 
"hard end", a "slow path end", and an "actual end". Moreover, it seems 
like the "hard end" is actually further away than the "actual end". So 
the "hard end" seems like more of a "really definitely actual end" or 
something. I don't know about you, but I think it looks kind of messy. 
In particular, I don't feel like the name "actual end" reflects what it 
represents, especially when there is another end that is behind the 
"actual end".


 413 HeapWord* ThreadLocalAllocBuffer::hard_end() {
 414   // Did a fast TLAB refill occur?
 415   if (_slow_path_end != _end) {
 416 // Fix up the actual end to be now the end of this TLAB.
 417 _slow_path_end = _end;
 418 _actual_end = _end;
 419   }
 420
 421   return _actual_end + alignment_reserve();
 422 }

I really do not like making getters unexpectedly have these kind of side 
effects. It is not expected that when you ask for the "hard end", you 
implicitly update the "slow path end" and "actual end" to new values.


src/hotspot/share/prims/jvmti.xml:

10357   
10358 
10359   Can sample the heap.
10360   If this capability is enabled then the heap sampling 
methods can be called.

10361 
10362   

Looks like this capability should not be "since 9" if it gets integrated 
now.


src/hotspot/share/runtime/heapMonitoring.cpp:

 448   if (is_alive->do_object_b(value)) {
 449 // Update the oop to point to the new object if it is 
still alive.

 450 f->do_oop(&(trace.obj));
 451
 452 // Copy the old trace, if it is still live.
 453 _allocated_traces->at_put(curr_pos++, trace);
 454
 455 // Store the live trace in a cache, to be served up on /heapz.
 456 _traces_on_last_full_gc->append(trace);
 457
 458 count++;
 459   } else {
 460 // If the old trace is no longer live, add it to the list of
 461 // recently collected garbage.
 462 store_garbage_trace(trace);
 463   }

In the case where the oop was not live, I would like it to be explicitly 
cleared.


Also I see a lot of concurrent-looking use of the following field:
 267   volatile bool _initialized;

Please note that the "volatile" qualifier does not help with reordering 
here. Reordering between volatile and non-volatile fields is completely 
free for both compiler and hardware, except for windows with MSVC, where 
volatile semantics is defined to use acquire/release semantics, and the 
hardware is TSO. But for the general case, I would expect this field to 
be stored with OrderAccess::release_store and loaded with 
OrderAccess::load_acquire. Otherwise it is not thread safe.


As a kind of meta comment, I wonder if it would make sense to add 
sampling for non-TLAB allocations. Seems like if someone is rapidly 
allocating a whole bunch of 1 MB objects that never fit in a TLAB, I 
might still be interested in seeing that in my traces, and not get 
surprised that the allocation rate is very high yet not showing up in 
any profiles.


Thanks,
/Erik

On 2018-01-26 06:45, JC Beyler wrote:

Thanks Robbin for the reviews :)

The new full webrev is here:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.03/
The incremental webrev is here:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.02_03/

I inlined my answers:

On Thu, Jan 25, 2018 at 1:15 AM, Robbin Ehn  wrote:

Hi JC, great to see another revision!


heapMonitoring.cpp

StackTraceData should not contain the oop for 'safety' reasons.
When StackTraceData is moved from _allocated_traces:
L452 store_garbage_trace(trace);
it contains a dead oop.
_allocated_traces could instead be a tupel of oop and StackTraceData thus
dead oops are not kept.

Done I used inheritance to make the copier work regardless but the
idea is the same.


You should use the new Access API for loading the oop, something like this:
RootAccess::load(...)
I don't think you need to use Access API for clearing the oop, but it would
look nicer. And you shouldn't probably be using:
Universe::heap()->is_in_reserved(value)

I am unfamiliar with this but I think I did do it like you wanted me
to (all tests pass so that's a start). I'm not sure how to clear the
oop exactly, is there somewhere that does that, which I can use to do
the same?

I removed the is_in_reserved, this came from our internal version, I
don't know why it was there but my tests work without so I removed it
:)



The lock:
L424   MutexLocker mu(HeapMonitorStorage_lock);
Is not nee

Re: Low-Overhead Heap Profiling

2018-01-26 Thread Robbin Ehn


Hi JC, (dropping compiler on mail thread)

On 01/26/2018 06:45 AM, JC Beyler wrote:

Thanks Robbin for the reviews :)

The new full webrev is here:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.03/
The incremental webrev is here:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.02_03/


Thanks!

I got this compile issue:
/home/rehn/source/jdk/small/open/src/hotspot/share/gc/shared/threadLocalAllocBuffer.hpp: 
In static member function 'static void 
G1ResManTLABCache::put(ThreadLocalAllocBuffer&, AllocationContext_t)':
/home/rehn/source/jdk/small/open/src/hotspot/share/gc/shared/threadLocalAllocBuffer.hpp:97:13: 
error: 'HeapWord* ThreadLocalAllocBuffer::hard_end()' is private

   HeapWord* hard_end();
 ^
/home/rehn/source/jdk/small/closed/src/hotspot/share/gc/g1/g1ResManTLABCache.cpp:52:50: 
error: within this context

   size_t remaining = pointer_delta(tlab.hard_end(), tlab.top());



I inlined my answers:

On Thu, Jan 25, 2018 at 1:15 AM, Robbin Ehn  wrote:

Hi JC, great to see another revision!


heapMonitoring.cpp

StackTraceData should not contain the oop for 'safety' reasons.
When StackTraceData is moved from _allocated_traces:
L452 store_garbage_trace(trace);
it contains a dead oop.
_allocated_traces could instead be a tupel of oop and StackTraceData thus
dead oops are not kept.


Done I used inheritance to make the copier work regardless but the
idea is the same.


Looks good.





You should use the new Access API for loading the oop, something like this:
RootAccess::load(...)
I don't think you need to use Access API for clearing the oop, but it would
look nicer. And you shouldn't probably be using:
Universe::heap()->is_in_reserved(value)


I am unfamiliar with this but I think I did do it like you wanted me
to (all tests pass so that's a start). I'm not sure how to clear the
oop exactly, is there somewhere that does that, which I can use to do
the same?

I removed the is_in_reserved, this came from our internal version, I
don't know why it was there but my tests work without so I removed it
:)


I talked a bit with GC folks about this today.
So actually the oop should be in the oopStorage and your should have a 
weakhandle to that oop (at least in the near future).

But this should work for now...
In the future when we have the oop in oppStorage you will not get called, so you 
will not know when the oops are dead, either GC must trigger a concurrent vm 
operation (e.g. _no_ safepoint operation) so we can clear dead oops or do 
periodic scanning.


It would be good with some input here from Thomas that knows what you are doing 
and knows GC.







The lock:
L424   MutexLocker mu(HeapMonitorStorage_lock);
Is not needed as far as I can see.
weak_oops_do is called in a safepoint, no TLAB allocation can happen and
JVMTI thread can't access these data-structures. Is there something more to
this lock that I'm missing?


Since a thread can call the JVMTI getLiveTraces (or any of the other
ones), it can get to the point of trying to copying the
_allocated_traces. I imagine it is possible that this is happening
during a GC or that it can be started and a GC happens afterwards.
Therefore, it seems to me that you want this protected, no?


A thread calling getLiveTraces will be stopped in the HeapThreadTransition.
A safepoint don't allow any threads to be in_vm or to be in_java during 
safepoint. Only threads in native can execute, but they will be stopped on any 
transition. If a thread is in_vm the safepoint waits to a transition (locking a 
mutex is also transition to blocked). So if weak_oops is called you have an 
guarantee that no threads are executing inside the VM or executing Java code. 
(not counting GC threads of course)
This also means that the lock can never be contented when weak_oops is called, 
so it's not harmful.








You have 6 files without any changes in them (any more):
g1CollectedHeap.cpp
psMarkSweep.cpp
psParallelCompact.cpp
genCollectedHeap.cpp
referenceProcessor.cpp
thread.hpp



Done.



I have not looked closely, but is it possible to hide heap sampling in
AllocTracer ? (with some minor changes to the AllocTracer API)



I am imagining that you are saying to move the code that does the
sampling code (change the tlab end, do the call to HeapMonitoring,
etc.) into the AllocTracer code itself? I think that is right and I'll
look if that is possible and prepare a webrev to show what would be
needed to make that happen.


Sounds good.





Minor nit, when declaring pointer there is a little mix of having the
pointer adjacent by type name and data name. (Most hotspot code is by type
name)
E.g.
heapMonitoring.cpp:711 jvmtiStackTrace *trace = 
heapMonitoring.cpp:733 Method* m = vfst.method();
(not just this file)



Done!



HeapMonitorThreadOnOffTest.java:77
I would make g_tmp volatile, otherwise the assignment in loop may
theoretical be skipped.



Also done!


Looks good, thanks!

/Robbin



Thanks again!
Jc

Re: Low-Overhead Heap Profiling

2018-01-25 Thread Robbin Ehn

Hi JC, great to see another revision!

heapMonitoring.cpp

StackTraceData should not contain the oop for 'safety' reasons.
When StackTraceData is moved from _allocated_traces:
L452 store_garbage_trace(trace);
it contains a dead oop.
_allocated_traces could instead be a tupel of oop and StackTraceData thus dead
oops are not kept.

You should use the new Access API for loading the oop, something like this:
RootAccess::load(...)
I don't think you need to use Access API for clearing the oop, but it would look
nicer. And you shouldn't probably be using: Universe::heap()->is_in_reserved(value)

The lock:
L424 MutexLocker mu(HeapMonitorStorage_lock);
Is not needed as far as I can see.
weak_oops_do is called in a safepoint, no TLAB allocation can happen and JVMTI
thread can't access these data-structures. Is there something more to this lock
that I'm missing?

You have 6 files without any changes in them (any more):
g1CollectedHeap.cpp
psMarkSweep.cpp
psParallelCompact.cpp
genCollectedHeap.cpp
referenceProcessor.cpp
thread.hpp

I have not looked closely, but is it possible to hide heap sampling in
AllocTracer ? (with some minor changes to the AllocTracer API)

Minor nit, when declaring pointer there is a little mix of having the pointer
adjacent by type name and data name. (Most hotspot code is by type name)

E.g.
heapMonitoring.cpp:711 jvmtiStackTrace *trace =
heapMonitoring.cpp:733 Method* m = vfst.method();
(not just this file)

HeapMonitorThreadOnOffTest.java:77
I would make g_tmp volatile, otherwise the assignment in loop may theoretical be
skipped.

Thanks!

/Robbin

On 01/24/2018 01:40 AM, JC Beyler wrote:

And it has been exactly two months since I posted an update here :)

Thanksgiving, Christmas, and handling
https://bugs.openjdk.java.net/browse/JDK-8190862 will do that to you
:)

I have gotten back to this item now that JDK-8190862 is done and I
have the following webrev ready:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.02/

With the incremental here:
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.01_02/

The updates are:

a) GetCachedTraces

- I added a GC collection if the GetLiveTraces is called, this is
because you really want to know what is currently live when you call
that method
- If you are worried about performance and only care about what was
live at the last GC, you can now call GetCachedTraces, which does not
provoke a GC

Let me know if there are any questions here or concerns. I'm happy to
explain and defend the choices :).
Note: I added a new test for this and updated other tests to double
check the live call. (see for example
http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.01_02/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorCachedTest.java.html)

b) Disabling samples wipes out the data

I had originally implemented for OpenJdk a version that keeps the data
in memory after disabling the sampler, allowing a user to get traces
post-sampling. Because of this, we would always do the weak_oops_do
method, whether enabled or disabled. This led to a slight regression
in performance for GC reference processing time. I had initially fixed
this with a small "was this ever enabled" flag. This would have
allowed a program that never uses this to not have a regression but a
program that enables the disabled the code for the rest of the
execution would still pay the price after disabling the sampler.

Because of this, I have moved things back to where they probably
should be: if you disable the sampler, you lose the data. But this
allows a simpler code path: if the sampler is disabled, skip over the
GC weak_oops_do method.

Let me know what you think and happy 2018 to all!
Jc

On Thu, Nov 23, 2017 at 7:20 AM, Thomas Schatzl
wrote:

On Tue, 2017-11-21 at 13:50 -0800, JC Beyler wrote:

Hi all,

I have a new webrev here:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.15/

With the incremental one here:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.14a_15/

I think I did most items from Thomans and Robbin. I've especially
added a new test:

Thanks. Looks good.

Thanks,
Thomas

Re: Low-Overhead Heap Profiling

2017-11-23 Thread Thomas Schatzl

On Tue, 2017-11-21 at 13:50 -0800, JC Beyler wrote:
> Hi all,
> 
> I have a new webrev here:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.15/
> 
> With the incremental one here:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.14a_15/
> 
> I think I did most items from Thomans and Robbin. I've especially
> added a new test:

Thanks. Looks good.

Thanks,
  Thomas

Re: Low-Overhead Heap Profiling

2017-11-08 Thread Thomas Schatzl

Hi JC,

  sorry for the long wait.

On Wed, 2017-11-01 at 10:46 -0700, JC Beyler wrote:
> Dear all,
> 
> Here is the next webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.14a/
> 
> Incremental since the rebase:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.14_14a/
> 
> (I'm still not too familiar with hg so I had to do a fresh rebase so
> v14 is once the rebase was done and v14a integrates the changes
> requested from Thomas).

Yeah, there seem to be something missing in the incremental webrev, but
thanks for the effort.

> I have also inlined answers to Thomas' email:
> > A few minor issues:
>  
> >   - in the declaration of CollectedHeap::sample_allocation, it
> > would be nice if the fix_sample_rate parameter would be described -
> > it takes a time to figure out what it's used for. I.e. in case an
> > allocation goes beyond the sampling watermark, this value which
> > represents the amount of overallocation is used to adjust the next
> > sampling watermark to sample at the correct rate.
> > Something like this - and if what I wrote is incorrect, there is
> > even more reason to document it.
> > Or maybe just renaming "fix_sample_rate" to something more
> > descriptive - but I have no good idea about that.
> > With lack of units in the type, it would also be nice to have the
> > unit in the identifier name, as done elsewhere.
> 

Thanks. Could you s/passed/past in that documentation?

> Done for Robbin's issue and changed it to 
> >   - some (or most actually) of the new setters and getters in the
> > ThreadLocalAllocBuffer class could be private I think. Also, we
> > typically do not use "simple" getters that just return a member in
> > the class where they are defined.
> 
> I removed all that were not used that I could see (not used outside
> the class) moved the ones that are not simple to private if they
> could be. I think it's a bit weird because now some of the setting of
> the tlab internal data is using methods, others are directly setting.
> Let me know what you think.

That's fine with me. You need to start somewhere I guess.

> >   - ThreadLocalAllocBuffer::pick_next_sample() - I recommend making
> > the first check an assert - it seems that it is only useful to call
> > this with heap monitoring enabled, as is done right now.
> 
> Longer conversation below about this, It cannot be an assert (I could
> remove the test altogether though).

I looked at the description, and you are right. I missed that. Keep it
as is. :)

> > - HeapMonitoring::next_random() - the different names for the
> > constants use different formatting. Preferable (to me) is
> > UpperCamelCase, but at least make them uniform.
> > 
> 
> I think done the way you wanted!

In heapMonitoring.hpp:50-53 the constants still have different format?
 
> >  
> >   - not really convinced that it is a good idea to not somehow
> > guard
> > StartHeapSampling() and StopHeapSampling() against being called by
> > multiple threads.
> > 
> 
> I added another mutex for the start/stop so that way it will protect
> from that.
> 

Thanks!

>  
> > Otherwise looks okay from what I can see.

Still okay. I do not need a re-review for the changes suggested in this
email.

> 
> Awesome, what do you think I still need for this before going to the
> next step (which is what by the way? :)).

I think:

- look through the JEP if it is still current and fix the descriptions
if required
  - add a link to the latest webrev to the JEP as comment
- if not done already, file CSRs [0] for
  - the new flag
  - JVMTI changes (I think, not sure actually)
- find a sponsor from Oracle to do some internal work (pushing, and
before that there is iirc still some background work related to JEPs
that can only be done by Oracle, mostly shepherding :/).

I talked to Robbin about this, and he offered to help you with that.

Thanks,
  Thomas

[0] https://wiki.openjdk.java.net/display/csr/Main

Re: Low-Overhead Heap Profiling

2017-11-01 Thread JC Beyler

Dear all,

Here is the next webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.14a/

Incremental since the rebase:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.14_14a/

(I'm still not too familiar with hg so I had to do a fresh rebase so v14 is
once the rebase was done and v14a integrates the changes requested from
Thomas).

I have also inlined answers to Thomas' email:

>
> A few minor issues:
>
>   - weak reference handling has been factored out in JDK-8189359, now
> you only need to add the additions required for this change to one
> place. :) Please update the webrev :)
>
>
That is awesome and very happily done :)



>   - the one issue Robin noticed.
>
>   - in the declaration of CollectedHeap::sample_allocation, it would be
> nice if the fix_sample_rate parameter would be described - it takes a
> time to figure out what it's used for. I.e. in case an allocation goes
> beyond the sampling watermark, this value which represents the amount
> of overallocation is used to adjust the next sampling watermark to
> sample at the correct rate.
> Something like this - and if what I wrote is incorrect, there is even
> more reason to document it.
> Or maybe just renaming "fix_sample_rate" to something more descriptive
> - but I have no good idea about that.
> With lack of units in the type, it would also be nice to have the unit
> in the identifier name, as done elsewhere.
>

Done for Robbin's issue and changed it to

>
>   - some (or most actually) of the new setters and getters in the
> ThreadLocalAllocBuffer class could be private I think. Also, we
> typically do not use "simple" getters that just return a member in the
> class where they are defined.
>

I removed all that were not used that I could see (not used outside the
class) moved the ones that are not simple to private if they could be. I
think it's a bit weird because now some of the setting of the tlab internal
data is using methods, others are directly setting. Let me know what you
think.


>
>   - ThreadLocalAllocBuffer::set_sample_end(): please use
> pointer_delta() for pointer subtractions.
>

Done!


>
>   - ThreadLocalAllocBuffer::pick_next_sample() - I recommend making the
> first check an assert - it seems that it is only useful to call this
> with heap monitoring enabled, as is done right now.
>

Longer conversation below about this, It cannot be an assert (I could
remove the test altogether though).


>
>   - ThreadLocalAllocBuffer::pick_next_sample() - please use
> "PTR_FORMAT" (or INTPTR_FORMAT - they are the same) as format string
> for printing pointer values as is customary within Hotspot. %p output
> is OS dependent. I.e. I heard that e.g. on Ubuntu it prints "null"
> instead of 0x0...0  which is kind of annoying.
>

Done :)


>
>   - personal preference: do not allocate
> HeapMonitoring::AlwaysTrueClosure globally, but only locally when it's
> used. Setting it up seems to be very cheap.
>

Done!


>
>   - HeapMonitoring::next_random() - the different names for the
> constants use different formatting. Preferable (to me) is
> UpperCamelCase, but at least make them uniform.
>
>
I think done the way you wanted!


>   - in HeapMonitoring::next_random(), you might want to use
> right_n_bits() to create your mask.


Done!


>
>
  - not really convinced that it is a good idea to not somehow guard
> StartHeapSampling() and StopHeapSampling() against being called by
> multiple threads.
>
>
I added another mutex for the start/stop so that way it will protect from
that.



> Otherwise looks okay from what I can see.
>

Awesome, what do you think I still need for this before going to the next
step (which is what by the way? :)).


---

Ok now the longer conversation about this remark:


>   - ThreadLocalAllocBuffer::pick_next_sample() - I recommend making the
> first check an assert - it seems that it is only useful to call this
> with heap monitoring enabled, as is done right now.
>


You can't do this right now because this is how the mutexes work right now
to ensure we allow things to work fast but safely:

I) Background before I go into details

1) You can turn on/off whenever you like; which flips the switch of the
HeapMonitoring::enabled() method.
2) When you hit the end of a tlab, you go to the slow path, check
HeapMonitoring::enabled()
3) Later in the handling of the sample if enabled() returned true, you
get to the point of choosing a new sampling rate

Now imagine we have two threads A & B and imagine that enabled is currently
returning true. Here is a series of events:

i) Thread A hits the end of tlab, checks enabled at entrance, gets a stack
ii) Meanwhile, Thread B turned off HeapMonitoring
iii) Thread A now goes to pick a new sample rate and is going to check
HeapMonitoring::enabled

If put as an assert, clearly (iii) will fail, we don't want this.

II) So it this really an issue? Is there something really broken?

Not really, I can go into a bigger explanation as to why but really it
boils down to

Re: Low-Overhead Heap Profiling

2017-10-25 Thread Thomas Schatzl

Hi Jc,

  sorry for taking a bit long to respond ;)

On Mon, 2017-10-23 at 08:27 -0700, JC Beyler wrote:
> Dear all,
> 
> Small update this week with this new webrev:
>   - http://cr.openjdk.java.net/~rasbold/8171119/webrev.13/
>   - Incremental is here: http://cr.openjdk.java.net/~rasbold/8171119/
> webrev.12_13/
> 
> I patched the code changes showed by Robbin last week and I
> refactored collectedHeap.cpp:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.12_13/src/hotspot/
> share/gc/shared/collectedHeap.cpp.patch
> 
> The original code became a bit too complex in my opinion with the
> handle_heap_sampling handling too many things. So I subdivided the
> logic into two smaller methods and moved out a bit of the logic to
> make it more clear. Hopefully it is :)
> 
> Let me know if you have any questions/comments :)
> Jc

A few minor issues:

  - weak reference handling has been factored out in JDK-8189359, now
you only need to add the additions required for this change to one
place. :) Please update the webrev :)

  - the one issue Robin noticed.

  - in the declaration of CollectedHeap::sample_allocation, it would be
nice if the fix_sample_rate parameter would be described - it takes a
time to figure out what it's used for. I.e. in case an allocation goes
beyond the sampling watermark, this value which represents the amount
of overallocation is used to adjust the next sampling watermark to
sample at the correct rate.
Something like this - and if what I wrote is incorrect, there is even
more reason to document it.
Or maybe just renaming "fix_sample_rate" to something more descriptive
- but I have no good idea about that.
With lack of units in the type, it would also be nice to have the unit
in the identifier name, as done elsewhere.

  - some (or most actually) of the new setters and getters in the
ThreadLocalAllocBuffer class could be private I think. Also, we
typically do not use "simple" getters that just return a member in the
class where they are defined.

  - ThreadLocalAllocBuffer::set_sample_end(): please use
pointer_delta() for pointer subtractions.

  - ThreadLocalAllocBuffer::pick_next_sample() - I recommend making the
first check an assert - it seems that it is only useful to call this
with heap monitoring enabled, as is done right now.

  - ThreadLocalAllocBuffer::pick_next_sample() - please use
"PTR_FORMAT" (or INTPTR_FORMAT - they are the same) as format string
for printing pointer values as is customary within Hotspot. %p output
is OS dependent. I.e. I heard that e.g. on Ubuntu it prints "null"
instead of 0x0...0  which is kind of annoying.

  - personal preference: do not allocate
HeapMonitoring::AlwaysTrueClosure globally, but only locally when it's
used. Setting it up seems to be very cheap.

  - HeapMonitoring::next_random() - the different names for the
constants use different formatting. Preferable (to me) is
UpperCamelCase, but at least make them uniform.

  - in HeapMonitoring::next_random(), you might want to use
right_n_bits() to create your mask.

  - not really convinced that it is a good idea to not somehow guard
StartHeapSampling() and StopHeapSampling() against being called by
multiple threads.

Otherwise looks okay from what I can see. 

Thanks,
  Thomas

Re: Low-Overhead Heap Profiling

2017-10-25 Thread Robbin Ehn


Hi,

325 HeapWord *tlab_old_end = thread->tlab().return end();

Should be something like:

325 HeapWord *tlab_old_end = thread->tlab().end();

Thanks, Robbin

On 2017-10-23 17:27, JC Beyler wrote:

Dear all,

Small update this week with this new webrev:
   - http://cr.openjdk.java.net/~rasbold/8171119/webrev.13/
   - Incremental is here: 
http://cr.openjdk.java.net/~rasbold/8171119/webrev.12_13/

I patched the code changes showed by Robbin last week and I refactored 
collectedHeap.cpp:

http://cr.openjdk.java.net/~rasbold/8171119/webrev.12_13/src/hotspot/share/gc/shared/collectedHeap.cpp.patch

The original code became a bit too complex in my opinion with the 
handle_heap_sampling handling too many things. So I subdivided the logic into 
two smaller methods and moved out a bit of the logic to make it more clear. 
Hopefully it is :)


Let me know if you have any questions/comments :)
Jc

On Mon, Oct 16, 2017 at 9:34 AM, JC Beyler > wrote:


Hi Robbin,

That is because version 11 to 12 was only a test change. I was going to
write about it and say here are the webrev links:
Incremental:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.11_12/


Full webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.12/


This change focused only on refactoring the tests to be more manageable,
readable, maintainable. As all tests are looking at allocations, I moved
common code to a java class:

http://cr.openjdk.java.net/~rasbold/8171119/webrev.11_12/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitor.java.patch



And then most tests call into that class to turn on/off the sampling,
allocate, etc. This has removed almost 500 lines of test code so I'm happy
about that.

Thanks for your changes, a bit of relics of previous versions :). I've
already integrated them into my code and will make a new webrev end of this
week with a bit of refactor of the code handling the tlab slow path. I find
it could use a bit of refactoring to make it easier to follow so I'm going
to take a stab at it this week.

Any other issues/comments?

Thanks!
Jc


On Mon, Oct 16, 2017 at 8:46 AM, Robbin Ehn mailto:robbin@oracle.com>> wrote:

Hi JC,

I saw a webrev.12 in the directory, with only test changes(11->12), so I
took that version.
I had a look and tested the tests, worked fine!

First glance at the code (looking at full v12) some minor things below,
mostly unused stuff.

Thanks, Robbin

diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.cpp
--- a/src/hotspot/share/runtime/heapMonitoring.cpp      Mon Oct 16
16:54:06 2017 +0200
+++ b/src/hotspot/share/runtime/heapMonitoring.cpp      Mon Oct 16
17:42:42 2017 +0200
@@ -211,2 +211,3 @@
    void initialize(int max_storage) {
+    // validate max_storage to sane value ? What would 0 mean ?
      MutexLocker mu(HeapMonitor_lock);
@@ -227,8 +228,4 @@
    bool initialized() { return _initialized; }
-  volatile bool *initialized_address() { return &_initialized; }

   private:
-  // Protects the traces currently sampled (below).
-  volatile intptr_t _stack_storage_lock[1];
-
    // The traces currently sampled.
@@ -313,3 +310,2 @@
    _initialized(false) {
-    _stack_storage_lock[0] = 0;
  }
@@ -532,13 +528,2 @@

-// Delegate the initialization question to the underlying storage 
system.
-bool HeapMonitoring::initialized() {
-  return StackTraceStorage::storage()->initialized();
-}
-
-// Delegate the initialization question to the underlying storage 
system.
-bool *HeapMonitoring::initialized_address() {
-  return
- 
const_cast(StackTraceStorage::storage()->initialized_address());

-}
-
  void HeapMonitoring::get_live_traces(jvmtiStackTraces *traces) {
diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.hpp
--- a/src/hotspot/share/runtime/heapMonitoring.hpp      Mon Oct 16
16:54:06 2017 +0200
+++ b/src/hotspot/share/runtime/heapMonitoring.hpp      Mon Oct 16
17:42:42 2017 +0200
@@ -35,3 +35,2 @@
    static uint64_t _rnd;
-  static bool _initialized;
    static jint _monitoring_rate;
@@ -92,7 +91,2 @@

-  // Is the profiler initialized and where is the address to the
initialized
-  // boolean.
-  static bool initializ

Re: Low-Overhead Heap Profiling

2017-10-16 Thread JC Beyler

Hi Robbin,

That is because version 11 to 12 was only a test change. I was going to
write about it and say here are the webrev links:
Incremental:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.11_12/

Full webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.12/

This change focused only on refactoring the tests to be more manageable,
readable, maintainable. As all tests are looking at allocations, I moved
common code to a java class:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.11_12/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitor.java.patch

And then most tests call into that class to turn on/off the sampling,
allocate, etc. This has removed almost 500 lines of test code so I'm happy
about that.

Thanks for your changes, a bit of relics of previous versions :). I've
already integrated them into my code and will make a new webrev end of this
week with a bit of refactor of the code handling the tlab slow path. I find
it could use a bit of refactoring to make it easier to follow so I'm going
to take a stab at it this week.

Any other issues/comments?

Thanks!
Jc


On Mon, Oct 16, 2017 at 8:46 AM, Robbin Ehn  wrote:

> Hi JC,
>
> I saw a webrev.12 in the directory, with only test changes(11->12), so I
> took that version.
> I had a look and tested the tests, worked fine!
>
> First glance at the code (looking at full v12) some minor things below,
> mostly unused stuff.
>
> Thanks, Robbin
>
> diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.cpp
> --- a/src/hotspot/share/runtime/heapMonitoring.cpp  Mon Oct 16
> 16:54:06 2017 +0200
> +++ b/src/hotspot/share/runtime/heapMonitoring.cpp  Mon Oct 16
> 17:42:42 2017 +0200
> @@ -211,2 +211,3 @@
>void initialize(int max_storage) {
> +// validate max_storage to sane value ? What would 0 mean ?
>  MutexLocker mu(HeapMonitor_lock);
> @@ -227,8 +228,4 @@
>bool initialized() { return _initialized; }
> -  volatile bool *initialized_address() { return &_initialized; }
>
>   private:
> -  // Protects the traces currently sampled (below).
> -  volatile intptr_t _stack_storage_lock[1];
> -
>// The traces currently sampled.
> @@ -313,3 +310,2 @@
>_initialized(false) {
> -_stack_storage_lock[0] = 0;
>  }
> @@ -532,13 +528,2 @@
>
> -// Delegate the initialization question to the underlying storage system.
> -bool HeapMonitoring::initialized() {
> -  return StackTraceStorage::storage()->initialized();
> -}
> -
> -// Delegate the initialization question to the underlying storage system.
> -bool *HeapMonitoring::initialized_address() {
> -  return
> -  const_cast(StackTraceStorage::storage()->initialized_
> address());
> -}
> -
>  void HeapMonitoring::get_live_traces(jvmtiStackTraces *traces) {
> diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.hpp
> --- a/src/hotspot/share/runtime/heapMonitoring.hpp  Mon Oct 16
> 16:54:06 2017 +0200
> +++ b/src/hotspot/share/runtime/heapMonitoring.hpp  Mon Oct 16
> 17:42:42 2017 +0200
> @@ -35,3 +35,2 @@
>static uint64_t _rnd;
> -  static bool _initialized;
>static jint _monitoring_rate;
> @@ -92,7 +91,2 @@
>
> -  // Is the profiler initialized and where is the address to the
> initialized
> -  // boolean.
> -  static bool initialized();
> -  static bool *initialized_address();
> -
>// Called when o is to be sampled from a given thread and a given size.
>
>
>
> On 10/10/2017 12:57 AM, JC Beyler wrote:
>
>> Dear all,
>>
>> Thread-safety is back!! Here is the update webrev:
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/
>>
>> Full webrev is here:
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.11/
>>
>> In order to really test this, I needed to add this so thought now was a
>> good time. It required a few changes here for the creation to ensure
>> correctness and safety. Now we keep the static pointer but clear the data
>> internally so on re-initialize, it will be a bit more costly than before. I
>> don't think this is a huge use-case so I did not think it was a problem. I
>> used the internal MutexLocker, I think I used it well, let me know.
>>
>> I also added three tests:
>>
>> 1) Stack depth test:
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/tes
>> t/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/H
>> eapMonitorStackDepthTest.java.patch
>>
>> This test shows that the maximum stack depth system is working.
>>
>> 2) Thread safety:
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/tes
>> t/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/H
>> eapMonitorThreadTest.java.patch
>>
>> The test creates 24 threads and they all allocate at the same time. The
>> test then checks it does find samples from all the threads.
>>
>> 3) Thread on/off safety
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/tes
>> t/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/H
>> eapMonitorThreadOnOffTest.java.patch
>>
>> The test creates 24 threads that all allocate a bun

Re: Low-Overhead Heap Profiling

2017-10-16 Thread Robbin Ehn


Hi JC,

I saw a webrev.12 in the directory, with only test changes(11->12), so I took 
that version.
I had a look and tested the tests, worked fine!

First glance at the code (looking at full v12) some minor things below, mostly 
unused stuff.

Thanks, Robbin

diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.cpp
--- a/src/hotspot/share/runtime/heapMonitoring.cpp  Mon Oct 16 16:54:06 
2017 +0200
+++ b/src/hotspot/share/runtime/heapMonitoring.cpp  Mon Oct 16 17:42:42 
2017 +0200
@@ -211,2 +211,3 @@
   void initialize(int max_storage) {
+// validate max_storage to sane value ? What would 0 mean ?
 MutexLocker mu(HeapMonitor_lock);
@@ -227,8 +228,4 @@
   bool initialized() { return _initialized; }
-  volatile bool *initialized_address() { return &_initialized; }

  private:
-  // Protects the traces currently sampled (below).
-  volatile intptr_t _stack_storage_lock[1];
-
   // The traces currently sampled.
@@ -313,3 +310,2 @@
   _initialized(false) {
-_stack_storage_lock[0] = 0;
 }
@@ -532,13 +528,2 @@

-// Delegate the initialization question to the underlying storage system.
-bool HeapMonitoring::initialized() {
-  return StackTraceStorage::storage()->initialized();
-}
-
-// Delegate the initialization question to the underlying storage system.
-bool *HeapMonitoring::initialized_address() {
-  return
-  const_cast(StackTraceStorage::storage()->initialized_address());
-}
-
 void HeapMonitoring::get_live_traces(jvmtiStackTraces *traces) {
diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.hpp
--- a/src/hotspot/share/runtime/heapMonitoring.hpp  Mon Oct 16 16:54:06 
2017 +0200
+++ b/src/hotspot/share/runtime/heapMonitoring.hpp  Mon Oct 16 17:42:42 
2017 +0200
@@ -35,3 +35,2 @@
   static uint64_t _rnd;
-  static bool _initialized;
   static jint _monitoring_rate;
@@ -92,7 +91,2 @@

-  // Is the profiler initialized and where is the address to the initialized
-  // boolean.
-  static bool initialized();
-  static bool *initialized_address();
-
   // Called when o is to be sampled from a given thread and a given size.



On 10/10/2017 12:57 AM, JC Beyler wrote:

Dear all,

Thread-safety is back!! Here is the update webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/

Full webrev is here:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.11/

In order to really test this, I needed to add this so thought now was a good time. It required a few changes here for the creation to ensure correctness and safety. Now we 
keep the static pointer but clear the data internally so on re-initialize, it will be a bit more costly than before. I don't think this is a huge use-case so I did not 
think it was a problem. I used the internal MutexLocker, I think I used it well, let me know.


I also added three tests:

1) Stack depth test:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStackDepthTest.java.patch

This test shows that the maximum stack depth system is working.

2) Thread safety:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadTest.java.patch

The test creates 24 threads and they all allocate at the same time. The test 
then checks it does find samples from all the threads.

3) Thread on/off safety
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadOnOffTest.java.patch

The test creates 24 threads that all allocate a bunch of memory. Then another 
thread turns the sampling on/off.

Btw, both tests 2 & 3 failed without the locks.

As I worked on this, I saw a lot of places where the tests are doing very similar things, I'm going to clean up the code a bit and make a HeapAllocator class that all tests 
can call directly. This will greatly simplify the code.


Thanks for any comments/criticisms!
Jc


On Mon, Oct 2, 2017 at 8:52 PM, JC Beyler mailto:jcbey...@google.com>> wrote:

Dear all,

Small update to the webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/ 


Full webrev is here:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/ 


I updated a bit of the naming, removed a TODO comment, and I added a test 
for testing the sampling rate. I also updated the maximum stack depth to 1024, 
there is no
reason to keep it so small. I did a micro benchmark that tests the overhead 
and it seems relatively the same.

I compared allocations from a stack depth of 10 and allocations from a 
stack depth of 1024 (allocations are from the same helper method in

http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java

Re: Low-Overhead Heap Profiling

2017-10-09 Thread JC Beyler

Dear all,

Thread-safety is back!! Here is the update webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/

Full webrev is here:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.11/

In order to really test this, I needed to add this so thought now was a
good time. It required a few changes here for the creation to ensure
correctness and safety. Now we keep the static pointer but clear the data
internally so on re-initialize, it will be a bit more costly than before. I
don't think this is a huge use-case so I did not think it was a problem. I
used the internal MutexLocker, I think I used it well, let me know.

I also added three tests:

1) Stack depth test:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStackDepthTest.java.patch

This test shows that the maximum stack depth system is working.

2) Thread safety:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadTest.java.patch

The test creates 24 threads and they all allocate at the same time. The
test then checks it does find samples from all the threads.

3) Thread on/off safety
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadOnOffTest.java.patch

The test creates 24 threads that all allocate a bunch of memory. Then
another thread turns the sampling on/off.

Btw, both tests 2 & 3 failed without the locks.

As I worked on this, I saw a lot of places where the tests are doing very
similar things, I'm going to clean up the code a bit and make a
HeapAllocator class that all tests can call directly. This will greatly
simplify the code.

Thanks for any comments/criticisms!
Jc


On Mon, Oct 2, 2017 at 8:52 PM, JC Beyler  wrote:

> Dear all,
>
> Small update to the webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/
>
> Full webrev is here:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/
>
> I updated a bit of the naming, removed a TODO comment, and I added a test
> for testing the sampling rate. I also updated the maximum stack depth to
> 1024, there is no reason to keep it so small. I did a micro benchmark that
> tests the overhead and it seems relatively the same.
>
> I compared allocations from a stack depth of 10 and allocations from a
> stack depth of 1024 (allocations are from the same helper method in
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/
> raw_files/new/test/hotspot/jtreg/serviceability/jvmti/
> HeapMonitor/MyPackage/HeapMonitorStatRateTest.java):
>   - For an array of 1 integer allocated in a loop; stack depth
> 1024 vs stack depth 10: 1% slower
>   - For an array of 200k integers allocated in a loop; stack depth
> 1024 vs stack depth 10: 3% slower
>
> So basically now moving the maximum stack depth to 1024 but we only copy
> over the stack depths actually used.
>
> For the next webrev, I will be adding a stack depth test to show that it
> works and probably put back the mutex locking so that we can see how
> difficult it is to keep thread safe.
>
> Let me know what you think!
> Jc
>
>
>
> On Mon, Sep 25, 2017 at 3:02 PM, JC Beyler  wrote:
>
>> Forgot to say that for my numbers:
>>  - Not in the test are the actual numbers I got for the various array
>> sizes, I ran the program 30 times and parsed the output; here are the
>> averages and standard deviation:
>>   1000: 1.28% average; 1.13% standard deviation
>>   1:1.59% average; 1.25% standard deviation
>>   10:   1.26% average; 1.26% standard deviation
>>
>> The 1000/1/10 are the sizes of the arrays being allocated. These
>> are allocated 100k times and the sampling rate is 111 times the size of the
>> array.
>>
>> Thanks!
>> Jc
>>
>>
>> On Mon, Sep 25, 2017 at 3:01 PM, JC Beyler  wrote:
>>
>>> Hi all,
>>>
>>> After a bit of a break, I am back working on this :). As before, here
>>> are two webrevs:
>>>
>>> - Full change set: http://cr.openjdk.java.ne
>>> t/~rasbold/8171119/webrev.09/
>>> - Compared to version 8: http://cr.openjdk.java.net/
>>> ~rasbold/8171119/webrev.08_09/
>>> (This version is compared to version 8 I last showed but ported to
>>> the new folder hierarchy)
>>>
>>> In this version I have:
>>>   - Handled Thomas' comments from his email of 07/03:
>>>- Merged the logging to be standard
>>>- Fixed up the code a bit where asked
>>>- Added some notes about the code not being thread-safe yet
>>>- Removed additional dead code from the version that modifies
>>> interpreter/c1/c2
>>>- Fixed compiler issues so that it compiles with
>>> --disable-precompiled-header
>>> - Tested with ./configure --with-boot-jdk=
>>> --with-debug-level=slowdebug --disable-precompiled-headers
>>>
>>> Additionally, I added a test to check the sanity of the sampler:
>>> HeapMonitorStatCorrectnessTest (http://cr.o

Re: Low-Overhead Heap Profiling

2017-07-05 Thread JC Beyler

Hi all,

I apologize, I have not yet handled your remarks but thought this new
webrev would also be useful to see and comment on perhaps.

Here is the latest webrev, it is generated slightly different than the
others since now I'm using webrev.ksh without the -N option:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/

And the webrev.07 to webrev.08 diff is here:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/

(Let me know if it works well)

It's a small change between versions but it:
  - provides a fix that makes the average sample rate correct (more on that
below).
  - fixes the code to actually have it play nicely with the fast tlab refill
  - cleaned up a bit the JVMTI text and now use jvmtiFrameInfo
- moved the capability to be onload solo

With this webrev, I've done a small study of the random number generator we
use here for the sampling rate. I took a small program and it can be
simplified to:

for (outer loop)
for (inner loop)
int[] tmp = new int[arraySize];

- I've fixed the outer and inner loops to being 800 for this experiment,
meaning we allocate 64 times an array of a given array size.

- Each program provides the average sample size used for the whole execution

- Then, I ran each variation 30 times and then calculated the average of
the average sample size used for various array sizes. I selected the array
size to be one of the following: 1, 10, 100, 1000.

- When compared to 512kb, the average sample size of 30 runs:
1: 4.62% of error
10: 3.09% of error
100: 0.36% of error
1000: 0.1% of error
1: 0.03% of error

What it shows is that, depending on the number of samples, the average does
become better. This is because with an allocation of 1 element per array,
it will take longer to hit one of the thresholds. This is seen by looking
at the sample count statistic I put in. For the same number of iterations
(800 * 800), the different array sizes provoke:
1: 62 samples
10: 125 samples
100: 788 samples
1000: 6166 samples
1: 57721 samples

And of course, the more samples you have, the more sample rates you pick,
which means that your average gets closer using that math.

Thanks,
Jc

On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler  wrote:

> Thanks Robbin,
>
> This seems to have worked. When I have the next webrev ready, we will find
> out but I'm fairly confident it will work!
>
> Thanks agian!
> Jc
>
> On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn 
> wrote:
>
>> Hi JC,
>>
>> On 06/29/2017 12:15 AM, JC Beyler wrote:
>>
>>> B) Incremental changes
>>>
>>
>> I guess the most common work flow here is using mq :
>> hg qnew fix_v1
>> edit files
>> hg qrefresh
>> hg qnew fix_v2
>> edit files
>> hg qrefresh
>>
>> if you do hg log you will see 2 commits
>>
>> webrev.ksh -r -2 -o my_inc_v1_v2
>> webrev.ksh -o my_full_v2
>>
>>
>> In  your .hgrc you might need:
>> [extensions]
>> mq =
>>
>> /Robbin
>>
>>
>>> Again another newbiew question here...
>>>
>>> For showing the incremental changes, is there a link that explains how
>>> to do that? I apologize for my newbie questions all the time :)
>>>
>>> Right now, I do:
>>>
>>>   ksh ../webrev.ksh -m -N
>>>
>>> That generates a webrev.zip and send it to Chuck Rasbold. He then
>>> uploads it to a new webrev.
>>>
>>> I tried commiting my change and adding a small change. Then if I just do
>>> ksh ../webrev.ksh without any options, it seems to produce a similar page
>>> but now with only the changes I had (so the 06-07 comparison you were
>>> talking about) and a changeset that has it all. I imagine that is what you
>>> meant.
>>>
>>> Which means that my workflow would become:
>>>
>>> 1) Make changes
>>> 2) Make a webrev without any options to show just the differences with
>>> the tip
>>> 3) Amend my changes to my local commit so that I have it done with
>>> 4) Go to 1
>>>
>>> Does that seem correct to you?
>>>
>>> Note that when I do this, I only see the full change of a file in the
>>> full change set (Side note here: now the page says change set and not
>>> patch, which is maybe why Serguei was having issues?).
>>>
>>> Thanks!
>>> Jc
>>>
>>>
>>>
>>> On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn >> > wrote:
>>>
>>> Hi,
>>>
>>> On 06/28/2017 12:04 AM, JC Beyler wrote:
>>>
>>> Dear Thomas et al,
>>>
>>> Here is the newest webrev:
>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ <
>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>
>>>
>>>
>>>
>>> You have some more bits to in there but generally this looks good
>>> and really nice with more tests.
>>> I'll do and deep dive and re-test this when I get back from my long
>>> vacation with whatever patch version you have then.
>>>
>>> Also I think it's time you provide incremental (v06->07 changes) as
>>> well as complete change-sets.
>>>
>>> Thanks, Robbin
>>>
>>>
>>>
>>>
>>> Thomas, I "think" I have answered all your remarks. The summary
>>> is:
>>>
>>> - The statistic system is up a

Re: Low-Overhead Heap Profiling

2017-07-04 Thread Robbin Ehn


Hi Thomas,

First I want to thank you for helping out with this!

On 07/03/2017 02:44 PM, Thomas Schatzl wrote:

Hi,

On Wed, 2017-06-28 at 09:51 +0200, Robbin Ehn wrote:

Hi,

On 06/21/2017 10:45 PM, JC Beyler wrote:


Hi all,

First off: Thanks again to Robbin and Thomas for their reviews :)

Next, I've uploaded a new webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/

Here is an update:

- @Robbin, I forgot to say that yes I need to look at implementing
this for the other architectures and testing it before it is all
ready to go. Is it common to have it  working on all possible
combinations or is there a subset that I should be doing first and
we can do the others later?


I will change my mind here and proposed a different approach.
Right now the only compiler platform change you have is for the
FastTLABRefill.
FastTLABRefill works for all gc except G1 (CMS is deprecated) but
only in C1 compiled code.
The performance angle of this is not really relevant, since then you
will have C2 compiled code.


Afaik C2 compiled code also uses the stubs for TLAB refill/inline
allocation too (I may remember wrongly).


The FastTLABRefill is only checked in e.g.:
c1_Runtime1_x86.cpp




So I suggested that we should remove FastTLABRefill completely, I
will try to makes this happen but most likely not before my long
vacation.

That means you can avoid all platform specific code (as patch is
now), correct?

If so just keep the x86 support in there and we remove that if I
succeed, plus you can support all platforms when
FastTLABRefill=false.

Sounds good?


Not sure about it is easy to remove FastTLABRefill, but that may be
something to discuss in more detail. Just noting that while the runtime
call can be negligible on large heaps (with large TLABs) it might not
be on small ones (or you have many Java threads which causes your TLABs
to get smallish due to how TLAB size is calculated iirc).

E.g. Serial GC may still be used a lot on embedded platforms which tend
to use smaller heaps and benefit the most from this optimization.


This was a worry I also had, I was told not to worry (by several folks).



Now one can always automatically disable FastTLABRefill if you want
this kind of heap monitoring (and it does not work with
FastTLABRefill).


That would be good.

When I get back I will start the CSR for removing FastTLABRefill.
The CSR can of course be 'denied', but I will give it a go.

Thanks!

/Robbin



Thanks,
   Thomas

Re: Low-Overhead Heap Profiling

2017-07-03 Thread Thomas Schatzl

Hi,

  here's a first cut of my thoughts of the latest changes. The
compilation problems (at the bottom) prevented me to dig further into
it a bit.

Please strongly consider providing incremental webrevs, it takes some
effort to find changes otherwise.

On Tue, 2017-06-27 at 15:04 -0700, JC Beyler wrote:
> Dear Thomas et al,
> 
> Here is the newest webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/
> 
> Thomas, I "think" I have answered all your remarks. The summary is:
> 
> - The statistic system is up and provides insight on what the heap
> sampler is doing
>    - I've noticed that, though the sampling rate is at the right
> mean, we are missing some samples, I have not yet tracked out why
> (details below)
> - I've run a tiny benchmark that is the worse case: it is a very
> tight loop and allocated a small array
>    - In this case, I see no overhead when the system is off so that
> is a good start :)
>    - I see right now a high overhead in this case when sampling is
> on. This is not a really too surprising but I'm going to see if this
> is consistent with our internal implementation. The benchmark is
> really allocation stressful so I'm not too surprised but I want to do
> the due diligence.

I am mostly curious in what happens if you run heap sampling with an
otherwise "reasonable" sampling rate on very large machines (e.g. SPARC
M6/M7 with 2k threads)... costs of single locks tend to explode when
running such pieces of code on such machines.

But yes, that is probably not a use case we should worry too much about
here.

>  - The statistic system up is up and I have a new test http://cr.open
> jdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/Hea
> pMonitor/MyPackage/HeapMonitorStatTest.java.patch
>     - I did a bit of a study about the random generator here, more
> details are below but basically it seems to work well

I think the RNG itself is fine, it uses a well-known algorithm. I was
only doubting the need for the 20 iterations "priming" it. I.e. do the
the first 20 random numbers really have a different distribution than
the next 20?
 
> > I am not sure whether "miss a chance to sample" meant "we could,
> > but consciously don't because it's not that useful" or "it would be
> > necessary but don't because it's too complicated to do.".
> > 
> > Looking at the original comment once more, I am also not sure if
> > that comment shouldn't referring to the "end" variable (not
> > actual_end) because that's the variable that is responsible for
> > taking the sampling path? (Going from the member description of
> > ThreadLocalAllocBuffer).
> > I've moved this code and it no longer shows up here but the 
> > rationale and answer was:
> 
> So.. Yes, end is the variable provoking the sampling. Actual end is
> the actual end of the TLAB.
> 
> What was happening here is that the code is resetting _end to point
> towards the end of the new TLAB. Because, we now have the end for
> sampling and _actual_end for the actual end, we need to update the
> actual_end as well.
> 
> Normally, were we to do the real work here, we would calculate the
> (end - start) offset, then do:
> 
> - Set the new end to : start + (old_end - old_start)
> - Set the actual end like we do here now where it because it is the
> actual end.
> 
> Why is this not done here now anymore?
>   - I was still debating which path to take:
>      - Do it in the fast refill code, it has its perks:
>          - In a world where fast refills are happening all the time
> or a lot, we can augment there the code to do the sampling
>      - Remember what we had as an end before leaving the slowpath and
> check on return
>          - This is what I'm doing now, it removes the need to go fix
> up all fast refill paths but if you remain in fast refill paths, you
> won't get sampling. I have to think of the consequences of that,
> maybe a future change later on?
>             - I have the statistics now so I'm going to study that
>                -> By the way, though my statistics are showing I'm
> missing some samples, if I turn off FastTlabRefill, it is the same
> loss so for now, it seems this does not occur in my simple test.

For all but G1 fast TLAB refill is the default. Threads basically never
go into the runtime for allocation as allocation itself is easy there
as eden is always a contiguous range of memory.

There may be a fast TLAB refill path for G1 in some distant future too
(it can be done), particularly if we wanted G1 to scale down a bit
better. Fast TLAB refill tends to have its main advantages with
smallish TLABs, and that tends to happen with smaller heaps the most as
indicated in the other email.

> > But maybe I am only confused and it's best to just leave the
> > comment away. :)
> > 
> > Thinking about it some more, doesn't this not-sampling in this case
> > mean that sampling does not work in any collector that does inline
> > TLAB allocation at the moment? (Or is inline TLAB alloc
> > automatically dis

Re: Low-Overhead Heap Profiling

2017-07-03 Thread Thomas Schatzl

Hi,

On Wed, 2017-06-28 at 09:51 +0200, Robbin Ehn wrote:
> Hi,
> 
> On 06/21/2017 10:45 PM, JC Beyler wrote:
> > 
> > Hi all,
> > 
> > First off: Thanks again to Robbin and Thomas for their reviews :)
> > 
> > Next, I've uploaded a new webrev:
> > http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/
> > 
> > Here is an update:
> > 
> > - @Robbin, I forgot to say that yes I need to look at implementing
> > this for the other architectures and testing it before it is all
> > ready to go. Is it common to have it  working on all possible
> > combinations or is there a subset that I should be doing first and
> > we can do the others later?
>
> I will change my mind here and proposed a different approach.
> Right now the only compiler platform change you have is for the
> FastTLABRefill.
> FastTLABRefill works for all gc except G1 (CMS is deprecated) but
> only in C1 compiled code.
> The performance angle of this is not really relevant, since then you
> will have C2 compiled code.

Afaik C2 compiled code also uses the stubs for TLAB refill/inline
allocation too (I may remember wrongly).

> So I suggested that we should remove FastTLABRefill completely, I
> will try to makes this happen but most likely not before my long
> vacation.
>
> That means you can avoid all platform specific code (as patch is
> now), correct?
> 
> If so just keep the x86 support in there and we remove that if I
> succeed, plus you can support all platforms when
> FastTLABRefill=false.
> 
> Sounds good?

Not sure about it is easy to remove FastTLABRefill, but that may be
something to discuss in more detail. Just noting that while the runtime
call can be negligible on large heaps (with large TLABs) it might not
be on small ones (or you have many Java threads which causes your TLABs
to get smallish due to how TLAB size is calculated iirc).

E.g. Serial GC may still be used a lot on embedded platforms which tend
to use smaller heaps and benefit the most from this optimization.

Now one can always automatically disable FastTLABRefill if you want
this kind of heap monitoring (and it does not work with
FastTLABRefill).

Thanks,
  Thomas

Re: Low-Overhead Heap Profiling

2017-06-28 Thread Robbin Ehn

Hi JC,

On 06/29/2017 12:15 AM, JC Beyler wrote:

B) Incremental changes

I guess the most common work flow here is using mq :
hg qnew fix_v1
edit files
hg qrefresh
hg qnew fix_v2
edit files
hg qrefresh

if you do hg log you will see 2 commits

webrev.ksh -r -2 -o my_inc_v1_v2
webrev.ksh -o my_full_v2

In  your .hgrc you might need:
[extensions]
mq =

/Robbin

Again another newbiew question here...

For showing the incremental changes, is there a link that explains how to do 
that? I apologize for my newbie questions all the time :)

Right now, I do:

  ksh ../webrev.ksh -m -N

That generates a webrev.zip and send it to Chuck Rasbold. He then uploads it to 
a new webrev.

I tried commiting my change and adding a small change. Then if I just do ksh ../webrev.ksh without any options, it seems to produce a similar page but now with only the 
changes I had (so the 06-07 comparison you were talking about) and a changeset that has it all. I imagine that is what you meant.

Which means that my workflow would become:

1) Make changes
2) Make a webrev without any options to show just the differences with the tip
3) Amend my changes to my local commit so that I have it done with
4) Go to 1

Does that seem correct to you?

Note that when I do this, I only see the full change of a file in the full change set (Side note here: now the page says change set and not patch, which is maybe why 
Serguei was having issues?).

Thanks!
Jc

On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn mailto:robbin@oracle.com>> wrote:

Hi,

On 06/28/2017 12:04 AM, JC Beyler wrote:

Dear Thomas et al,

Here is the newest webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ 

You have some more bits to in there but generally this looks good and 
really nice with more tests.
I'll do and deep dive and re-test this when I get back from my long 
vacation with whatever patch version you have then.

Also I think it's time you provide incremental (v06->07 changes) as well as 
complete change-sets.

Thanks, Robbin

Thomas, I "think" I have answered all your remarks. The summary is:

- The statistic system is up and provides insight on what the heap 
sampler is doing
 - I've noticed that, though the sampling rate is at the right 
mean, we are missing some samples, I have not yet tracked out why (details 
below)

- I've run a tiny benchmark that is the worse case: it is a very tight 
loop and allocated a small array
 - In this case, I see no overhead when the system is off so that 
is a good start :)
 - I see right now a high overhead in this case when sampling is 
on. This is not a really too surprising but I'm going to see if this is 
consistent with our
internal implementation. The benchmark is really allocation stressful 
so I'm not too surprised but I want to do the due diligence.

   - The statistic system up is up and I have a new test

http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch

  - I did a bit of a study about the random generator here, more 
details are below but basically it seems to work well

   - I added a capability but since this is the first time doing this, 
I was not sure I did it right
 - I did add a test though for it and the test seems to do what I 
expect (all methods are failing with the JVMTI_ERROR_MUST_POSSESS_CAPABILITY 
error).
 - 
http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch

   - I still need to figure out what to do about the multi-agent vs 
single-agent issue

   - As far as measurements, it seems I still need to look at:
 - Why we do the 20 random calls first, are they necessary?
 - Look at the mean of the sampling rate that the random generator 
does and also what is actually sampled
 - What is the overhead in terms of memory/performance when on?

I have inlined my answers, I think I got them all in the new webrev, 
let me know your thoughts.

Thanks again!
Jc

On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl mailto:thomas.scha...@oracle.com> >> wrote:

 Hi,

 On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
 > Hi all,
 >
 > First off: Thanks again to Robbin and Thomas for the

Re: Low-Overhead Heap Profiling

2017-06-28 Thread Robbin Ehn


Hi,

On 06/28/2017 12:04 AM, JC Beyler wrote:

Dear Thomas et al,

Here is the newest webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/


You have some more bits to in there but generally this looks good and really 
nice with more tests.
I'll do and deep dive and re-test this when I get back from my long vacation 
with whatever patch version you have then.

Also I think it's time you provide incremental (v06->07 changes) as well as 
complete change-sets.

Thanks, Robbin





Thomas, I "think" I have answered all your remarks. The summary is:

- The statistic system is up and provides insight on what the heap sampler is 
doing
- I've noticed that, though the sampling rate is at the right mean, we are 
missing some samples, I have not yet tracked out why (details below)

- I've run a tiny benchmark that is the worse case: it is a very tight loop and 
allocated a small array
- In this case, I see no overhead when the system is off so that is a good 
start :)
- I see right now a high overhead in this case when sampling is on. This is not a really too surprising but I'm going to see if this is consistent with our internal 
implementation. The benchmark is really allocation stressful so I'm not too surprised but I want to do the due diligence.


  - The statistic system up is up and I have a new test 
http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch

 - I did a bit of a study about the random generator here, more details are 
below but basically it seems to work well

  - I added a capability but since this is the first time doing this, I was not 
sure I did it right
- I did add a test though for it and the test seems to do what I expect 
(all methods are failing with the JVMTI_ERROR_MUST_POSSESS_CAPABILITY error).
- 
http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch

  - I still need to figure out what to do about the multi-agent vs single-agent 
issue

  - As far as measurements, it seems I still need to look at:
- Why we do the 20 random calls first, are they necessary?
- Look at the mean of the sampling rate that the random generator does and 
also what is actually sampled
- What is the overhead in terms of memory/performance when on?

I have inlined my answers, I think I got them all in the new webrev, let me 
know your thoughts.

Thanks again!
Jc


On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl mailto:thomas.scha...@oracle.com>> wrote:

Hi,

On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
> Hi all,
>
> First off: Thanks again to Robbin and Thomas for their reviews :)
>
> Next, I've uploaded a new webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ 

>
> Here is an update:
>
> - @Robbin, I forgot to say that yes I need to look at implementing
> this for the other architectures and testing it before it is all
> ready to go. Is it common to have it working on all possible
> combinations or is there a subset that I should be doing first and we
> can do the others later?
> - I've tested slowdebug, built and ran the JTreg tests I wrote with
> slowdebug and fixed a few more issues
> - I've refactored a bit of the code following Thomas' comments
>- I think I've handled all the comments from Thomas (I put
> comments inline below for the specifics)

Thanks for handling all those.

> - Following Thomas' comments on statistics, I want to add some
> quality assurance tests and find that the easiest way would be to
> have a few counters of what is happening in the sampler and expose
> that to the user.
>- I'll be adding that in the next version if no one sees any
> objections to that.
>- This will allow me to add a sanity test in JTreg about number of
> samples and average of sampling rate
>
> @Thomas: I had a few questions that I inlined below but I will
> summarize the "bigger ones" here:
>- You mentioned constants are not using the right conventions, I
> looked around and didn't see any convention except normal naming then
> for static constants. Is that right?

I looked through https://wiki.openjdk.java.net/display/HotSpot/StyleGui 

de and the rule is to "follow an existing pattern and must have a
distinct appearance from other names". Which does not help a lot I
guess :/ The GC team started using upper camel case, e.g.
SomeOtherConstant, but very likely this is probably not applied
consistently throughout. So I am fine with not adding another style
(like kMaxStackDepth with the "k" in front with some unknown meaning)
is fine.

(Chances are you will find that style somew

Re: Low-Overhead Heap Profiling

2017-06-28 Thread Robbin Ehn


Hi,

On 06/21/2017 10:45 PM, JC Beyler wrote:

Hi all,

First off: Thanks again to Robbin and Thomas for their reviews :)

Next, I've uploaded a new webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/

Here is an update:

- @Robbin, I forgot to say that yes I need to look at implementing this for the other architectures and testing it before it is all ready to go. Is it common to have it 
working on all possible combinations or is there a subset that I should be doing first and we can do the others later?


I will change my mind here and proposed a different approach.
Right now the only compiler platform change you have is for the FastTLABRefill.
FastTLABRefill works for all gc except G1 (CMS is deprecated) but only in C1 
compiled code.
The performance angle of this is not really relevant, since then you will have 
C2 compiled code.

So I suggested that we should remove FastTLABRefill completely, I will try to 
makes this happen but most likely not before my long vacation.

That means you can avoid all platform specific code (as patch is now), correct?

If so just keep the x86 support in there and we remove that if I succeed, plus 
you can support all platforms when FastTLABRefill=false.

Sounds good?

If you had any other questions I missed please let me know.
I'll look at your latest webrev, and response to that also.

Thanks!

/Robbin



- I've tested slowdebug, built and ran the JTreg tests I wrote with slowdebug 
and fixed a few more issues
- I've refactored a bit of the code following Thomas' comments
- I think I've handled all the comments from Thomas (I put comments inline 
below for the specifics)

The biggest addition to this webrev is that there is more testing, I've added 
two tests for looking specifically at the two garbage sampling data Recent vs 
Frequent:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorRecentTest.java.patch
and
http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorFrequentTest.java.patch

I've also refactored the JNI library a bit: 
http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c.patch.
- Side question: I was looking at trying to make this a multi-file library 
so you would not have all the code in one spot. It seems this is not really 
done?
   - Is there a solution to this? What I really wanted was:
 - The core library that will help testing be easier
 - The JNI code for each Java class separate and calling into the core 
library

- Following Thomas' comments on statistics, I want to add some quality assurance tests and find that the easiest way would be to have a few counters of what is happening in 
the sampler and expose that to the user.

- I'll be adding that in the next version if no one sees any objections to 
that.
- This will allow me to add a sanity test in JTreg about number of samples 
and average of sampling rate

@Thomas: I had a few questions that I inlined below but I will summarize the "bigger 
ones" here:
- You mentioned constants are not using the right conventions, I looked 
around and didn't see any convention except normal naming then for static 
constants. Is that right?
- You mentioned putting the weak_oops_do in a separate method and logging, 
I inlined my answer below and would appreciate your feedback there too.

Thanks again for your reviews and patience!
Jc

PS: I've also inlined my answers to Thomas below:

On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl mailto:thomas.scha...@oracle.com>> wrote:

Hi all,

On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
> Dear all,
>
> I've continued working on this and have done the following webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ 


[...]
> Things I still need to do:
>- Have to fix that TLAB case for the FastTLABRefill
>- Have to start looking at the data to see that it is consistent
> and does gather the right samples, right frequency, etc.
>- Have to check the GC elements and what that produces
>- Run a slowdebug run and ensure I fixed all those issues you saw
> Robbin
>
> Thanks for looking at the webrev and have a great week!

   scratching a bit on the surface of this change, so apologies for
rather shallow comments:

- macroAssembler_x86.cpp:5604: while this is compiler code, and I am
not sure this is final, please avoid littering the code with TODO
remarks :) They tend to be candidates for later wtf moments only.

Just file a CR for that.


Newcomer question: what is a CR and not sure I have the rights to do that yet ? 
:)

- calling HeapMonitoring::do_weak_oops() (which should probably be
called weak_oops_do() like other similar methods) only if string
dedupli

Re: Low-Overhead Heap Profiling

2017-06-28 Thread serguei.spit...@oracle.com

One more comment on the JVMTI spec
  update (jvmti.xml).

  In the Heap Monitoring section new typedefs are defined:
  Call Frame

typedef struct {
jint bci;
jmethodID method_id;
} jvmtiCallFrame;

  Stack Trace

typedef struct {
JNIEnv* env_id;
jvmtiCallFrame* frames;
jint frame_count;
jint size;
jlong thread_id;
} jvmtiStackTrace;

  I think, it'd make sense to use similar typedefs defined in the
  Stack Frame section:
    jvmtiFrameInfo and jvmtiStackInfo

  Thanks,
  Serguei

  On 6/27/17 23:58, serguei.spit...@oracle.com wrote:

  Hi JC,

Please, find more comments below.

On 6/27/17 21:26, serguei.spit...@oracle.com
wrote:

Hi JC,

  Sorry for a latency in reply.
  I'm going to a 4-week vacation.
  But I'll try to help you occasionally when there will be
  access to the Internet.

  On 6/23/17 09:58, JC Beyler wrote:

  Hi Serguei and all,

Thanks Serguei for your review. I believe I have not
  answered many questions but raised more but I hope this
  helps understand what I am trying to do.

I have inlined my answers to make it easier to answer
  everything (and to be honest, ask questions).

@Thomas: Thank you for your
  thorough answer as well, I'll be working on it and will
  answer when I get the fixes and comments in a handled
  state, most likely beginning of next week :)

  On Fri, Jun 23, 2017 at 3:01 AM,
serguei.spit...@oracle.com

wrote:

Hi
  JC,

  > Some initial JVMTI-specific
  questions/comments.

  . . .

  I think, it would make sense to introduce new optional
  capability, something like:
     can_sample_heap_allocations

I will work on adding that for the next webrev,
  Robbin had mentioned I should and I dropped the ball
  on that one.

I see a couple of more emails from you.
Will try to help when you have some problems or questions.

  > Do you consider this API to be used by a single
  agent or it is a multi-agent feature?
  > What if one agent calls the StartHeapSampling()
but another calls one of the others. 
  > The API specs needs to clarify this.

  > I'm not sure you had a chance to carefully think
  on what JVMTI phases are allowed for new functions.

- I am pretty sure I have NOT really thought about
  this as carefully as you have because this is the
  first time I am doing this and these questions already
  make me go "hmmm" :)

Expectable. :)

I have a tendency to work through issues and try to
  keep the design simple first. Your question seems to
  imply I can make this work for a single agent world
  and have a means to explicitly disable it for
  multi-agent for now.

  Right.
  I agree, this will allow to make a progress and get more
  experience first.

I looked quickly in the prims folder for anything
  mentioning single agent vs multi agent but did not
  find anything. Does this get handled s this more a
  docu

Re: Low-Overhead Heap Profiling

2017-06-27 Thread serguei.spit...@oracle.com

Hi JC,

Please, find more comments below.

On 6/27/17 21:26, serguei.spit...@oracle.com wrote:

Hi JC,

Sorry for a latency in reply.
I'm going to a 4-week vacation.
But I'll try to help you occasionally when there will be access
to the Internet.

On 6/23/17 09:58, JC Beyler wrote:

Hi Serguei and all,

Thanks Serguei for your review. I believe I have not
answered many questions but raised more but I hope this
helps understand what I am trying to do.

I have inlined my answers to make it easier to answer
everything (and to be honest, ask questions).

@Thomas: Thank you for your thorough
answer as well, I'll be working on it and will answer when I
get the fixes and comments in a handled state, most likely
beginning of next week :)

On Fri, Jun 23, 2017 at 3:01 AM, serguei.spit...@oracle.com

wrote:

Hi
JC,

> Some initial JVMTI-specific questions/comments.

. . .

I
think, it would make sense to introduce new optional
capability, something like:
can_sample_heap_allocations

I will work on adding that for the next webrev,
Robbin had mentioned I should and I dropped the ball on
that one.

I see a couple of more emails from you.
Will try to help when you have some problems or questions.

> Do you consider this API to be used by a single
agent or it is a multi-agent feature?
> What if one agent calls the StartHeapSampling()
but another calls one of the others.
> The API specs needs to clarify this.

> I'm not sure you had a chance to carefully think on
what JVMTI phases are allowed for new functions.

- I am pretty sure I have NOT really thought about
this as carefully as you have because this is the first
time I am doing this and these questions already make me
go "hmmm" :)

Expectable. :)

I have a tendency to work through issues and try to
keep the design simple first. Your question seems to
imply I can make this work for a single agent world and
have a means to explicitly disable it for multi-agent
for now.

Right.
I agree, this will allow to make a progress and get more experience
first.

I looked quickly in the prims folder for anything
mentioning single agent vs multi agent but did not find
anything. Does this get handled s this more a
documentation thing and tough luck for the user.

This page: https://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html#environments

seems to say the latter. That the documentation
should say that this is not a multiple agent support
(yet?) and there is nothing we can do. Is that right?

You can look at these two functions in the
jvmtiManageCapabilities.cpp:

144 jvmtiCapabilities
JvmtiManageCapabilities::init_always_solo_capabilities() {
145 jvmtiCapabilities jc;
146
147 memset(&jc, 0, sizeof(jc));
148 jc.can_suspend = 1;
149 return jc;
150 }
151
152
153 jvmtiCapabilities
JvmtiManageCapabilities::init_onload_solo_capabilities() {
154 jvmtiCapabilities jc;
155
156 memset(&jc, 0, sizeof(jc));
157 jc.can_generate_field_modification_events = 1;
158 jc.can_genera

Re: Low-Overhead Heap Profiling

2017-06-27 Thread serguei.spit...@oracle.com

Hi JC,

  Sorry for a latency in reply.
  I'm going to a 4-week vacation.
  But I'll try to help you occasionally when there will be access to
  the Internet.

  On 6/23/17 09:58, JC Beyler wrote:

  Hi Serguei and all,

Thanks Serguei for your review. I believe I have not
  answered many questions but raised more but I hope this helps
  understand what I am trying to do.

I have inlined my answers to make it easier to answer
  everything (and to be honest, ask questions).

@Thomas: Thank you for your thorough
  answer as well, I'll be working on it and will answer when I
  get the fixes and comments in a handled state, most likely
  beginning of next week :)

  On Fri, Jun 23, 2017 at 3:01 AM, serguei.spit...@oracle.com

wrote:

Hi
  JC,

  > Some initial JVMTI-specific questions/comments.

  > I was not able to apply your patch as the are
  many merge conflicts.
  > Could you, please, re-base it on the latest JDK
  10 sources?

This is weird, I did a pull right now and it worked. I
  hesitated I was using hg correctly so I just did:
hg clone http://hg.openjdk.java.net/jdk10/hs
  jdk10hs

bash ./get_source.sh

wget http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/hotspot.patch

patch -p1 < hotspot.patch

Thanks.
The following seems to work:
 % cd hotspot; patch -p1 < ../hotspot.patch

Hopefully, I'll be able to build and play with it in order to
understand better.

(Maybe the patch is not the right way of doing things,
  there might be a hg equivalent?).

Yes.
I normally do this:

cd /hotspot
hg import ../hotspot.patch

It returns the following errors:

applying ../hotspot.patch
patching file src/cpu/x86/vm/macroAssembler_x86.cpp
Hunk #1 succeeded at 5662 with fuzz 2 (offset 60 lines).
patching file src/share/vm/gc/g1/g1CollectedHeap.cpp
Hunk #1 FAILED at 74
Hunk #2 succeeded at 4316 with fuzz 1 (offset 33 lines).
Hunk #3 FAILED at 4506
2 out of 3 hunks FAILED -- saving rejects to file
src/share/vm/gc/g1/g1CollectedHeap.cpp.rej
patching file src/share/vm/gc/g1/g1CollectedHeap.hpp
Hunk #1 FAILED at 302
1 out of 1 hunks FAILED -- saving rejects to file
src/share/vm/gc/g1/g1CollectedHeap.hpp.rej
patching file src/share/vm/gc/g1/g1MarkSweep.cpp
Hunk #1 FAILED at 47
Hunk #2 FAILED at 249
2 out of 2 hunks FAILED -- saving rejects to file
src/share/vm/gc/g1/g1MarkSweep.cpp.rej
patching file src/share/vm/gc/parallel/psMarkSweep.cpp
Hunk #1 FAILED at 50
Hunk #2 FAILED at 609
2 out of 2 hunks FAILED -- saving rejects to file
src/share/vm/gc/parallel/psMarkSweep.cpp.rej
patching file src/share/vm/gc/parallel/psParallelCompact.cpp
Hunk #1 FAILED at 59
Hunk #2 FAILED at 2167
2 out of 2 hunks FAILED -- saving rejects to file
src/share/vm/gc/parallel/psParallelCompact.cpp.rej
patching file src/share/vm/gc/shared/collectedHeap.cpp
Hunk #1 FAILED at 37
Hunk #2 FAILED at 294
Hunk #3 FAILED at 314
Hunk #4 FAILED at 335
4 out of 4 hunks FAILED -- saving rejects to file
src/share/vm/gc/shared/collectedHeap.cpp.rej
patching file src/share/vm/gc/shared/collectedHeap.hpp
Hunk #1 succeeded at 146 with fuzz 2 (offset 3 lines).
patching file src/share/vm/gc/shared/collectedHeap.inline.hpp
Hunk #1 FAILED at 156
1 out of 1 hunks FAILED -- saving rejects to file
src/share/vm/gc/shared/collectedHeap.inline.hpp.rej
patching file src/share/vm/gc/shared/genCollectedHeap.cpp
Hunk #1 FAILED at 48
Hunk #2 FAILED at 721
2 out of 2 hunks FAILED -- saving rejects to file
src/share/vm/gc/shared/genCollectedHeap.cpp.rej
patching file src/share/vm/gc/shared/referenceProcessor.cpp
Hunk #1 FAILED at 34
Hunk #2 FAILED at 256
2 out of 2 hunks FAILED -- saving rejects to file
src/share/vm/gc/shared/referenceProcessor.cpp.rej
patching file src/share/vm/gc/shared/threadLocalAllocBuffer.cpp
Hunk #1 FAILED at 28
Hunk #2 FAILED at 120
Hunk #3 FAILED at 182
Hunk #4 FAILED at 305
4 out of 4 hunks FAILED -- saving rejects to file
src/share/vm/gc/shared/threadLocalAllocBuffer.cpp.rej
patching file src/share

Re: Low-Overhead Heap Profiling

2017-06-23 Thread Thomas Schatzl

Hi,

On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
> Hi all,
> 
> First off: Thanks again to Robbin and Thomas for their reviews :)
> 
> Next, I've uploaded a new webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/
> 
> Here is an update:
> 
> - @Robbin, I forgot to say that yes I need to look at implementing
> this for the other architectures and testing it before it is all
> ready to go. Is it common to have it working on all possible
> combinations or is there a subset that I should be doing first and we
> can do the others later?
> - I've tested slowdebug, built and ran the JTreg tests I wrote with
> slowdebug and fixed a few more issues
> - I've refactored a bit of the code following Thomas' comments
>    - I think I've handled all the comments from Thomas (I put
> comments inline below for the specifics)

Thanks for handling all those.

> - Following Thomas' comments on statistics, I want to add some
> quality assurance tests and find that the easiest way would be to
> have a few counters of what is happening in the sampler and expose
> that to the user.
>    - I'll be adding that in the next version if no one sees any
> objections to that.
>    - This will allow me to add a sanity test in JTreg about number of
> samples and average of sampling rate
> 
> @Thomas: I had a few questions that I inlined below but I will
> summarize the "bigger ones" here:
>    - You mentioned constants are not using the right conventions, I
> looked around and didn't see any convention except normal naming then
> for static constants. Is that right?

I looked through https://wiki.openjdk.java.net/display/HotSpot/StyleGui
de and the rule is to "follow an existing pattern and must have a
distinct appearance from other names". Which does not help a lot I
guess :/ The GC team started using upper camel case, e.g.
SomeOtherConstant, but very likely this is probably not applied
consistently throughout. So I am fine with not adding another style
(like kMaxStackDepth with the "k" in front with some unknown meaning)
is fine.

(Chances are you will find that style somewhere used anyway too,
apologies if so :/)

> PS: I've also inlined my answers to Thomas below:
> 
> On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl  e.com> wrote:
> > Hi all,
> > 
> > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
> > > Dear all,
> > >
> > > I've continued working on this and have done the following
> > webrev:
> > > http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/
> > 
> > [...]
> > > Things I still need to do:
> > >    - Have to fix that TLAB case for the FastTLABRefill
> > >    - Have to start looking at the data to see that it is
> > consistent and does gather the right samples, right frequency, etc.
> > >    - Have to check the GC elements and what that produces
> > >    - Run a slowdebug run and ensure I fixed all those issues you
> > saw > Robbin
> > >
> > > Thanks for looking at the webrev and have a great week!
> > 
> >   scratching a bit on the surface of this change, so apologies for
> > rather shallow comments:
> > 
> > - macroAssembler_x86.cpp:5604: while this is compiler code, and I
> > am not sure this is final, please avoid littering the code with
> > TODO remarks :) They tend to be candidates for later wtf moments
> > only.
> > 
> > Just file a CR for that.
> > 
> Newcomer question: what is a CR and not sure I have the rights to do
> that yet ? :)

Apologies. CR is a change request, this suggests to file a bug in the
bug tracker. And you are right, you can't just create a new account in
the OpenJDK JIRA yourselves. :(

I was mostly referring to the "... but it is a TODO" part of that
comment in macroassembler_x86.cpp. Comments about the why of the code
are appreciated.

[Note that I now understand that this is to some degree still work in
progress. As long as the final changeset does no contain TODO's I am
fine (and it's not a hard objection, rather their use in "final" code
is typically limited in my experience)]

5603   // Currently, if this happens, just set back the actual end to 
where it was.
5604   // We miss a chance to sample here.

Would be okay, if explaining "this" and the "why" of missing a chance
to sample here would be best.

Like maybe:

// If we needed to refill TLABs, just set the actual end point to 
// the end of the TLAB again. We do not sample here although we could.

I am not sure whether "miss a chance to sample" meant "we could, but
consciously don't because it's not that useful" or "it would be
necessary but don't because it's too complicated to do.".

Looking at the original comment once more, I am also not sure if that
comment shouldn't referring to the "end" variable (not actual_end)
because that's the variable that is responsible for taking the sampling
path? (Going from the member description of ThreadLocalAllocBuffer).

But maybe I am only confused and it's best to just leave the comment
away. :)

Thinking about it some more, doesn't this not-sampling in this case
mean tha

Re: Low-Overhead Heap Profiling

2017-06-23 Thread serguei.spit...@oracle.com


  
  
Hi JC,
  
  Some initial JVMTI-specific questions/comments.
  
  I was not able to apply your patch as the are many merge
  conflicts.
  Could you, please, re-base it on the latest JDK 10 sources?
  
  I think, it would make sense to introduce new optional capability,
  something like:
     can_sample_heap_allocations
  
  Do you consider this API to be used by a single agent or it is a
  multi-agent feature?
  What if one agent calls the StartHeapSampling()
but another calls one of the others.
  
  
  The API specs needs to clarify this.
  
  I'm not sure you had a chance to carefully think on what JVMTI
  phases are allowed for new functions.
  
  
http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/src/share/vm/runtime/heapMonitoring.cpp.html
  
  I do not see the variable HeapMonitoring::_enabled is checked in
  the functions:
      HeapMonitoring::get_live_traces(),
      HeapMonitoring::get_garbage_traces(),
      HeapMonitoring::get_frequent_garbage_traces(),
      HeapMonitoring::release_traces()
  
  A question about a multi-agent feature from above applies here as
  well.
  
  
  Thanks,
  Serguei
  
  
  
  On 6/21/17 13:45, JC Beyler wrote:


  
  Hi all,


First off: Thanks again to Robbin and Thomas for their
  reviews :)


Next, I've uploaded a new webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/


Here is an update:


- @Robbin, I forgot to say that yes I need to look at
  implementing this for the other architectures and testing it
  before it is all ready to go. Is it common to have it working
  on all possible combinations or is there a subset that I
  should be doing first and we can do the others later?
- I've tested slowdebug, built and ran the JTreg tests I
  wrote with slowdebug and fixed a few more issues

- I've refactored a bit of the code following Thomas'
  comments

   - I think I've handled all the comments from Thomas (I
  put comments inline below for the specifics)


The biggest addition to this webrev is that there is more
  testing, I've added two tests for looking specifically at the
  two garbage sampling data Recent vs Frequent:
   http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorRecentTest.java.patch 
   and
   http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorFrequentTest.java.patch


I've also refactored the JNI library a bit: http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c.patch.
   - Side question: I was looking at trying to make this a
  multi-file library so you would not have all the code in one
  spot. It seems this is not really done?
      - Is there a solution to this? What I really wanted
  was:
        - The core library that will help testing be easier
        - The JNI code for each Java class separate and
  calling into the core library


- Following Thomas' comments on statistics, I want to add
  some quality assurance tests and find that the easiest way
  would be to have a few counters of what is happening in the
  sampler and expose that to the user.
   - I'll be adding that in the next version if no one sees
  any objections to that.
   - This will allow me to add a sanity test in JTreg about
  number of samples and average of sampling rate


@Thomas: I had a few questions that I inlined below but I
  will summarize the "bigger ones" here:
   - You mentioned constants are not using the right
  conventions, I looked around and didn't see any convention
  except normal naming then for static constants. Is that right?
   - You mentioned putting the weak_oops_do in a separate
  method and logging, I inlined my answer below and would
  appreciate your feedback there too.


Thanks again for your reviews and patience!

Jc


PS: I've also inlined my answers to Thomas below:



  
On Tue, Jun 13, 2017 at 8:03 AM,
  Thomas Schatzl 
  wrote:
  Hi all,

  On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
  > Dear all,
  >

Re: Low-Overhead Heap Profiling

2017-06-13 Thread Thomas Schatzl

Hi all,

On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
> Dear all,
> 
> I've continued working on this and have done the following webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/

[...]
> Things I still need to do:
>    - Have to fix that TLAB case for the FastTLABRefill
>    - Have to start looking at the data to see that it is consistent
> and does gather the right samples, right frequency, etc.
>    - Have to check the GC elements and what that produces
>    - Run a slowdebug run and ensure I fixed all those issues you saw
> Robbin
> 
> Thanks for looking at the webrev and have a great week!

  scratching a bit on the surface of this change, so apologies for
rather shallow comments:

- macroAssembler_x86.cpp:5604: while this is compiler code, and I am
not sure this is final, please avoid littering the code with TODO
remarks :) They tend to be candidates for later wtf moments only.

Just file a CR for that.

- calling HeapMonitoring::do_weak_oops() (which should probably be
called weak_oops_do() like other similar methods) only if string
deduplication is enabled (in g1CollectedHeap.cpp:4511) seems wrong.

The call should be at least around 6 lines up outside the if.

Preferentially in a method like process_weak_jni_handles(), including
additional logging. (No new (G1) gc phase without minimal logging :)).

Seeing the changes in referenceProcess.cpp, you need to add the call to
HeapMonitoring::do_weak_oops() exactly similar to
process_weak_jni_handles() in case there is no reference processing
done.

- psParallelCompact.cpp:2172 similar to other places where the change
adds the call to HeapMonitoring::do_weak_oops(), remove the empty line.

- the change doubles the size of
CollectedHeap::allocate_from_tlab_slow() above the "small and nice"
threshold. Maybe it could be refactored a bit.

- referenceProcessor.cpp:40: please make sure that the includes are
sorted alphabetically (I did not check the other files yet).

- referenceProcessor.cpp:261: the change should add logging about the
number of references encountered, maybe after the corresponding "JNI
weak reference count" log message.

- threadLocalAllocBuffer.cpp:331: one more "TODO"

- threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer class
documentation should be updated about the sampling additions. I would
have no clue what the difference between "actual_end" and "end" would
be from the given information.

- threadLocalAllocBuffer.hpp:130: extra whitespace ;)

- some copyrights :)

- in heapMonitoring.hpp: there are some random comments about some code
that has been grabbed from "util/math/fastmath.[h|cc]". I can't tell
whether this is code that can be used but I assume that Noam Shazeer is
okay with that (i.e. that's all Google code).

- heapMonitoring.hpp/cpp static constant naming does not correspond to
Hotspot's. Additionally, in Hotspot static methods are cased like other
methods.

- in heapMonitoring.cpp there are a few cryptic comments at the top
that seem to refer to internal stuff that should probably be removed.

I did not think through the impact of the TLAB changes on collector
behavior yet (if there are). Also I did not check for problems with
concurrent mark and SATB/G1 (if there are).

Thanks,
  Thomas

Re: Low-Overhead Heap Profiling

2017-06-13 Thread Robbin Ehn


Hi again,

Forgot to say added Dan, he might be interested in JVMTI changes at least.

On 06/13/2017 11:19 AM, Robbin Ehn wrote:


To get a more potential users it's nice with a Java API and as you say 
simplifies tests, good.


Obviously this is not a public API, but the code for doing it from java will at 
least but out there.

Thanks!

/Robbin





   - This has allowed me to add a nice test here to test the wipe out 
of the data:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorOnOffTest.java.patch

   - Finally, I've done initial overhead testing and, though my data was a bit noisy, I saw no real overhead using the Tlab approach while off. I will try to get better 
data and more importantly, more stable data when turning it on.


I think it's key to have numbers showing ~0% performance impact here.



Things I still need to do:
- Have to fix that TLAB case for the FastTLABRefill
- Have to start looking at the data to see that it is consistent and does 
gather the right samples, right frequency, etc.
- Have to check the GC elements and what that produces
- Run a slowdebug run and ensure I fixed all those issues you saw Robbin


Also we will need support for some more platform, as the patch stands now, it's 
only the src/cpu/x86/vm/macroAssembler_x86.cpp part that needs to be done for 
other platforms?

Thanks!

/Robbin



Thanks for looking at the webrev and have a great week!
Jc

On Fri, Jun 2, 2017 at 8:57 AM, JC Beyler mailto:jcbey...@google.com>> wrote:

Dear all,

Thanks Robbin for the comments. I have left the MuxLocker for now and will 
look at removing it or as a future enhancement as you say. I did make the class 
extends and
added a TODO for myself to test this in slowdebug.

I have also put a new webrev up that is still a work in progress but should 
allow us to talk about TLAB vs C1/C2 modifications.

TLAB implementation: http://cr.openjdk.java.net/~rasbold/8171119/webrev.04/ 

C1/C2 implementation: http://cr.openjdk.java.net/~rasbold/8171119/webrev.03/ 


What this webrev has is:
- I have put in a TLAB implementation, it is a WIP, and I have not yet 
done the qualitative/quantitative study on it vs the implementation using 
compilation changes
but the big parts are in place
- Caveats:
- There is a TODO: 
http://cr.openjdk.java.net/~rasbold/8171119/webrev.04/src/cpu/x86/vm/macroAssembler_x86.cpp.patch


- I have not fixed the calculation in the case of a 
FastTLABRefill case
- This is always on right now, there is no way to turn it off, 
that's also a TODO to be directed by the JVMTI API

 - I also have circumvented the AsyncGetCallTrace using the snippet of 
code you showed Robbin, it works for here/now
   - But we might have to revisit this one day because it then does 
not try to get some of the native stacks and jumps directly to the Java stacks 
(I see cases
where this could be an issue)
   - However, this has cleaned up quite a bit of the code and I 
have removed all mention of ASGCT and its structures now and use directly the 
JVMTI structures

- GC is handled now, I have not yet done the qualitative/quantitative 
study on it but the big parts are in place

- Due to the way the TLAB is called, the stack walker is now correct 
and produces the right stacks it seems (this is a bold sentence from my ONE 
JTREG test :))

Final notes on this webrev:
- Have to fix that TLAB case for the FastTLABRefill
- Implement the turn on/off system for the TLAB implementation
- Have to start looking at the data to see that it is consistent and 
does gather the right samples, right frequency, etc.
- Have to check the GC elements and what that produces
- Run a slowdebug run and ensure I fixed all those issues you saw Robbin

As always, your comments and feedback are greatly appreciated! Happy Friday!
Jc


On Tue, May 30, 2017 at 5:33 AM, Robbin Ehn mailto:robbin@oracle.com>> wrote:

Hi Jc,

On 05/22/2017 08:47 PM, JC Beyler wrote:

Dear all,

I have a new webrev up:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.03/ 



I liked this!

Two small things:

heapMonitoring.hpp
class HeapMonitoring should extend AllStatic

heapMonitoring.cpp
class MuxLocker should extend StackObj
But I think you should skip MuxLocker or push it separate generic 
enhancement.

Great with the jtreg test, thanks alot!


This webrev has

Re: Low-Overhead Heap Profiling

2017-06-13 Thread Robbin Ehn


Hi JC,

This version really makes my happy, awesome work!

Since we got no attention from compiler and you manage to get ride of most of 
the compiler changes, I dropped compiler but added runtime instead.
Serguei, when you have time, please chime in.

On 06/12/2017 08:11 PM, JC Beyler wrote:

Dear all,

I've continued working on this and have done the following webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/

What it does:
- Implement the turn on/off system for the TLAB implementation
   - Basically you can turn the system on/off using the JVMTI API (StartHeapProfiling and StopHeapProfiling here 
http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/src/share/vm/prims/jvmti.xml.patch)
  - I've currently implemented that the system resets the profiling data at the StartHeapProfiling, this allows you to start profiling, stop profiling, and then the 
user can query what happened during profiling as a post-processing step :)


Nice addition.



- I've currently, for sake of simplicity and Robbin hinted it would be better for now, removed the mutex code for the data but think that will have to come back in a 
next patch or in this one at some point


29 // Keep muxlock for now

237   // Protects the traces currently sampled (below).
238   volatile intptr_t _allocated_traces_lock[1];

This seems unused now, and I do not see any locks at all?



- I've also really worked on the testing code to make it more expandable in 
this case
http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c.patch is now a bit more generic and allows to turn on/off the 
sampling; it allows to query the existence or not of frames passed from Java world to JNI, which allows the test to define what should be seen and have the code in one place.

   - That is done using the helper class 
http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/test/serviceability/jvmti/HeapMonitor/MyPackage/Frame.java.patch
  - Basically each test can now provide an array of Frames that the native library can check the internal data for a match. This allows to have various tests 
have their own signatures, etc.


To get a more potential users it's nice with a Java API and as you say 
simplifies tests, good.



   - This has allowed me to add a nice test here to test the wipe out 
of the data:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorOnOffTest.java.patch

   - Finally, I've done initial overhead testing and, though my data was a bit noisy, I saw no real overhead using the Tlab approach while off. I will try to get better 
data and more importantly, more stable data when turning it on.


I think it's key to have numbers showing ~0% performance impact here.



Things I still need to do:
- Have to fix that TLAB case for the FastTLABRefill
- Have to start looking at the data to see that it is consistent and does 
gather the right samples, right frequency, etc.
- Have to check the GC elements and what that produces
- Run a slowdebug run and ensure I fixed all those issues you saw Robbin


Also we will need support for some more platform, as the patch stands now, it's 
only the src/cpu/x86/vm/macroAssembler_x86.cpp part that needs to be done for 
other platforms?

Thanks!

/Robbin



Thanks for looking at the webrev and have a great week!
Jc

On Fri, Jun 2, 2017 at 8:57 AM, JC Beyler mailto:jcbey...@google.com>> wrote:

Dear all,

Thanks Robbin for the comments. I have left the MuxLocker for now and will 
look at removing it or as a future enhancement as you say. I did make the class 
extends and
added a TODO for myself to test this in slowdebug.

I have also put a new webrev up that is still a work in progress but should 
allow us to talk about TLAB vs C1/C2 modifications.

TLAB implementation: http://cr.openjdk.java.net/~rasbold/8171119/webrev.04/ 

C1/C2 implementation: http://cr.openjdk.java.net/~rasbold/8171119/webrev.03/ 


What this webrev has is:
- I have put in a TLAB implementation, it is a WIP, and I have not yet 
done the qualitative/quantitative study on it vs the implementation using 
compilation changes
but the big parts are in place
- Caveats:
- There is a TODO: 
http://cr.openjdk.java.net/~rasbold/8171119/webrev.04/src/cpu/x86/vm/macroAssembler_x86.cpp.patch


- I have not fixed the calculation in the case of a 
FastTLABRefill case
- This is always on right now, there is no way to turn it off, 
that's also a TODO to be directed by the JVMTI API

 - I also have circumvented the AsyncGetCallTrace using the snippet of 
co

Re: Low-Overhead Heap Profiling

2017-05-30 Thread Robbin Ehn


Hi Jc,

On 05/22/2017 08:47 PM, JC Beyler wrote:

Dear all,

I have a new webrev up:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.03/


I liked this!

Two small things:

heapMonitoring.hpp
class HeapMonitoring should extend AllStatic

heapMonitoring.cpp
class MuxLocker should extend StackObj
But I think you should skip MuxLocker or push it separate generic enhancement.

Great with the jtreg test, thanks alot!



This webrev has, I hope, fixed a lot of the comments from Robbin:
   - The casts normally are all C++ style
   - Moved this to jdk10-hs
  - I have not tested slowdebug yet, hopefully it does not break there
   - Added the garbage collection system:
  - Now live sampled allocations are tracked throughout their lifetime
 - When GC happens, it moves the sampled allocation information to two 
lists: recent and frequent GC lists
 - Those lists use the array system that the live objects were 
using before but have different re-use strategies
  - Added the JVMTI API for them via a GetFrequentGarbageTraces and 
GetGarbageTraces
- Both use the same JVMTI structures
  - Added the calls to them for the test, though I've kept that test simple 
for now:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.03/raw_files/new/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c
 - As I write this, I notice my webrev is missing a final change I made to the test that calls the same ReleaseTraces to each live/garbage/frequent structure. This 
is updated in my local repo and will get in the next webrev.


Next steps for this work are:
- Putting the TLAB implementation (yes not forgotten ;-))
- Adding more testing and separate the current test system to check things 
a bit more thoroughly
- Have not tried to circumvent AsyncGetCallTrace yet
- Still have to double check the stack walker a bit more


Looking forward to this.

Could someone from compiler take a look please?

Thanks!

/Robbin



Happy webrev perusal!
Jc


On Tue, May 16, 2017 at 5:20 AM, Robbin Ehn mailto:robbin@oracle.com>> wrote:

Just a few answers,

On 05/15/2017 06:48 PM, JC Beyler wrote:

Dear all,

I've updated the webrev to:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/ 

>


I'll look at this later, thanks!


Robbin,
I believe I have addressed most of your items with webrev 02:
- I added a JTreg test to show how it works:

http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/raw_files/new/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c



>
   - I've modified the code to use its own data structures both 
internally and externally, this will make it easier to move out of 
AsyncGetCallTrace as we move
forward, that is still on my TODOs
   - I cleaned up the JVMTI API by passing a structure that handles the 
num_traces and put in a ReleaseTraces as well
   - I cleaned up other issues as well.

However, I have three questions, which are probably because I'm new in 
this community:
   1) My previous webrevs were based off of JDK9 by mistake. When I 
took JDK10 via : hg clone http://hg.openjdk.java.net/jdk10/jdk10
 > jdk10
   - I don't see code compatible with what you were showing (ie 
your patches don't make sense for that code base; ex: klass is still accessed 
via klass() for
example in collectedHeap.inline.hpp)
   - Would you know what is the right hg clone command so we are 
working on the same code base?


We use jdk10-hs, e.g.
hg tclone http://hg.openjdk.java.net/jdk10/hs 
 10-hs

There is sporadic big merges going from jdk9->jdk10->jdk10-hs and 
jdk10-hs->jdk10, so 10 is moving...


   2) You mentioned I was using os::malloc, new, NEW_C_HEAP_ARRAY; I 
cleaned out the os::malloc but which of the new vs NEW_C_HEAP_ARRAY should I 
use. It might be
that I don't understand when one uses one or the other but I see both 
used around the code base?
 - Is it that new is to be used for anything internal and 
NEW_C_HEAP_ARRAY anything provided to the JVMTI users outside of the JVM?


We overload new operator when you extend correct base class, e.g. 
CH

Re: Low-Overhead Heap Profiling

2017-05-22 Thread JC Beyler

Dear all,

I have a new webrev up:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.03/

This webrev has, I hope, fixed a lot of the comments from Robbin:
  - The casts normally are all C++ style
  - Moved this to jdk10-hs
 - I have not tested slowdebug yet, hopefully it does not break there
  - Added the garbage collection system:
 - Now live sampled allocations are tracked throughout their lifetime
- When GC happens, it moves the sampled allocation information to
two lists: recent and frequent GC lists
- Those lists use the array system that the live objects were
using before but have different re-use strategies
 - Added the JVMTI API for them via a GetFrequentGarbageTraces and
GetGarbageTraces
- Both use the same JVMTI structures
 - Added the calls to them for the test, though I've kept that test
simple for now:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.03/raw_files/new/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c
- As I write this, I notice my webrev is missing a final change I
made to the test that calls the same ReleaseTraces to each
live/garbage/frequent structure. This is updated in my local repo and will
get in the next webrev.

Next steps for this work are:
   - Putting the TLAB implementation (yes not forgotten ;-))
   - Adding more testing and separate the current test system to check
things a bit more thoroughly
   - Have not tried to circumvent AsyncGetCallTrace yet
   - Still have to double check the stack walker a bit more

Happy webrev perusal!
Jc


On Tue, May 16, 2017 at 5:20 AM, Robbin Ehn  wrote:

> Just a few answers,
>
> On 05/15/2017 06:48 PM, JC Beyler wrote:
>
>> Dear all,
>>
>> I've updated the webrev to:
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/ <
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/>
>>
>
> I'll look at this later, thanks!
>
>
>> Robbin,
>> I believe I have addressed most of your items with webrev 02:
>>- I added a JTreg test to show how it works:
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/raw_fi
>> les/new/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c <
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/raw_f
>> iles/new/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c>
>>   - I've modified the code to use its own data structures both internally
>> and externally, this will make it easier to move out of AsyncGetCallTrace
>> as we move forward, that is still on my TODOs
>>   - I cleaned up the JVMTI API by passing a structure that handles the
>> num_traces and put in a ReleaseTraces as well
>>   - I cleaned up other issues as well.
>>
>> However, I have three questions, which are probably because I'm new in
>> this community:
>>   1) My previous webrevs were based off of JDK9 by mistake. When I took
>> JDK10 via : hg clone http://hg.openjdk.java.net/jdk10/jdk10 <
>> http://hg.openjdk.java.net/jdk10/jdk10> jdk10
>>   - I don't see code compatible with what you were showing (ie your
>> patches don't make sense for that code base; ex: klass is still accessed
>> via klass() for example in collectedHeap.inline.hpp)
>>   - Would you know what is the right hg clone command so we are
>> working on the same code base?
>>
>
> We use jdk10-hs, e.g.
> hg tclone http://hg.openjdk.java.net/jdk10/hs 10-hs
>
> There is sporadic big merges going from jdk9->jdk10->jdk10-hs and
> jdk10-hs->jdk10, so 10 is moving...
>
>
>>   2) You mentioned I was using os::malloc, new, NEW_C_HEAP_ARRAY; I
>> cleaned out the os::malloc but which of the new vs NEW_C_HEAP_ARRAY should
>> I use. It might be that I don't understand when one uses one or the other
>> but I see both used around the code base?
>> - Is it that new is to be used for anything internal and
>> NEW_C_HEAP_ARRAY anything provided to the JVMTI users outside of the JVM?
>>
>
> We overload new operator when you extend correct base class, e.g.
> CHeapObj so use 'new'
> But for arrays you will need the macro NEW_C_HEAP_ARRAY.
>
>
>>   3) Casts: same kind question: which should I use. The code was using a
>> bit of everything, I'll refactor it entirely but I was not clear if I
>> should go to C casts or C++ casts as I see both in the codebase. What is
>> the convention I should use?
>>
>
> Just be consist, use what suites you, C++ casts might be preferable, if we
> are moving towards C++11.
> And use 'right' cast, e.g. going from Thread* to JavaThread* you should
> use C cast or static_cast, not reinterpret_cast I would say.
>
>
>> Final notes on this webrev:
>>- I am still missing:
>>  - Putting a TLAB implementation so that we can compare both webrevs
>>  - Have not tried to circumvent AsyncGetCallTrace
>>  - Putting in the handling of GC'd objects
>>  - Fix a stack walker issue I have seen, I think I know the problem
>> and will test that theory out for the next webrev
>>
>> I will work on integrating those items for the next webrev!
>>
>
> Thanks!
>
>
>> Thanks for your help,
>

Re: Low-Overhead Heap Profiling

2017-05-16 Thread JC Beyler

Dear Robbin,

Thank you for the answers, much appreciated :)
Jc

On Tue, May 16, 2017 at 5:20 AM, Robbin Ehn  wrote:

> Just a few answers,
>
> On 05/15/2017 06:48 PM, JC Beyler wrote:
>
>> Dear all,
>>
>> I've updated the webrev to:
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/ <
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/>
>>
>
> I'll look at this later, thanks!
>
>
>> Robbin,
>> I believe I have addressed most of your items with webrev 02:
>>- I added a JTreg test to show how it works:
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/raw_fi
>> les/new/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c <
>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/raw_f
>> iles/new/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c>
>>   - I've modified the code to use its own data structures both internally
>> and externally, this will make it easier to move out of AsyncGetCallTrace
>> as we move forward, that is still on my TODOs
>>   - I cleaned up the JVMTI API by passing a structure that handles the
>> num_traces and put in a ReleaseTraces as well
>>   - I cleaned up other issues as well.
>>
>> However, I have three questions, which are probably because I'm new in
>> this community:
>>   1) My previous webrevs were based off of JDK9 by mistake. When I took
>> JDK10 via : hg clone http://hg.openjdk.java.net/jdk10/jdk10 <
>> http://hg.openjdk.java.net/jdk10/jdk10> jdk10
>>   - I don't see code compatible with what you were showing (ie your
>> patches don't make sense for that code base; ex: klass is still accessed
>> via klass() for example in collectedHeap.inline.hpp)
>>   - Would you know what is the right hg clone command so we are
>> working on the same code base?
>>
>
> We use jdk10-hs, e.g.
> hg tclone http://hg.openjdk.java.net/jdk10/hs 10-hs
>
> There is sporadic big merges going from jdk9->jdk10->jdk10-hs and
> jdk10-hs->jdk10, so 10 is moving...
>
>
>>   2) You mentioned I was using os::malloc, new, NEW_C_HEAP_ARRAY; I
>> cleaned out the os::malloc but which of the new vs NEW_C_HEAP_ARRAY should
>> I use. It might be that I don't understand when one uses one or the other
>> but I see both used around the code base?
>> - Is it that new is to be used for anything internal and
>> NEW_C_HEAP_ARRAY anything provided to the JVMTI users outside of the JVM?
>>
>
> We overload new operator when you extend correct base class, e.g.
> CHeapObj so use 'new'
> But for arrays you will need the macro NEW_C_HEAP_ARRAY.
>
>
>>   3) Casts: same kind question: which should I use. The code was using a
>> bit of everything, I'll refactor it entirely but I was not clear if I
>> should go to C casts or C++ casts as I see both in the codebase. What is
>> the convention I should use?
>>
>
> Just be consist, use what suites you, C++ casts might be preferable, if we
> are moving towards C++11.
> And use 'right' cast, e.g. going from Thread* to JavaThread* you should
> use C cast or static_cast, not reinterpret_cast I would say.
>
>
>> Final notes on this webrev:
>>- I am still missing:
>>  - Putting a TLAB implementation so that we can compare both webrevs
>>  - Have not tried to circumvent AsyncGetCallTrace
>>  - Putting in the handling of GC'd objects
>>  - Fix a stack walker issue I have seen, I think I know the problem
>> and will test that theory out for the next webrev
>>
>> I will work on integrating those items for the next webrev!
>>
>
> Thanks!
>
>
>> Thanks for your help,
>> Jc
>>
>> Ps:  I tested this on a new repo:
>>
>> hg clone http://hg.openjdk.java.net/jdk10/jdk10 <
>> http://hg.openjdk.java.net/jdk10/jdk10> jdk10
>> ... building it
>> cd test
>> jtreg -nativepath:/build/linux-x86_64-normal-server
>> -release/support/test/hotspot/jtreg/native/lib/ -jdk
>> /linux-x86_64-normal-server-release/images/jdk
>> ../hotspot/test/serviceability/jvmti/HeapMonitor/
>>
>>
> I'll test it out!
>
> /Robbin
>
>
>>
>> On Thu, May 4, 2017 at 11:21 PM, serguei.spit...@oracle.com > serguei.spit...@oracle.com> > serguei.spit...@oracle.com>> wrote:
>>
>> Robbin,
>>
>> Thank you for forwarding!
>> I will review it.
>>
>> Thanks,
>> Serguei
>>
>>
>>
>> On 5/4/17 02:13, Robbin Ehn wrote:
>>
>> Hi,
>>
>> To me the compiler changes looks what is expected.
>> It would be good if someone from compiler could take a look at
>> that.
>> Added compiler to mail thread.
>>
>> Also adding Serguei, It would be good with his view also.
>>
>> My initial take on it, read through most of the code and took it
>> for a ride.
>>
>> ##
>> - Regarding the compiler changes: I think we need the 'TLAB end'
>> trickery (mentioned by Tony P)
>> instead of a separate check for sampling in fast path for the
>> final version.
>>
>> ##
>> - This patch I had to apply to get it compile on JDK 10:
>>
>> diff -r ac3

Re: Low-Overhead Heap Profiling

2017-05-16 Thread Robbin Ehn


Just a few answers,

On 05/15/2017 06:48 PM, JC Beyler wrote:

Dear all,

I've updated the webrev to:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/ 



I'll look at this later, thanks!



Robbin,
I believe I have addressed most of your items with webrev 02:
   - I added a JTreg test to show how it works:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/raw_files/new/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c 

  - I've modified the code to use its own data structures both internally and externally, this will make it easier to move out of AsyncGetCallTrace as we move forward, that 
is still on my TODOs

  - I cleaned up the JVMTI API by passing a structure that handles the 
num_traces and put in a ReleaseTraces as well
  - I cleaned up other issues as well.

However, I have three questions, which are probably because I'm new in this 
community:
  1) My previous webrevs were based off of JDK9 by mistake. When I took JDK10 via : hg clone http://hg.openjdk.java.net/jdk10/jdk10  
jdk10
  - I don't see code compatible with what you were showing (ie your patches don't make sense for that code base; ex: klass is still accessed via klass() for example in 
collectedHeap.inline.hpp)

  - Would you know what is the right hg clone command so we are working on 
the same code base?


We use jdk10-hs, e.g.
hg tclone http://hg.openjdk.java.net/jdk10/hs 10-hs

There is sporadic big merges going from jdk9->jdk10->jdk10-hs and 
jdk10-hs->jdk10, so 10 is moving...



  2) You mentioned I was using os::malloc, new, NEW_C_HEAP_ARRAY; I cleaned out the os::malloc but which of the new vs NEW_C_HEAP_ARRAY should I use. It might be that I 
don't understand when one uses one or the other but I see both used around the code base?

- Is it that new is to be used for anything internal and NEW_C_HEAP_ARRAY 
anything provided to the JVMTI users outside of the JVM?


We overload new operator when you extend correct base class, e.g. 
CHeapObj so use 'new'
But for arrays you will need the macro NEW_C_HEAP_ARRAY.



  3) Casts: same kind question: which should I use. The code was using a bit of everything, I'll refactor it entirely but I was not clear if I should go to C casts or C++ 
casts as I see both in the codebase. What is the convention I should use?


Just be consist, use what suites you, C++ casts might be preferable, if we are 
moving towards C++11.
And use 'right' cast, e.g. going from Thread* to JavaThread* you should use C 
cast or static_cast, not reinterpret_cast I would say.



Final notes on this webrev:
   - I am still missing:
 - Putting a TLAB implementation so that we can compare both webrevs
 - Have not tried to circumvent AsyncGetCallTrace
 - Putting in the handling of GC'd objects
 - Fix a stack walker issue I have seen, I think I know the problem and 
will test that theory out for the next webrev

I will work on integrating those items for the next webrev!


Thanks!



Thanks for your help,
Jc

Ps:  I tested this on a new repo:

hg clone http://hg.openjdk.java.net/jdk10/jdk10 
 jdk10
... building it
cd test
jtreg -nativepath:/build/linux-x86_64-normal-server-release/support/test/hotspot/jtreg/native/lib/ -jdk 
/linux-x86_64-normal-server-release/images/jdk ../hotspot/test/serviceability/jvmti/HeapMonitor/




I'll test it out!

/Robbin




On Thu, May 4, 2017 at 11:21 PM, serguei.spit...@oracle.com 
 mailto:serguei.spit...@oracle.com>> wrote:

Robbin,

Thank you for forwarding!
I will review it.

Thanks,
Serguei



On 5/4/17 02:13, Robbin Ehn wrote:

Hi,

To me the compiler changes looks what is expected.
It would be good if someone from compiler could take a look at that.
Added compiler to mail thread.

Also adding Serguei, It would be good with his view also.

My initial take on it, read through most of the code and took it for a 
ride.

##
- Regarding the compiler changes: I think we need the 'TLAB end' 
trickery (mentioned by Tony P)
instead of a separate check for sampling in fast path for the final 
version.

##
- This patch I had to apply to get it compile on JDK 10:

diff -r ac3ded340b35 src/share/vm/gc/shared/collectedHeap.inline.hpp
--- a/src/share/vm/gc/shared/collectedHeap.inline.hppFri Apr 28 
14:31:38 2017 +0200
+++ b/src/share/vm/gc/shared/collectedHeap.inline.hppThu May 04 
10:22:56 2017 +0200
@@ -87,3 +87,3 @@
  // support for object alloc event (no-op most of the time)
-if (klass() != NULL && klass()->name() != NULL) {
+

Re: Low-Overhead Heap Profiling

2017-05-04 Thread serguei.spit...@oracle.com


Robbin,

Thank you for forwarding!
I will review it.

Thanks,
Serguei


On 5/4/17 02:13, Robbin Ehn wrote:

Hi,

To me the compiler changes looks what is expected.
It would be good if someone from compiler could take a look at that.
Added compiler to mail thread.

Also adding Serguei, It would be good with his view also.

My initial take on it, read through most of the code and took it for a 
ride.


##
- Regarding the compiler changes: I think we need the 'TLAB end' 
trickery (mentioned by Tony P)
instead of a separate check for sampling in fast path for the final 
version.


##
- This patch I had to apply to get it compile on JDK 10:

diff -r ac3ded340b35 src/share/vm/gc/shared/collectedHeap.inline.hpp
--- a/src/share/vm/gc/shared/collectedHeap.inline.hppFri Apr 28 
14:31:38 2017 +0200
+++ b/src/share/vm/gc/shared/collectedHeap.inline.hppThu May 04 
10:22:56 2017 +0200

@@ -87,3 +87,3 @@
 // support for object alloc event (no-op most of the time)
-if (klass() != NULL && klass()->name() != NULL) {
+if (klass != NULL && klass->name() != NULL) {
   Thread *base_thread = Thread::current();
diff -r ac3ded340b35 src/share/vm/runtime/heapMonitoring.cpp
--- a/src/share/vm/runtime/heapMonitoring.cppFri Apr 28 14:31:38 
2017 +0200
+++ b/src/share/vm/runtime/heapMonitoring.cppThu May 04 10:22:56 
2017 +0200

@@ -316,3 +316,3 @@
   JavaThread *thread = reinterpret_cast*>(Thread::current());

-  assert(o->size() << LogHeapWordSize == byte_size,
+  assert(o->size() << LogHeapWordSize == (long)byte_size,
  "Object size is incorrect.");

##
- This patch I had to apply to get it not asserting during slowdebug:

--- a/src/share/vm/runtime/heapMonitoring.cppFri Apr 28 15:15:16 
2017 +0200
+++ b/src/share/vm/runtime/heapMonitoring.cppThu May 04 10:24:25 
2017 +0200

@@ -32,3 +32,3 @@
 // TODO(jcbeyler): should we make this into a JVMTI structure?
-struct StackTraceData {
+struct StackTraceData : CHeapObj {
   ASGCT_CallTrace *trace;
@@ -143,3 +143,2 @@
 StackTraceStorage::StackTraceStorage() :
-_allocated_traces(new StackTraceData*[MaxHeapTraces]),
 _allocated_traces_size(MaxHeapTraces),
@@ -147,2 +146,3 @@
 _allocated_count(0) {
+  _allocated_traces = NEW_C_HEAP_ARRAY(StackTraceData*, 
MaxHeapTraces, mtInternal);
   memset(_allocated_traces, 0, sizeof(*_allocated_traces) * 
MaxHeapTraces);

@@ -152,3 +152,3 @@
 StackTraceStorage::~StackTraceStorage() {
-  delete[] _allocated_traces;
+  FREE_C_HEAP_ARRAY(StackTraceData*, _allocated_traces);
 }

- Classes should extend correct base class for which type of memory is 
used for it e.g.: CHeapObj or StackObj or AllStatic
- The style in heapMonitoring.cpp is a bit different from normal 
vm-style, e.g. using C++ casts instead of C. You mix NEW_C_HEAP_ARRAY, 
os::malloc and new.

- In jvmtiHeapTransition.hpp you use C cast instead.

##
- This patch I had apply to get traces without setting an ‘unrelated’ 
capability

- Should this not be a new capability?

diff -r c02a5d8785bf src/share/vm/prims/forte.cpp
--- a/src/share/vm/prims/forte.cppFri Apr 28 15:15:16 2017 +0200
+++ b/src/share/vm/prims/forte.cppThu May 04 10:24:25 2017 +0200
@@ -530,6 +530,6 @@

-  if (!JvmtiExport::should_post_class_load()) {
+/*  if (!JvmtiExport::should_post_class_load()) {
 trace->num_frames = ticks_no_class_load; // -1
 return;
-  }
+  }*/

##
- forte.cpp: (I know this is not part of your changes but)
find_jmethod_id_or_null give me NULL for my test.
It looks like we actually want the regular jmethod_id() ?

Since we are the thread we are talking about (and in same ucontext) 
and thread is in vm and have a last java frame,
I think most of the checks done in AsyncGetCallTrace is irrelevant, so 
you should be-able to call forte_fill_call_trace_given_top directly.
But since we might need jmethod_id() if possible to avoid getting 
method id NULL,
we need some fixes in forte code, or just do the vframStream loop 
inside heapMonitoring.cpp and not use forte.cpp.


Something like:

  if (jthread->has_last_Java_frame()) { // just to be safe
vframeStream vfst(jthread);
while (!vfst.at_end()) {
  Method* m = vfst.method();
  m->jmethod_id();
  m->line_number_from_bci(vfst.bci());
  vfst.next();
}

- This is a bit confusing in forte.cpp, trace->frames[count].lineno = 
bci.

Line number should be m->line_number_from_bci(bci);
Do the heapMonitoring suppose to trace with bci or line number?
I would say bci, meaning we should either rename 
ASGCT_CallFrame→lineno or use another data structure which says bci.


##
- // TODO(jcbeyler): remove this extra code handling the extra trace for
Please fix all these TODO's :)

##
- heapMonitoring.hpp:
// TODO(jcbeyler): is this algorithm acceptable in open source?

Why is this comme

Re: Low-Overhead Heap Profiling

2017-05-04 Thread Robbin Ehn


Hi,

To me the compiler changes looks what is expected.
It would be good if someone from compiler could take a look at that.
Added compiler to mail thread.

Also adding Serguei, It would be good with his view also.

My initial take on it, read through most of the code and took it for a ride.

##
- Regarding the compiler changes: I think we need the 'TLAB end' trickery 
(mentioned by Tony P)
instead of a separate check for sampling in fast path for the final version.

##
- This patch I had to apply to get it compile on JDK 10:

diff -r ac3ded340b35 src/share/vm/gc/shared/collectedHeap.inline.hpp
--- a/src/share/vm/gc/shared/collectedHeap.inline.hpp   Fri Apr 28 14:31:38 
2017 +0200
+++ b/src/share/vm/gc/shared/collectedHeap.inline.hpp   Thu May 04 10:22:56 
2017 +0200
@@ -87,3 +87,3 @@
 // support for object alloc event (no-op most of the time)
-if (klass() != NULL && klass()->name() != NULL) {
+if (klass != NULL && klass->name() != NULL) {
   Thread *base_thread = Thread::current();
diff -r ac3ded340b35 src/share/vm/runtime/heapMonitoring.cpp
--- a/src/share/vm/runtime/heapMonitoring.cpp   Fri Apr 28 14:31:38 2017 +0200
+++ b/src/share/vm/runtime/heapMonitoring.cpp   Thu May 04 10:22:56 2017 +0200
@@ -316,3 +316,3 @@
   JavaThread *thread = reinterpret_cast(Thread::current());
-  assert(o->size() << LogHeapWordSize == byte_size,
+  assert(o->size() << LogHeapWordSize == (long)byte_size,
  "Object size is incorrect.");

##
- This patch I had to apply to get it not asserting during slowdebug:

--- a/src/share/vm/runtime/heapMonitoring.cpp   Fri Apr 28 15:15:16 2017 +0200
+++ b/src/share/vm/runtime/heapMonitoring.cpp   Thu May 04 10:24:25 2017 +0200
@@ -32,3 +32,3 @@
 // TODO(jcbeyler): should we make this into a JVMTI structure?
-struct StackTraceData {
+struct StackTraceData : CHeapObj {
   ASGCT_CallTrace *trace;
@@ -143,3 +143,2 @@
 StackTraceStorage::StackTraceStorage() :
-_allocated_traces(new StackTraceData*[MaxHeapTraces]),
 _allocated_traces_size(MaxHeapTraces),
@@ -147,2 +146,3 @@
 _allocated_count(0) {
+  _allocated_traces = NEW_C_HEAP_ARRAY(StackTraceData*, MaxHeapTraces, 
mtInternal);
   memset(_allocated_traces, 0, sizeof(*_allocated_traces) * MaxHeapTraces);
@@ -152,3 +152,3 @@
 StackTraceStorage::~StackTraceStorage() {
-  delete[] _allocated_traces;
+  FREE_C_HEAP_ARRAY(StackTraceData*, _allocated_traces);
 }

- Classes should extend correct base class for which type of memory is used for it 
e.g.: CHeapObj or StackObj or AllStatic
- The style in heapMonitoring.cpp is a bit different from normal vm-style, e.g. 
using C++ casts instead of C. You mix NEW_C_HEAP_ARRAY, os::malloc and new.
- In jvmtiHeapTransition.hpp you use C cast instead.

##
- This patch I had apply to get traces without setting an ‘unrelated’ capability
- Should this not be a new capability?

diff -r c02a5d8785bf src/share/vm/prims/forte.cpp
--- a/src/share/vm/prims/forte.cpp  Fri Apr 28 15:15:16 2017 +0200
+++ b/src/share/vm/prims/forte.cpp  Thu May 04 10:24:25 2017 +0200
@@ -530,6 +530,6 @@

-  if (!JvmtiExport::should_post_class_load()) {
+/*  if (!JvmtiExport::should_post_class_load()) {
 trace->num_frames = ticks_no_class_load; // -1
 return;
-  }
+  }*/

##
- forte.cpp: (I know this is not part of your changes but)
find_jmethod_id_or_null give me NULL for my test.
It looks like we actually want the regular jmethod_id() ?

Since we are the thread we are talking about (and in same ucontext) and thread 
is in vm and have a last java frame,
I think most of the checks done in AsyncGetCallTrace is irrelevant, so you 
should be-able to call forte_fill_call_trace_given_top directly.
But since we might need jmethod_id() if possible to avoid getting method id 
NULL,
we need some fixes in forte code, or just do the vframStream loop inside 
heapMonitoring.cpp and not use forte.cpp.

Something like:

  if (jthread->has_last_Java_frame()) { // just to be safe
vframeStream vfst(jthread);
while (!vfst.at_end()) {
  Method* m = vfst.method();
  m->jmethod_id();
  m->line_number_from_bci(vfst.bci());
  vfst.next();
}

- This is a bit confusing in forte.cpp, trace->frames[count].lineno = bci.
Line number should be m->line_number_from_bci(bci);
Do the heapMonitoring suppose to trace with bci or line number?
I would say bci, meaning we should either rename ASGCT_CallFrame→lineno or use 
another data structure which says bci.

##
- // TODO(jcbeyler): remove this extra code handling the extra trace for
Please fix all these TODO's :)

##
- heapMonitoring.hpp:
// TODO(jcbeyler): is this algorithm acceptable in open source?

Why is this comment here? What is the implication?
Have you tested any simpler algorithm?

##
- Create a sanity jtreg test.

Re: Low-Overhead Heap Profiling

2017-04-21 Thread JC Beyler

Hi all,

I've added size information to the allocation sampling system. This allows
the callback to remember the size of each sampled allocation.
http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/

The new webrev.01 also adds the actual heap monitoring sampling system in
files:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/src/share/vm/runtime/heapMonitoring.cpp.patch
and
http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/src/share/vm/runtime/heapMonitoring.hpp.patch

My next step is to add the GC part to the webrev, which will allow users to
determine what objects are live and what are garbage.

Thanks for your attention and let me know if there are any questions!

Have a wonderful Friday!
Jc

On Mon, Apr 17, 2017 at 12:37 PM, JC Beyler  wrote:

> Hi all,
>
> I worked on getting a few numbers for overhead and accuracy for my
> feature. I'm unsure if here is the right place to provide the full data, so
> I am just summarizing here for now.
>
> - Overhead of the feature
>
> Using the Dacapo benchmark (http://dacapobench.org/). My initial results
> are that sampling provides 2.4% with a 512k sampling, 512k being our
> default setting.
>
> - Note: this was without the tradesoap, tradebeans and tomcat benchmarks
> since they did not work with my JDK9 (issue between Dacapo and JDK9 it
> seems)
> - I want to rerun next week to ensure number stability
>
> - Accuracy of the feature
>
> I wrote a small microbenchmark that allocates from two different
> stacktraces at a given ratio. For example, 10% of stacktrace S1 and 90%
> from stacktrace S2. The microbenchmark was run 20 times, I averaged the
> results and looked for accuracy. It seems that statistically it is sound
> since if I allocated10% S1 and 90% S2, with a sampling rate of 512k, I
> obtained 9.61% S1 and 90.49% S2.
>
> Let me know if there are any questions on the numbers and if you'd like to
> see some more data.
>
> Note: this was done using our internal JDK8 implementation since the
> webrev provided by http://cr.openjdk.java.net/
> ~rasbold/heapz/webrev.00/index.html does not yet contain the whole
> implementation and therefore would have been misleading.
>
> Thanks,
> Jc
>
>
> On Tue, Apr 4, 2017 at 3:55 PM, JC Beyler  wrote:
>
>> Hi all,
>>
>> To move the discussion forward, with Chuck Rasbold's help to make a
>> webrev, we pushed this:
>> http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html
>> 415 lines changed: 399 ins; 13 del; 3 mod; 51122 unchg
>>
>> This is not a final change that does the whole proposition from the JBS
>> entry: https://bugs.openjdk.java.net/browse/JDK-8177374; what it does
>> show is parts of the implementation that is proposed and hopefully can
>> start the conversation going as I work through the details.
>>
>> For example, the changes to C2 are done here for the allocations:
>> http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/src/share/vm/opto/
>> macro.cpp.patch
>>
>> Hopefully this all makes sense and thank you for all your future comments!
>> Jc
>>
>>
>> On Tue, Dec 13, 2016 at 1:11 PM, JC Beyler  wrote:
>>
>>> Hello all,
>>>
>>> This is a follow-up from Jeremy's initial email from last year:
>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/20
>>> 15-June/017543.html
>>>
>>> I've gone ahead and started working on preparing this and Jeremy and I
>>> went down the route of actually writing it up in JEP form:
>>> https://bugs.openjdk.java.net/browse/JDK-8171119
>>>
>>> I think original conversation that happened last year in that thread
>>> still holds true:
>>>
>>>  - We have a patch at Google that we think others might be interested in
>>> - It provides a means to understand where the allocation hotspots
>>> are at a very low overhead
>>> - Since it is at a low overhead, we can leave it on by default
>>>
>>> So I come to the mailing list with Jeremy's initial question:
>>> "I thought I would ask if there is any interest / if I should write a
>>> JEP / if I should just forget it."
>>>
>>> A year ago, it seemed some thought it was a good idea, is this still
>>> true?
>>>
>>> Thanks,
>>> Jc
>>>
>>>
>>>
>>
>

Re: Low-Overhead Heap Profiling

2017-04-17 Thread JC Beyler

Hi all,

I worked on getting a few numbers for overhead and accuracy for my feature.
I'm unsure if here is the right place to provide the full data, so I am
just summarizing here for now.

- Overhead of the feature

Using the Dacapo benchmark (http://dacapobench.org/). My initial results
are that sampling provides 2.4% with a 512k sampling, 512k being our
default setting.

- Note: this was without the tradesoap, tradebeans and tomcat benchmarks
since they did not work with my JDK9 (issue between Dacapo and JDK9 it
seems)
- I want to rerun next week to ensure number stability

- Accuracy of the feature

I wrote a small microbenchmark that allocates from two different
stacktraces at a given ratio. For example, 10% of stacktrace S1 and 90%
from stacktrace S2. The microbenchmark was run 20 times, I averaged the
results and looked for accuracy. It seems that statistically it is sound
since if I allocated10% S1 and 90% S2, with a sampling rate of 512k, I
obtained 9.61% S1 and 90.49% S2.

Let me know if there are any questions on the numbers and if you'd like to
see some more data.

Note: this was done using our internal JDK8 implementation since the webrev
provided by http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html does
not yet contain the whole implementation and therefore would have been
misleading.

Thanks,
Jc

On Tue, Apr 4, 2017 at 3:55 PM, JC Beyler  wrote:

> Hi all,
>
> To move the discussion forward, with Chuck Rasbold's help to make a
> webrev, we pushed this:
> http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html
> 415 lines changed: 399 ins; 13 del; 3 mod; 51122 unchg
>
> This is not a final change that does the whole proposition from the JBS
> entry: https://bugs.openjdk.java.net/browse/JDK-8177374; what it does
> show is parts of the implementation that is proposed and hopefully can
> start the conversation going as I work through the details.
>
> For example, the changes to C2 are done here for the allocations:
> http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/src/share/vm/
> opto/macro.cpp.patch
>
> Hopefully this all makes sense and thank you for all your future comments!
> Jc
>
>
> On Tue, Dec 13, 2016 at 1:11 PM, JC Beyler  wrote:
>
>> Hello all,
>>
>> This is a follow-up from Jeremy's initial email from last year:
>> http://mail.openjdk.java.net/pipermail/serviceability-dev/20
>> 15-June/017543.html
>>
>> I've gone ahead and started working on preparing this and Jeremy and I
>> went down the route of actually writing it up in JEP form:
>> https://bugs.openjdk.java.net/browse/JDK-8171119
>>
>> I think original conversation that happened last year in that thread
>> still holds true:
>>
>>  - We have a patch at Google that we think others might be interested in
>> - It provides a means to understand where the allocation hotspots are
>> at a very low overhead
>> - Since it is at a low overhead, we can leave it on by default
>>
>> So I come to the mailing list with Jeremy's initial question:
>> "I thought I would ask if there is any interest / if I should write a
>> JEP / if I should just forget it."
>>
>> A year ago, it seemed some thought it was a good idea, is this still true?
>>
>> Thanks,
>> Jc
>>
>>
>>
>

Re: Low-Overhead Heap Profiling

2015-08-16 Thread Jeremy Manson

Working on it.  I hope to have something to share in the next couple of
weeks.

Jeremy

On Wed, Aug 12, 2015 at 4:28 AM, Kees Jan Koster  wrote:

> Guys,
>
> This piqued my interest. Where can I find the draft JEP and the sources
> for this, please?
>
> Kees Jan
>
>
> > On 4 Aug 2015, at 23:22, Jeremy Manson  wrote:
> >
> > Thanks, Staffan.  I've been tinkering with a draft JEP, but it has been
> going slowly because of other craziness in my life.  Some responses:
> >
> > 1) It was a fair bit of work to do it, mostly because it was the first
> time I had tinkered with c2.  This was 5 years ago.  Most of the work since
> then has been dealing with merge conflicts.
> >
> > 2) I did envision a JVMTI interface.  More in the JEP.
> >
> > 3) We have some internal tests, but obviously, we'd have to figure out
> how to externalize them.  For native code, it's much easier only to have to
> worry about building and running on Linux.  Presumably, there's already
> code in there for JVMTI.
> >
> > I'll try to get a JEP going for real, although it might not happen in
> the next week or so, because I'm moving next week (+the JVMLS).
> >
> > Jeremy
> >
> > On Tue, Aug 4, 2015 at 4:04 AM, Staffan Larsen <
> staffan.lar...@oracle.com> wrote:
> > Hi,
> >
> > Sorry for being away for so long.
> >
> > If memory serves me right then the current AllocObjectInNewTLAB JFR
> event was written as a way to quickly get some allocation profiling
> information with a minimum of VM changes. It probably also carried over to
> Hotspot from JRockit. I agree that we can and should collect better
> information than what that event does.
> >
> > I like the approach of instrumenting the interpreter and compiler to do
> the sampling. I had thought it would be a lot of work to do it and was
> reluctant to go down that path. If the work is already done and Jeremy has
> maintained it for a few years without great problems, I think we should
> look at bringing it in. I am not worried about the overhead of a decrement
> and a compare in the allocation path, but of course we should benchmark
> that.
> >
> > It wasn’t clear to me from the discussion how (or if) this was being
> exposed to end users. Was the proposal to just have the native entry points
> in the VM and have these be used by various agents? Would this be part of
> JVMTI?
> >
> > I think a draft JEP would be the logical next step and make it easier
> for us all to discuss what exactly the proposal should contain. Also, let’s
> not forget the need for tests for the feature.
> >
> > Thanks,
> > /Staffan
> >
> >
> >
>
>
> --
> Kees Jan
>
> http://java-monitor.com/
> kjkos...@kjkoster.org
> +31651838192
>
> The secret of success lies in the stability of the goal. -- Benjamin
> Disraeli
>
>

Re: Low-Overhead Heap Profiling

2015-08-12 Thread Kees Jan Koster

Guys,

This piqued my interest. Where can I find the draft JEP and the sources for 
this, please?

Kees Jan


> On 4 Aug 2015, at 23:22, Jeremy Manson  wrote:
> 
> Thanks, Staffan.  I've been tinkering with a draft JEP, but it has been going 
> slowly because of other craziness in my life.  Some responses:
> 
> 1) It was a fair bit of work to do it, mostly because it was the first time I 
> had tinkered with c2.  This was 5 years ago.  Most of the work since then has 
> been dealing with merge conflicts.
> 
> 2) I did envision a JVMTI interface.  More in the JEP.
> 
> 3) We have some internal tests, but obviously, we'd have to figure out how to 
> externalize them.  For native code, it's much easier only to have to worry 
> about building and running on Linux.  Presumably, there's already code in 
> there for JVMTI.
> 
> I'll try to get a JEP going for real, although it might not happen in the 
> next week or so, because I'm moving next week (+the JVMLS).
> 
> Jeremy
> 
> On Tue, Aug 4, 2015 at 4:04 AM, Staffan Larsen  
> wrote:
> Hi,
> 
> Sorry for being away for so long.
> 
> If memory serves me right then the current AllocObjectInNewTLAB JFR event was 
> written as a way to quickly get some allocation profiling information with a 
> minimum of VM changes. It probably also carried over to Hotspot from JRockit. 
> I agree that we can and should collect better information than what that 
> event does.
> 
> I like the approach of instrumenting the interpreter and compiler to do the 
> sampling. I had thought it would be a lot of work to do it and was reluctant 
> to go down that path. If the work is already done and Jeremy has maintained 
> it for a few years without great problems, I think we should look at bringing 
> it in. I am not worried about the overhead of a decrement and a compare in 
> the allocation path, but of course we should benchmark that.
> 
> It wasn’t clear to me from the discussion how (or if) this was being exposed 
> to end users. Was the proposal to just have the native entry points in the VM 
> and have these be used by various agents? Would this be part of JVMTI?
> 
> I think a draft JEP would be the logical next step and make it easier for us 
> all to discuss what exactly the proposal should contain. Also, let’s not 
> forget the need for tests for the feature.
> 
> Thanks,
> /Staffan
> 
> 
> 


--
Kees Jan

http://java-monitor.com/
kjkos...@kjkoster.org
+31651838192

The secret of success lies in the stability of the goal. -- Benjamin Disraeli



signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Low-Overhead Heap Profiling

2015-08-04 Thread Staffan Larsen

Hi,

Sorry for being away for so long.

If memory serves me right then the current AllocObjectInNewTLAB JFR event was 
written as a way to quickly get some allocation profiling information with a 
minimum of VM changes. It probably also carried over to Hotspot from JRockit. I 
agree that we can and should collect better information than what that event 
does. 

I like the approach of instrumenting the interpreter and compiler to do the 
sampling. I had thought it would be a lot of work to do it and was reluctant to 
go down that path. If the work is already done and Jeremy has maintained it for 
a few years without great problems, I think we should look at bringing it in. I 
am not worried about the overhead of a decrement and a compare in the 
allocation path, but of course we should benchmark that.

It wasn’t clear to me from the discussion how (or if) this was being exposed to 
end users. Was the proposal to just have the native entry points in the VM and 
have these be used by various agents? Would this be part of JVMTI?

I think a draft JEP would be the logical next step and make it easier for us 
all to discuss what exactly the proposal should contain. Also, let’s not forget 
the need for tests for the feature.

Thanks,
/Staffan

Re: Low-Overhead Heap Profiling

2015-06-28 Thread Jeremy Manson

>
>
>
> On Thu, Jun 25, 2015 at 2:23 PM, Tony Printezis 
>  wrote:
>
>> BTW, Could we get a reaction from the Oracle folks on this?
>>
>
I contacted Staffan (who is probably the right person) offline, and he is,
apparently, on vacation until the beginning of August.

Unless there is someone else at Oracle who might be the right person, we
might have to hold off on a final decision until then.  In the meantime, we
can probably continue the conversation - I might draft a JEP.

Jeremy

Re: Low-Overhead Heap Profiling

2015-06-27 Thread Kirk Pepperdine

Hi Jeremy,

> 
> The answer to that is not to use safepoint-biased execution profilers, I 
> think.  

Thank you for the advice. I’ve been advocating for a number of years that 
people understand the sampling bias that exists in all profilers. I also would 
consider safe point bias being the most egregious form or sampling possible. 
This is why I’m very happy to see Richard’s Honest profiler (using 
AsyncGetCallTrace) show up and also to have inspired Pierre Laporte to start 
work on a lock profiler project at our (un)conference (jCrete). As an aside, we 
hope jCrete can help inspire and support more efforts like this.

My desire here is to ensure that minimize the effects of sample bias on the 
results. In this regards, I like the approach you are proposing. My concern 
with Tony’s approach what that it would introduce a size based sampling bias. 
As I mentioned before, GC costs is one thing to consider however I’m more 
concerned with frequency related allocation costs and it’s effect on 
application throughputs.

Also, I think we have other techniques that can infer where an allocation has 
taken place be it in a tlab or even in tenured. I’m not sure that we have to 
profile for this type of information though if we can get it for almost free…..

Kind regards,
Kirk

Re: Low-Overhead Heap Profiling

2015-06-26 Thread Karen Kinnear

Thanks Jeremy - that helps -  not such a big deal in terms of maintaining - so 
back to the bigger discussion.

thanks,
Karen

On Jun 26, 2015, at 1:34 AM, Jeremy Manson wrote:

> Karen,
> 
> I understand your concerns.  For reference, this is the additional code in 
> the x86 assembler.  There are very small stubs in C1 and the template 
> interpreter to call out to this macro (forgive the gratuitous use of the 
> string "google" - we sprinkle it around a bit because it makes it a little 
> easier to distinguish our code from upstream code).
> 
> #define GOOGLE_HEAP_MONITORING(ma, thread, var_size_in_bytes, 
> con_size_in_bytes, object, t1, t2, sample_invocation) \
> do {\
>   { \
> SkipIfEqual skip_if(ma, &GoogleHeapMonitor, 0); \
> Label skip_sample;  \
> Register thr = thread;  \
> if (!thr->is_valid()) { \
>   NOT_LP64(assert(t1 != noreg,  \
>   "Need temporary register for constants"));\
>   thr = NOT_LP64(t1) LP64_ONLY(r15_thread); \
>   NOT_LP64(ma -> get_thread(thr);)  \
> }   \
> /* Trigger heap monitoring event */ \
> Address bus(thr,\
> JavaThread::google_bytes_until_sample_offset());\
> \
> if (var_size_in_bytes->is_valid()) {\
>   ma -> NOT_LP64(subl) LP64_ONLY(subq)(bus, var_size_in_bytes); \
> } else {\
>   int csib = (con_size_in_bytes);   \
>   assert(t2 != noreg,   \
>  "Need temporary register for constants");  \
>   ma -> NOT_LP64(movl) LP64_ONLY(mov64)(t2, csib);  \
>   ma -> NOT_LP64(subl) LP64_ONLY(subq)(bus, t2);\
> }   \
> \
> ma -> jcc(Assembler::positive, skip_sample);\
> \
> {   \
>   sample_invocation \
> }   \
> ma -> bind(skip_sample);\
>   } \
> } while(0)
> 
> It's not all that hard to port to additional architectures, but we'll have to 
> think about it.
> 
> Jeremy
> 
> On Thu, Jun 25, 2015 at 1:41 PM, Karen Kinnear  
> wrote:
> Jeremy,
> 
> Did I follow this correctly - that your approach modifies the compilers and 
> interpreters and Tony's modifies the
> common allocation code?
> 
> Given that the number of compilers and interpreters and interpreter platforms 
> keeps expanding - I'd like to
> add a vote to have heap allocation profiling in common allocation code.
> 
> thanks,
> Karen
> 
> On Jun 25, 2015, at 4:28 PM, Tony Printezis wrote:
> 
>> Hi Jeremy,
>> 
>> Inline.
>> 
>> On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremyman...@google.com) 
>> wrote:
>> 
>>> 
>>> 
>>> On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis  
>>> wrote:
>>> Hi Jeremy,
>>> 
>>> Please see inline.
>>> 
>>> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com) 
>>> wrote:
>>> 
 I don't want the size of the TLAB, which is ergonomically adjusted, to be 
 tied to the sampling rate.  There is no reason to do that.  I want 
 reasonable statistical sampling of the allocations.  
>>> 
>>> 
>>> As I said explicitly in my e-mail, I totally agree with this. Which is why 
>>> I never suggested to resize TLABs in order to vary the sampling rate. 
>>> (Apologies if my e-mail was not clear.)
>>> 
>>> 
>>> My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs 
>>> entirely
>> 
>> 
>> This is correct: We’ll also have to intercept the outside-TLAB allocs. But, 
>> IMHO, this is a feature as it’s helpful to know how many (and which) 
>> allocations happen outside TLABs. These are generally very infrequent (and 
>> slow anyway), so sampling all of those, instead of only sampling some of 
>> them, does not have much of an overhead. But, you could also do sampling for 
>> the outside-TLAB allocs too, if you want: just accumulate their size on a 
>> separate per-thread counter and sample the one that bumps that

Re: Low-Overhead Heap Profiling

2015-06-26 Thread Kirk Pepperdine

> 
> Do you mean “sample every X ms, say”?
> 
> 
> This is not impossible, but a little weird. 

Yes, you are right, this is weird and I don’t think is necessary for this type 
of profiling. Please ignore.

Regards,
Kirk

Re: Low-Overhead Heap Profiling

2015-06-26 Thread Kirk Pepperdine

Hi Tony,

>> 
>> I would think that the size based sampling would create a size based bias in 
>> your sampling. 
> 
> 
> That’s actually true. And this could be good (if you’re interested in what’s 
> filling up your eden, the larger objects might be of more interest) or bad 
> (if you want to get a general idea of what’s being allocated, the size bias 
> might make you miss some types of objects / allocation sites).
> 
> 

IME, if I see allocation rates > ~500MB/sec I know there are some significant 
wins I can make by tuning allocation hotspots. So, I’m interested in frequent 
allocations no matter what the size.
> 
>> Since IME, it’s allocation frequency is more damaging to performance, I’d 
>> prefer to see time boxed sampling
> 
> 
> Do you mean “sample every X ms, say”?
> 
> 

I don’t find that sampling rate has to be that high to find hot allocation 
sites.

Kind regards,
Kirk Pepperdine

> 
> Tony
> 
> 
> 
>> 
>> Kind regards,
>> Kirk Pepperdine
>> 
> 
> 
> -
> 
> Tony Printezis | JVM/GC Engineer / VM Team | Twitter
> 
> @TonyPrintezis
> tprinte...@twitter.com

Re: Low-Overhead Heap Profiling

2015-06-26 Thread Kirk Pepperdine

Hi Jeremy,

Sorry I wasn’t so clear, it’s not about collection, it’s about allocation. In 
this regard it’s not about about size, it’s about the frequency. People tend 
allocate small objects frequently and they will avoid allocating large objects 
frequently. The assumption is, large is expensive but small isn’t. These event 
will show up using execution profilers but given the safe-point bias of 
execution profilers and other factors, it’s often clearer to view this problem 
using memory profilers.

Kind regards,
Kirk

On Jun 25, 2015, at 7:34 PM, Jeremy Manson  wrote:

> Why would allocation frequency be more damaging to performance?  Allocation 
> is cheap, and as long as they become dead before the YG collection, it costs 
> the same to collect one 1MB object as it does to collection 1000 1K objects.
> 
> Jeremy
> 
> On Wed, Jun 24, 2015 at 11:54 PM, Kirk Pepperdine  
> wrote:
>> 
>> But, seriously, why didn’t you like my proposal? It can do anything your 
>> scheme can with fewer and simpler code changes. The only thing that it 
>> cannot do is to sample based on object count (i.e., every 100 objects) 
>> instead of based on object size (i.e., every 1MB of allocations). But I 
>> think doing sampling based on size is the right approach here (IMHO).
>> 
>> 
> 
> I would think that the size based sampling would create a size based bias in 
> your sampling. Since IME, it’s allocation frequency is more damaging to 
> performance, I’d prefer to see time boxed sampling
> 
> Kind regards,
> Kirk Pepperdine
> 
>

Re: Low-Overhead Heap Profiling

2015-06-25 Thread Jeremy Manson

Another thought.  Since:

- It would be kind of surprising for Thread->allocated_bytes() to be
different from the number used as the interval for tracking (e.g., if your
interval is, say, 512K, you check allocated bytes, it says 0, you allocate
512K, you check allocated bytes, it says 512K, but no sample was taken),
AND
- We're already taking the maintenance hit to maintain the allocated bytes
counter everywhere,

Maybe a good compromise would be to piggyback on the allocated bytes
counter?  If the allocated bytes counter is at N, and we pick a next
sampling interval of K, we set a per-thread variable to N+K, and everywhere
we increment the allocated bytes counter, we just test to see if it is
greater than N+K?

This would add an additional load and another easily predicted branch, but
no additional subtraction.  Also, it would have very obvious and tractable
modifications to make in existing places that already have logic for the
counter, so there wouldn't be much of an additional maintenance burden.
Finally, it would more-or-less address my concerns, because the non-TLAB
fast paths I'm worried about are already instrumented for it.

Jeremy

On Thu, Jun 25, 2015 at 10:27 PM, Jeremy Manson 
wrote:

>
>
> On Thu, Jun 25, 2015 at 1:28 PM, Tony Printezis 
> wrote:
>
>> Hi Jeremy,
>>
>> Inline.
>>
>> On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremyman...@google.com)
>> wrote:
>>
>>
>>
>> On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis 
>>  wrote:
>>
>>> Hi Jeremy,
>>>
>>> Please see inline.
>>>
>>> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com)
>>> wrote:
>>>
>>> I don't want the size of the TLAB, which is ergonomically adjusted, to
>>> be tied to the sampling rate.  There is no reason to do that.  I want
>>> reasonable statistical sampling of the allocations.
>>>
>>>
>>> As I said explicitly in my e-mail, I totally agree with this. Which is
>>> why I never suggested to resize TLABs in order to vary the sampling rate.
>>> (Apologies if my e-mail was not clear.)
>>>
>>
>> My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs
>> entirely
>>
>>
>> This is correct: We’ll also have to intercept the outside-TLAB allocs.
>> But, IMHO, this is a feature as it’s helpful to know how many (and which)
>> allocations happen outside TLABs. These are generally very infrequent (and
>> slow anyway), so sampling all of those, instead of only sampling some of
>> them, does not have much of an overhead. But, you could also do sampling
>> for the outside-TLAB allocs too, if you want: just accumulate their size on
>> a separate per-thread counter and sample the one that bumps that counter
>> goes over a limit.
>>
>>
> The outside-TLAB allocations generally get caught anyway, because they
> tend to be large enough to jump over the sample size immediately.
>
>
>> An additional observation (orthogonal to the main point, but I thought
>> I’d mention it anyway): For the outside-TLAB allocs it’d be helpful to also
>> know which generation the object ended up in (e.g., young gen or
>> direct-to-old-gen). This is very helpful in some situations when you’re
>> trying to work out which allocation(s) grew the old gen occupancy between
>> two young GCs.
>>
>
> True.  We don't have this implemented, but it would be reasonably
> straightforward to glean it from the oop.
>
>
>> FWIW, the existing JFR events follow the approach I described above:
>>
>> * one event for each new TLAB + first alloc in that TLAB (my proposal
>> basically generalizes this and removes the 1-1 relationship between object
>> alloc sampling and new TLAB operation)
>>
>> * one event for all allocs outside a TLAB
>>
>> I think the above separation is helpful. But if you think it could
>> confuse users, you can of course easily just combine the information (but I
>> strongly believe it’s better to report the information separately).
>>
>
> I do think it would make a confusing API.  It might make more sense to
> have a reporting mechanism that had a set number of fields with very
> concrete information (size, class, stacktrace), but allowed for
> platform-specific metadata.  We end up with a very long list of things we
> want in the sample: generation (how do you describe a generation?), object
> age (by number of GCs survived?  What kind of GC?), was it a TLAB
> allocation, etc.
>
>
> (and, in fact, not work if TLAB support is turned off)?
>>
>>
>> Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a
>> genuine question.)
>>
>
> I don't think they do.  I have turned them off for various reasons
> (usually, I'm trying to instrument allocations and I don't want to muck
> about with thinking about TLABs), and the code paths seem a little crufty.
> ISTR at some point finding something that clearly only worked by mistake,
> but I can't remember now what it was.
>
> [snip]
>
>
>
>>   However, you can do pretty much anything from the VM itself.  Crucially
>>> (for us), we don't just log the stack traces, we also keep tra

Re: Low-Overhead Heap Profiling

2015-06-25 Thread Jeremy Manson

On Thu, Jun 25, 2015 at 1:55 PM, Tony Printezis 
wrote:

> Bernd,
>
> I like the idea of buffering up the sampled objects in, some data
> structure. But I assume it’d have to be a per-thread data structure to
> avoid conention issues. So, we’ll also need a periodic task that collects
> all such data structures and makes them available (somehow) to whoever
> wants to consume them?
>

This is easily done.  But, per my last message, I don't think it should be
the default.  It can just be available as another callback, if you want it.

Jeremy

Re: Low-Overhead Heap Profiling

2015-06-25 Thread Jeremy Manson

Karen,

I understand your concerns.  For reference, this is the additional code in
the x86 assembler.  There are very small stubs in C1 and the template
interpreter to call out to this macro (forgive the gratuitous use of the
string "google" - we sprinkle it around a bit because it makes it a little
easier to distinguish our code from upstream code).

#define GOOGLE_HEAP_MONITORING(ma, thread, var_size_in_bytes,
con_size_in_bytes, object, t1, t2, sample_invocation) \
do {\
  { \
SkipIfEqual skip_if(ma, &GoogleHeapMonitor, 0); \
Label skip_sample;  \
Register thr = thread;  \
if (!thr->is_valid()) { \
  NOT_LP64(assert(t1 != noreg,  \
  "Need temporary register for constants"));\
  thr = NOT_LP64(t1) LP64_ONLY(r15_thread); \
  NOT_LP64(ma -> get_thread(thr);)  \
}   \
/* Trigger heap monitoring event */ \
Address bus(thr,\
JavaThread::google_bytes_until_sample_offset());\
\
if (var_size_in_bytes->is_valid()) {\
  ma -> NOT_LP64(subl) LP64_ONLY(subq)(bus, var_size_in_bytes); \
} else {\
  int csib = (con_size_in_bytes);   \
  assert(t2 != noreg,   \
 "Need temporary register for constants");  \
  ma -> NOT_LP64(movl) LP64_ONLY(mov64)(t2, csib);  \
  ma -> NOT_LP64(subl) LP64_ONLY(subq)(bus, t2);\
}   \
\
ma -> jcc(Assembler::positive, skip_sample);\
\
{   \
  sample_invocation \
}   \
ma -> bind(skip_sample);\
  } \
} while(0)

It's not all that hard to port to additional architectures, but we'll have
to think about it.

Jeremy

On Thu, Jun 25, 2015 at 1:41 PM, Karen Kinnear 
wrote:

> Jeremy,
>
> Did I follow this correctly - that your approach modifies the compilers
> and interpreters and Tony's modifies the
> common allocation code?
>
> Given that the number of compilers and interpreters and interpreter
> platforms keeps expanding - I'd like to
> add a vote to have heap allocation profiling in common allocation code.
>
> thanks,
> Karen
>
> On Jun 25, 2015, at 4:28 PM, Tony Printezis wrote:
>
> Hi Jeremy,
>
> Inline.
>
> On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremyman...@google.com)
> wrote:
>
>
>
> On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis 
> wrote:
>
>> Hi Jeremy,
>>
>> Please see inline.
>>
>> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com)
>> wrote:
>>
>> I don't want the size of the TLAB, which is ergonomically adjusted, to be
>> tied to the sampling rate.  There is no reason to do that.  I want
>> reasonable statistical sampling of the allocations.
>>
>>
>> As I said explicitly in my e-mail, I totally agree with this. Which is
>> why I never suggested to resize TLABs in order to vary the sampling rate.
>> (Apologies if my e-mail was not clear.)
>>
>
> My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs
> entirely
>
>
> This is correct: We’ll also have to intercept the outside-TLAB allocs.
> But, IMHO, this is a feature as it’s helpful to know how many (and which)
> allocations happen outside TLABs. These are generally very infrequent (and
> slow anyway), so sampling all of those, instead of only sampling some of
> them, does not have much of an overhead. But, you could also do sampling
> for the outside-TLAB allocs too, if you want: just accumulate their size on
> a separate per-thread counter and sample the one that bumps that counter
> goes over a limit.
>
> An additional observation (orthogonal to the main point, but I thought I’d
> mention it anyway): For the outside-TLAB allocs it’d be helpful to also
> know which generation the object ended up in (e.g., young gen or
> direct-to-old-gen). This is very helpful in some situations when you’re
> trying to work out which allocation(s) grew t

Re: Low-Overhead Heap Profiling

2015-06-25 Thread Jeremy Manson

On Thu, Jun 25, 2015 at 1:28 PM, Tony Printezis 
wrote:

> Hi Jeremy,
>
> Inline.
>
> On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremyman...@google.com)
> wrote:
>
>
>
> On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis 
> wrote:
>
>> Hi Jeremy,
>>
>> Please see inline.
>>
>> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com)
>> wrote:
>>
>> I don't want the size of the TLAB, which is ergonomically adjusted, to be
>> tied to the sampling rate.  There is no reason to do that.  I want
>> reasonable statistical sampling of the allocations.
>>
>>
>> As I said explicitly in my e-mail, I totally agree with this. Which is
>> why I never suggested to resize TLABs in order to vary the sampling rate.
>> (Apologies if my e-mail was not clear.)
>>
>
> My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs
> entirely
>
>
> This is correct: We’ll also have to intercept the outside-TLAB allocs.
> But, IMHO, this is a feature as it’s helpful to know how many (and which)
> allocations happen outside TLABs. These are generally very infrequent (and
> slow anyway), so sampling all of those, instead of only sampling some of
> them, does not have much of an overhead. But, you could also do sampling
> for the outside-TLAB allocs too, if you want: just accumulate their size on
> a separate per-thread counter and sample the one that bumps that counter
> goes over a limit.
>
>
The outside-TLAB allocations generally get caught anyway, because they tend
to be large enough to jump over the sample size immediately.


> An additional observation (orthogonal to the main point, but I thought I’d
> mention it anyway): For the outside-TLAB allocs it’d be helpful to also
> know which generation the object ended up in (e.g., young gen or
> direct-to-old-gen). This is very helpful in some situations when you’re
> trying to work out which allocation(s) grew the old gen occupancy between
> two young GCs.
>

True.  We don't have this implemented, but it would be reasonably
straightforward to glean it from the oop.


> FWIW, the existing JFR events follow the approach I described above:
>
> * one event for each new TLAB + first alloc in that TLAB (my proposal
> basically generalizes this and removes the 1-1 relationship between object
> alloc sampling and new TLAB operation)
>
> * one event for all allocs outside a TLAB
>
> I think the above separation is helpful. But if you think it could confuse
> users, you can of course easily just combine the information (but I
> strongly believe it’s better to report the information separately).
>

I do think it would make a confusing API.  It might make more sense to have
a reporting mechanism that had a set number of fields with very concrete
information (size, class, stacktrace), but allowed for platform-specific
metadata.  We end up with a very long list of things we want in the sample:
generation (how do you describe a generation?), object age (by number of
GCs survived?  What kind of GC?), was it a TLAB allocation, etc.


(and, in fact, not work if TLAB support is turned off)?
>
>
> Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine
> question.)
>

I don't think they do.  I have turned them off for various reasons
(usually, I'm trying to instrument allocations and I don't want to muck
about with thinking about TLABs), and the code paths seem a little crufty.
ISTR at some point finding something that clearly only worked by mistake,
but I can't remember now what it was.

[snip]



>   However, you can do pretty much anything from the VM itself.  Crucially
>> (for us), we don't just log the stack traces, we also keep track of which
>> are live and which aren't.  We can't do this in a callback, if the callback
>> can't create weak refs to the object.
>>
>> What we do at Google is to have two methods: one that you pass a callback
>> to (the callback gets invoked with a StackTraceData object, as I've defined
>> above), and another that just tells you which sampled objects are still
>> live.  We could also add a third, which allowed a callback to set the
>> sampling interval (basically, the VM would call it to get the integer
>> number of bytes to be allocated before the next sample).
>>
>> Would people be amenable to that?  It makes the code more complex, but,
>> as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB
>> object come from?").
>>
>>
>> Well, that 1GB object would have most likely been allocated outside a
>> TLAB and you could have identified it by instrumenting the “outside-of-TLAB
>> allocation path” (just saying…).
>>
>
> That's orthogonal to the point I was making in the quote above - the point
> I was making there was that we want to be able to detect what sampled
> objects are live.  We can do that regardless of how we implement the
> sampling (although it did involve my making a new kind of weak oop
> processing mechanism inside the VM).
>
>
> Yeah, I was thinking of doing something similar (tracking obj

Re: Low-Overhead Heap Profiling

2015-06-25 Thread Tony Printezis

BTW, Could we get a reaction from the Oracle folks on this? Even though Jeremy 
and myself are proposing different implementation approaches, we both agree 
(and Jeremy please correct me on this) that having an allocation sampling 
mechanism that’s more flexible to what’s already in HotSpot (in particular: the 
sampling frequency not being tied to the TLAB size) will be a very helpful 
profiling feature. Is this something that we pursue to contribute?

Tony

On June 24, 2015 at 1:57:44 PM, Tony Printezis (tprinte...@twitter.com) wrote:

Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied 
to the sampling rate.  There is no reason to do that.  I want reasonable 
statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I 
never suggested to resize TLABs in order to vary the sampling rate. (Apologies 
if my e-mail was not clear.)




All this requires is a separate counter that is set to the next sampling 
interval, and decremented when an allocation happens, which goes into a slow 
path when the decrement hits 0.  Doing a subtraction and a pointer bump in 
allocation instead of just a pointer bump is basically free.  


Maybe on intel is cheap, but maybe it’s not on other platforms that other folks 
care about.



Note that it has been doing an additional addition (to keep track of per thread 
allocation) as part of allocation since Java 7, 


Interesting. I hadn’t realized that. Does that keep track of total size 
allocated per thread or number of allocated objects per thread? If it’s the 
former, why isn’t it possible to calculate that from the TLABs information?



and no one has complained.

I'm not worried about the ease of implementation here, because we've already 
implemented it.  


Yeah, but someone will have to maintain it moving forward.



It hasn't even been hard for us to do the forward port, except when the 
relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling rate to 
2^32, have the sampling code do nothing, and no one will ever notice.  


You still have extra instructions in the allocation path, so it’s not turned 
off (i.e., you have the tax without any benefit).



In fact, we could just have the sampling code do nothing, and no one would ever 
notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made 
it more expensive to grab a stack trace (the cost became proportional to the 
number of loaded classes), but we have a patch that mitigates that, which we 
would also be happy to upstream.

As for the other concern: my concern about *just* having the callback mechanism 
is that there is quite a lot you can't do from user code during an allocation, 
because of lack of access to JNI.


Maybe I missed something. Are the callbacks in Java? I.e., do you call them 
using JNI from the slow path you call directly from the allocation code?



  However, you can do pretty much anything from the VM itself.  Crucially (for 
us), we don't just log the stack traces, we also keep track of which are live 
and which aren't.  We can't do this in a callback, if the callback can't create 
weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to 
(the callback gets invoked with a StackTraceData object, as I've defined 
above), and another that just tells you which sampled objects are still live.  
We could also add a third, which allowed a callback to set the sampling 
interval (basically, the VM would call it to get the integer number of bytes to 
be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I 
say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object 
come from?").


Well, that 1GB object would have most likely been allocated outside a TLAB and 
you could have identified it by instrumenting the “outside-of-TLAB allocation 
path” (just saying…).

But, seriously, why didn’t you like my proposal? It can do anything your scheme 
can with fewer and simpler code changes. The only thing that it cannot do is to 
sample based on object count (i.e., every 100 objects) instead of based on 
object size (i.e., every 1MB of allocations). But I think doing sampling based 
on size is the right approach here (IMHO).

Tony




Jeremy


On Tue, Jun 23, 2015 at 1:06 PM, Tony Printezis  wrote:
Jeremy (and all),

I’m not on the serviceability list so I won’t include the messages so far. :-) 
Also CCing the hotspot GC list, in case they have some feedback on this.

Could I suggest a (much) simpler but at least as powerful and flexible way to 
do this? (This is something we’ve been meaning to do for a while now for 
TwitterJDK, the JDK we develop and deploy here at Twitter.) You can force 
allocations

Re: Low-Overhead Heap Profiling

2015-06-25 Thread Tony Printezis

Hi Kirk,

(long time!) See inline.

On June 25, 2015 at 2:54:04 AM, Kirk Pepperdine (kirk.pepperd...@gmail.com) 
wrote:


But, seriously, why didn’t you like my proposal? It can do anything your scheme 
can with fewer and simpler code changes. The only thing that it cannot do is to 
sample based on object count (i.e., every 100 objects) instead of based on 
object size (i.e., every 1MB of allocations). But I think doing sampling based 
on size is the right approach here (IMHO).



I would think that the size based sampling would create a size based bias in 
your sampling. 


That’s actually true. And this could be good (if you’re interested in what’s 
filling up your eden, the larger objects might be of more interest) or bad (if 
you want to get a general idea of what’s being allocated, the size bias might 
make you miss some types of objects / allocation sites).



Since IME, it’s allocation frequency is more damaging to performance, I’d 
prefer to see time boxed sampling


Do you mean “sample every X ms, say”?



Tony




Kind regards,
Kirk Pepperdine



-

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis
tprinte...@twitter.com

Re: Low-Overhead Heap Profiling

2015-06-25 Thread Tony Printezis

Bernd,

I like the idea of buffering up the sampled objects in, some data structure. 
But I assume it’d have to be a per-thread data structure to avoid conention 
issues. So, we’ll also need a periodic task that collects all such data 
structures and makes them available (somehow) to whoever wants to consume them?

Tony

On June 24, 2015 at 7:49:06 PM, Bernd Eckenfels (e...@zusammenkunft.net) wrote:

Am Wed, 24 Jun 2015 16:26:35 -0700  
schrieb Jeremy Manson :  
> > As for the other concern: my concern about *just* having the  
> > callback mechanism is that there is quite a lot you can't do from  
> > user code during an allocation, because of lack of access to JNI.  
> >  
> >  
> > Maybe I missed something. Are the callbacks in Java? I.e., do you  
> > call them using JNI from the slow path you call directly from the  
> > allocation code?  
> >  
> > (For context: this referred to the hypothetical feature where we can  
> provide a callback that invokes some code from allocation.)  

What about a hypothetical queueing feature, so you can process the  
events asynchronously (perhaps with some backpressure control). This  
would work well for statistics processing.  

(Your other use case, the throwing of OOM would not work, I guess)  

But its an elegant solution to provide a code environment generic enoug  
for all kinds of instrumentation and independent of the "allocation  
recursion".  

Greetings  
Bernd  
-

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis
tprinte...@twitter.com

Re: Low-Overhead Heap Profiling

2015-06-25 Thread Karen Kinnear

Jeremy,

Did I follow this correctly - that your approach modifies the compilers and 
interpreters and Tony's modifies the
common allocation code?

Given that the number of compilers and interpreters and interpreter platforms 
keeps expanding - I'd like to
add a vote to have heap allocation profiling in common allocation code.

thanks,
Karen

On Jun 25, 2015, at 4:28 PM, Tony Printezis wrote:

> Hi Jeremy,
> 
> Inline.
> 
> On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremyman...@google.com) wrote:
> 
>> 
>> 
>> On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis  
>> wrote:
>> Hi Jeremy,
>> 
>> Please see inline.
>> 
>> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com) 
>> wrote:
>> 
>>> I don't want the size of the TLAB, which is ergonomically adjusted, to be 
>>> tied to the sampling rate.  There is no reason to do that.  I want 
>>> reasonable statistical sampling of the allocations.  
>> 
>> 
>> As I said explicitly in my e-mail, I totally agree with this. Which is why I 
>> never suggested to resize TLABs in order to vary the sampling rate. 
>> (Apologies if my e-mail was not clear.)
>> 
>> 
>> My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs 
>> entirely
> 
> 
> This is correct: We’ll also have to intercept the outside-TLAB allocs. But, 
> IMHO, this is a feature as it’s helpful to know how many (and which) 
> allocations happen outside TLABs. These are generally very infrequent (and 
> slow anyway), so sampling all of those, instead of only sampling some of 
> them, does not have much of an overhead. But, you could also do sampling for 
> the outside-TLAB allocs too, if you want: just accumulate their size on a 
> separate per-thread counter and sample the one that bumps that counter goes 
> over a limit.
> 
> An additional observation (orthogonal to the main point, but I thought I’d 
> mention it anyway): For the outside-TLAB allocs it’d be helpful to also know 
> which generation the object ended up in (e.g., young gen or 
> direct-to-old-gen). This is very helpful in some situations when you’re 
> trying to work out which allocation(s) grew the old gen occupancy between two 
> young GCs.
> 
> FWIW, the existing JFR events follow the approach I described above:
> 
> * one event for each new TLAB + first alloc in that TLAB (my proposal 
> basically generalizes this and removes the 1-1 relationship between object 
> alloc sampling and new TLAB operation)
> 
> * one event for all allocs outside a TLAB
> 
> I think the above separation is helpful. But if you think it could confuse 
> users, you can of course easily just combine the information (but I strongly 
> believe it’s better to report the information separately).
> 
> 
> 
>> (and, in fact, not work if TLAB support is turned off)? 
> 
> 
> Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine 
> question.)
> 
> 
> 
>>  I might be missing something obvious (and see my response below).
> 
> 
>>  
>>> All this requires is a separate counter that is set to the next sampling 
>>> interval, and decremented when an allocation happens, which goes into a 
>>> slow path when the decrement hits 0.  Doing a subtraction and a pointer 
>>> bump in allocation instead of just a pointer bump is basically free.  
>> 
>> 
>> Maybe on intel is cheap, but maybe it’s not on other platforms that other 
>> folks care about.
>> 
>> Really?  A memory read and a subtraction?  Which architectures care about 
>> that?
> 
> 
> I was not concerned with the read and subtraction, I was more concerned with 
> the conditional that follows them (intel has great branch prediction).
> 
> And a personal pet peeve (based on past experience): How many “free” 
> instructions do you have to add before they are not free any more?
> 
> 
> 
>> 
>> Again, notice that no one has complained about the addition that was added 
>> for total bytes allocated per thread.  I note that was actually added in the 
>> 6u20 timeframe.
>> 
>>> Note that it has been doing an additional addition (to keep track of per 
>>> thread allocation) as part of allocation since Java 7, 
>> 
>> 
>> Interesting. I hadn’t realized that. Does that keep track of total size 
>> allocated per thread or number of allocated objects per thread? If it’s the 
>> former, why isn’t it possible to calculate that from the TLABs information?
>> 
>> 
>> Total size allocated per thread.  It isn't possible to calculate that from 
>> the TLAB because of out-of-TLAB allocation 
> 
> 
> The allocating Thread is passed to the slow (outside-TLAB) alloc path so it 
> would be trivial to update the per-thread allocation stats from there too (in 
> fact, it does; see below).
> 
> 
> 
>> (and hypothetically disabled TLABs).
> 
> 
> Anyone cares? :-)
> 
> 
> 
>> 
>> For some reason, they never included it in the ThreadMXBean interface, but 
>> it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean 
>> to a com.sun.management.ThreadMXBean and call getThreadAlloc

Re: Low-Overhead Heap Profiling

2015-06-25 Thread Tony Printezis

Hi Jeremy,

Inline.

On June 24, 2015 at 7:26:55 PM, Jeremy Manson (jeremyman...@google.com) wrote:



On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis  wrote:
Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied 
to the sampling rate.  There is no reason to do that.  I want reasonable 
statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I 
never suggested to resize TLABs in order to vary the sampling rate. (Apologies 
if my e-mail was not clear.)


My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs entirely


This is correct: We’ll also have to intercept the outside-TLAB allocs. But, 
IMHO, this is a feature as it’s helpful to know how many (and which) 
allocations happen outside TLABs. These are generally very infrequent (and slow 
anyway), so sampling all of those, instead of only sampling some of them, does 
not have much of an overhead. But, you could also do sampling for the 
outside-TLAB allocs too, if you want: just accumulate their size on a separate 
per-thread counter and sample the one that bumps that counter goes over a limit.

An additional observation (orthogonal to the main point, but I thought I’d 
mention it anyway): For the outside-TLAB allocs it’d be helpful to also know 
which generation the object ended up in (e.g., young gen or direct-to-old-gen). 
This is very helpful in some situations when you’re trying to work out which 
allocation(s) grew the old gen occupancy between two young GCs.

FWIW, the existing JFR events follow the approach I described above:

* one event for each new TLAB + first alloc in that TLAB (my proposal basically 
generalizes this and removes the 1-1 relationship between object alloc sampling 
and new TLAB operation)

* one event for all allocs outside a TLAB

I think the above separation is helpful. But if you think it could confuse 
users, you can of course easily just combine the information (but I strongly 
believe it’s better to report the information separately).



(and, in fact, not work if TLAB support is turned off)? 


Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine 
question.)



 I might be missing something obvious (and see my response below).


 
All this requires is a separate counter that is set to the next sampling 
interval, and decremented when an allocation happens, which goes into a slow 
path when the decrement hits 0.  Doing a subtraction and a pointer bump in 
allocation instead of just a pointer bump is basically free.  


Maybe on intel is cheap, but maybe it’s not on other platforms that other folks 
care about.

Really?  A memory read and a subtraction?  Which architectures care about that?


I was not concerned with the read and subtraction, I was more concerned with 
the conditional that follows them (intel has great branch prediction).

And a personal pet peeve (based on past experience): How many “free” 
instructions do you have to add before they are not free any more?




Again, notice that no one has complained about the addition that was added for 
total bytes allocated per thread.  I note that was actually added in the 6u20 
timeframe.

Note that it has been doing an additional addition (to keep track of per thread 
allocation) as part of allocation since Java 7, 


Interesting. I hadn’t realized that. Does that keep track of total size 
allocated per thread or number of allocated objects per thread? If it’s the 
former, why isn’t it possible to calculate that from the TLABs information?


Total size allocated per thread.  It isn't possible to calculate that from the 
TLAB because of out-of-TLAB allocation 


The allocating Thread is passed to the slow (outside-TLAB) alloc path so it 
would be trivial to update the per-thread allocation stats from there too (in 
fact, it does; see below).



(and hypothetically disabled TLABs).


Anyone cares? :-)




For some reason, they never included it in the ThreadMXBean interface, but it 
is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean to a 
com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on it.


Thanks for the tip. I’ll look into this...

and no one has complained.

I'm not worried about the ease of implementation here, because we've already 
implemented it.  


Yeah, but someone will have to maintain it moving forward.


I've been maintaining it internally to Google for 5 years.  It's actually 
pretty self-contained.  The only work involved is when they refactor something 
(so I've had to move it), or when a bug in the existing implementation is 
discovered.  It is very closely parallel to the TLAB code, which doesn't change 
much / at all.


The TLAB code has really not changed much for a while. ;-) (but haven’t looked 
at the JDK 9 source very closely though…)

It hasn't even been hard

Re: Low-Overhead Heap Profiling

2015-06-24 Thread Kirk Pepperdine

> 
> But, seriously, why didn’t you like my proposal? It can do anything your 
> scheme can with fewer and simpler code changes. The only thing that it cannot 
> do is to sample based on object count (i.e., every 100 objects) instead of 
> based on object size (i.e., every 1MB of allocations). But I think doing 
> sampling based on size is the right approach here (IMHO).
> 
> 

I would think that the size based sampling would create a size based bias in 
your sampling. Since IME, it’s allocation frequency is more damaging to 
performance, I’d prefer to see time boxed sampling

Kind regards,
Kirk Pepperdine

Re: Low-Overhead Heap Profiling

2015-06-24 Thread Bernd Eckenfels

Am Wed, 24 Jun 2015 16:26:35 -0700
schrieb Jeremy Manson :
> > As for the other concern: my concern about *just* having the
> > callback mechanism is that there is quite a lot you can't do from
> > user code during an allocation, because of lack of access to JNI.
> >
> >
> > Maybe I missed something. Are the callbacks in Java? I.e., do you
> > call them using JNI from the slow path you call directly from the
> > allocation code?
> >
> > (For context: this referred to the hypothetical feature where we can
> provide a callback that invokes some code from allocation.)

What about a hypothetical queueing feature, so you can process the
events asynchronously (perhaps with some backpressure control). This
would work well for statistics processing.

(Your other use case, the throwing of OOM would not work, I guess)

But its an elegant solution to provide a code environment generic enoug
for all kinds of instrumentation and independent of the "allocation
recursion".

Greetings
Bernd

Re: Low-Overhead Heap Profiling

2015-06-24 Thread Jeremy Manson

On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis 
wrote:

> Hi Jeremy,
>
> Please see inline.
>
> On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com)
> wrote:
>
> I don't want the size of the TLAB, which is ergonomically adjusted, to be
> tied to the sampling rate.  There is no reason to do that.  I want
> reasonable statistical sampling of the allocations.
>
>
> As I said explicitly in my e-mail, I totally agree with this. Which is why
> I never suggested to resize TLABs in order to vary the sampling rate.
> (Apologies if my e-mail was not clear.)
>

My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs
entirely (and, in fact, not work if TLAB support is turned off)?  I might
be missing something obvious (and see my response below).

> All this requires is a separate counter that is set to the next sampling
> interval, and decremented when an allocation happens, which goes into a
> slow path when the decrement hits 0.  Doing a subtraction and a pointer
> bump in allocation instead of just a pointer bump is basically free.
>
>
> Maybe on intel is cheap, but maybe it’s not on other platforms that other
> folks care about.
>
Really?  A memory read and a subtraction?  Which architectures care about
that?

Again, notice that no one has complained about the addition that was added
for total bytes allocated per thread.  I note that was actually added in
the 6u20 timeframe.

Note that it has been doing an additional addition (to keep track of per
> thread allocation) as part of allocation since Java 7,
>
>
> Interesting. I hadn’t realized that. Does that keep track of total size
> allocated per thread or number of allocated objects per thread? If it’s the
> former, why isn’t it possible to calculate that from the TLABs information?
>

Total size allocated per thread.  It isn't possible to calculate that from
the TLAB because of out-of-TLAB allocation (and hypothetically disabled
TLABs).

For some reason, they never included it in the ThreadMXBean interface, but
it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean
to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on
it.

> and no one has complained.
>
> I'm not worried about the ease of implementation here, because we've
> already implemented it.
>
>
> Yeah, but someone will have to maintain it moving forward.
>

I've been maintaining it internally to Google for 5 years.  It's actually
pretty self-contained.  The only work involved is when they refactor
something (so I've had to move it), or when a bug in the existing
implementation is discovered.  It is very closely parallel to the TLAB
code, which doesn't change much / at all.

> It hasn't even been hard for us to do the forward port, except when the
> relevant Hotspot code is significantly refactored.
>
> We can also turn the sampling off, if we want.  We can set the sampling
> rate to 2^32, have the sampling code do nothing, and no one will ever
> notice.
>
>
> You still have extra instructions in the allocation path, so it’s not
> turned off (i.e., you have the tax without any benefit).
>
>
Hey, you have a counter in your allocation path you've never noticed, which
none of your code uses.  Pipelining is a wonderful thing.  :)

In fact, we could just have the sampling code do nothing, and no one would
> ever notice.
>
> Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8
> made it more expensive to grab a stack trace (the cost became proportional
> to the number of loaded classes), but we have a patch that mitigates that,
> which we would also be happy to upstream.
>
> As for the other concern: my concern about *just* having the callback
> mechanism is that there is quite a lot you can't do from user code during
> an allocation, because of lack of access to JNI.
>
>
> Maybe I missed something. Are the callbacks in Java? I.e., do you call
> them using JNI from the slow path you call directly from the allocation
> code?
>
> (For context: this referred to the hypothetical feature where we can
provide a callback that invokes some code from allocation.)

(It's not actually hypothetical, because we've already implemented it, but
let's call it hypothetical for the moment.)

We invoke native code.  You can't invoke any Java code during allocation,
including calling JNI methods, because that would make allocation
potentially reentrant, which doesn't work for all sorts of reasons.  The
native code doesn't even get passed a JNIEnv * - there is nothing it can do
with it without making the VM crash a lot.

Or, rather, you might be able to do that, but it would take a lot of
Hotspot rearchitecting.  When I tried to do it, I realized it would be an
extremely deep dive.

  However, you can do pretty much anything from the VM itself.  Crucially
> (for us), we don't just log the stack traces, we also keep track of which
> are live and which aren't.  We can't do this in a callback, if the callback
> can't create weak refs

Re: Low-Overhead Heap Profiling

2015-06-24 Thread Tony Printezis

Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson (jeremyman...@google.com) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied 
to the sampling rate.  There is no reason to do that.  I want reasonable 
statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I 
never suggested to resize TLABs in order to vary the sampling rate. (Apologies 
if my e-mail was not clear.)




All this requires is a separate counter that is set to the next sampling 
interval, and decremented when an allocation happens, which goes into a slow 
path when the decrement hits 0.  Doing a subtraction and a pointer bump in 
allocation instead of just a pointer bump is basically free.  


Maybe on intel is cheap, but maybe it’s not on other platforms that other folks 
care about.



Note that it has been doing an additional addition (to keep track of per thread 
allocation) as part of allocation since Java 7, 


Interesting. I hadn’t realized that. Does that keep track of total size 
allocated per thread or number of allocated objects per thread? If it’s the 
former, why isn’t it possible to calculate that from the TLABs information?



and no one has complained.

I'm not worried about the ease of implementation here, because we've already 
implemented it.  


Yeah, but someone will have to maintain it moving forward.



It hasn't even been hard for us to do the forward port, except when the 
relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling rate to 
2^32, have the sampling code do nothing, and no one will ever notice.  


You still have extra instructions in the allocation path, so it’s not turned 
off (i.e., you have the tax without any benefit).



In fact, we could just have the sampling code do nothing, and no one would ever 
notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made 
it more expensive to grab a stack trace (the cost became proportional to the 
number of loaded classes), but we have a patch that mitigates that, which we 
would also be happy to upstream.

As for the other concern: my concern about *just* having the callback mechanism 
is that there is quite a lot you can't do from user code during an allocation, 
because of lack of access to JNI.


Maybe I missed something. Are the callbacks in Java? I.e., do you call them 
using JNI from the slow path you call directly from the allocation code?



  However, you can do pretty much anything from the VM itself.  Crucially (for 
us), we don't just log the stack traces, we also keep track of which are live 
and which aren't.  We can't do this in a callback, if the callback can't create 
weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to 
(the callback gets invoked with a StackTraceData object, as I've defined 
above), and another that just tells you which sampled objects are still live.  
We could also add a third, which allowed a callback to set the sampling 
interval (basically, the VM would call it to get the integer number of bytes to 
be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I 
say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object 
come from?").


Well, that 1GB object would have most likely been allocated outside a TLAB and 
you could have identified it by instrumenting the “outside-of-TLAB allocation 
path” (just saying…).

But, seriously, why didn’t you like my proposal? It can do anything your scheme 
can with fewer and simpler code changes. The only thing that it cannot do is to 
sample based on object count (i.e., every 100 objects) instead of based on 
object size (i.e., every 1MB of allocations). But I think doing sampling based 
on size is the right approach here (IMHO).

Tony




Jeremy


On Tue, Jun 23, 2015 at 1:06 PM, Tony Printezis  wrote:
Jeremy (and all),

I’m not on the serviceability list so I won’t include the messages so far. :-) 
Also CCing the hotspot GC list, in case they have some feedback on this.

Could I suggest a (much) simpler but at least as powerful and flexible way to 
do this? (This is something we’ve been meaning to do for a while now for 
TwitterJDK, the JDK we develop and deploy here at Twitter.) You can force 
allocations to go into the slow path periodically by artificially setting the 
TLAB top to a lower value. So, imagine a TLAB is 4M. You can set top to 
(bottom+1M). When an allocation thinks the TLAB is full (in this case, the 
first 1MB is full) it will call the allocation slow path. There, you can 
intercept it, sample the allocation (and, like in your case, you’ll also have 
the correct stack trace), notice that the TLAB is not actually full, extend its 
to top to, say, (bottom+2M), and you’re done.

Advantages of this approach:

* This is a muc

Re: Low-Overhead Heap Profiling

2015-06-23 Thread Jeremy Manson

I don't want the size of the TLAB, which is ergonomically adjusted, to be
tied to the sampling rate.  There is no reason to do that.  I want
reasonable statistical sampling of the allocations.

All this requires is a separate counter that is set to the next sampling
interval, and decremented when an allocation happens, which goes into a
slow path when the decrement hits 0.  Doing a subtraction and a pointer
bump in allocation instead of just a pointer bump is basically free.  Note
that it has been doing an additional addition (to keep track of per thread
allocation) as part of allocation since Java 7, and no one has complained.

I'm not worried about the ease of implementation here, because we've
already implemented it.  It hasn't even been hard for us to do the forward
port, except when the relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling
rate to 2^32, have the sampling code do nothing, and no one will ever
notice.  In fact, we could just have the sampling code do nothing, and no
one would ever notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8
made it more expensive to grab a stack trace (the cost became proportional
to the number of loaded classes), but we have a patch that mitigates that,
which we would also be happy to upstream.

As for the other concern: my concern about *just* having the callback
mechanism is that there is quite a lot you can't do from user code during
an allocation, because of lack of access to JNI.  However, you can do
pretty much anything from the VM itself.  Crucially (for us), we don't just
log the stack traces, we also keep track of which are live and which
aren't.  We can't do this in a callback, if the callback can't create weak
refs to the object.

What we do at Google is to have two methods: one that you pass a callback
to (the callback gets invoked with a StackTraceData object, as I've defined
above), and another that just tells you which sampled objects are still
live.  We could also add a third, which allowed a callback to set the
sampling interval (basically, the VM would call it to get the integer
number of bytes to be allocated before the next sample).

Would people be amenable to that?  It makes the code more complex, but, as
I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB
object come from?").

Jeremy

On Tue, Jun 23, 2015 at 1:06 PM, Tony Printezis 
wrote:

> Jeremy (and all),
>
> I’m not on the serviceability list so I won’t include the messages so far.
> :-) Also CCing the hotspot GC list, in case they have some feedback on this.
>
> Could I suggest a (much) simpler but at least as powerful and flexible way
> to do this? (This is something we’ve been meaning to do for a while now for
> TwitterJDK, the JDK we develop and deploy here at Twitter.) You can force
> allocations to go into the slow path periodically by artificially setting
> the TLAB top to a lower value. So, imagine a TLAB is 4M. You can set top to
> (bottom+1M). When an allocation thinks the TLAB is full (in this case, the
> first 1MB is full) it will call the allocation slow path. There, you can
> intercept it, sample the allocation (and, like in your case, you’ll also
> have the correct stack trace), notice that the TLAB is not actually full,
> extend its to top to, say, (bottom+2M), and you’re done.
>
> Advantages of this approach:
>
> * This is a much smaller, simpler, and self-contained change (no compiler
> changes necessary to maintain...).
>
> * When it’s off, the overhead is only one extra test at the slow path TLAB
> allocation (i.e., negligible; we do some sampling on TLABs in TwitterJDK
> using a similar mechanism and, when it’s off, I’ve observed no performance
> overhead).
>
> * (most importantly) You can turn this on and off, and adjust the sampling
> rate, dynamically. If you do the sampling based on JITed code, you’ll have
> to recompile all methods with allocation sites to turn the sampling on or
> off. (You can of course have it always on and just discard the output; it’d
> be nice not to have to do that though. IMHO, at least.)
>
> * You can also very cheaply turn this on and off (or adjust the sampling
> frequncy) per thread, if that’s be helpful in some way (just add the
> appropriate info on the thread’s TLAB).
>
> A few extra comments on the previous discussion:
>
> * "JFR samples per new TLAB allocation. It provides really very good
> picture and I haven't seen overhead more than 2” : When TLABs get very
> large, I don’t think sampling one object per TLAB is enough to get a good
> sample (IMHO, at least). It’s probably OK for something like jbb which
> mostly allocates instances of a handful of classes and has very few
> allocation sites. But, a lot of the code we run at Twitter is a lot more
> elaborate than that and, in our experience, sampling one object per TLAB is
> not enough. You can, of course, decrease the TLAB size to increase the
> samp

Re: Low-Overhead Heap Profiling

2015-06-23 Thread Tony Printezis

Jeremy (and all),

I’m not on the serviceability list so I won’t include the messages so far. :-) 
Also CCing the hotspot GC list, in case they have some feedback on this.

Could I suggest a (much) simpler but at least as powerful and flexible way to 
do this? (This is something we’ve been meaning to do for a while now for 
TwitterJDK, the JDK we develop and deploy here at Twitter.) You can force 
allocations to go into the slow path periodically by artificially setting the 
TLAB top to a lower value. So, imagine a TLAB is 4M. You can set top to 
(bottom+1M). When an allocation thinks the TLAB is full (in this case, the 
first 1MB is full) it will call the allocation slow path. There, you can 
intercept it, sample the allocation (and, like in your case, you’ll also have 
the correct stack trace), notice that the TLAB is not actually full, extend its 
to top to, say, (bottom+2M), and you’re done.

Advantages of this approach:

* This is a much smaller, simpler, and self-contained change (no compiler 
changes necessary to maintain...).

* When it’s off, the overhead is only one extra test at the slow path TLAB 
allocation (i.e., negligible; we do some sampling on TLABs in TwitterJDK using 
a similar mechanism and, when it’s off, I’ve observed no performance overhead).

* (most importantly) You can turn this on and off, and adjust the sampling 
rate, dynamically. If you do the sampling based on JITed code, you’ll have to 
recompile all methods with allocation sites to turn the sampling on or off. 
(You can of course have it always on and just discard the output; it’d be nice 
not to have to do that though. IMHO, at least.)

* You can also very cheaply turn this on and off (or adjust the sampling 
frequncy) per thread, if that’s be helpful in some way (just add the 
appropriate info on the thread’s TLAB).

A few extra comments on the previous discussion:

* "JFR samples per new TLAB allocation. It provides really very good picture 
and I haven't seen overhead more than 2” : When TLABs get very large, I don’t 
think sampling one object per TLAB is enough to get a good sample (IMHO, at 
least). It’s probably OK for something like jbb which mostly allocates 
instances of a handful of classes and has very few allocation sites. But, a lot 
of the code we run at Twitter is a lot more elaborate than that and, in our 
experience, sampling one object per TLAB is not enough. You can, of course, 
decrease the TLAB size to increase the sampling size. But it’d be good not to 
have to do that given a smaller TLAB size could increase contention across 
threads.

* "Should it *just* take a stack trace, or should the behavior be 
configurable?” : I think we’d have to separate the allocation sampling 
mechanism from the consumption of the allocation samples. Once the sampling 
mechanism is in, different JVMs can take advantage of it in different ways. I 
assume that the Oracle folks would like at least a JFR event for every such 
sample. But in your build you can add extra code to collect the information in 
the way you have now.

* Talking of JFR, it’s a bit unfortunate that the AllocObjectInNewTLAB event 
has both the new TLAB information and the allocation information. It would have 
been nice if that event was split into two, say NewTLAB and AllocObjectInTLAB, 
and we’d be able to fire the latter for each sample.

* "Should the interval between samples be configurable?” : Totally. In fact, 
it’d be helpful if it was configurable dynamically. Imagine if a JVM starts 
misbehaving after 2-3 weeks of running. You can dynamically increase the 
sampling rate to get a better profile if the default is not giving fine-grain 
enough information.

* "As long of these features don’t contribute to sampling bias” : If the 
sampling interval is fixed, sampling bias would be a very real concern. In the 
above example, I’d increment top by 1M (the sampling frequency) + p% (a fudge 
factor). 

* "Yes, a perhaps optional callbacks would be nice too.” : Oh, no. :-) But, as 
I said, we should definitely separate the sampling mechanism from the mechanism 
that consumes the samples.

* "Another problem with our submitting things is that we can't really test on 
anything other than Linux.” : Another reason to go with a as platform 
independent solution as possible. :-)

Regards,

Tony

-

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis
tprinte...@twitter.com

Re: Low-Overhead Heap Profiling

2015-06-23 Thread Mario Torre

2015-06-23 19:01 GMT+02:00 Jeremy Manson :
> If it is a blocker for this work, we can find a way to test on OS X and
> Windows.

Yeah, I think if this is going to be a general interface, we need to
test on all the official platforms.

Cheers,
Mario

> Jeremy
>
> On Tue, Jun 23, 2015 at 9:14 AM, Jeremy Manson 
> wrote:
>>
>> I haven't tried!  Another problem with our submitting things is that we
>> can't really test on anything other than Linux.
>>
>> The reason we use ASGCT is that our modified version of ASGCT gathers
>> native frames *and* Java frames, getting a mixed mode stack trace that
>> crosses JNI boundaries.  I haven't checked on the details, but we could
>> probably use a more standard stack trace gathering mechanism for general
>> consumption.
>>
>> As I said, I think we'd have to change that to jvmtiStackInfo.  Would
>> folks like to see some proposed JVMTI interfaces for this?
>>
>> Jeremy
>>
>> On Tue, Jun 23, 2015 at 6:14 AM, Daniel D. Daugherty
>>  wrote:
>>>
>>> > ASGCT_CallTrace *trace;
>>>
>>> So it looks like this uses the AsyncGetTrace() infrastructure.
>>> Does your tool work on Windows and MacOS X?
>>>
>>> Dan
>>>
>>>
>>>
>>> On 6/22/15 10:58 PM, Jeremy Manson wrote:

 While I'm glad to hear that our colleagues at RedHat would love it, it
 will still need a sponsor from Oracle.  Any volunteers?

 It will also need a little polish to be available to the world.  Open
 design questions for upstreaming are things like:

 - Should the interval between samples be configurable?

 - Should it *just* take a stack trace, or should the behavior be
 configurable?  For example, we have a variant that allows it to invoke a
 callback on allocation. Do you want this?  Because it is being called 
 during
 allocation, the callback can't invoke JNI (because of the potential for a
 safepoint), so it might be somewhat confusing to the user.

 - If the answer to the above is yes, should it be able to invoke
 *multiple* callbacks with multiple intervals?  That could get very 
 expensive
 and hairy.

 - Other than stack trace, what kind of information should the sampled
 data contain?  Right now, the structure is:

 struct StackTraceData {
   ASGCT_CallTrace *trace;
   jint byte_size;
   jlong thread_id;
   const jbyte *name;
   jint name_length;
   jlong uid;
 };

 (where name is the class name, and uid is just a unique identifier.)
 For general consumption, we'll probably have to change the ASGCT_CallTrace
 to a jvmtiStackInfo, I suppose.

 Jeremy
>>>
>>>
>>
>



-- 
pgp key: http://subkeys.pgp.net/ PGP Key ID: 80F240CF
Fingerprint: BA39 9666 94EC 8B73 27FA  FC7C 4086 63E3 80F2 40CF

Java Champion - Blog: http://neugens.wordpress.com - Twitter: @neugens
Proud GNU Classpath developer: http://www.classpath.org/
OpenJDK: http://openjdk.java.net/projects/caciocavallo/

Please, support open standards:
http://endsoftpatents.org/

Re: Low-Overhead Heap Profiling

2015-06-23 Thread Jeremy Manson

If it is a blocker for this work, we can find a way to test on OS X and
Windows.

Jeremy

On Tue, Jun 23, 2015 at 9:14 AM, Jeremy Manson 
wrote:

> I haven't tried!  Another problem with our submitting things is that we
> can't really test on anything other than Linux.
>
> The reason we use ASGCT is that our modified version of ASGCT gathers
> native frames *and* Java frames, getting a mixed mode stack trace that
> crosses JNI boundaries.  I haven't checked on the details, but we could
> probably use a more standard stack trace gathering mechanism for general
> consumption.
>
> As I said, I think we'd have to change that to jvmtiStackInfo.  Would
> folks like to see some proposed JVMTI interfaces for this?
>
> Jeremy
>
> On Tue, Jun 23, 2015 at 6:14 AM, Daniel D. Daugherty <
> daniel.daughe...@oracle.com> wrote:
>
>> > ASGCT_CallTrace *trace;
>>
>> So it looks like this uses the AsyncGetTrace() infrastructure.
>> Does your tool work on Windows and MacOS X?
>>
>> Dan
>>
>>
>>
>> On 6/22/15 10:58 PM, Jeremy Manson wrote:
>>
>>> While I'm glad to hear that our colleagues at RedHat would love it, it
>>> will still need a sponsor from Oracle.  Any volunteers?
>>>
>>> It will also need a little polish to be available to the world.  Open
>>> design questions for upstreaming are things like:
>>>
>>> - Should the interval between samples be configurable?
>>>
>>> - Should it *just* take a stack trace, or should the behavior be
>>> configurable?  For example, we have a variant that allows it to invoke a
>>> callback on allocation. Do you want this?  Because it is being called
>>> during allocation, the callback can't invoke JNI (because of the potential
>>> for a safepoint), so it might be somewhat confusing to the user.
>>>
>>> - If the answer to the above is yes, should it be able to invoke
>>> *multiple* callbacks with multiple intervals?  That could get very
>>> expensive and hairy.
>>>
>>> - Other than stack trace, what kind of information should the sampled
>>> data contain?  Right now, the structure is:
>>>
>>> struct StackTraceData {
>>>   ASGCT_CallTrace *trace;
>>>   jint byte_size;
>>>   jlong thread_id;
>>>   const jbyte *name;
>>>   jint name_length;
>>>   jlong uid;
>>> };
>>>
>>> (where name is the class name, and uid is just a unique identifier.)
>>> For general consumption, we'll probably have to change the ASGCT_CallTrace
>>> to a jvmtiStackInfo, I suppose.
>>>
>>> Jeremy
>>>
>>
>>
>

Re: Low-Overhead Heap Profiling

2015-06-23 Thread Jeremy Manson

I haven't tried!  Another problem with our submitting things is that we
can't really test on anything other than Linux.

The reason we use ASGCT is that our modified version of ASGCT gathers
native frames *and* Java frames, getting a mixed mode stack trace that
crosses JNI boundaries.  I haven't checked on the details, but we could
probably use a more standard stack trace gathering mechanism for general
consumption.

As I said, I think we'd have to change that to jvmtiStackInfo.  Would folks
like to see some proposed JVMTI interfaces for this?

Jeremy

On Tue, Jun 23, 2015 at 6:14 AM, Daniel D. Daugherty <
daniel.daughe...@oracle.com> wrote:

> > ASGCT_CallTrace *trace;
>
> So it looks like this uses the AsyncGetTrace() infrastructure.
> Does your tool work on Windows and MacOS X?
>
> Dan
>
>
>
> On 6/22/15 10:58 PM, Jeremy Manson wrote:
>
>> While I'm glad to hear that our colleagues at RedHat would love it, it
>> will still need a sponsor from Oracle.  Any volunteers?
>>
>> It will also need a little polish to be available to the world.  Open
>> design questions for upstreaming are things like:
>>
>> - Should the interval between samples be configurable?
>>
>> - Should it *just* take a stack trace, or should the behavior be
>> configurable?  For example, we have a variant that allows it to invoke a
>> callback on allocation. Do you want this?  Because it is being called
>> during allocation, the callback can't invoke JNI (because of the potential
>> for a safepoint), so it might be somewhat confusing to the user.
>>
>> - If the answer to the above is yes, should it be able to invoke
>> *multiple* callbacks with multiple intervals?  That could get very
>> expensive and hairy.
>>
>> - Other than stack trace, what kind of information should the sampled
>> data contain?  Right now, the structure is:
>>
>> struct StackTraceData {
>>   ASGCT_CallTrace *trace;
>>   jint byte_size;
>>   jlong thread_id;
>>   const jbyte *name;
>>   jint name_length;
>>   jlong uid;
>> };
>>
>> (where name is the class name, and uid is just a unique identifier.)  For
>> general consumption, we'll probably have to change the ASGCT_CallTrace to a
>> jvmtiStackInfo, I suppose.
>>
>> Jeremy
>>
>
>

Re: Low-Overhead Heap Profiling

2015-06-23 Thread Daniel D. Daugherty


> ASGCT_CallTrace *trace;

So it looks like this uses the AsyncGetTrace() infrastructure.
Does your tool work on Windows and MacOS X?

Dan


On 6/22/15 10:58 PM, Jeremy Manson wrote:
While I'm glad to hear that our colleagues at RedHat would love it, it 
will still need a sponsor from Oracle.  Any volunteers?


It will also need a little polish to be available to the world.  Open 
design questions for upstreaming are things like:


- Should the interval between samples be configurable?

- Should it *just* take a stack trace, or should the behavior be 
configurable?  For example, we have a variant that allows it to invoke 
a callback on allocation. Do you want this?  Because it is being 
called during allocation, the callback can't invoke JNI (because of 
the potential for a safepoint), so it might be somewhat confusing to 
the user.


- If the answer to the above is yes, should it be able to invoke 
*multiple* callbacks with multiple intervals?  That could get very 
expensive and hairy.


- Other than stack trace, what kind of information should the sampled 
data contain?  Right now, the structure is:


struct StackTraceData {
  ASGCT_CallTrace *trace;
  jint byte_size;
  jlong thread_id;
  const jbyte *name;
  jint name_length;
  jlong uid;
};

(where name is the class name, and uid is just a unique identifier.) 
 For general consumption, we'll probably have to change the 
ASGCT_CallTrace to a jvmtiStackInfo, I suppose.


Jeremy

Re: Low-Overhead Heap Profiling

2015-06-23 Thread Jeremy Manson

Mario and Kirk -

Random followup to the comments:

Mario: Internally, we have both a callback mechanism (which is off by
default) and a sampling mechanism, which is always on.  We can certainly
contribute both.  The callback mechanism has a strict rule that you don't
call JNI in it.  The one thing you can do is register to throw an OOME
after you return from the allocation - some folks wanted to use this to
rate limit how much allocation a given request could do.

Kirk: We have had a bug open for a long time for an extension that says how
many GC cycles a given sample survived.  That might be useful, and would
actually be pretty easy to implement.

I don't want to do to much work on this (or, more likely, tell one of the
folks on my team to do it) until we have a way forward to get this
interface into some JDK distribution at some point.

Jeremy

On Mon, Jun 22, 2015 at 11:46 PM, Mario Torre <
neugens.limasoftw...@gmail.com> wrote:

> 2015-06-23 6:58 GMT+02:00 Jeremy Manson :
> > While I'm glad to hear that our colleagues at RedHat would love it, it
> will
> > still need a sponsor from Oracle.  Any volunteers?
> >
> > It will also need a little polish to be available to the world.  Open
> design
> > questions for upstreaming are things like:
> >
> > - Should the interval between samples be configurable?
>
> Yeah, this would be nice.
>
> > - Should it *just* take a stack trace, or should the behavior be
> > configurable?  For example, we have a variant that allows it to invoke a
> > callback on allocation.  Do you want this?  Because it is being called
> > during allocation, the callback can't invoke JNI (because of the
> potential
> > for a safe point) , so it might be somewhat confusing to the user.
>
> Yes, a perhaps optional callbacks would be nice too. I don't think the
> "no JNI here" is a problem given the scope of the tool, although I can
> see how it could be a bit confusing or kind of limiting in some cases,
> but probably the benefit outweighs this issue. I can see, for example,
> how a callback here could provide a plug for things like system tap
> probes etc...
>
> > - If the answer to the above is yes, should it be able to invoke
> *multiple*
> > callbacks with multiple intervals?  That could get very expensive and
> hairy.
>
> I don't think you need multiple callbacks, since the callback would be
> user code, users can call additional callbacks themselves, but it also
> depends on how this user code is installed. For instance, if the
> outside world interface is some kind of JVMTi interface, then we
> probably need to support multiple callbacks (as in multiple agents
> installing their custom code). I suspect that to keep the overhead low
> an higher level interface is needed to collect the allocation events,
> while the callback interface mechanisms would be most useful to
> implement/extend this interface, but not really meant to be used by
> default.
>
> > - Other than stack trace, what kind of information should the sampled
> data
> > contain?  Right now, the structure is:
> >
> > struct StackTraceData {
> >   ASGCT_CallTrace *trace;
> >   jint byte_size;
> >   jlong thread_id;
> >   const jbyte *name;
> >   jint name_length;
> >   jlong uid;
> > };
> >
> > (where name is the class name, and uid is just a unique identifier.)  For
> > general consumption, we'll probably have to change the ASGCT_CallTrace
> to a
> > jvmtiStackInfo, I suppose.
>
> It looks to me that most of the interesting stuff is here. The
> questions that remain to me are how the outside interface would look
> and how it would work, we can probably take a more abstracted and high
> level look at it and then refine the specific step by step.
>
> Cheers,
> Mario
>
> --
> pgp key: http://subkeys.pgp.net/ PGP Key ID: 80F240CF
> Fingerprint: BA39 9666 94EC 8B73 27FA  FC7C 4086 63E3 80F2 40CF
>
> Java Champion - Blog: http://neugens.wordpress.com - Twitter: @neugens
> Proud GNU Classpath developer: http://www.classpath.org/
> OpenJDK: http://openjdk.java.net/projects/caciocavallo/
>
> Please, support open standards:
> http://endsoftpatents.org/
>

Re: Low-Overhead Heap Profiling

2015-06-22 Thread Mario Torre

2015-06-23 6:58 GMT+02:00 Jeremy Manson :
> While I'm glad to hear that our colleagues at RedHat would love it, it will
> still need a sponsor from Oracle.  Any volunteers?
>
> It will also need a little polish to be available to the world.  Open design
> questions for upstreaming are things like:
>
> - Should the interval between samples be configurable?

Yeah, this would be nice.

> - Should it *just* take a stack trace, or should the behavior be
> configurable?  For example, we have a variant that allows it to invoke a
> callback on allocation.  Do you want this?  Because it is being called
> during allocation, the callback can't invoke JNI (because of the potential
> for a safe point) , so it might be somewhat confusing to the user.

Yes, a perhaps optional callbacks would be nice too. I don't think the
"no JNI here" is a problem given the scope of the tool, although I can
see how it could be a bit confusing or kind of limiting in some cases,
but probably the benefit outweighs this issue. I can see, for example,
how a callback here could provide a plug for things like system tap
probes etc...

> - If the answer to the above is yes, should it be able to invoke *multiple*
> callbacks with multiple intervals?  That could get very expensive and hairy.

I don't think you need multiple callbacks, since the callback would be
user code, users can call additional callbacks themselves, but it also
depends on how this user code is installed. For instance, if the
outside world interface is some kind of JVMTi interface, then we
probably need to support multiple callbacks (as in multiple agents
installing their custom code). I suspect that to keep the overhead low
an higher level interface is needed to collect the allocation events,
while the callback interface mechanisms would be most useful to
implement/extend this interface, but not really meant to be used by
default.

> - Other than stack trace, what kind of information should the sampled data
> contain?  Right now, the structure is:
>
> struct StackTraceData {
>   ASGCT_CallTrace *trace;
>   jint byte_size;
>   jlong thread_id;
>   const jbyte *name;
>   jint name_length;
>   jlong uid;
> };
>
> (where name is the class name, and uid is just a unique identifier.)  For
> general consumption, we'll probably have to change the ASGCT_CallTrace to a
> jvmtiStackInfo, I suppose.

It looks to me that most of the interesting stuff is here. The
questions that remain to me are how the outside interface would look
and how it would work, we can probably take a more abstracted and high
level look at it and then refine the specific step by step.

Cheers,
Mario

-- 
pgp key: http://subkeys.pgp.net/ PGP Key ID: 80F240CF
Fingerprint: BA39 9666 94EC 8B73 27FA  FC7C 4086 63E3 80F2 40CF

Java Champion - Blog: http://neugens.wordpress.com - Twitter: @neugens
Proud GNU Classpath developer: http://www.classpath.org/
OpenJDK: http://openjdk.java.net/projects/caciocavallo/

Please, support open standards:
http://endsoftpatents.org/

Re: Low-Overhead Heap Profiling

2015-06-22 Thread Kirk Pepperdine

Hi Jeremy,

> 
> It will also need a little polish to be available to the world.  Open design 
> questions for upstreaming are things like:
> 
> - Should the interval between samples be configurable?

This could be helpful.

> 
> - Should it *just* take a stack trace, or should the behavior be 
> configurable?  For example, we have a variant that allows it to invoke a 
> callback on allocation.  Do you want this?  Because it is being called during 
> allocation, the callback can't invoke JNI (because of the potential for a 
> safepoint), so it might be somewhat confusing to the user.

As long of these features don’t contribute to sampling bias I think they’d (at 
times) all be useful.


> 
> - If the answer to the above is yes, should it be able to invoke *multiple* 
> callbacks with multiple intervals?  That could get very expensive and hairy.
> 
> - Other than stack trace, what kind of information should the sampled data 
> contain?  Right now, the structure is:
> 
> struct StackTraceData {
>   ASGCT_CallTrace *trace;
>   jint byte_size;
>   jlong thread_id;
>   const jbyte *name;
>   jint name_length;
>   jlong uid;
> };

If you could get age data that could be interesting as well.

Kind regards,
Kirk Pepperdine

Re: Low-Overhead Heap Profiling

2015-06-22 Thread Kirk Pepperdine

Hi Vladimir,

The difference would be in the licensing.

Kind regards,
Kirk Pepperdine

On Jun 23, 2015, at 7:31 AM, Vladimir Voskresensky 
 wrote:

> Hello Jeremy,
> 
> If this is sampling, not tracing, then how is it different from the 
> low-overhead memory profiling provided by JFR [1].
> JFR samples per new TLAB allocation. It provides really very good picture and 
> I haven't seen overhead more than 2%.
> Btw, JFR also does not have false positives reported by instrumented 
> approaches for the cases when JIT was able to eliminate heap allocation.
> 
> Thanks,
> Vladimir.
> [1] http://hirt.se/blog/?p=381
> 
> On 22.06.2015 11:48, Jeremy Manson wrote:
>> Hey folks,
>> 
>> (cc'ing Aleksey and John, because I mentioned this to them at the JVMLS last 
>> year, but I never followed up.)
>> 
>> We have a patch at Google I've always wanted to contribute to OpenJDK, but I 
>> always figured it would be unwanted.  I've recently been thinking that might 
>> not be as true, though.  I thought I would ask if there is any interest / if 
>> I should write a JEP / if I should just forget it.
>> 
>> The basic problem is that there is no low-overhead supported way to figure 
>> out where allocation hotspots are. That is, sets of stack traces where lots 
>> of allocation / large allocations took place.
>> 
>> What I had originally done (this was in ~2007) was use bytecode rewriting to 
>> instrument allocation sites.  The instrumentation would call a Java method, 
>> and the method would capture a stack trace.  To reduce overhead, there was a 
>> per-thread counter that only took a stack trace once every N bytes 
>> allocated, where N is a randomly chosen point in a probability distribution 
>> that centered around ~512K.
>> 
>> This was *way* too slow, and it didn't pick up allocations through JNI, so I 
>> instrumented allocations at the VM level, and the overhead went away.  The 
>> sampling is always turned on in our internal VMs, and a user can just query 
>> an interface for a list of sampled stack traces.  The allocated stack traces 
>> are held with weak refs, so you only get live samples.
>> 
>> The additional overhead for allocations amounts to a subtraction, and an 
>> occasional stack trace, which is usually a very, very small amount of our 
>> CPU (although I had to do some jiggering in JDK8 to fix some performance 
>> regressions).
>> 
>> There really isn't another good way to do this with low overhead.  I was 
>> wondering how the gruop would feel about our contributing it?
>> 
>> Thoughts?
>> 
>> Jeremy
>

Re: Low-Overhead Heap Profiling

2015-06-22 Thread Vladimir Voskresensky


Hello Jeremy,

If this is sampling, not tracing, then how is it different from the 
low-overhead memory profiling provided by JFR [1].
JFR samples per new TLAB allocation. It provides really very good 
picture and I haven't seen overhead more than 2%.
Btw, JFR also does not have false positives reported by instrumented 
approaches for the cases when JIT was able to eliminate heap allocation.


Thanks,
Vladimir.
[1] http://hirt.se/blog/?p=381

On 22.06.2015 11:48, Jeremy Manson wrote:

Hey folks,

(cc'ing Aleksey and John, because I mentioned this to them at the 
JVMLS last year, but I never followed up.)


We have a patch at Google I've always wanted to contribute to OpenJDK, 
but I always figured it would be unwanted.  I've recently been 
thinking that might not be as true, though.  I thought I would ask if 
there is any interest / if I should write a JEP / if I should just 
forget it.


The basic problem is that there is no low-overhead supported way to 
figure out where allocation hotspots are. That is, sets of stack 
traces where lots of allocation / large allocations took place.


What I had originally done (this was in ~2007) was use bytecode 
rewriting to instrument allocation sites.  The instrumentation would 
call a Java method, and the method would capture a stack trace.  To 
reduce overhead, there was a per-thread counter that only took a stack 
trace once every N bytes allocated, where N is a randomly chosen point 
in a probability distribution that centered around ~512K.


This was *way* too slow, and it didn't pick up allocations through 
JNI, so I instrumented allocations at the VM level, and the overhead 
went away.  The sampling is always turned on in our internal VMs, and 
a user can just query an interface for a list of sampled stack 
traces.  The allocated stack traces are held with weak refs, so you 
only get live samples.


The additional overhead for allocations amounts to a subtraction, and 
an occasional stack trace, which is usually a very, very small amount 
of our CPU (although I had to do some jiggering in JDK8 to fix some 
performance regressions).


There really isn't another good way to do this with low overhead.  I 
was wondering how the gruop would feel about our contributing it?


Thoughts?

Jeremy

Re: Low-Overhead Heap Profiling

2015-06-22 Thread Jeremy Manson

While I'm glad to hear that our colleagues at RedHat would love it, it will
still need a sponsor from Oracle.  Any volunteers?

It will also need a little polish to be available to the world.  Open
design questions for upstreaming are things like:

- Should the interval between samples be configurable?

- Should it *just* take a stack trace, or should the behavior be
configurable?  For example, we have a variant that allows it to invoke a
callback on allocation.  Do you want this?  Because it is being called
during allocation, the callback can't invoke JNI (because of the potential
for a safepoint), so it might be somewhat confusing to the user.

- If the answer to the above is yes, should it be able to invoke *multiple*
callbacks with multiple intervals?  That could get very expensive and hairy.

- Other than stack trace, what kind of information should the sampled data
contain?  Right now, the structure is:

struct StackTraceData {
  ASGCT_CallTrace *trace;
  jint byte_size;
  jlong thread_id;
  const jbyte *name;
  jint name_length;
  jlong uid;
};

(where name is the class name, and uid is just a unique identifier.)  For
general consumption, we'll probably have to change the ASGCT_CallTrace to a
jvmtiStackInfo, I suppose.

Jeremy

Re: Low-Overhead Heap Profiling

2015-06-22 Thread Mario Torre

I for one would love it.

Cheers,
Mario

2015-06-22 20:48 GMT+02:00 Jeremy Manson :
> Hey folks,
>
> (cc'ing Aleksey and John, because I mentioned this to them at the JVMLS last
> year, but I never followed up.)
>
> We have a patch at Google I've always wanted to contribute to OpenJDK, but I
> always figured it would be unwanted.  I've recently been thinking that might
> not be as true, though.  I thought I would ask if there is any interest / if
> I should write a JEP / if I should just forget it.
>
> The basic problem is that there is no low-overhead supported way to figure
> out where allocation hotspots are.  That is, sets of stack traces where lots
> of allocation / large allocations took place.
>
> What I had originally done (this was in ~2007) was use bytecode rewriting to
> instrument allocation sites.  The instrumentation would call a Java method,
> and the method would capture a stack trace.  To reduce overhead, there was a
> per-thread counter that only took a stack trace once every N bytes
> allocated, where N is a randomly chosen point in a probability distribution
> that centered around ~512K.
>
> This was *way* too slow, and it didn't pick up allocations through JNI, so I
> instrumented allocations at the VM level, and the overhead went away.  The
> sampling is always turned on in our internal VMs, and a user can just query
> an interface for a list of sampled stack traces.  The allocated stack traces
> are held with weak refs, so you only get live samples.
>
> The additional overhead for allocations amounts to a subtraction, and an
> occasional stack trace, which is usually a very, very small amount of our
> CPU (although I had to do some jiggering in JDK8 to fix some performance
> regressions).
>
> There really isn't another good way to do this with low overhead.  I was
> wondering how the gruop would feel about our contributing it?
>
> Thoughts?
>
> Jeremy



-- 
pgp key: http://subkeys.pgp.net/ PGP Key ID: 80F240CF
Fingerprint: BA39 9666 94EC 8B73 27FA  FC7C 4086 63E3 80F2 40CF

Java Champion - Blog: http://neugens.wordpress.com - Twitter: @neugens
Proud GNU Classpath developer: http://www.classpath.org/
OpenJDK: http://openjdk.java.net/projects/caciocavallo/

Please, support open standards:
http://endsoftpatents.org/

69 matches

Mail list logo