Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-09-09 Thread Liang, Kan




On 8/31/2019 5:19 AM, Peter Zijlstra wrote:

Then there is no mucking about with that odd counter/metrics msr pair
reset nonsense. Becuase that really stinks.

You have to write them to reset the internal counters.

But not for ever read, only on METRIC_OVF.


The precision are lost if the counters were running much longer than the 
measured delta. We have to reset the counters after every read.

For example, the worst case is:
The previous reading was rounded up and the current reading is rounded 
down. The error of METRICS is 1/256 - 1/SLOTS, which is related to 
SLOTS. The error for each Topdown event is  (1/256 - 1/SLOTS) * SLOTS.

With the SLOTS increasing, the precision will get worse and worse.
Therefor, we have to reset the counters periodically.

Thanks,
Kan




Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-31 Thread Andi Kleen
On Sat, Aug 31, 2019 at 02:13:05AM -0700, Stephane Eranian wrote:
> Andi,
> 
> On Fri, Aug 30, 2019 at 5:31 PM Andi Kleen  wrote:
> >
> > > the same manner. It would greatly simplify the kernel implementation.
> >
> > I tried that originally. It was actually more complicated.
> >
> > You can't really do deltas on raw metrics, and a lot of the perf
> > infrastructure is built around deltas.
> >
> How is RAPL handled? No deltas there either. It uses the snapshot model.

RAPL doesn't support any context switch or CPU context.
Also it has no concept of "accumulate with clear on read"

-Andi


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-31 Thread Peter Zijlstra
On Sat, Aug 31, 2019 at 02:13:05AM -0700, Stephane Eranian wrote:

> With PERF_METRICS, the delta is always since previous read. If you read
> frequently enough you do not lose precision.

You always loose precision, the whole fraction of 255 looses the
fractional part; consider:

255 * 1 / 8
31.8750
255 * 7 / 8
223.1250

31 * 8 / 255
.97254901960784313725
223 * 8 / 255
6.99607843137254901960

Now, we can make reconstruction use normal 'round-nearest' and then we'd
get 1 and 7 back, but there's always cases where you loose things.

Also, with 255 being an odd number, round-nearest is actually 'hard' :/


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-31 Thread Peter Zijlstra
On Wed, Aug 28, 2019 at 12:04:45PM -0700, Andi Kleen wrote:

> > > NMI
> > > ==
> > > 
> > > The METRICS register may be overflow. The bit 48 of STATUS register
> > > will be set. If so, update all active slots and metrics events.
> > 
> > that happen? It would be useful to get that METRIC_OVF (can we please
> 
> This happens when the internal counters that feed the metrics
> overflow.
> 
> > If this is so; then we can use this to update/reset PERF_METRICS and
> > nothing else.
> 
> It has to be handled in the PMI.

That's what I wrote; Overflow is always NMI.

> > Then there is no mucking about with that odd counter/metrics msr pair
> > reset nonsense. Becuase that really stinks.
> 
> You have to write them to reset the internal counters.

But not for ever read, only on METRIC_OVF.


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-31 Thread Peter Zijlstra
On Thu, Aug 29, 2019 at 12:56:02PM -0400, Liang, Kan wrote:
> On 8/29/2019 9:52 AM, Peter Zijlstra wrote:

> > What what? The PERF_METRICS contents depends on the FIXCTR3 value ?!
> 
> Yes.
> 
> For current implementation, PERF_METRIC MSR is composed by four fields,
> backend bound, frontend bound, bad speculation and retiring.
> Each of the fields are populated using the below formula for eg:
> PERF_METRIC[RETIRING] = (0xFF *
> PERF_METRICS_RETIRING_INTERNAL_48bit_COUNTER)
> / FIXCTR3

So it really depends on the actual exposed FIXCTR3 _value_ to compute
the PERF_METRIC field? *mind boggles*, that's really unspeakable crap.
And this isn't documented anywhere afaict.

I was thinking they've have an internal counter for the SLOTS value too,
so the PERF_METRIC fields are indenpendent; which would be like 'sane'.

Exposing the internal counters would've been _s_ much better, just
add 4 more fixed counters and call it a day.

> The METRICS_OVF indicates the overflow of any internal counters.

OK, but I'm thinking that by that time the fraction in PERF_METRIC will
be too coarse and we're loosing precision. Reconstruction will be
inaccurate.

> The internal counters only start counting from 0, which cannot be programmed
> by SW. But resetting the PERF_METRIC would implicitly resetting the internal
> counters.

The only possible option given the choices.


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-31 Thread Stephane Eranian
Andi,

On Fri, Aug 30, 2019 at 5:31 PM Andi Kleen  wrote:
>
> > the same manner. It would greatly simplify the kernel implementation.
>
> I tried that originally. It was actually more complicated.
>
> You can't really do deltas on raw metrics, and a lot of the perf
> infrastructure is built around deltas.
>
How is RAPL handled? No deltas there either. It uses the snapshot model.
At each interval, perf stat just reads the current count, and does not compute
a delta since previous read.
With PERF_METRICS, the delta is always since previous read. If you read
frequently enough you do not lose precision.

>
> To do the regular reset and not lose precision over time internally
> you have to keep expanded counters anyways. And if you do that
> you can just expose them to user space too, and have everything
> in user space just work without any changes (except for the final
> output)
>
> -Andi
>


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-30 Thread Andi Kleen
> the same manner. It would greatly simplify the kernel implementation.

I tried that originally. It was actually more complicated.

You can't really do deltas on raw metrics, and a lot of the perf
infrastructure is built around deltas.

To do the regular reset and not lose precision over time internally
you have to keep expanded counters anyways. And if you do that
you can just expose them to user space too, and have everything
in user space just work without any changes (except for the final
output)

-Andi



Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-30 Thread Stephane Eranian
Hi,

On Mon, Aug 26, 2019 at 7:48 AM  wrote:
>
> From: Kan Liang 
>
> Intro
> =
>
> Icelake has support for measuring the four top level TopDown metrics
> directly in hardware. This is implemented by an additional "metrics"
> register, and a new Fixed Counter 3 that measures pipeline "slots".
>
> Events
> ==
>
> We export four metric events as separate perf events, which map to
> internal "metrics" counter register. Those events do not exist in
> hardware, but can be allocated by the scheduler.
>
There is another approach possible for supporting Topdown-style counters.
Instead of trying to abstract them as separate events to the user and then
trying to put them back together in the kernel and then using slots to scale
them as counts, we could just expose them as is, i.e., structured counter
values. The kernel already handles structured counter configs and exports
the fields on the config via sysfs and the perf tool picks them up and can
encode any event. We could have a similar approach for a counter
value. It could have fields, unit, types. Perf stat would pick them up in
the same manner. It would greatly simplify the kernel implementation.
You would need to publish an pseudo-event code for each group of
metrics. Note that I am not advocating expose the raw counter value.
That way you would maintain one event code -> one "counter" on hw.
The reset on read would also work. It would generate only one rdmsr
per read without forcing any grouping. You would not need any grouping,
or using slots under the hood to scale. If PERF_METRICS gets extended, you
can just add another pseudo event code or umask.

The PERF_METRICS events do not make real sense in isolation. The SLOTS
scaling is hard to interpret. You can never profiling on PERF_METRICS event
so keeping them grouped is okay.


> For the event mapping we use a special 0x00 event code, which is
> reserved for fake events. The metric events start from umask 0x10.
>
> When setting up such events they point to the slots counter, and a
> special callback, update_topdown_event(), reads the additional metrics
> msr to generate the metrics. Then the metric is reported by multiplying
> the metric (percentage) with slots.
>
> This multiplication allows to easily keep a running count, for example
> when the slots counter overflows, and makes all the standard tools, such
> as a perf stat, work. They can do deltas of the values without needing
> to know about percentages. This also simplifies accumulating the counts
> of child events, which otherwise would need to know how to average
> percent values.
>
> All four metric events don't support sampling. Since they will be
> handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is
> introduced to indicate this case.
>
> The slots event can support both sampling and counting.
> For counting, the flag is also applied.
> For sampling, it will be handled normally as other normal events.
>
> Groups
> ==
>
> To avoid reading the METRICS register multiple times, the metrics and
> slots value can only be updated by the first slots/metrics event in a
> group. All active slots and metrics events will be updated one time.
>
> Reset
> ==
>
> The PERF_METRICS and Fixed counter 3 have to be reset for each read,
> because:
> - The 8bit metrics ratio values lose precision when the measurement
>   period gets longer.
> - The PERF_METRICS may report wrong value if its delta was less than
>   1/255 of SLOTS (Fixed counter 3).
>
> Also, for counting, the -max_period is the initial value of the SLOTS.
> The huge initial value will definitely trigger the issue mentioned
> above. Force initial value as 0 for topdown and slots event counting.
>
> NMI
> ==
>
> The METRICS register may be overflow. The bit 48 of STATUS register
> will be set. If so, update all active slots and metrics events.
>
> The update_topdown_event() has to read two registers separately. The
> values may be modify by a NMI. PMU has to be disabled before calling the
> function.
>
> RDPMC
> ==
>
> RDPMC is temporarily disabled. The following patch will enable it.
>
> Originally-by: Andi Kleen 
> Signed-off-by: Kan Liang 
> ---
>  arch/x86/events/core.c   |  10 ++
>  arch/x86/events/intel/core.c | 230 ++-
>  arch/x86/events/perf_event.h |  17 +++
>  arch/x86/include/asm/msr-index.h |   2 +
>  4 files changed, 255 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 54534ff00940..1ae23db5c2d7 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -76,6 +76,8 @@ u64 x86_perf_event_update(struct perf_event *event)
> if (idx == INTEL_PMC_IDX_FIXED_BTS)
> return 0;
>
> +   if (is_topdown_count(event) && x86_pmu.update_topdown_event)
> +   return x86_pmu.update_topdown_event(event);
> /*
>  * Careful: an NMI might modify the previous event value.
>  *
> @@ -1003,6 +1005,10 

Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-29 Thread Liang, Kan




On 8/29/2019 9:52 AM, Peter Zijlstra wrote:

On Thu, Aug 29, 2019 at 09:31:37AM -0400, Liang, Kan wrote:

On 8/28/2019 11:19 AM, Peter Zijlstra wrote:

+static int icl_set_topdown_event_period(struct perf_event *event)
+{
+   struct hw_perf_event *hwc = >hw;
+   s64 left = local64_read(>period_left);
+
+   /*
+* Clear PERF_METRICS and Fixed counter 3 in initialization.
+* After that, both MSRs will be cleared for each read.
+* Don't need to clear them again.
+*/
+   if (left == x86_pmu.max_period) {
+   wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+   wrmsrl(MSR_PERF_METRICS, 0);
+   local64_set(>period_left, 0);
+   }

This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
never trigger the overflow there; this then seems to suggest the actual
counter value is irrelevant. Therefore you don't actually need this.



Could you please elaborate on why initialization to 0 never triggers an
overflow?


Well, 'never' as in a 'long' time.


As of my understanding, initialization to 0 only means that it will take
more time than initialization to -max_period (0x8000  0001) to trigger
an overflow.


Only twice as long. And why do we care about that?

The problem with it is though that you get the overflow at the end of
the whole period, instead of halfway through, so reconstruction is
trickier.


Maybe 0 is too tricky. We can set the initial value to 1.


That's even worse. I'm still not understanding why we can't use the
normal code.


I think the bottom line is that we need a small initial value for FIXED_CTR3
here.


But why?!


PERF_METRICS reports an 8bit integer fraction which is something like 0xff *
internal counters / FIXCTR3.
The internal counters only start counting from 0. (SW cannot set an
arbitrary initial value for internal counters.)
If the initial value of FIXED_CTR3 is too big, PERF_METRICS could always
remain constant, e.g. 0.


What what? The PERF_METRICS contents depends on the FIXCTR3 value ?!


Yes.

For current implementation, PERF_METRIC MSR is composed by four fields, 
backend bound, frontend bound, bad speculation and retiring.

Each of the fields are populated using the below formula for eg:
PERF_METRIC[RETIRING] = (0xFF * 
PERF_METRICS_RETIRING_INTERNAL_48bit_COUNTER)

/ FIXCTR3

The METRICS_OVF indicates the overflow of any internal counters.

The internal counters only start counting from 0, which cannot be 
programmed by SW. But resetting the PERF_METRIC would implicitly 
resetting the internal counters.


Thanks,
Kan



That's bloody insane. /me goes find the SDM. The SDM is bloody useless
:-(.

Please give a complete and coherent description of all of this. I can't
very well review any of this until I know how the hardware works, now
can I.

In this write-up, include the exact condition for METRICS_OVF (the SDM
states: 'it indicates that PERF_METRIC counter has overflowed', which is
gramatically incorrect and makes no sense even with the missing article
injected).








Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-29 Thread Peter Zijlstra
On Thu, Aug 29, 2019 at 09:31:37AM -0400, Liang, Kan wrote:
> On 8/28/2019 11:19 AM, Peter Zijlstra wrote:
> > > +static int icl_set_topdown_event_period(struct perf_event *event)
> > > +{
> > > + struct hw_perf_event *hwc = >hw;
> > > + s64 left = local64_read(>period_left);
> > > +
> > > + /*
> > > +  * Clear PERF_METRICS and Fixed counter 3 in initialization.
> > > +  * After that, both MSRs will be cleared for each read.
> > > +  * Don't need to clear them again.
> > > +  */
> > > + if (left == x86_pmu.max_period) {
> > > + wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
> > > + wrmsrl(MSR_PERF_METRICS, 0);
> > > + local64_set(>period_left, 0);
> > > + }
> > This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> > never trigger the overflow there; this then seems to suggest the actual
> > counter value is irrelevant. Therefore you don't actually need this.
> > 
> 
> Could you please elaborate on why initialization to 0 never triggers an
> overflow?

Well, 'never' as in a 'long' time.

> As of my understanding, initialization to 0 only means that it will take
> more time than initialization to -max_period (0x8000  0001) to trigger
> an overflow.

Only twice as long. And why do we care about that?

The problem with it is though that you get the overflow at the end of
the whole period, instead of halfway through, so reconstruction is
trickier.

> Maybe 0 is too tricky. We can set the initial value to 1.

That's even worse. I'm still not understanding why we can't use the
normal code.

> I think the bottom line is that we need a small initial value for FIXED_CTR3
> here.

But why?!

> PERF_METRICS reports an 8bit integer fraction which is something like 0xff *
> internal counters / FIXCTR3.
> The internal counters only start counting from 0. (SW cannot set an
> arbitrary initial value for internal counters.)
> If the initial value of FIXED_CTR3 is too big, PERF_METRICS could always
> remain constant, e.g. 0.

What what? The PERF_METRICS contents depends on the FIXCTR3 value ?!
That's bloody insane. /me goes find the SDM. The SDM is bloody useless
:-(.

Please give a complete and coherent description of all of this. I can't
very well review any of this until I know how the hardware works, now
can I.

In this write-up, include the exact condition for METRICS_OVF (the SDM
states: 'it indicates that PERF_METRIC counter has overflowed', which is
gramatically incorrect and makes no sense even with the missing article
injected).


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-29 Thread Liang, Kan




On 8/28/2019 11:19 AM, Peter Zijlstra wrote:

+static int icl_set_topdown_event_period(struct perf_event *event)
+{
+   struct hw_perf_event *hwc = >hw;
+   s64 left = local64_read(>period_left);
+
+   /*
+* Clear PERF_METRICS and Fixed counter 3 in initialization.
+* After that, both MSRs will be cleared for each read.
+* Don't need to clear them again.
+*/
+   if (left == x86_pmu.max_period) {
+   wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+   wrmsrl(MSR_PERF_METRICS, 0);
+   local64_set(>period_left, 0);
+   }

This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
never trigger the overflow there; this then seems to suggest the actual
counter value is irrelevant. Therefore you don't actually need this.



Could you please elaborate on why initialization to 0 never triggers an 
overflow?
As of my understanding, initialization to 0 only means that it will take 
more time than initialization to -max_period (0x8000  0001) to 
trigger an overflow.


Maybe 0 is too tricky. We can set the initial value to 1.
I think the bottom line is that we need a small initial value for 
FIXED_CTR3 here.
PERF_METRICS reports an 8bit integer fraction which is something like 
0xff * internal counters / FIXCTR3.
The internal counters only start counting from 0. (SW cannot set an 
arbitrary initial value for internal counters.)
If the initial value of FIXED_CTR3 is too big, PERF_METRICS could always 
remain constant, e.g. 0.


Thanks,
Kan


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-29 Thread Peter Zijlstra
On Wed, Aug 28, 2019 at 08:11:51PM -0700, Andi Kleen wrote:
> On Wed, Aug 28, 2019 at 06:28:57PM +0200, Peter Zijlstra wrote:
> > On Wed, Aug 28, 2019 at 09:17:54AM -0700, Andi Kleen wrote:
> > > > This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> > > > never trigger the overflow there; this then seems to suggest the actual
> > > 
> > > The 48bit counter might overflow in a few hours.
> > 
> > Sure; the point is? Kan said it should not be too big; a full 48bit wrap
> > around must be too big or nothing is.
> 
> We expect users to avoid making it too big, but we cannot rule it out.
> 
> Actually for the common perf stat for a long time case you can make it fairly
> big because the percentages still work out over the complete run time.
> 
> The problem with letting it accumulate too much is mainly if you
> want to use a continuously running metrics counter to time smaller
> regions by taking deltas, for example using RDPMC. 
> 
> In this case the small regions would be too inaccurate after some time.
> 
> But to make the first case work absolutely need to handle the overflow
> case. Otherwise the metrics would just randomly stop at some
> point.

But then you need -max_period, not 0. By using half the period, there is
no ambiguity on overflow.


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-28 Thread Andi Kleen
On Wed, Aug 28, 2019 at 06:28:57PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 28, 2019 at 09:17:54AM -0700, Andi Kleen wrote:
> > > This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> > > never trigger the overflow there; this then seems to suggest the actual
> > 
> > The 48bit counter might overflow in a few hours.
> 
> Sure; the point is? Kan said it should not be too big; a full 48bit wrap
> around must be too big or nothing is.

We expect users to avoid making it too big, but we cannot rule it out.

Actually for the common perf stat for a long time case you can make it fairly
big because the percentages still work out over the complete run time.

The problem with letting it accumulate too much is mainly if you
want to use a continuously running metrics counter to time smaller
regions by taking deltas, for example using RDPMC. 

In this case the small regions would be too inaccurate after some time.

But to make the first case work absolutely need to handle the overflow
case. Otherwise the metrics would just randomly stop at some
point.


-Andi





Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-28 Thread Liang, Kan




On 8/28/2019 11:02 AM, Peter Zijlstra wrote:

Reset
==

The PERF_METRICS and Fixed counter 3 have to be reset for each read,
because:
- The 8bit metrics ratio values lose precision when the measurement
   period gets longer.

So it musn't be too hot,


- The PERF_METRICS may report wrong value if its delta was less than
   1/255 of SLOTS (Fixed counter 3).


The "delta" is actually for the internal counters. Sorry for the
confusion.


it also musn't be too cold. But that leaves it unspecified what exactly
is the right range.

IOW, you want a Goldilocks number of SLOTS.


Also, for counting, the -max_period is the initial value of the SLOTS.
The huge initial value will definitely trigger the issue mentioned
above. Force initial value as 0 for topdown and slots event counting.

But you just told us that 0 is wrong too (too cold).


We set -max_period (0x8000  0001) as initial value of FIXCTR3 for 
counting. But the internal counters probably starts counting from 0.

PERF_METRICS is fraction of internal counters / FIXCTR3.
So PERF_METRICS will never works.
We have to make FIXCTR3 count from 0 as well.

Thanks,
Kan


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-28 Thread Andi Kleen
On Wed, Aug 28, 2019 at 05:02:38PM +0200, Peter Zijlstra wrote:
> > 
> > To avoid reading the METRICS register multiple times, the metrics and
> > slots value can only be updated by the first slots/metrics event in a
> > group. All active slots and metrics events will be updated one time.
> 
> Can't we require SLOTS to be the group leader for any Metric group?

Metrics are really useful to collect with any other sampling event
to give you more context in the program behavior.

> Is there ever a case where we want to add other events to a metric
> group?

Yes. Any normal leader sampling case. You just collect metrics too.

> it also musn't be too cold. But that leaves it unspecified what exactly
> is the right range.
> 
> IOW, you want a Goldilocks number of SLOTS.

That's too strong. You probably don't want minutes, but seconds
worth of measurement should be ok.

> > NMI
> > ==
> > 
> > The METRICS register may be overflow. The bit 48 of STATUS register
> > will be set. If so, update all active slots and metrics events.
> 
> that happen? It would be useful to get that METRIC_OVF (can we please

This happens when the internal counters that feed the metrics
overflow.

> If this is so; then we can use this to update/reset PERF_METRICS and
> nothing else.

It has to be handled in the PMI.

> Then there is no mucking about with that odd counter/metrics msr pair
> reset nonsense. Becuase that really stinks.

You have to write them to reset the internal counters.

-Andi



Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-28 Thread Peter Zijlstra
On Wed, Aug 28, 2019 at 09:17:54AM -0700, Andi Kleen wrote:
> > This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> > never trigger the overflow there; this then seems to suggest the actual
> 
> The 48bit counter might overflow in a few hours.

Sure; the point is? Kan said it should not be too big; a full 48bit wrap
around must be too big or nothing is.


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-28 Thread Andi Kleen
> This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> never trigger the overflow there; this then seems to suggest the actual

The 48bit counter might overflow in a few hours.

-Andi


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-28 Thread Peter Zijlstra
On Mon, Aug 26, 2019 at 07:47:35AM -0700, kan.li...@linux.intel.com wrote:
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 54534ff00940..1ae23db5c2d7 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -76,6 +76,8 @@ u64 x86_perf_event_update(struct perf_event *event)
>   if (idx == INTEL_PMC_IDX_FIXED_BTS)
>   return 0;
>  
> + if (is_topdown_count(event) && x86_pmu.update_topdown_event)
> + return x86_pmu.update_topdown_event(event);

If is_topdown_count() is true; it had better bloody well have that
function. But I really hate this.

>   /*
>* Careful: an NMI might modify the previous event value.
>*
> @@ -1003,6 +1005,10 @@ static int collect_events(struct cpu_hw_events *cpuc, 
> struct perf_event *leader,
>  
>   max_count = x86_pmu.num_counters + x86_pmu.num_counters_fixed;
>  
> + /* There are 4 TopDown metrics events. */
> + if (x86_pmu.intel_cap.perf_metrics)
> + max_count += 4;

I'm thinking this is wrong.. this unconditionally allows collecting 4
extra events on every part that has this metrics crud on.

>   /* current number of events already accepted */
>   n = cpuc->n_events;
>  
> @@ -1184,6 +1190,10 @@ int x86_perf_event_set_period(struct perf_event *event)
>   if (idx == INTEL_PMC_IDX_FIXED_BTS)
>   return 0;
>  
> + if (unlikely(is_topdown_count(event)) &&
> + x86_pmu.set_topdown_event_period)
> + return x86_pmu.set_topdown_event_period(event);

Same as with the other method; and I similarly hates it.

> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index f4d6335a18e2..616313d7f3d7 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c

> +static int icl_set_topdown_event_period(struct perf_event *event)
> +{
> + struct hw_perf_event *hwc = >hw;
> + s64 left = local64_read(>period_left);
> +
> + /*
> +  * Clear PERF_METRICS and Fixed counter 3 in initialization.
> +  * After that, both MSRs will be cleared for each read.
> +  * Don't need to clear them again.
> +  */
> + if (left == x86_pmu.max_period) {
> + wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
> + wrmsrl(MSR_PERF_METRICS, 0);
> + local64_set(>period_left, 0);
> + }

This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
never trigger the overflow there; this then seems to suggest the actual
counter value is irrelevant. Therefore you don't actually need this.

> +
> + perf_event_update_userpage(event);
> +
> + return 0;
> +}
> +
> +static u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
> +{
> + u32 val;
> +
> + /*
> +  * The metric is reported as an 8bit integer percentage

s/percentage/fraction/

percentage is 1/100, 8bit is 256.

> +  * suming up to 0xff.
> +  * slots-in-metric = (Metric / 0xff) * slots
> +  */
> + val = (metric >> ((idx - INTEL_PMC_IDX_FIXED_METRIC_BASE) * 8)) & 0xff;
> + return  mul_u64_u32_div(slots, val, 0xff);

This really utterly blows.. I'd have wasted range to be able to do a
power-of-two fraction here. That is use 8 bits with a range [0,128].

But also; x86_64 seems to lack a sane implementation of that function,
and it currently compiles into utter crap (it can be 2 instructions).

> +}

> +/*
> + * Update all active Topdown events.
> + * PMU has to be disabled before calling this function.

Forgets to explain why...

> + */


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-28 Thread Peter Zijlstra
On Mon, Aug 26, 2019 at 07:47:35AM -0700, kan.li...@linux.intel.com wrote:

> Groups
> ==
> 
> To avoid reading the METRICS register multiple times, the metrics and
> slots value can only be updated by the first slots/metrics event in a
> group. All active slots and metrics events will be updated one time.

Can't we require SLOTS to be the group leader for any Metric group?

Is there ever a case where we want to add other events to a metric
group?

> Reset
> ==
> 
> The PERF_METRICS and Fixed counter 3 have to be reset for each read,
> because:
> - The 8bit metrics ratio values lose precision when the measurement
>   period gets longer.

So it musn't be too hot,

> - The PERF_METRICS may report wrong value if its delta was less than
>   1/255 of SLOTS (Fixed counter 3).

it also musn't be too cold. But that leaves it unspecified what exactly
is the right range.

IOW, you want a Goldilocks number of SLOTS.

> Also, for counting, the -max_period is the initial value of the SLOTS.
> The huge initial value will definitely trigger the issue mentioned
> above. Force initial value as 0 for topdown and slots event counting.

But you just told us that 0 is wrong too (too cold).

I'm still confused by all this; when exactly does:

> NMI
> ==
> 
> The METRICS register may be overflow. The bit 48 of STATUS register
> will be set. If so, update all active slots and metrics events.

that happen? It would be useful to get that METRIC_OVF (can we please
start naming them; 62,55,48 is past silly) at the exact point
where PERF_METRICS is saturated.

If this is so; then we can use this to update/reset PERF_METRICS and
nothing else.

That is; leave the SLOTS programming alone; use -max_period as usual.

Then on METRIC_OVF, read PERF_METRICS and clear it, and update all the
metric events by adding slots_delta * frac / 256 -- where slots_delta is
the SLOTS count since the last METRIC_OVF.

On read; read PERF_METRICS -- BUT DO NOT RESET -- and compute an
intermediate delta and add that to whatever stable count we had.

Maybe something like:

do {
count1 = local64_read(>count);
barrier();
metrics = read_perf_metrics();
barrier();
count2 = local64_read(event->count);
} while (count1 != count2);

/* no METRIC_OVF happened and {count,metrics} is consistent */

return count1 + (slots_delta * frac / 256);

> The update_topdown_event() has to read two registers separately. The
> values may be modify by a NMI. PMU has to be disabled before calling the
> function.

Then there is no mucking about with that odd counter/metrics msr pair
reset nonsense. Becuase that really stinks.


[RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-26 Thread kan . liang
From: Kan Liang 

Intro
=

Icelake has support for measuring the four top level TopDown metrics
directly in hardware. This is implemented by an additional "metrics"
register, and a new Fixed Counter 3 that measures pipeline "slots".

Events
==

We export four metric events as separate perf events, which map to
internal "metrics" counter register. Those events do not exist in
hardware, but can be allocated by the scheduler.

For the event mapping we use a special 0x00 event code, which is
reserved for fake events. The metric events start from umask 0x10.

When setting up such events they point to the slots counter, and a
special callback, update_topdown_event(), reads the additional metrics
msr to generate the metrics. Then the metric is reported by multiplying
the metric (percentage) with slots.

This multiplication allows to easily keep a running count, for example
when the slots counter overflows, and makes all the standard tools, such
as a perf stat, work. They can do deltas of the values without needing
to know about percentages. This also simplifies accumulating the counts
of child events, which otherwise would need to know how to average
percent values.

All four metric events don't support sampling. Since they will be
handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is
introduced to indicate this case.

The slots event can support both sampling and counting.
For counting, the flag is also applied.
For sampling, it will be handled normally as other normal events.

Groups
==

To avoid reading the METRICS register multiple times, the metrics and
slots value can only be updated by the first slots/metrics event in a
group. All active slots and metrics events will be updated one time.

Reset
==

The PERF_METRICS and Fixed counter 3 have to be reset for each read,
because:
- The 8bit metrics ratio values lose precision when the measurement
  period gets longer.
- The PERF_METRICS may report wrong value if its delta was less than
  1/255 of SLOTS (Fixed counter 3).

Also, for counting, the -max_period is the initial value of the SLOTS.
The huge initial value will definitely trigger the issue mentioned
above. Force initial value as 0 for topdown and slots event counting.

NMI
==

The METRICS register may be overflow. The bit 48 of STATUS register
will be set. If so, update all active slots and metrics events.

The update_topdown_event() has to read two registers separately. The
values may be modify by a NMI. PMU has to be disabled before calling the
function.

RDPMC
==

RDPMC is temporarily disabled. The following patch will enable it.

Originally-by: Andi Kleen 
Signed-off-by: Kan Liang 
---
 arch/x86/events/core.c   |  10 ++
 arch/x86/events/intel/core.c | 230 ++-
 arch/x86/events/perf_event.h |  17 +++
 arch/x86/include/asm/msr-index.h |   2 +
 4 files changed, 255 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 54534ff00940..1ae23db5c2d7 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -76,6 +76,8 @@ u64 x86_perf_event_update(struct perf_event *event)
if (idx == INTEL_PMC_IDX_FIXED_BTS)
return 0;
 
+   if (is_topdown_count(event) && x86_pmu.update_topdown_event)
+   return x86_pmu.update_topdown_event(event);
/*
 * Careful: an NMI might modify the previous event value.
 *
@@ -1003,6 +1005,10 @@ static int collect_events(struct cpu_hw_events *cpuc, 
struct perf_event *leader,
 
max_count = x86_pmu.num_counters + x86_pmu.num_counters_fixed;
 
+   /* There are 4 TopDown metrics events. */
+   if (x86_pmu.intel_cap.perf_metrics)
+   max_count += 4;
+
/* current number of events already accepted */
n = cpuc->n_events;
 
@@ -1184,6 +1190,10 @@ int x86_perf_event_set_period(struct perf_event *event)
if (idx == INTEL_PMC_IDX_FIXED_BTS)
return 0;
 
+   if (unlikely(is_topdown_count(event)) &&
+   x86_pmu.set_topdown_event_period)
+   return x86_pmu.set_topdown_event_period(event);
+
/*
 * If we are way outside a reasonable range then just skip forward:
 */
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f4d6335a18e2..616313d7f3d7 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -247,6 +247,10 @@ static struct event_constraint 
intel_icl_event_constraints[] = {
FIXED_EVENT_CONSTRAINT(0x003c, 1),  /* CPU_CLK_UNHALTED.CORE */
FIXED_EVENT_CONSTRAINT(0x0300, 2),  /* CPU_CLK_UNHALTED.REF */
FIXED_EVENT_CONSTRAINT(0x0400, 3),  /* SLOTS */
+   METRIC_EVENT_CONSTRAINT(0x1000, 0), /* Retiring metric */
+   METRIC_EVENT_CONSTRAINT(0x1100, 1), /* Bad speculation metric */
+   METRIC_EVENT_CONSTRAINT(0x1200, 2), /* FE bound metric */
+   METRIC_EVENT_CONSTRAINT(0x1300, 3),