Re: [PATCH v3 1/2] perf/core: Share an event with multiple cgroups

2021-04-20 Thread Stephane Eranian
Hi Peter,

On Thu, Apr 15, 2021 at 7:51 AM Peter Zijlstra  wrote:
>
> On Tue, Apr 13, 2021 at 08:53:36AM -0700, Namhyung Kim wrote:
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run.  To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups.  All we need is a separate counter (and
> > two timing variables) for each cgroup.  I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different.  And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
>
> git grep "This patch" Documentation/
>
> > cgroup event counting (i.e. perf stat).
> >
> >  * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> >  64-bit array to attach given cgroups.  The first element is a
> >  number of cgroups in the buffer, and the rest is a list of cgroup
> >  ids to add a cgroup info to the given event.
>
> WTH is a cgroup-id? The syscall takes a fd to the path, why have two
> different ways?
>
> >  * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> >  array to get the event counter values.  The first element is size
> >  of the array in byte, and the second element is a cgroup id to
> >  read.  The rest is to save the counter value and timings.
>
> :-(
>
> So basically you're doing a whole seconds cgroup interface, one that
> violates the one counter per file premise and lives off of ioctl()s.
>
> *IF* we're going to do something like this, I feel we should explore the
> whole vector-per-fd concept before proceeding. Can we make it less yuck
> (less special ioctl() and more regular file ops. Can we apply the
> concept to more things?
>
> The second patch extends the ioctl() to be more read() like, instead of
> doing the sane things and extending read() by adding PERF_FORMAT_VECTOR
> or whatever. In fact, this whole second ioctl() doesn't make sense to
> have if we do indeed want to do vector-per-fd.
>
> Also, I suppose you can already fake this, by having a
> SW_CGROUP_SWITCHES (sorry, I though I picked those up, done now) event
> with PERF_SAMPLE_READ|PERF_SAMPLE_CGROUP and PERF_FORMAT_GROUP in a
> group with a bunch of events. Then the buffer will fill with the values
> you use here.
>
> Yes, I suppose it has higher overhead, but you get the data you want
> without having to do terrible things like this.
>
The sampling approach will certainly incur more overhead and be at
risk of losing the ability to
reconstruct the total counter per-cgroup, unless  you set the period
for SW_CGROUP_SWITCHES to
1. But then, you run the risk of losing samples if the buffer is full
or sampling is throtlled.
In some scenarios, we believe the number of context switches between
cgroup could be quite high (>> 1000/s).
And on top you would have to add the processing of the samples to
extract the counts per cgroup. That would require
a synthesis on cgroup on perf record and some post-processing on perf
report. We are interested in using the data live
to make some policy decisions, so a counting approach with perf stat
will always be best.

The fundamental problem Namhyung is trying to solve is the following:

num_fds = num_cpus x num_events x num_cgroups

On an 256-CPU AMD server running 200 cgroups with 6 events/cgroup (as
an example):

num_fds = 256 x 200 x 6 = 307,200 fds (with all the kernel memory
associated with them).
On each CPU, that implies: 200 x 6 = 1200 events to schedule and 6 to
find on each cgroup switch

This does not scale for us:
   - run against the fd limit, but also memory consumption in the
kernel per struct file, struct inode, struct perf_event 
   - number of events per-cpu is still also large
   - require event scheduling on cgroup switches, even with RB-tree
improvements, still heavy
   - require event scheduling even if measuring the same events across
all cgroups

One factor in that equation above needs to disappear. The one counter
per file descriptor is respected with
Nahmyung's patch because he is operating a plain per-cpu mode. What
changes is just how and where the
count is accumulated in perf_events. The resulting programming on the
hardware is the same as before.

What is needed is a way to accumulate counts per-cgroup without
incurring all this overhead. That will
inevitably introduce another way of specifying cgroups. The current
mode offers maximum flexibility.
You can specify any 

Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-04-01 Thread Stephane Eranian
Hi,

I would like to re-emphasize why this patch is important. As Namhyung
outlined in his cover message,
cgroup monitoring build on top of per-cpu monitoring and offers
maximum flexibility by allowing each event
to be attached to a single cgroup. Although this is fine when the
machines were much smaller and the number
of simultaneous cgroups was also small, it does not work anymore with
today's machines and even less with future
machines.  Over the last couple of years, we have tried to make cgroup
monitoring more scalable. Ian  Rogers
patch series addressed the RB-tree handling of the event to avoid
walking the whole tree to find events from the
sched in cgroup. This helped reduce some of the overhead we are seeing
and which is causing serious problems
to our end users, forcing them to tone down monitoring and slice
collection over cgroup over time which is far from
ideal.

Namhyung's series goes a lot further, by addressing two key overhead factors:
  1- the file descriptor consumption explosion
  2- the context switch overhead

Again this is a major cause of problems for us and needed to be
addressed in a way that maintained backward compatibility.
We are interested in the case where the same events are measured
across all cgroups and I believe this is a common usage
model.

1/ File descriptor issue

With the current interface, if you want to monitor 10 events on a 112
CPU server across 200 cgroups, you need:

num_fds = num_events x num_cpus x num_cgroups = 10 x 112 x 200 =
224,000 descriptors

A usual Linux distribution allows around 1024. Although if you are
root, you could increase the limit, this has some other impact to the
system: the memory footprint in kernel memory to back these file
descriptors and struct perf_event is large (see our presentation at
LPC2019).

2/ Context switch overhead

Each time you have a cgroup switch, i.e., a context switch where you
switch cgroups, then you incur a PMU event reschedule. A cgroup sched
in
is more expensive than a per-process sched in because you have to find
the events which are relevant to the next cgroup, i.e., you may have
to
walk more entries in the RB-tree. If the events are identical across
cgroups, you may end up paying that cost to reinstall the same events
(ignoring
multiplexing).
Furthermore, event scheduling is an expensive operation because of
memory access and PMU register accesses. It is always best, if it can
be avoided.
>From our experience, that has caused significant overhead in our
systems to the point where we have to reduce the interval at which we
collect the data
and the number of cgroups we can monitor at once.


3/ Namhyung's solution

I personally like Namhyung's solution to the problem because it fits
within the interface, does not break existing per-cgroup mode. The
implementation is fairly
simple and non-invasive. It provides a very significant reduction of
overhead on BOTH the file descriptor pressure and context switch
overheads. It matches perfectly
with the common usage model of monitoring the same events across
multiple cgroups simultaneously. The patch does not disrupt existing
perf_event_open() or
read()/close() syscalls. Everything is handled via a pair of new ioctl().

It eliminates the file descriptor overhead as follows using the same
example as before:

Before:
num_fds = num_events x num_cpus x num_cgroups = 10 x 112 x 200 =
224,000 descriptors
After:
num_fds = num_events x num_cpus = 10 x 112 = 1120 descriptors
(200x reduction in fds and the memory savings that go with that in the
kernel)

In other words, it reduces the file descriptor consumption to what is
necessary for plain system-wide monitoring.

On context switch, the kernel computes the event delta and stores into
a hash table, i.e., a single PMU register access instead of the full
PMU rescheduling.
The delta is propagated to the proper cgroup hierarchy if needed.

The change is generic and benefits ALL processor architectures in the
same manner.

We have tested the patch on our servers with large configurations and
it has delivered significant savings and enabled monitoring of more
cgroups simultaneously
instead of monitoring in batches which never yielded a consistent view
of the system.

Furthermore, the patches could be extended to add, as Song Lu
suggested, the possibility of deleting cgroups attached to an event to
allow continuous monitoring
without having to restart the monitoring tool. I believe the extension
can be further improved by allowing a vector read of the counts as
well. That would eliminate a
significant number of ioctl(READ) syscalls.

Overall, I think this patch series delivers significant value-add to
the perf_events interface and should be committed ASAP.

Thanks.




On Tue, Mar 30, 2021 at 8:11 AM Namhyung Kim  wrote:
>
> On Tue, Mar 30, 2021 at 3:33 PM Song Liu  wrote:
> > > On Mar 29, 2021, at 4:33 AM, Namhyung Kim  wrote:
> > >
> > > On Mon, Mar 29, 2021 at 2:17 AM Song Liu  wrote:
> > >>> On Mar 23, 2021, at 9:21 

Re: [PATCH] Revert "perf/x86: Allow zero PEBS status with only single active event"

2021-03-16 Thread Stephane Eranian
On Tue, Mar 16, 2021 at 5:28 AM Liang, Kan  wrote:
>
>
>
> On 3/16/2021 3:22 AM, Namhyung Kim wrote:
> > Hi Peter and Kan,
> >
> > On Thu, Mar 4, 2021 at 5:22 AM Peter Zijlstra  wrote:
> >>
> >> On Wed, Mar 03, 2021 at 02:53:00PM -0500, Liang, Kan wrote:
> >>> On 3/3/2021 1:59 PM, Peter Zijlstra wrote:
>  On Wed, Mar 03, 2021 at 05:42:18AM -0800, kan.li...@linux.intel.com 
>  wrote:
> >>
> > +++ b/arch/x86/events/intel/ds.c
> > @@ -2000,18 +2000,6 @@ static void intel_pmu_drain_pebs_nhm(struct 
> > pt_regs *iregs, struct perf_sample_d
> >continue;
> >}
> > - /*
> > -  * On some CPUs the PEBS status can be zero when PEBS is
> > -  * racing with clearing of GLOBAL_STATUS.
> > -  *
> > -  * Normally we would drop that record, but in the
> > -  * case when there is only a single active PEBS event
> > -  * we can assume it's for that event.
> > -  */
> > - if (!pebs_status && cpuc->pebs_enabled &&
> > - !(cpuc->pebs_enabled & (cpuc->pebs_enabled-1)))
> > - pebs_status = cpuc->pebs_enabled;
> 
>  Wouldn't something like:
> 
>   pebs_status = p->status = cpus->pebs_enabled;
> 
> >>>
> >>> I didn't consider it as a potential solution in this patch because I don't
> >>> think it's a proper way that SW modifies the buffer, which is supposed to 
> >>> be
> >>> manipulated by the HW.
> >>
> >> Right, but then HW was supposed to write sane values and it doesn't do
> >> that either ;-)
> >>
> >>> It's just a personal preference. I don't see any issue here. We may try 
> >>> it.
> >>
> >> So I mostly agree with you, but I think it's a shame to unsupport such
> >> chips, HSW is still a plenty useable chip today.
> >
> > I got a similar issue on ivybridge machines which caused kernel crash.
> > My case it's related to the branch stack with PEBS events but I think
> > it's the same issue.  And I can confirm that the above approach of
> > updating p->status fixed the problem.
> >
> > I've talked to Stephane about this, and he wants to make it more
> > robust when we see stale (or invalid) PEBS records.  I'll send the
> > patch soon.
> >
>
> Hi Namhyung,
>
> In case you didn't see it, I've already submitted a patch to fix the
> issue last Friday.
> https://lore.kernel.org/lkml/161298-140216-1-git-send-email-kan.li...@linux.intel.com/
> But if you have a more robust proposal, please feel free to submit it.
>
This fixes the problem on the older systems. The other problem we
identified related to the
PEBS sample processing code is that you can end up with uninitialized
perf_sample_data
struct passed to perf_event_overflow():

 setup_pebs_fixed_sample_data(pebs, data)
{
if (!pebs)
return;
perf_sample_data_init(data);  <<< must be moved before the if (!pebs)
...
}

__intel_pmu_pebs_event(pebs, data)
{
setup_sample(pebs, data)
perf_event_overflow(data);
...
}

If there is any other reason to get a pebs = NULL in fix_sample_data()
or adaptive_sample_data(), then
you must call perf_sample_data_init(data) BEFORE you return otherwise
you end up in perf_event_overflow()
with uninitialized data and you may die as follows:

[] ? perf_output_copy+0x4d/0xb0
[] perf_output_sample+0x561/0xab0
[] ? __perf_event_header__init_id+0x112/0x130
[] ? perf_prepare_sample+0x1b1/0x730
[] perf_event_output_forward+0x59/0x80
[] ? perf_event_update_userpage+0xf4/0x110
[] perf_event_overflow+0x88/0xe0
[] __intel_pmu_pebs_event+0x328/0x380

This all stems from get_next_pebs_record_by_bit()  potentially
returning NULL but the NULL is not handled correctly
by the callers. This is what I'd like to see cleaned up in
__intel_pmu_pebs_event() to  avoid future problems.

I have a patch that moves the perf_sample_data_init() and I can send
it to LKML but it would also need the cleanup
for get_next_pebs_record_by_bit() to be complete.

Thanks.


Re: [perf] perf_fuzzer causes unchecked MSR access error

2021-03-03 Thread Stephane Eranian
On Wed, Mar 3, 2021 at 10:16 AM Vince Weaver  wrote:
>
> Hello
>
> on my Haswell machine the perf_fuzzer managed to trigger this message:
>
> [117248.075892] unchecked MSR access error: WRMSR to 0x3f1 (tried to write 
> 0x0400) at rIP: 0x8106e4f4 (native_write_msr+0x4/0x20)
> [117248.089957] Call Trace:
> [117248.092685]  intel_pmu_pebs_enable_all+0x31/0x40
> [117248.097737]  intel_pmu_enable_all+0xa/0x10
> [117248.102210]  __perf_event_task_sched_in+0x2df/0x2f0
> [117248.107511]  finish_task_switch.isra.0+0x15f/0x280
> [117248.112765]  schedule_tail+0xc/0x40
> [117248.116562]  ret_from_fork+0x8/0x30
>
> that shouldn't be possible, should it?  MSR 0x3f1 is MSR_IA32_PEBS_ENABLE
>
Not possible, bit 58 is not defined in PEBS_ENABLE, AFAIK.

>
> this is on recent-git with the patch causing the pebs-related crash
> reverted.
>
> Vince


Re: [PATCH v3 1/3] perf core: Factor out __perf_sw_event_sched

2021-02-25 Thread Stephane Eranian
Hi Peter,

Any comments on this patch series?

It is quite useful to be able to count the number of cgroup switches
simply using perf stat/record.
Not all context switches (cs) are necessarily cgroup switches.
Thanks.

On Wed, Feb 10, 2021 at 12:33 AM Namhyung Kim  wrote:
>
> In some cases, we need to check more than whether the software event
> is enabled.  So split the condition check and the actual event
> handling.  This is a preparation for the next change.
>
> Suggested-by: Peter Zijlstra 
> Signed-off-by: Namhyung Kim 
> ---
>  include/linux/perf_event.h | 33 -
>  1 file changed, 12 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index fab42cfbd350..2a1be6026a2f 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1174,30 +1174,24 @@ DECLARE_PER_CPU(struct pt_regs, __perf_regs[4]);
>   * which is guaranteed by us not actually scheduling inside other swevents
>   * because those disable preemption.
>   */
> -static __always_inline void
> -perf_sw_event_sched(u32 event_id, u64 nr, u64 addr)
> +static __always_inline void __perf_sw_event_sched(u32 event_id, u64 nr, u64 
> addr)
>  {
> -   if (static_key_false(_swevent_enabled[event_id])) {
> -   struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
> +   struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
>
> -   perf_fetch_caller_regs(regs);
> -   ___perf_sw_event(event_id, nr, regs, addr);
> -   }
> +   perf_fetch_caller_regs(regs);
> +   ___perf_sw_event(event_id, nr, regs, addr);
>  }
>
>  extern struct static_key_false perf_sched_events;
>
> -static __always_inline bool
> -perf_sw_migrate_enabled(void)
> +static __always_inline bool __perf_sw_enabled(int swevt)
>  {
> -   if 
> (static_key_false(_swevent_enabled[PERF_COUNT_SW_CPU_MIGRATIONS]))
> -   return true;
> -   return false;
> +   return static_key_false(_swevent_enabled[swevt]);
>  }
>
>  static inline void perf_event_task_migrate(struct task_struct *task)
>  {
> -   if (perf_sw_migrate_enabled())
> +   if (__perf_sw_enabled(PERF_COUNT_SW_CPU_MIGRATIONS))
> task->sched_migrated = 1;
>  }
>
> @@ -1207,11 +1201,9 @@ static inline void perf_event_task_sched_in(struct 
> task_struct *prev,
> if (static_branch_unlikely(_sched_events))
> __perf_event_task_sched_in(prev, task);
>
> -   if (perf_sw_migrate_enabled() && task->sched_migrated) {
> -   struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
> -
> -   perf_fetch_caller_regs(regs);
> -   ___perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, regs, 0);
> +   if (__perf_sw_enabled(PERF_COUNT_SW_CPU_MIGRATIONS) &&
> +   task->sched_migrated) {
> +   __perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
> task->sched_migrated = 0;
> }
>  }
> @@ -1219,7 +1211,8 @@ static inline void perf_event_task_sched_in(struct 
> task_struct *prev,
>  static inline void perf_event_task_sched_out(struct task_struct *prev,
>  struct task_struct *next)
>  {
> -   perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0);
> +   if (__perf_sw_enabled(PERF_COUNT_SW_CONTEXT_SWITCHES))
> +   __perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0);
>
> if (static_branch_unlikely(_sched_events))
> __perf_event_task_sched_out(prev, next);
> @@ -1475,8 +1468,6 @@ static inline int perf_event_refresh(struct perf_event 
> *event, int refresh)
>  static inline void
>  perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr){ }
>  static inline void
> -perf_sw_event_sched(u32 event_id, u64 nr, u64 addr){ }
> -static inline void
>  perf_bp_event(struct perf_event *event, void *data){ }
>
>  static inline int perf_register_guest_info_callbacks
> --
> 2.30.0.478.g8a0d178c01-goog
>


Re: [PATCHv2] perf tools: Detect when pipe is passed as perf data

2021-01-10 Thread Stephane Eranian
On Wed, Jan 6, 2021 at 1:49 AM Jiri Olsa  wrote:
>
> On Tue, Jan 05, 2021 at 05:33:38PM -0800, Stephane Eranian wrote:
> > Hi,
> >
> > On Wed, Dec 30, 2020 at 3:09 AM Jiri Olsa  wrote:
> > >
> > > Currently we allow pipe input/output only through '-' string
> > > being passed to '-o' or '-i' options, like:
> > >
> > It seems to me it would be useful to auto-detect that the perf.data
> > file is in pipe vs. file mode format.
> > Your patch detects the type of the file which is something different
> > from the format of its content.
>
> hi,
> it goes together with the format, once the output file
> is pipe, the format is pipe as well
>
What I was saying is if I do:
$ perf record -o - -a sleep 10 > perf.data
$ perf report -i perf.data
it should autodetect it is a pipe mode file.
Does it do that today?

> jirka
>
> > Thanks.
> >
> > >   # mkfifo perf.pipe
> > >   # perf record --no-buffering -e 'sched:sched_switch' -o - > perf.pipe &
> > >   [1] 354406
> > >   # cat perf.pipe | ./perf --no-pager script -i - | head -3
> > > perf 354406 [000] 168190.164921: sched:sched_switch: 
> > > perf:354406..
> > >  migration/012 [000] 168190.164928: sched:sched_switch: 
> > > migration/0:..
> > > perf 354406 [001] 168190.164981: sched:sched_switch: 
> > > perf:354406..
> > >   ...
> > >
> > > This patch detects if given path is pipe and set the perf data
> > > object accordingly, so it's possible now to do above with:
> > >
> > >   # mkfifo perf.pipe
> > >   # perf record --no-buffering -e 'sched:sched_switch' -o perf.pipe &
> > >   [1] 360188
> > >   # perf --no-pager script -i ./perf.pipe | head -3
> > > perf 354442 [000] 168275.464895: sched:sched_switch: 
> > > perf:354442..
> > >  migration/012 [000] 168275.464902: sched:sched_switch: 
> > > migration/0:..
> > > perf 354442 [001] 168275.464953: sched:sched_switch: 
> > > perf:354442..
> > >
> > > It's of course possible to combine any of above ways.
> > >
> > > Signed-off-by: Jiri Olsa 
> > > ---
> > > v2:
> > >   - removed O_CREAT|O_TRUNC flags from pipe's write end
> > >
> > >  tools/perf/util/data.c | 27 +--
> > >  1 file changed, 21 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
> > > index f29af4fc3d09..4dfa9e0f2fec 100644
> > > --- a/tools/perf/util/data.c
> > > +++ b/tools/perf/util/data.c
> > > @@ -159,7 +159,7 @@ int perf_data__update_dir(struct perf_data *data)
> > > return 0;
> > >  }
> > >
> > > -static bool check_pipe(struct perf_data *data)
> > > +static int check_pipe(struct perf_data *data)
> > >  {
> > > struct stat st;
> > > bool is_pipe = false;
> > > @@ -172,6 +172,15 @@ static bool check_pipe(struct perf_data *data)
> > > } else {
> > > if (!strcmp(data->path, "-"))
> > > is_pipe = true;
> > > +   else if (!stat(data->path, ) && S_ISFIFO(st.st_mode)) {
> > > +   int flags = perf_data__is_read(data) ?
> > > +   O_RDONLY : O_WRONLY;
> > > +
> > > +   fd = open(data->path, flags);
> > > +   if (fd < 0)
> > > +   return -EINVAL;
> > > +   is_pipe = true;
> > > +   }
> > > }
> > >
> > > if (is_pipe) {
> > > @@ -190,7 +199,8 @@ static bool check_pipe(struct perf_data *data)
> > > }
> > > }
> > >
> > > -   return data->is_pipe = is_pipe;
> > > +   data->is_pipe = is_pipe;
> > > +   return 0;
> > >  }
> > >
> > >  static int check_backup(struct perf_data *data)
> > > @@ -344,8 +354,11 @@ static int open_dir(struct perf_data *data)
> > >
> > >  int perf_data__open(struct perf_data *data)
> > >  {
> > > -   if (check_pipe(data))
> > > -   return 0;
> > > +   int err;
> > > +
> > > +   err = check_pipe(data);
> > > +   if (err || data->is_pipe)
> > > +   return err;
> > >
> > > /* currently it allows stdio for pipe only */
> > > data->use_stdio = false;
> > > @@ -410,8 +423,10 @@ int perf_data__switch(struct perf_data *data,
> > >  {
> > > int ret;
> > >
> > > -   if (check_pipe(data))
> > > -   return -EINVAL;
> > > +   ret = check_pipe(data);
> > > +   if (ret || data->is_pipe)
> > > +   return ret;
> > > +
> > > if (perf_data__is_read(data))
> > > return -EINVAL;
> > >
> > > --
> > > 2.26.2
> > >
> >
>


Re: [PATCHv2] perf tools: Detect when pipe is passed as perf data

2021-01-05 Thread Stephane Eranian
Hi,

On Wed, Dec 30, 2020 at 3:09 AM Jiri Olsa  wrote:
>
> Currently we allow pipe input/output only through '-' string
> being passed to '-o' or '-i' options, like:
>
It seems to me it would be useful to auto-detect that the perf.data
file is in pipe vs. file mode format.
Your patch detects the type of the file which is something different
from the format of its content.
Thanks.

>   # mkfifo perf.pipe
>   # perf record --no-buffering -e 'sched:sched_switch' -o - > perf.pipe &
>   [1] 354406
>   # cat perf.pipe | ./perf --no-pager script -i - | head -3
> perf 354406 [000] 168190.164921: sched:sched_switch: perf:354406..
>  migration/012 [000] 168190.164928: sched:sched_switch: migration/0:..
> perf 354406 [001] 168190.164981: sched:sched_switch: perf:354406..
>   ...
>
> This patch detects if given path is pipe and set the perf data
> object accordingly, so it's possible now to do above with:
>
>   # mkfifo perf.pipe
>   # perf record --no-buffering -e 'sched:sched_switch' -o perf.pipe &
>   [1] 360188
>   # perf --no-pager script -i ./perf.pipe | head -3
> perf 354442 [000] 168275.464895: sched:sched_switch: perf:354442..
>  migration/012 [000] 168275.464902: sched:sched_switch: migration/0:..
> perf 354442 [001] 168275.464953: sched:sched_switch: perf:354442..
>
> It's of course possible to combine any of above ways.
>
> Signed-off-by: Jiri Olsa 
> ---
> v2:
>   - removed O_CREAT|O_TRUNC flags from pipe's write end
>
>  tools/perf/util/data.c | 27 +--
>  1 file changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
> index f29af4fc3d09..4dfa9e0f2fec 100644
> --- a/tools/perf/util/data.c
> +++ b/tools/perf/util/data.c
> @@ -159,7 +159,7 @@ int perf_data__update_dir(struct perf_data *data)
> return 0;
>  }
>
> -static bool check_pipe(struct perf_data *data)
> +static int check_pipe(struct perf_data *data)
>  {
> struct stat st;
> bool is_pipe = false;
> @@ -172,6 +172,15 @@ static bool check_pipe(struct perf_data *data)
> } else {
> if (!strcmp(data->path, "-"))
> is_pipe = true;
> +   else if (!stat(data->path, ) && S_ISFIFO(st.st_mode)) {
> +   int flags = perf_data__is_read(data) ?
> +   O_RDONLY : O_WRONLY;
> +
> +   fd = open(data->path, flags);
> +   if (fd < 0)
> +   return -EINVAL;
> +   is_pipe = true;
> +   }
> }
>
> if (is_pipe) {
> @@ -190,7 +199,8 @@ static bool check_pipe(struct perf_data *data)
> }
> }
>
> -   return data->is_pipe = is_pipe;
> +   data->is_pipe = is_pipe;
> +   return 0;
>  }
>
>  static int check_backup(struct perf_data *data)
> @@ -344,8 +354,11 @@ static int open_dir(struct perf_data *data)
>
>  int perf_data__open(struct perf_data *data)
>  {
> -   if (check_pipe(data))
> -   return 0;
> +   int err;
> +
> +   err = check_pipe(data);
> +   if (err || data->is_pipe)
> +   return err;
>
> /* currently it allows stdio for pipe only */
> data->use_stdio = false;
> @@ -410,8 +423,10 @@ int perf_data__switch(struct perf_data *data,
>  {
> int ret;
>
> -   if (check_pipe(data))
> -   return -EINVAL;
> +   ret = check_pipe(data);
> +   if (ret || data->is_pipe)
> +   return ret;
> +
> if (perf_data__is_read(data))
> return -EINVAL;
>
> --
> 2.26.2
>


[tip: perf/urgent] perf/x86/intel: Check PEBS status correctly

2020-12-03 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/urgent branch of tip:

Commit-ID: fc17db8aa4c53cbd2d5469bb0521ea0f0a6dbb27
Gitweb:
https://git.kernel.org/tip/fc17db8aa4c53cbd2d5469bb0521ea0f0a6dbb27
Author:Stephane Eranian 
AuthorDate:Thu, 26 Nov 2020 20:09:22 +09:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 03 Dec 2020 10:00:26 +01:00

perf/x86/intel: Check PEBS status correctly

The kernel cannot disambiguate when 2+ PEBS counters overflow at the
same time. This is what the comment for this code suggests.  However,
I see the comparison is done with the unfiltered p->status which is a
copy of IA32_PERF_GLOBAL_STATUS at the time of the sample. This
register contains more than the PEBS counter overflow bits. It also
includes many other bits which could also be set.

Signed-off-by: Namhyung Kim 
Signed-off-by: Stephane Eranian 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201126110922.317681-2-namhy...@kernel.org
---
 arch/x86/events/intel/ds.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 89dba58..485c506 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1916,7 +1916,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs 
*iregs, struct perf_sample_d
 * that caused the PEBS record. It's called collision.
 * If collision happened, the record will be dropped.
 */
-   if (p->status != (1ULL << bit)) {
+   if (pebs_status != (1ULL << bit)) {
for_each_set_bit(i, (unsigned long *)_status, size)
error[i]++;
continue;


Re: [RFC 1/2] perf core: Add PERF_COUNT_SW_CGROUP_SWITCHES event

2020-12-02 Thread Stephane Eranian
On Wed, Dec 2, 2020 at 2:42 PM Andi Kleen  wrote:
>
> On Wed, Dec 02, 2020 at 11:47:25AM -0800, Stephane Eranian wrote:
> > On Wed, Dec 2, 2020 at 11:28 AM Andi Kleen  wrote:
> > >
> > > > + prev_cgrp = task_css_check(prev, perf_event_cgrp_id, 1)->cgroup;
> > > > + next_cgrp = task_css_check(next, perf_event_cgrp_id, 1)->cgroup;
> > > > +
> > > > + if (prev_cgrp != next_cgrp)
> > > > + perf_sw_event_sched(PERF_COUNT_SW_CGROUP_SWITCHES, 1, 0);
> > >
> > > Seems to be the perf cgroup only, not all cgroups.
> > > That's a big difference and needs to be documented properly.
> > >
> > We care about the all-cgroup case.
>
> Then it's not correct I think. You need a different hook point.
>
I realize that ;-(


Re: [RFC 1/2] perf core: Add PERF_COUNT_SW_CGROUP_SWITCHES event

2020-12-02 Thread Stephane Eranian
On Wed, Dec 2, 2020 at 11:28 AM Andi Kleen  wrote:
>
> > + prev_cgrp = task_css_check(prev, perf_event_cgrp_id, 1)->cgroup;
> > + next_cgrp = task_css_check(next, perf_event_cgrp_id, 1)->cgroup;
> > +
> > + if (prev_cgrp != next_cgrp)
> > + perf_sw_event_sched(PERF_COUNT_SW_CGROUP_SWITCHES, 1, 0);
>
> Seems to be the perf cgroup only, not all cgroups.
> That's a big difference and needs to be documented properly.
>
We care about the all-cgroup case.

> Probably would make sense to have two events for both, one for
> all cgroups and one for perf only.
>
>
>
> -Andi


Re: [RFC] perf/x86: Fix a warning on x86_pmu_stop()

2020-11-24 Thread Stephane Eranian
Hi,

Another remark on the PEBS drainage code, it seems to me like a test
is not quite correct:
intel_pmu_drain_pebs_nhm()
{
...
   if (p->status != (1ULL << bit)) {
for_each_set_bit(i, (unsigned long *)_status, size)
error[i]++;
continue;
}

The kernel cannot disambiguate when 2+ PEBS counters overflow at the
same time. This is what the comment for this code suggests.
However, I see the comparison is done with the unfiltered p->status
which is a copy of  IA32_PERF_GLOBAL_STATUS at the time of
the sample. This register contains more than the PEBS counter overflow
bits. It also includes many other bits which could also be set.

Shouldn't this test use pebs_status instead (which covers only the
PEBS counters)?

  if (pebs_status != (1ULL << bit)) {
  }

Or am I missing something?
Thanks.


On Tue, Nov 24, 2020 at 12:09 AM Peter Zijlstra  wrote:
>
> On Tue, Nov 24, 2020 at 02:01:39PM +0900, Namhyung Kim wrote:
>
> > Yes, it's not about __intel_pmu_pebs_event().  I'm looking at
> > intel_pmu_drain_pebs_nhm() specifically.  There's code like
> >
> > /* log dropped samples number */
> > if (error[bit]) {
> > perf_log_lost_samples(event, error[bit]);
> >
> > if (perf_event_account_interrupt(event))
> > x86_pmu_stop(event, 0);
> > }
> >
> > if (counts[bit]) {
> > __intel_pmu_pebs_event(event, iregs, base,
> >top, bit, counts[bit],
> >setup_pebs_fixed_sample_data);
> > }
> >
> > There's a path to x86_pmu_stop() when an error bit is on.
>
> That would seem to suggest you try something like this:
>
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 31b9e58b03fe..8c6ee8be8b6e 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1945,7 +1945,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs 
> *iregs, struct perf_sample_d
> if (error[bit]) {
> perf_log_lost_samples(event, error[bit]);
>
> -   if (perf_event_account_interrupt(event))
> +   if (iregs && perf_event_account_interrupt(event))
> x86_pmu_stop(event, 0);
> }
>


Re: [PATCH] perf/intel: Remove Perfmon-v4 counter_freezing support

2020-11-10 Thread Stephane Eranian
On Tue, Nov 10, 2020 at 7:37 AM Peter Zijlstra  wrote:
>
> On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
> > > The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake server
> > > platforms can provide an architectural state of the instruction executed
> > > after the instruction that caused the event. This patch set enables the
> > > the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
> > > The Linux guest can use PEBS feature like native:
> > >
> > >   # perf record -e instructions:ppp ./br_instr a
> > >   # perf record -c 10 -e instructions:pp ./br_instr a
> > >
> > > If the counter_freezing is not enabled on the host, the guest PEBS will
> > > be disabled on purpose when host is using PEBS facility. By default,
> > > KVM disables the co-existence of guest PEBS and host PEBS.
> >
> > Uuhh, what?!? counter_freezing should never be enabled, its broken. Let
> > me go delete all that code.
>
> ---
> Subject: perf/intel: Remove Perfmon-v4 counter_freezing support
>
> Perfmon-v4 counter freezing is fundamentally broken; remove this default
> disabled code to make sure nobody uses it.
>
> The feature is called Freeze-on-PMI in the SDM, and if it would do that,
> there wouldn't actually be a problem, *however* it does something subtly
> different. It globally disables the whole PMU when it raises the PMI,
> not when the PMI hits.
>
> This means there's a window between the PMI getting raised and the PMI
> actually getting served where we loose events and this violates the
> perf counter independence. That is, a counting event should not result
> in a different event count when there is a sampling event co-scheduled.
>

What is implemented is Freeze-on-Overflow, yet it is described as Freeze-on-PMI.
That, in itself, is a problem. I agree with you on that point.

However, there are use cases for both modes.

I can sample on event A and count on B, C and when A overflows, I want
to snapshot B, C.
For that I want B, C at the moment of the overflow, not at the moment
the PMI is delivered. Thus, youd
would want the Freeze-on-overflow behavior. You can collect in this
mode with the perf tool,
IIRC: perf record -e '{cycles,instructions,branches:S}' 

The other usage model is that of the replay-debugger (rr) which you are alluding
to, which needs precise count of an event including during the skid
window. For that, you need
Freeze-on-PMI (delivered). Note that this tool likely only cares about
user level occurrences of events.

As for counter independence, I am not sure it holds in all cases. If
the events are setup for user+kernel
then, as soon as you co-schedule a sampling event, you will likely get
more counts on the counting
event due to the additional kernel entries/exits caused by
interrupt-based profiling. Even if you were to
restrict to user level only, I would expect to see a few more counts.


> This is known to break existing software.
>
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/events/intel/core.c | 152 
> ---
>  arch/x86/events/perf_event.h |   3 +-
>  2 files changed, 1 insertion(+), 154 deletions(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index c79748f6921d..9909dfa6fb12 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2121,18 +2121,6 @@ static void intel_tfa_pmu_enable_all(int added)
> intel_pmu_enable_all(added);
>  }
>
> -static void enable_counter_freeze(void)
> -{
> -   update_debugctlmsr(get_debugctlmsr() |
> -   DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
> -}
> -
> -static void disable_counter_freeze(void)
> -{
> -   update_debugctlmsr(get_debugctlmsr() &
> -   ~DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
> -}
> -
>  static inline u64 intel_pmu_get_status(void)
>  {
> u64 status;
> @@ -2696,95 +2684,6 @@ static int handle_pmi_common(struct pt_regs *regs, u64 
> status)
> return handled;
>  }
>
> -static bool disable_counter_freezing = true;
> -static int __init intel_perf_counter_freezing_setup(char *s)
> -{
> -   bool res;
> -
> -   if (kstrtobool(s, ))
> -   return -EINVAL;
> -
> -   disable_counter_freezing = !res;
> -   return 1;
> -}
> -__setup("perf_v4_pmi=", intel_perf_counter_freezing_setup);
> -
> -/*
> - * Simplified handler for Arch Perfmon v4:
> - * - We rely on counter freezing/unfreezing to enable/disable the PMU.
> - * This is done automatically on PMU ack.
> - * - Ack the PMU only after the APIC.
> - */
> -
> -static int intel_pmu_handle_irq_v4(struct pt_regs *regs)
> -{
> -   struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
> -   int handled = 0;
> -   bool bts = false;
> -   u64 status;
> -   int pmu_enabled = cpuc->enabled;
> -   int loops = 0;
> -
> -   /* PMU has been disabled because of counter freezing */
> -   

[tip: perf/urgent] perf/x86/intel: Make anythread filter support conditional

2020-11-10 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/urgent branch of tip:

Commit-ID: cadbaa039b99a6d5c26ce1c7f2fc0325943e605a
Gitweb:
https://git.kernel.org/tip/cadbaa039b99a6d5c26ce1c7f2fc0325943e605a
Author:Stephane Eranian 
AuthorDate:Wed, 28 Oct 2020 12:42:47 -07:00
Committer: Peter Zijlstra 
CommitterDate: Mon, 09 Nov 2020 18:12:36 +01:00

perf/x86/intel: Make anythread filter support conditional

Starting with Arch Perfmon v5, the anythread filter on generic counters may be
deprecated. The current kernel was exporting the any filter without checking.
On Icelake, it means you could do cpu/event=0x3c,any/ even though the filter
does not exist. This patch corrects the problem by relying on the CPUID 0xa leaf
function to determine if anythread is supported or not as described in the
Intel SDM Vol3b 18.2.5.1 AnyThread Deprecation section.

Signed-off-by: Stephane Eranian 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201028194247.3160610-1-eran...@google.com
---
 arch/x86/events/intel/core.c  | 10 ++
 arch/x86/events/perf_event.h  |  1 +
 arch/x86/include/asm/perf_event.h |  4 +++-
 arch/x86/kvm/cpuid.c  |  4 +++-
 4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c37387c..af457f8 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4987,6 +4987,12 @@ __init int intel_pmu_init(void)
 
x86_add_quirk(intel_arch_events_quirk); /* Install first, so it runs 
last */
 
+   if (version >= 5) {
+   x86_pmu.intel_cap.anythread_deprecated = 
edx.split.anythread_deprecated;
+   if (x86_pmu.intel_cap.anythread_deprecated)
+   pr_cont(" AnyThread deprecated, ");
+   }
+
/*
 * Install the hw-cache-events table:
 */
@@ -5512,6 +5518,10 @@ __init int intel_pmu_init(void)
x86_pmu.intel_ctrl |=
((1LL << x86_pmu.num_counters_fixed)-1) << INTEL_PMC_IDX_FIXED;
 
+   /* AnyThread may be deprecated on arch perfmon v5 or later */
+   if (x86_pmu.intel_cap.anythread_deprecated)
+   x86_pmu.format_attrs = intel_arch_formats_attr;
+
if (x86_pmu.event_constraints) {
/*
 * event on fixed counter2 (REF_CYCLES) only works on this
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 1d1fe46..6a8edfe 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -585,6 +585,7 @@ union perf_capabilities {
u64 pebs_baseline:1;
u64 perf_metrics:1;
u64 pebs_output_pt_available:1;
+   u64 anythread_deprecated:1;
};
u64 capabilities;
 };
diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 6960cd6..b9a7fd0 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -137,7 +137,9 @@ union cpuid10_edx {
struct {
unsigned int num_counters_fixed:5;
unsigned int bit_width_fixed:8;
-   unsigned int reserved:19;
+   unsigned int reserved1:2;
+   unsigned int anythread_deprecated:1;
+   unsigned int reserved2:16;
} split;
unsigned int full;
 };
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 06a278b..0752dec 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -672,7 +672,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array 
*array, u32 function)
 
edx.split.num_counters_fixed = min(cap.num_counters_fixed, 
MAX_FIXED_COUNTERS);
edx.split.bit_width_fixed = cap.bit_width_fixed;
-   edx.split.reserved = 0;
+   edx.split.anythread_deprecated = 1;
+   edx.split.reserved1 = 0;
+   edx.split.reserved2 = 0;
 
entry->eax = eax.full;
entry->ebx = cap.events_mask;


Re: [RFC 2/2] perf/core: Invoke pmu::sched_task callback for per-cpu events

2020-11-05 Thread Stephane Eranian
On Thu, Nov 5, 2020 at 11:40 AM Liang, Kan  wrote:
>
>
>
> On 11/5/2020 10:54 AM, Namhyung Kim wrote:
> >> -void perf_sched_cb_inc(struct pmu *pmu)
> >> +void perf_sched_cb_inc(struct pmu *pmu, bool systemwide)
> >>{
> >>  struct perf_cpu_context *cpuctx = 
> >> this_cpu_ptr(pmu->pmu_cpu_context);
> >>
> >> -   if (!cpuctx->sched_cb_usage++)
> >> -   list_add(>sched_cb_entry, 
> >> this_cpu_ptr(_cb_list));
> >> +   cpuctx->sched_cb_usage++;
> >>
> >> -   this_cpu_inc(perf_sched_cb_usages);
> >> +   if (systemwide) {
> >> +   this_cpu_inc(perf_sched_cb_usages);
> >> +   list_add(>sched_cb_entry, 
> >> this_cpu_ptr(_cb_list));
> > You need to check the value and make sure it's added only once.
>
> Right, maybe we have to add a new variable for that.
>
Sure, I tend to agree here that we need a narrower scope trigger, only
when needed, i.e., an event
or PMU feature that requires ctxsw work. In fact, I am also interested
in splitting ctxswin and ctswout
callbacks. The reason is that you have overhead in the way the
callback is invoked. You may end up
calling the sched_task on ctxswout when only ctxwin is needed. In
doing that you pay the cost of
stopping/starting the PMU for possibly nothing. Stopping the PMU can
be expensive, like on AMD
where you need multiple wrmsr.

So splitting or adding a flag to convey that either CTXSW_IN or
CTXSW_OUT is needed would help.
I am suggesting this now given you are adding a flag.

>
> diff --git a/arch/powerpc/perf/core-book3s.c
> b/arch/powerpc/perf/core-book3s.c
> index 6586f7e71cfb..63c9b87cab5e 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -380,7 +380,7 @@ static void power_pmu_bhrb_enable(struct perf_event
> *event)
> cpuhw->bhrb_context = event->ctx;
> }
> cpuhw->bhrb_users++;
> -   perf_sched_cb_inc(event->ctx->pmu);
> +   perf_sched_cb_inc(event->ctx->pmu, !(event->attach_state &
> PERF_ATTACH_TASK));
>   }
>
>   static void power_pmu_bhrb_disable(struct perf_event *event)
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 444e5f061d04..a34b90c7fa6d 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1022,9 +1022,9 @@ pebs_update_state(bool needed_cb, struct
> cpu_hw_events *cpuc,
>
> if (needed_cb != pebs_needs_sched_cb(cpuc)) {
> if (!needed_cb)
> -   perf_sched_cb_inc(pmu);
> +   perf_sched_cb_inc(pmu, !(event->attach_state & 
> PERF_ATTACH_TASK));
> else
> -   perf_sched_cb_dec(pmu);
> +   perf_sched_cb_dec(pmu, !(event->attach_state & 
> PERF_ATTACH_TASK));
>
> update = true;
> }
> diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
> index 8961653c5dd2..8d4d02cde3d4 100644
> --- a/arch/x86/events/intel/lbr.c
> +++ b/arch/x86/events/intel/lbr.c
> @@ -693,7 +693,7 @@ void intel_pmu_lbr_add(struct perf_event *event)
>  */
> if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
> cpuc->lbr_pebs_users++;
> -   perf_sched_cb_inc(event->ctx->pmu);
> +   perf_sched_cb_inc(event->ctx->pmu, !(event->attach_state &
> PERF_ATTACH_TASK));
> if (!cpuc->lbr_users++ && !event->total_time_running)
> intel_pmu_lbr_reset();
>
> @@ -740,7 +740,7 @@ void intel_pmu_lbr_del(struct perf_event *event)
> cpuc->lbr_users--;
> WARN_ON_ONCE(cpuc->lbr_users < 0);
> WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
> -   perf_sched_cb_dec(event->ctx->pmu);
> +   perf_sched_cb_dec(event->ctx->pmu, !(event->attach_state &
> PERF_ATTACH_TASK));
>   }
>
>   static inline bool vlbr_exclude_host(void)
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index a1b91f2de264..14f936385cc8 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -875,6 +875,7 @@ struct perf_cpu_context {
>
> struct list_headsched_cb_entry;
> int sched_cb_usage;
> +   int sched_cb_sw_usage;
>
> int online;
> /*
> @@ -967,8 +968,8 @@ extern const struct perf_event_attr
> *perf_event_attrs(struct perf_event *event);
>   extern void perf_event_print_debug(void);
>   extern void perf_pmu_disable(struct pmu *pmu);
>   extern void perf_pmu_enable(struct pmu *pmu);
> -extern void perf_sched_cb_dec(struct pmu *pmu);
> -extern void perf_sched_cb_inc(struct pmu *pmu);
> +extern void perf_sched_cb_dec(struct pmu *pmu, bool systemwide);
> +extern void perf_sched_cb_inc(struct pmu *pmu, bool systemwide);
>   extern int perf_event_task_disable(void);
>   extern int perf_event_task_enable(void);
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 66a9bd71f3da..af75859c9138 

Re: [RFC] perf evlist: Warn if event group has mixed sw/hw events

2020-11-05 Thread Stephane Eranian
On Mon, Oct 26, 2020 at 7:19 AM Namhyung Kim  wrote:
>
> I found that order of events in a group impacts performance during the
> open.  If a group has a software event as a leader and has other
> hardware events, the lead needs to be moved to a hardware context.
> This includes RCU synchronization which takes about 20 msec on my
> system.  And this is just for a single group, so total time increases
> in proportion to the number of event groups and the number of cpus.
>
> On my 36 cpu system, opening 3 groups system-wide takes more than 2
> seconds.  You can see and compare it easily with the following:
>
>   $ time ./perf stat -a -e '{cs,cycles},{cs,cycles},{cs,cycles}' sleep 1
>   ...
>1.006333430 seconds time elapsed
>
>   real  0m3.969s
>   user  0m0.089s
>   sys   0m0.074s
>
>   $ time ./perf stat -a -e '{cycles,cs},{cycles,cs},{cycles,cs}' sleep 1
>   ...
>1.006755292 seconds time elapsed
>
>   real  0m1.144s
>   user  0m0.067s
>   sys   0m0.083s
>
> This patch just added a warning before running it.  I'd really want to
> fix the kernel if possible but don't have a good idea.  Thoughts?
>
This is a problem for us. This has caused problems on our systems with
perf command taking much longer than expected and firing timeouts.

The cost of perf_event_open() should not be so dependent on the order
of the events in a group. The penalty incurred by synchronize_rcu()
is very large and likely does not scale too well. Scalability may not
only be impacted by the number of CPUs of the machine. I am not an
expert
at RCU but it seems it exposes perf_event_open() to penalties caused
by other subsystem operations. I am wondering if there would be a
different way of handling the change of group type that would avoid
the high cost of synchronize_rcu().


> Signed-off-by: Namhyung Kim 
> ---
>  tools/perf/builtin-record.c |  2 +
>  tools/perf/builtin-stat.c   |  2 +
>  tools/perf/builtin-top.c|  2 +
>  tools/perf/util/evlist.c| 78 +
>  tools/perf/util/evlist.h|  1 +
>  5 files changed, 85 insertions(+)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index adf311d15d3d..c0b08cacbae0 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -912,6 +912,8 @@ static int record__open(struct record *rec)
>
> perf_evlist__config(evlist, opts, _param);
>
> +   evlist__warn_mixed_group(evlist);
> +
> evlist__for_each_entry(evlist, pos) {
>  try_again:
> if (evsel__open(pos, pos->core.cpus, pos->core.threads) < 0) {
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index b01af171d94f..d5d4e02bda69 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -738,6 +738,8 @@ static int __run_perf_stat(int argc, const char **argv, 
> int run_idx)
> if (affinity__setup() < 0)
> return -1;
>
> +   evlist__warn_mixed_group(evsel_list);
> +
> evlist__for_each_cpu (evsel_list, i, cpu) {
> affinity__set(, cpu);
>
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index 7c64134472c7..9ad319cea948 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -1027,6 +1027,8 @@ static int perf_top__start_counters(struct perf_top 
> *top)
>
> perf_evlist__config(evlist, opts, _param);
>
> +   evlist__warn_mixed_group(evlist);
> +
> evlist__for_each_entry(evlist, counter) {
>  try_again:
> if (evsel__open(counter, top->evlist->core.cpus,
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index 8bdf3d2c907c..02cff39e509e 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "parse-events.h"
>  #include 
> @@ -1980,3 +1981,80 @@ struct evsel *evlist__find_evsel(struct evlist 
> *evlist, int idx)
> }
> return NULL;
>  }
> +
> +static int *sw_types;
> +static int nr_sw_types;
> +
> +static void collect_software_pmu_types(void)
> +{
> +   const char *known_sw_pmu[] = {
> +   "software", "tracepoint", "breakpoint", "kprobe", "uprobe", 
> "msr"
> +   };
> +   DIR *dir;
> +   struct dirent *d;
> +   char path[PATH_MAX];
> +   int i;
> +
> +   if (sw_types != NULL)
> +   return;
> +
> +   nr_sw_types = ARRAY_SIZE(known_sw_pmu);
> +   sw_types = calloc(nr_sw_types, sizeof(int));
> +   if (sw_types == NULL) {
> +   pr_err("Memory allocation failed!\n");
> +   return;
> +   }
> +
> +   dir = opendir("/sys/bus/event_source/devices");
> +   while ((d = readdir(dir)) != NULL) {
> +   for (i = 0; i < nr_sw_types; i++) {
> +   if (strcmp(d->d_name, known_sw_pmu[i]))
> +   continue;
> +
> +   snprintf(path, sizeof(path), 

Re: [RFC 0/2] perf/core: Invoke pmu::sched_task callback for cpu events

2020-11-05 Thread Stephane Eranian
On Mon, Nov 2, 2020 at 6:52 AM Namhyung Kim  wrote:
>
> Hello,
>
> It was reported that system-wide events with precise_ip set have a lot
> of unknown symbols on Intel machines.  Depending on the system load I
> can see more than 30% of total symbols are not resolved (actually
> don't have DSO mappings).
>
> I found that it's only large PEBS is enabled - using call-graph or the
> frequency mode will disable it and have valid results.  I've verified
> it by checking intel_pmu_pebs_sched_task() is called like below:
>
>   # perf probe -a intel_pmu_pebs_sched_task
>
>   # perf stat -a -e probe:intel_pmu_pebs_sched_task \
>   >   perf record -a -e cycles:ppp -c 11 sleep 1
>   [ perf record: Woken up 1 times to write data ]
>   [ perf record: Captured and wrote 2.625 MB perf.data (10345 samples) ]
>
>Performance counter stats for 'system wide':
>
>  0  probe:intel_pmu_pebs_sched_task
>
>2.157533991 seconds time elapsed
>
>
> Looking at the code, I found out that the pmu::sched_task callback was
> changed recently that it's called only for task events.  So cpu events
> with large PEBS didn't flush the buffer and they are attributed to
> unrelated tasks later resulted in unresolved symbols.
>
> This patch reverts it and keeps the optimization for task events.
> While at it, I also found the context switch callback was not enabled
> for cpu events from the beginning.  So I've added it too.  With this
> applied, I can see the above callbacks are hit as expected and perf
> report has valid symbols.
>
This is a serious bug that impacts many kernel versions as soon as
multi-entry PEBS is activated by the kernel in system-wide mode.
I remember this was working in the past so it must have been broken by
some code refactoring or optimization or extension of sched_task
to other features. PEBS must be flushed on context switch in per-cpu
mode, otherwise you may report samples in locations that do not belong
to the process where they are processed in. PEBS does not tag samples
with PID/TID.


[tip: perf/core] perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE

2020-10-29 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 995f088efebe1eba0282a6ffa12411b37f8990c2
Gitweb:
https://git.kernel.org/tip/995f088efebe1eba0282a6ffa12411b37f8990c2
Author:Stephane Eranian 
AuthorDate:Thu, 01 Oct 2020 06:57:49 -07:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 29 Oct 2020 11:00:39 +01:00

perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE

When studying code layout, it is useful to capture the page size of the
sampled code address.

Add a new sample type for code page size.
The new sample type requires collecting the ip. The code page size can
be calculated from the NMI-safe perf_get_page_size().

For large PEBS, it's very unlikely that the mapping is gone for the
earlier PEBS records. Enable the feature for the large PEBS. The worst
case is that page-size '0' is returned.

Signed-off-by: Kan Liang 
Signed-off-by: Stephane Eranian 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201001135749.2804-5-kan.li...@linux.intel.com
---
 arch/x86/events/perf_event.h|  2 +-
 include/linux/perf_event.h  |  1 +
 include/uapi/linux/perf_event.h |  4 +++-
 kernel/events/core.c| 11 ++-
 4 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ee2b9b9..10032f0 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -132,7 +132,7 @@ struct amd_nb {
PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | \
PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER | \
-   PERF_SAMPLE_PERIOD)
+   PERF_SAMPLE_PERIOD | PERF_SAMPLE_CODE_PAGE_SIZE)
 
 #define PEBS_GP_REGS   \
((1ULL << PERF_REG_X86_AX)| \
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7e3785d..e533b03 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1035,6 +1035,7 @@ struct perf_sample_data {
u64 phys_addr;
u64 cgroup;
u64 data_page_size;
+   u64 code_page_size;
 } cacheline_aligned;
 
 /* default value for data source */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index cc6ea34..c2f20ee 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -144,8 +144,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_AUX = 1U << 20,
PERF_SAMPLE_CGROUP  = 1U << 21,
PERF_SAMPLE_DATA_PAGE_SIZE  = 1U << 22,
+   PERF_SAMPLE_CODE_PAGE_SIZE  = 1U << 23,
 
-   PERF_SAMPLE_MAX = 1U << 23, /* non-ABI */
+   PERF_SAMPLE_MAX = 1U << 24, /* non-ABI */
 
__PERF_SAMPLE_CALLCHAIN_EARLY   = 1ULL << 63, /* non-ABI; 
internal use */
 };
@@ -898,6 +899,7 @@ enum perf_event_type {
 *  { u64   size;
 *char  data[size]; } && PERF_SAMPLE_AUX
 *  { u64   data_page_size;} && 
PERF_SAMPLE_DATA_PAGE_SIZE
+*  { u64   code_page_size;} && 
PERF_SAMPLE_CODE_PAGE_SIZE
 * };
 */
PERF_RECORD_SAMPLE  = 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a796db2..7f655d1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1898,6 +1898,9 @@ static void __perf_event_header_size(struct perf_event 
*event, u64 sample_type)
if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)
size += sizeof(data->data_page_size);
 
+   if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE)
+   size += sizeof(data->code_page_size);
+
event->header_size = size;
 }
 
@@ -6945,6 +6948,9 @@ void perf_output_sample(struct perf_output_handle *handle,
if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)
perf_output_put(handle, data->data_page_size);
 
+   if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE)
+   perf_output_put(handle, data->code_page_size);
+
if (sample_type & PERF_SAMPLE_AUX) {
perf_output_put(handle, data->aux_size);
 
@@ -7125,7 +7131,7 @@ void perf_prepare_sample(struct perf_event_header *header,
 
__perf_event_header__init_id(header, data, event);
 
-   if (sample_type & PERF_SAMPLE_IP)
+   if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE))
data->ip = perf_instruction_pointer(regs);
 
if (sample_type & PERF_SAMPLE_CALLCHAIN) {
@@ -7253,6 +7259,9 @@ void perf_prepare_sample(struct perf_event_header *header,
if (sample_type &

[PATCH v2] perf/x86/intel: make anythread filter support conditional

2020-10-28 Thread Stephane Eranian
Starting with Arch Perfmon v5, the anythread filter on generic counters may be
deprecated. The current kernel was exporting the any filter without checking.
On Icelake, it means you could do cpu/event=0x3c,any/ even though the filter
does not exist. This patch corrects the problem by relying on the CPUID 0xa leaf
function to determine if anythread is supported or not as described in the
Intel SDM Vol3b 18.2.5.1 AnyThread Deprecation section.

In V2, we remove intel_arch_v4_format_attrs because it is a duplicate
of intel_arch_format_attrs.

Signed-off-by: Stephane Eranian 
---
 arch/x86/events/intel/core.c  | 10 ++
 arch/x86/events/perf_event.h  |  1 +
 arch/x86/include/asm/perf_event.h |  4 +++-
 arch/x86/kvm/cpuid.c  |  4 +++-
 4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f1926e9f2143..7daab613052b 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4987,6 +4987,12 @@ __init int intel_pmu_init(void)
 
x86_add_quirk(intel_arch_events_quirk); /* Install first, so it runs 
last */
 
+   if (version >= 5) {
+   x86_pmu.intel_cap.anythread_deprecated = 
edx.split.anythread_deprecated;
+   if (x86_pmu.intel_cap.anythread_deprecated)
+   pr_cont(" AnyThread deprecated, ");
+   }
+
/*
 * Install the hw-cache-events table:
 */
@@ -5512,6 +5518,10 @@ __init int intel_pmu_init(void)
x86_pmu.intel_ctrl |=
((1LL << x86_pmu.num_counters_fixed)-1) << INTEL_PMC_IDX_FIXED;
 
+   /* AnyThread may be deprecated on arch perfmon v5 or later */
+   if (x86_pmu.intel_cap.anythread_deprecated)
+   x86_pmu.format_attrs = intel_arch_formats_attr;
+
if (x86_pmu.event_constraints) {
/*
 * event on fixed counter2 (REF_CYCLES) only works on this
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ee2b9b9fc2a5..906b494083a8 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -585,6 +585,7 @@ union perf_capabilities {
u64 pebs_baseline:1;
u64 perf_metrics:1;
u64 pebs_output_pt_available:1;
+   u64 anythread_deprecated:1;
};
u64 capabilities;
 };
diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 6960cd6d1f23..b9a7fd0a27e2 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -137,7 +137,9 @@ union cpuid10_edx {
struct {
unsigned int num_counters_fixed:5;
unsigned int bit_width_fixed:8;
-   unsigned int reserved:19;
+   unsigned int reserved1:2;
+   unsigned int anythread_deprecated:1;
+   unsigned int reserved2:16;
} split;
unsigned int full;
 };
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7456f9ad424b..09097d430961 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -636,7 +636,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array 
*array, u32 function)
 
edx.split.num_counters_fixed = min(cap.num_counters_fixed, 
MAX_FIXED_COUNTERS);
edx.split.bit_width_fixed = cap.bit_width_fixed;
-   edx.split.reserved = 0;
+   edx.split.anythread_deprecated = 1;
+   edx.split.reserved1 = 0;
+   edx.split.reserved2 = 0;
 
entry->eax = eax.full;
entry->ebx = cap.events_mask;
-- 
2.29.1.341.ge80a0c044ae-goog



Re: [PATCH] perf/x86/intel: make anythread filter support conditional

2020-10-22 Thread Stephane Eranian
On Thu, Oct 22, 2020 at 1:00 AM Peter Zijlstra  wrote:
>
> On Wed, Oct 21, 2020 at 02:16:12PM -0700, Stephane Eranian wrote:
> > Starting with Arch Perfmon v5, the anythread filter on generic counters may 
> > be
> > deprecated. The current kernel was exporting the any filter without 
> > checking.
> > On Icelake, it means you could do cpu/event=0x3c,any/ even though the filter
> > does not exist. This patch corrects the problem by relying on the CPUID 0xa 
> > leaf
> > function to determine if anythread is supported or not as described in the
> > Intel SDM Vol3b 18.2.5.1 AnyThread Deprecation section.
> >
> > Signed-off-by: Stephane Eranian 
> > ---
> >  arch/x86/events/intel/core.c  | 20 
> >  arch/x86/events/perf_event.h  |  1 +
> >  arch/x86/include/asm/perf_event.h |  4 +++-
> >  arch/x86/kvm/cpuid.c  |  4 +++-
> >  4 files changed, 27 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > index f1926e9f2143..65bf649048a6 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -4220,6 +4220,16 @@ static struct attribute *intel_arch3_formats_attr[] 
> > = {
> >   NULL,
> >  };
> >
> > +static struct attribute *intel_arch5_formats_attr[] = {
> > + _attr_event.attr,
> > + _attr_umask.attr,
> > + _attr_edge.attr,
> > + _attr_pc.attr,
> > + _attr_inv.attr,
> > + _attr_cmask.attr,
> > + NULL,
> > +};
>
> Instead of adding yet another (which is an exact duplicate of the
> existing intel_arch_formats_attr BTW), can't we clean this up and use
> is_visible() as 'demanded' by GregKH and done by Jiri here:
>
>   3d5672735b23 ("perf/x86: Add is_visible attribute_group callback for base 
> events")
>   b7c9b3927337 ("perf/x86/intel: Use ->is_visible callback for default group")
>   baa0c83363c7 ("perf/x86: Use the new pmu::update_attrs attribute group")
>
> And only have "any" visible for v3,v4

Sure, let me resubmit with these changes.


[PATCH] perf/x86/intel: make anythread filter support conditional

2020-10-21 Thread Stephane Eranian
Starting with Arch Perfmon v5, the anythread filter on generic counters may be
deprecated. The current kernel was exporting the any filter without checking.
On Icelake, it means you could do cpu/event=0x3c,any/ even though the filter
does not exist. This patch corrects the problem by relying on the CPUID 0xa leaf
function to determine if anythread is supported or not as described in the
Intel SDM Vol3b 18.2.5.1 AnyThread Deprecation section.

Signed-off-by: Stephane Eranian 
---
 arch/x86/events/intel/core.c  | 20 
 arch/x86/events/perf_event.h  |  1 +
 arch/x86/include/asm/perf_event.h |  4 +++-
 arch/x86/kvm/cpuid.c  |  4 +++-
 4 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f1926e9f2143..65bf649048a6 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4220,6 +4220,16 @@ static struct attribute *intel_arch3_formats_attr[] = {
NULL,
 };
 
+static struct attribute *intel_arch5_formats_attr[] = {
+   _attr_event.attr,
+   _attr_umask.attr,
+   _attr_edge.attr,
+   _attr_pc.attr,
+   _attr_inv.attr,
+   _attr_cmask.attr,
+   NULL,
+};
+
 static struct attribute *hsw_format_attr[] = {
_attr_in_tx.attr,
_attr_in_tx_cp.attr,
@@ -4987,6 +4997,12 @@ __init int intel_pmu_init(void)
 
x86_add_quirk(intel_arch_events_quirk); /* Install first, so it runs 
last */
 
+   if (version >= 5) {
+   x86_pmu.intel_cap.anythread_deprecated = 
edx.split.anythread_deprecated;
+   if (x86_pmu.intel_cap.anythread_deprecated)
+   pr_cont(" AnyThread deprecated, ");
+   }
+
/*
 * Install the hw-cache-events table:
 */
@@ -5512,6 +5528,10 @@ __init int intel_pmu_init(void)
x86_pmu.intel_ctrl |=
((1LL << x86_pmu.num_counters_fixed)-1) << INTEL_PMC_IDX_FIXED;
 
+   /* AnyThread may be deprecated on arch perfmon v5 or later */
+   if (x86_pmu.intel_cap.anythread_deprecated)
+   x86_pmu.format_attrs = intel_arch5_formats_attr;
+
if (x86_pmu.event_constraints) {
/*
 * event on fixed counter2 (REF_CYCLES) only works on this
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ee2b9b9fc2a5..906b494083a8 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -585,6 +585,7 @@ union perf_capabilities {
u64 pebs_baseline:1;
u64 perf_metrics:1;
u64 pebs_output_pt_available:1;
+   u64 anythread_deprecated:1;
};
u64 capabilities;
 };
diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 6960cd6d1f23..b9a7fd0a27e2 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -137,7 +137,9 @@ union cpuid10_edx {
struct {
unsigned int num_counters_fixed:5;
unsigned int bit_width_fixed:8;
-   unsigned int reserved:19;
+   unsigned int reserved1:2;
+   unsigned int anythread_deprecated:1;
+   unsigned int reserved2:16;
} split;
unsigned int full;
 };
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7456f9ad424b..09097d430961 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -636,7 +636,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array 
*array, u32 function)
 
edx.split.num_counters_fixed = min(cap.num_counters_fixed, 
MAX_FIXED_COUNTERS);
edx.split.bit_width_fixed = cap.bit_width_fixed;
-   edx.split.reserved = 0;
+   edx.split.anythread_deprecated = 1;
+   edx.split.reserved1 = 0;
+   edx.split.reserved2 = 0;
 
entry->eax = eax.full;
entry->ebx = cap.events_mask;
-- 
2.29.0.rc2.309.g374f81d7ae-goog



Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-09-30 Thread Stephane Eranian
On Wed, Sep 30, 2020 at 10:30 AM Peter Zijlstra  wrote:
>
> On Wed, Sep 30, 2020 at 07:48:48AM -0700, Dave Hansen wrote:
> > On 9/30/20 7:42 AM, Liang, Kan wrote:
> > >> When I tested on my kernel, it panicked because I suspect
> > >> current->active_mm could be NULL. Adding a check for NULL avoided the
> > >> problem. But I suspect this is not the correct solution.
> > >
> > > I guess the NULL active_mm should be a rare case. If so, I think it's
> > > not bad to add a check and return 0 page size.
> >
> > I think it would be best to understand why ->active_mm is NULL instead
> > of just papering over the problem.  If it is papered over, and this is
> > common, you might end up effectively turning off your shiny new feature
> > inadvertently.
>
> context_switch() can set prev->active_mm to NULL when it transfers it to
> @next. It does this before @current is updated. So an NMI that comes in
> between this active_mm swizzling and updating @current will see
> !active_mm.
>
I think Peter is right. This code is called in the context of NMI, so
if active_mm is set to NULL inside
a locked section, this is not enough to protect the perf_events code
from seeing it.

> In general though; I think using ->active_mm is a mistake though. That
> code should be doing something like:
>
>
> mm = current->mm;
> if (!mm)
> mm = _mm;
>
>


Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-09-30 Thread Stephane Eranian
On Wed, Sep 30, 2020 at 7:48 AM Dave Hansen  wrote:
>
> On 9/30/20 7:42 AM, Liang, Kan wrote:
> >> When I tested on my kernel, it panicked because I suspect
> >> current->active_mm could be NULL. Adding a check for NULL avoided the
> >> problem. But I suspect this is not the correct solution.
> >
> > I guess the NULL active_mm should be a rare case. If so, I think it's
> > not bad to add a check and return 0 page size.
>
> I think it would be best to understand why ->active_mm is NULL instead
> of just papering over the problem.  If it is papered over, and this is
> common, you might end up effectively turning off your shiny new feature
> inadvertently.

I tried that on a backport of the patch to an older kernel. Maybe the
behavior of active_mm has change compared to tip.git.
I will try again with tip.git.


Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-09-30 Thread Stephane Eranian
On Mon, Sep 21, 2020 at 8:29 AM  wrote:
>
> From: Kan Liang 
>
> Current perf can report both virtual addresses and physical addresses,
> but not the MMU page size. Without the MMU page size information of the
> utilized page, users cannot decide whether to promote/demote large pages
> to optimize memory usage.
>
> Add a new sample type for the data MMU page size.
>
> Current perf already has a facility to collect data virtual addresses.
> A page walker is required to walk the pages tables and calculate the
> MMU page size from a given virtual address.
>
> On some platforms, e.g., X86, the page walker is invoked in an NMI
> handler. So the page walker must be NMI-safe and low overhead. Besides,
> the page walker should work for both user and kernel virtual address.
> The existing generic page walker, e.g., walk_page_range_novma(), is a
> little bit complex and doesn't guarantee the NMI-safe. The follow_page()
> is only for user-virtual address.
>
> Add a new function perf_get_page_size() to walk the page tables and
> calculate the MMU page size. In the function:
> - Interrupts have to be disabled to prevent any teardown of the page
>   tables.
> - The active_mm is used for the page walker. Compared with mm, the
>   active_mm is a better choice. It's always non-NULL. For the user
>   thread, it always points to the real address space. For the kernel
>   thread, it "take over" the mm of the threads that switched to it,
>   so it's not using all of the page tables from the init_mm all the
>   time.
> - The MMU page size is calculated from the page table level.
>
> The method should work for all architectures, but it has only been
> verified on X86. Should there be some architectures, which support perf,
> where the method doesn't work, it can be fixed later separately.
> Reporting the wrong page size would not be fatal for the architecture.
>
> Some under discussion features may impact the method in the future.
> Quote from Dave Hansen,
>   "There are lots of weird things folks are trying to do with the page
>tables, like Address Space Isolation.  For instance, if you get a
>perf NMI when running userspace, current->mm->pgd is *different* than
>the PGD that was in use when userspace was running. It's close enough
>today, but it might not stay that way."
> If the case happens later, lots of consecutive page walk errors will
> happen. The worst case is that lots of page-size '0' are returned, which
> would not be fatal.
> In the perf tool, a check is implemented to detect this case. Once it
> happens, a kernel patch could be implemented accordingly then.
>
> Suggested-by: Peter Zijlstra 
> Signed-off-by: Kan Liang 
> ---
>  include/linux/perf_event.h  |  1 +
>  include/uapi/linux/perf_event.h |  4 +-
>  kernel/events/core.c| 93 +
>  3 files changed, 97 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 0c19d279b97f..7e3785dd27d9 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1034,6 +1034,7 @@ struct perf_sample_data {
>
> u64 phys_addr;
> u64 cgroup;
> +   u64 data_page_size;
>  } cacheline_aligned;
>
>  /* default value for data source */
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 077e7ee69e3d..cc6ea346e9f9 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -143,8 +143,9 @@ enum perf_event_sample_format {
> PERF_SAMPLE_PHYS_ADDR   = 1U << 19,
> PERF_SAMPLE_AUX = 1U << 20,
> PERF_SAMPLE_CGROUP  = 1U << 21,
> +   PERF_SAMPLE_DATA_PAGE_SIZE  = 1U << 22,
>
> -   PERF_SAMPLE_MAX = 1U << 22, /* non-ABI */
> +   PERF_SAMPLE_MAX = 1U << 23, /* non-ABI */
>
> __PERF_SAMPLE_CALLCHAIN_EARLY   = 1ULL << 63, /* non-ABI; 
> internal use */
>  };
> @@ -896,6 +897,7 @@ enum perf_event_type {
>  *  { u64   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>  *  { u64   size;
>  *char  data[size]; } && PERF_SAMPLE_AUX
> +*  { u64   data_page_size;} && 
> PERF_SAMPLE_DATA_PAGE_SIZE
>  * };
>  */
> PERF_RECORD_SAMPLE  = 9,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 45edb85344a1..dd329a8f99f7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -51,6 +51,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "internal.h"
>
> @@ -1894,6 +1895,9 @@ static void __perf_event_header_size(struct perf_event 
> *event, u64 sample_type)
> if (sample_type & PERF_SAMPLE_CGROUP)
> size += sizeof(data->cgroup);
>
> +   if 

Re: [PATCH 2/5] perf stat: Add --for-each-cgroup option

2020-09-22 Thread Stephane Eranian
Hi,

On Mon, Sep 21, 2020 at 2:46 AM Namhyung Kim  wrote:
>
> The --for-each-cgroup option is a syntax sugar to monitor large number
> of cgroups easily.  Current command line requires to list all the
> events and cgroups even if users want to monitor same events for each
> cgroup.  This patch addresses that usage by copying given events for
> each cgroup on user's behalf.
>
> For instance, if they want to monitor 6 events for 200 cgroups each
> they should write 1200 event names (with -e) AND 1200 cgroup names
> (with -G) on the command line.  But with this change, they can just
> specify 6 events and 200 cgroups with a new option.
>
> A simpler example below: It wants to measure 3 events for 2 cgroups
> ('A' and 'B').  The result is that total 6 events are counted like
> below.
>
>   $ ./perf stat -a -e cpu-clock,cycles,instructions --for-each-cgroup A,B 
> sleep 1
>
You could also do it by keeping the -G option and providing
--for-each-cgroup as a modifier
of the behavior of -G:

$ ./perf stat -a -e cpu-clock,cycles,instructions --for-each-cgroup -G
 A,B sleep 1

That way, you do not have to handle the case where both are used.
And it makes transitioning to the new style simpler, i.e., the -G
option remains, just need
to trim the number of cgroups to 200 in your example.

Just a suggestion.

>Performance counter stats for 'system wide':
>
>   988.18 msec cpu-clock A #0.987 CPUs utilized
>3,153,761,702  cyclesA #3.200 GHz  
> (100.00%)
>8,067,769,847  instructions  A #2.57  insn per 
> cycle   (100.00%)
>   982.71 msec cpu-clock B #0.982 CPUs utilized
>3,136,093,298  cyclesB #3.182 GHz  
> (99.99%)
>8,109,619,327  instructions  B #2.58  insn per 
> cycle   (99.99%)
>
>  1.001228054 seconds time elapsed
>
> Signed-off-by: Namhyung Kim 
> ---
>  tools/perf/builtin-stat.c | 20 +-
>  tools/perf/util/cgroup.c  | 84 +++
>  tools/perf/util/cgroup.h  |  1 +
>  tools/perf/util/stat.h|  1 +
>  4 files changed, 105 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 7f8d756d9408..a43e58e0a088 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -1051,6 +1051,17 @@ static int parse_control_option(const struct option 
> *opt,
> return evlist__parse_control(str, >ctl_fd, 
> >ctl_fd_ack, >ctl_fd_close);
>  }
>
> +static int parse_stat_cgroups(const struct option *opt,
> + const char *str, int unset)
> +{
> +   if (stat_config.cgroup_list) {
> +   pr_err("--cgroup and --for-each-cgroup cannot be used 
> together\n");
> +   return -1;
> +   }
> +
> +   return parse_cgroups(opt, str, unset);
> +}
> +
>  static struct option stat_options[] = {
> OPT_BOOLEAN('T', "transaction", _run,
> "hardware transaction statistics"),
> @@ -1094,7 +1105,9 @@ static struct option stat_options[] = {
> OPT_STRING('x', "field-separator", _config.csv_sep, "separator",
>"print counts with custom separator"),
> OPT_CALLBACK('G', "cgroup", _list, "name",
> -"monitor event in cgroup name only", parse_cgroups),
> +"monitor event in cgroup name only", parse_stat_cgroups),
> +   OPT_STRING(0, "for-each-cgroup", _config.cgroup_list, "name",
> +   "expand events for each cgroup"),
> OPT_STRING('o', "output", _name, "file", "output file name"),
> OPT_BOOLEAN(0, "append", _file, "append to the output file"),
> OPT_INTEGER(0, "log-fd", _fd,
> @@ -2234,6 +2247,11 @@ int cmd_stat(int argc, const char **argv)
> if (add_default_attributes())
> goto out;
>
> +   if (stat_config.cgroup_list) {
> +   if (evlist__expand_cgroup(evsel_list, 
> stat_config.cgroup_list) < 0)
> +   goto out;
> +   }
> +
> target__validate();
>
> if ((stat_config.aggr_mode == AGGR_THREAD) && (target.system_wide))
> diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c
> index 050dea9f1e88..e4916ed740ac 100644
> --- a/tools/perf/util/cgroup.c
> +++ b/tools/perf/util/cgroup.c
> @@ -12,6 +12,7 @@
>  #include 
>
>  int nr_cgroups;
> +bool multiply_cgroup;
>
>  static int open_cgroup(const char *name)
>  {
> @@ -156,6 +157,10 @@ int parse_cgroups(const struct option *opt, const char 
> *str,
> return -1;
> }
>
> +   /* delay processing cgroups after it sees all events */
> +   if (multiply_cgroup)
> +   return 0;
> +
> for (;;) {
> p = strchr(str, ',');
> e = p ? p : eos;
> @@ -193,6 +198,85 @@ int parse_cgroups(const 

Re: [PATCH 02/26] perf: Introduce mmap3 version of mmap event

2020-09-14 Thread Stephane Eranian
On Mon, Sep 14, 2020 at 2:08 AM  wrote:
>
> On Sun, Sep 13, 2020 at 11:41:00PM -0700, Stephane Eranian wrote:
> > On Sun, Sep 13, 2020 at 2:03 PM Jiri Olsa  wrote:
> > what happens if I set mmap3 and mmap2?
> >
> > I think using mmap3 for every mmap may be overkill as you add useless
> > 20 bytes to an mmap record.
> > I am not sure if your code handles the case where mmap3 is not needed
> > because there is no buildid, e.g, anonymous memory.
> > It seems to me you've written the patch in such a way that if the user
> > tool supports mmap3, then it supersedes mmap2, and thus
> > you need all the fields of mmap2. But if could be more interesting to
> > return either MMAP2 or MMAP3 depending on tool support
> > and type of mmap, that would certainly save 20 bytes on any anon mmap.
> > But maybe that logic is already in your patch and I missed it.
>
> That, and what if you don't want any of that buildid nonsense at all? I
> always kill that because it makes perf pointlessly slow and has
> absolutely no upsides for me.
>
I have seen situations where the perf tool takes a visibly significant
amount of time (many seconds) to inject the buildids at the end of the
collection
of perf record (same if using perf inject -b). That is because it
needs to go through all the records in the perf.data to find MMAP
records and then read
the buildids from the filesystem. This has caused some problems in our
environment. Having the kernel add the buildid to *relevant* mmaps
would avoid
a lot of that penalty, by avoiding having to parse the perf.data file
and leveraging the fact that the buildid may be in memory already.
Although my concern on
this has to do with large pages and the impact they have on alignment
of sections in memory.  I think Ian can comment better on this.

I think this patch series is useful if it can demonstrate a speedup
during recording (perf record or perf record | perf inject -b). But it
needs to be
optimized to minimize the volume of useless info returned. And Jiri
needs to decide if MMAP3 is a replacement of MMAP2, or a different
kind of record
targeted at ELF images only in which case some of the fields may be
removed. My tendency would be to go for the latter.


Re: [PATCH 02/26] perf: Introduce mmap3 version of mmap event

2020-09-14 Thread Stephane Eranian
On Sun, Sep 13, 2020 at 2:03 PM Jiri Olsa  wrote:
>
> Add new version of mmap event. The MMAP3 record is an
> augmented version of MMAP2, it adds build id value to
> identify the exact binary object behind memory map:
>
>   struct {
> struct perf_event_header header;
>
> u32  pid, tid;
> u64  addr;
> u64  len;
> u64  pgoff;
> u32  maj;
> u32  min;
> u64  ino;
> u64  ino_generation;
> u32  prot, flags;
> u32  reserved;
> u8   buildid[20];
> char filename[];
> struct sample_id sample_id;
>   };
>
> Adding 4 bytes reserved field to align buildid data to 8 bytes,
> so sample_id data is properly aligned.
>
> The mmap3 event is enabled by new mmap3 bit in perf_event_attr
> struct.  When set for an event, it enables the build id retrieval
> and will use mmap3 format for the event.
>
> Keeping track of mmap3 events and calling build_id_parse
> in perf_event_mmap_event only if we have any defined.
>
> Having build id attached directly to the mmap event will help
> tool like perf to skip final search through perf data for
> binaries that are needed in the report time. Also it prevents
> possible race when the binary could be removed or replaced
> during profiling.
>
> Signed-off-by: Jiri Olsa 
> ---
>  include/uapi/linux/perf_event.h | 27 ++-
>  kernel/events/core.c| 38 +++--
>  2 files changed, 57 insertions(+), 8 deletions(-)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 077e7ee69e3d..facfc3c673ed 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -384,7 +384,8 @@ struct perf_event_attr {
> aux_output :  1, /* generate AUX records 
> instead of events */
> cgroup :  1, /* include cgroup events 
> */
> text_poke  :  1, /* include text poke 
> events */
> -   __reserved_1   : 30;
> +   mmap3  :  1, /* include bpf events */
> +   __reserved_1   : 29;
>
what happens if I set mmap3 and mmap2?

I think using mmap3 for every mmap may be overkill as you add useless
20 bytes to an mmap record.
I am not sure if your code handles the case where mmap3 is not needed
because there is no buildid, e.g, anonymous memory.
It seems to me you've written the patch in such a way that if the user
tool supports mmap3, then it supersedes mmap2, and thus
you need all the fields of mmap2. But if could be more interesting to
return either MMAP2 or MMAP3 depending on tool support
and type of mmap, that would certainly save 20 bytes on any anon mmap.
But maybe that logic is already in your patch and I missed it.


> union {
> __u32   wakeup_events;/* wakeup every n events */
> @@ -1060,6 +1061,30 @@ enum perf_event_type {
>  */
> PERF_RECORD_TEXT_POKE   = 20,
>
> +   /*
> +* The MMAP3 records are an augmented version of MMAP2, they add
> +* build id value to identify the exact binary behind map
> +*
> +* struct {
> +*  struct perf_event_headerheader;
> +*
> +*  u32 pid, tid;
> +*  u64 addr;
> +*  u64 len;
> +*  u64 pgoff;
> +*  u32 maj;
> +*  u32 min;
> +*  u64 ino;
> +*  u64 ino_generation;
> +*  u32 prot, flags;
> +*  u32 reserved;
> +*  u8  buildid[20];
> +*  charfilename[];
> +*  struct sample_idsample_id;
> +* };
> +*/
> +   PERF_RECORD_MMAP3   = 21,
> +
> PERF_RECORD_MAX,/* non-ABI */
>  };
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7ed5248f0445..719894492dac 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -51,6 +51,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "internal.h"
>
> @@ -386,6 +387,7 @@ static DEFINE_PER_CPU(int, perf_sched_cb_usages);
>  static DEFINE_PER_CPU(struct pmu_event_list, pmu_sb_events);
>
>  static atomic_t nr_mmap_events __read_mostly;
> +static atomic_t nr_mmap3_events __read_mostly;
>  static 

[PATCH v2] perf headers: fix processing of pmu_mappings

2020-06-09 Thread Stephane Eranian
This patch changes the handling of the env->pmu_mappings string.
It transforms the string from a \0 separated list of value:name pairs
into a space separated list of value:name pairs. This makes it much simpler
to parse looking for a particular value or name.

This version also updates print_pmu_mappings() to handle the new space
separator.

Before: printf(env->pmu_mappings);
14:amd_iommu_1

After: printf(env->pmu_mappings);
14:amd_iommu_1 7:uprobe 5:breakpoint 10:amd_l3 19:amd_iommu_6 8:power 4:cpu 
17:amd_iommu_4 15:amd_iommu_2 1:software 6:kprobe 13:amd_iommu_0 9:amd_df 
20:amd_iommu_7 18:amd_iommu_5 2:tracepoint 21:msr 12:ibs_op 16:amd_iommu_3 
11:ibs_fetch

Signed-off-by: Stephane Eranian 
---
 tools/perf/util/header.c | 25 -
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 31a7f278036c..3649c0e1740b 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1470,10 +1470,19 @@ static void print_pmu_mappings(struct feat_fd *ff, FILE 
*fp)
goto error;
 
str = tmp + 1;
+
+   tmp = strchr(str, ' ');
+   if (tmp)
+   *tmp = '\0';
+
fprintf(fp, "%s%s = %" PRIu32, delimiter, str, type);
 
delimiter = ", ";
-   str += strlen(str) + 1;
+
+   if (tmp) {
+   *tmp = ' ';
+   str = tmp + 1;
+   }
pmu_num--;
}
 
@@ -1956,13 +1965,15 @@ static int process_numa_topology(struct feat_fd *ff, 
void *data __maybe_unused)
 static int process_pmu_mappings(struct feat_fd *ff, void *data __maybe_unused)
 {
char *name;
-   u32 pmu_num;
+   u32 pmu_num, o_num;
u32 type;
struct strbuf sb;
 
if (do_read_u32(ff, _num))
return -1;
 
+   o_num = pmu_num;
+
if (!pmu_num) {
pr_debug("pmu mappings not available\n");
return 0;
@@ -1980,10 +1991,11 @@ static int process_pmu_mappings(struct feat_fd *ff, 
void *data __maybe_unused)
if (!name)
goto error;
 
-   if (strbuf_addf(, "%u:%s", type, name) < 0)
+   /* add proper spacing between entries */
+   if (pmu_num < o_num && strbuf_add(, " ", 1) < 0)
goto error;
-   /* include a NULL character at the end */
-   if (strbuf_add(, "", 1) < 0)
+
+   if (strbuf_addf(, "%u:%s", type, name) < 0)
goto error;
 
if (!strcmp(name, "msr"))
@@ -1992,6 +2004,9 @@ static int process_pmu_mappings(struct feat_fd *ff, void 
*data __maybe_unused)
free(name);
pmu_num--;
}
+   /* include a NULL character at the end */
+   if (strbuf_add(, "", 1) < 0)
+   goto error;
ff->ph->env.pmu_mappings = strbuf_detach(, NULL);
return 0;
 
-- 
2.27.0.278.ge193c7cf3a9-goog



[PATCH] perf headers: fix processing of pmu_mappings

2020-06-08 Thread Stephane Eranian
This patch fixes a bug in process_pmu_mappings() where the code
would not produce an env->pmu_mappings string that was easily parsable.
The function parses the PMU_MAPPING header information into a string
consisting of value:name pairs where value is the PMU type identifier
and name is the PMU name, e.g., 10:ibs_fetch. As it was, the code
was producing a truncated string with only the first pair showing
even though the rest was there but after the \0.
This patch fixes the problem byt adding a proper white space between
pairs and moving the \0 termination to the end. With this patch applied,
all pairs appear and are easily parsed.

Before:
14:amd_iommu_1

After:
14:amd_iommu_1 7:uprobe 5:breakpoint 10:amd_l3 19:amd_iommu_6 8:power 4:cpu 
17:amd_iommu_4 15:amd_iommu_2 1:software 6:kprobe 13:amd_iommu_0 9:amd_df 
20:amd_iommu_7 18:amd_iommu_5 2:tracepoint 21:msr 12:ibs_op 16:amd_iommu_3 
11:ibs_fetch

Signed-off-by: Stephane Eranian 
---
 tools/perf/util/header.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 7a67d017d72c3..cf72124da9350 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -2462,13 +2462,15 @@ static int process_numa_topology(struct feat_fd *ff, 
void *data __maybe_unused)
 static int process_pmu_mappings(struct feat_fd *ff, void *data __maybe_unused)
 {
char *name;
-   u32 pmu_num;
+   u32 pmu_num, o_num;
u32 type;
struct strbuf sb;
 
if (do_read_u32(ff, _num))
return -1;
 
+   o_num = pmu_num;
+
if (!pmu_num) {
pr_debug("pmu mappings not available\n");
return 0;
@@ -2486,10 +2488,11 @@ static int process_pmu_mappings(struct feat_fd *ff, 
void *data __maybe_unused)
if (!name)
goto error;
 
-   if (strbuf_addf(, "%u:%s", type, name) < 0)
+   /* add proper spacing between entries */
+   if (pmu_num < o_num && strbuf_add(, " ", 1) < 0)
goto error;
-   /* include a NULL character at the end */
-   if (strbuf_add(, "", 1) < 0)
+
+   if (strbuf_addf(, "%u:%s", type, name) < 0)
goto error;
 
if (!strcmp(name, "msr"))
@@ -2498,6 +2501,9 @@ static int process_pmu_mappings(struct feat_fd *ff, void 
*data __maybe_unused)
free(name);
pmu_num--;
}
+   /* include a NULL character at the end */
+   if (strbuf_add(, "", 1) < 0)
+   goto error;
ff->ph->env.pmu_mappings = strbuf_detach(, NULL);
return 0;
 
-- 
2.27.0.278.ge193c7cf3a9-goog



Re: [PATCH v2 1/5] perf/x86/rapl: move RAPL support to common x86 code

2020-06-04 Thread Stephane Eranian
On Thu, Jun 4, 2020 at 6:11 AM Johannes Hirte
 wrote:
>
> On 2020 Jun 01, Stephane Eranian wrote:
> > On Mon, Jun 1, 2020 at 5:39 AM Johannes Hirte
> >  wrote:
> > >
> > > On 2020 Mai 27, Stephane Eranian wrote:
> > >
> > > ...
> > > > diff --git a/arch/x86/events/Makefile b/arch/x86/events/Makefile
> > > > index 6f1d1fde8b2de..12c42eba77ec3 100644
> > > > --- a/arch/x86/events/Makefile
> > > > +++ b/arch/x86/events/Makefile
> > > > @@ -1,5 +1,6 @@
> > > >  # SPDX-License-Identifier: GPL-2.0-only
> > > >  obj-y+= core.o probe.o
> > > > +obj-$(PERF_EVENTS_INTEL_RAPL)+= rapl.o
> > > >  obj-y+= amd/
> > > >  obj-$(CONFIG_X86_LOCAL_APIC)+= msr.o
> > > >  obj-$(CONFIG_CPU_SUP_INTEL)  += intel/
> > >
> > > With this change, rapl won't be build. Must be:
> > >
> > > obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL)+= rapl.o
> > >
> > Correct. I posted a patch last week to fix that.
> > Thanks.
>
> Yes, it just wasn't in tip when I've tested. Sorry for the noise.
>
It is now. All is good.
Thanks.

>
> --
> Regards,
>   Johannes Hirte
>


[tip: perf/urgent] perf/x86/rapl: Fix RAPL config variable bug

2020-06-02 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/urgent branch of tip:

Commit-ID: 16accae3d97f97d7f61c4ee5d0002bccdef59088
Gitweb:
https://git.kernel.org/tip/16accae3d97f97d7f61c4ee5d0002bccdef59088
Author:Stephane Eranian 
AuthorDate:Thu, 28 May 2020 13:16:14 -07:00
Committer: Ingo Molnar 
CommitterDate: Tue, 02 Jun 2020 11:52:56 +02:00

perf/x86/rapl: Fix RAPL config variable bug

This patch fixes a bug introduced by:

  fd3ae1e1587d6 ("perf/x86/rapl: Move RAPL support to common x86 code")

The Kconfig variable name was wrong. It was missing the CONFIG_ prefix.

Signed-off-by: Stephane Eranian 
Signed-off-by: Ingo Molnar 
Tested-by: Kim Phillips 
Acked-by: Peter Zijlstra 
Link: https://lore.kernel.org/r/20200528201614.250182-1-eran...@google.com
---
 arch/x86/events/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/Makefile b/arch/x86/events/Makefile
index 12c42eb..9933c0e 100644
--- a/arch/x86/events/Makefile
+++ b/arch/x86/events/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-y  += core.o probe.o
-obj-$(PERF_EVENTS_INTEL_RAPL)  += rapl.o
+obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL)   += rapl.o
 obj-y  += amd/
 obj-$(CONFIG_X86_LOCAL_APIC)+= msr.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= intel/


Re: [PATCH v2 1/5] perf/x86/rapl: move RAPL support to common x86 code

2020-06-01 Thread Stephane Eranian
On Mon, Jun 1, 2020 at 5:39 AM Johannes Hirte
 wrote:
>
> On 2020 Mai 27, Stephane Eranian wrote:
>
> ...
> > diff --git a/arch/x86/events/Makefile b/arch/x86/events/Makefile
> > index 6f1d1fde8b2de..12c42eba77ec3 100644
> > --- a/arch/x86/events/Makefile
> > +++ b/arch/x86/events/Makefile
> > @@ -1,5 +1,6 @@
> >  # SPDX-License-Identifier: GPL-2.0-only
> >  obj-y+= core.o probe.o
> > +obj-$(PERF_EVENTS_INTEL_RAPL)+= rapl.o
> >  obj-y+= amd/
> >  obj-$(CONFIG_X86_LOCAL_APIC)+= msr.o
> >  obj-$(CONFIG_CPU_SUP_INTEL)  += intel/
>
> With this change, rapl won't be build. Must be:
>
> obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL)+= rapl.o
>
Correct. I posted a patch last week to fix that.
Thanks.

> --
> Regards,
>   Johannes Hirte
>


Re: [PATCH] perf/x86/rapl: fix rapl config variable bug

2020-06-01 Thread Stephane Eranian
On Thu, May 28, 2020 at 2:30 PM Kim Phillips  wrote:
>
> On 5/28/20 3:16 PM, Stephane Eranian wrote:
> > This patch fixes a bug introduced by:
> >
> > commit fd3ae1e1587d6 ("perf/x86/rapl: Move RAPL support to common x86 code")
> >
> > The Kconfig variable name was wrong. It was missing the CONFIG_ prefix.
> >
> > Signed-off-by: Stephane Eranian 
> >
> > ---
>
> Tested-by: Kim Phillips 
>
Without this patch, the rapl.c module does not get compiled.
Please apply.
Thanks.

> Thanks,
>
> Kim


[PATCH] perf/x86/rapl: fix rapl config variable bug

2020-05-28 Thread Stephane Eranian
This patch fixes a bug introduced by:

commit fd3ae1e1587d6 ("perf/x86/rapl: Move RAPL support to common x86 code")

The Kconfig variable name was wrong. It was missing the CONFIG_ prefix.

Signed-off-by: Stephane Eranian 

---
 arch/x86/events/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/Makefile b/arch/x86/events/Makefile
index 12c42eba77ec3..9933c0e8e97a9 100644
--- a/arch/x86/events/Makefile
+++ b/arch/x86/events/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-y  += core.o probe.o
-obj-$(PERF_EVENTS_INTEL_RAPL)  += rapl.o
+obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL)   += rapl.o
 obj-y  += amd/
 obj-$(CONFIG_X86_LOCAL_APIC)+= msr.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= intel/
-- 
2.27.0.rc2.251.g90737beb825-goog



[tip: perf/core] perf/x86/rapl: Refactor to share the RAPL code between Intel and AMD CPUs

2020-05-28 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 5c95c68949880035b68e5c48fdf4899ec0989631
Gitweb:
https://git.kernel.org/tip/5c95c68949880035b68e5c48fdf4899ec0989631
Author:Stephane Eranian 
AuthorDate:Wed, 27 May 2020 15:46:56 -07:00
Committer: Ingo Molnar 
CommitterDate: Thu, 28 May 2020 07:58:55 +02:00

perf/x86/rapl: Refactor to share the RAPL code between Intel and AMD CPUs

This patch modifies the rapl_model struct to include architecture specific
knowledge in this previously Intel specific structure, and in particular
it adds the MSR for POWER_UNIT and the rapl_msrs array.

No functional changes.

Signed-off-by: Stephane Eranian 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20200527224659.206129-3-eran...@google.com
---
 arch/x86/events/rapl.c | 29 +++--
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
index 3e6c01b..f29935e 100644
--- a/arch/x86/events/rapl.c
+++ b/arch/x86/events/rapl.c
@@ -131,7 +131,9 @@ struct rapl_pmus {
 };
 
 struct rapl_model {
+   struct perf_msr *rapl_msrs;
unsigned long   events;
+   unsigned intmsr_power_unit;
boolapply_quirk;
 };
 
@@ -141,7 +143,7 @@ static struct rapl_pmus *rapl_pmus;
 static cpumask_t rapl_cpu_mask;
 static unsigned int rapl_cntr_mask;
 static u64 rapl_timer_ms;
-static struct perf_msr rapl_msrs[];
+static struct perf_msr *rapl_msrs;
 
 static inline struct rapl_pmu *cpu_to_rapl_pmu(unsigned int cpu)
 {
@@ -516,7 +518,7 @@ static bool test_msr(int idx, void *data)
return test_bit(idx, (unsigned long *) data);
 }
 
-static struct perf_msr rapl_msrs[] = {
+static struct perf_msr intel_rapl_msrs[] = {
[PERF_RAPL_PP0]  = { MSR_PP0_ENERGY_STATUS,  
_events_cores_group, test_msr },
[PERF_RAPL_PKG]  = { MSR_PKG_ENERGY_STATUS,  
_events_pkg_group,   test_msr },
[PERF_RAPL_RAM]  = { MSR_DRAM_ENERGY_STATUS, 
_events_ram_group,   test_msr },
@@ -578,13 +580,13 @@ static int rapl_cpu_online(unsigned int cpu)
return 0;
 }
 
-static int rapl_check_hw_unit(bool apply_quirk)
+static int rapl_check_hw_unit(struct rapl_model *rm)
 {
u64 msr_rapl_power_unit_bits;
int i;
 
/* protect rdmsrl() to handle virtualization */
-   if (rdmsrl_safe(MSR_RAPL_POWER_UNIT, _rapl_power_unit_bits))
+   if (rdmsrl_safe(rm->msr_power_unit, _rapl_power_unit_bits))
return -1;
for (i = 0; i < NR_RAPL_DOMAINS; i++)
rapl_hw_unit[i] = (msr_rapl_power_unit_bits >> 8) & 0x1FULL;
@@ -595,7 +597,7 @@ static int rapl_check_hw_unit(bool apply_quirk)
 * "Intel Xeon Processor E5-1600 and E5-2600 v3 Product Families, V2
 * of 2. Datasheet, September 2014, Reference Number: 330784-001 "
 */
-   if (apply_quirk)
+   if (rm->apply_quirk)
rapl_hw_unit[PERF_RAPL_RAM] = 16;
 
/*
@@ -676,6 +678,8 @@ static struct rapl_model model_snb = {
  BIT(PERF_RAPL_PKG) |
  BIT(PERF_RAPL_PP1),
.apply_quirk= false,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_snbep = {
@@ -683,6 +687,8 @@ static struct rapl_model model_snbep = {
  BIT(PERF_RAPL_PKG) |
  BIT(PERF_RAPL_RAM),
.apply_quirk= false,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_hsw = {
@@ -691,6 +697,8 @@ static struct rapl_model model_hsw = {
  BIT(PERF_RAPL_RAM) |
  BIT(PERF_RAPL_PP1),
.apply_quirk= false,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_hsx = {
@@ -698,12 +706,16 @@ static struct rapl_model model_hsx = {
  BIT(PERF_RAPL_PKG) |
  BIT(PERF_RAPL_RAM),
.apply_quirk= true,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_knl = {
.events = BIT(PERF_RAPL_PKG) |
  BIT(PERF_RAPL_RAM),
.apply_quirk= true,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_skl = {
@@ -713,6 +725,8 @@ static struct rapl_model model_skl = {
  BIT(PERF_RAPL_PP1) |
  BIT(PERF_RAPL_PSYS),
.apply_quirk= false,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static const struct x86_cpu_id rapl_model_match[] __initconst = {
@@ -760,10 +774,13 @@ static int __init rapl_pmu_init

[tip: perf/core] perf/x86/rapl: Flip logic on default events visibility

2020-05-28 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 2a3e3f73a23b4ff2c0065d3a42edc18ad94b7851
Gitweb:
https://git.kernel.org/tip/2a3e3f73a23b4ff2c0065d3a42edc18ad94b7851
Author:Stephane Eranian 
AuthorDate:Wed, 27 May 2020 15:46:57 -07:00
Committer: Ingo Molnar 
CommitterDate: Thu, 28 May 2020 07:58:55 +02:00

perf/x86/rapl: Flip logic on default events visibility

This patch modifies the default visibility of the attribute_group
for each RAPL event. By default if the grp.is_visible field is NULL,
sysfs considers that it must display the attribute group.
If the field is not NULL (callback function), then the return value
of the callback determines the visibility (0 = not visible). The RAPL
attribute groups had the field set to NULL, meaning that unless they
failed the probing from perf_msr_probe(), they would be visible. We want
to avoid having to specify attribute groups that are not supported by the HW
in the rapl_msrs[] array, they don't have an MSR address to begin with.

Therefore, we intialize the visible field of all RAPL attribute groups
to a callback that returns 0. If the RAPL msr goes through probing
and succeeds the is_visible field will be set back to NULL (visible).
If the probing fails the field is set to a callback that return 0 (not visible).

Signed-off-by: Stephane Eranian 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20200527224659.206129-4-eran...@google.com
---
 arch/x86/events/rapl.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
index f29935e..8d17af4 100644
--- a/arch/x86/events/rapl.c
+++ b/arch/x86/events/rapl.c
@@ -460,9 +460,16 @@ static struct attribute *rapl_events_cores[] = {
NULL,
 };
 
+static umode_t
+rapl_not_visible(struct kobject *kobj, struct attribute *attr, int i)
+{
+   return 0;
+}
+
 static struct attribute_group rapl_events_cores_group = {
.name  = "events",
.attrs = rapl_events_cores,
+   .is_visible = rapl_not_visible,
 };
 
 static struct attribute *rapl_events_pkg[] = {
@@ -475,6 +482,7 @@ static struct attribute *rapl_events_pkg[] = {
 static struct attribute_group rapl_events_pkg_group = {
.name  = "events",
.attrs = rapl_events_pkg,
+   .is_visible = rapl_not_visible,
 };
 
 static struct attribute *rapl_events_ram[] = {
@@ -487,6 +495,7 @@ static struct attribute *rapl_events_ram[] = {
 static struct attribute_group rapl_events_ram_group = {
.name  = "events",
.attrs = rapl_events_ram,
+   .is_visible = rapl_not_visible,
 };
 
 static struct attribute *rapl_events_gpu[] = {
@@ -499,6 +508,7 @@ static struct attribute *rapl_events_gpu[] = {
 static struct attribute_group rapl_events_gpu_group = {
.name  = "events",
.attrs = rapl_events_gpu,
+   .is_visible = rapl_not_visible,
 };
 
 static struct attribute *rapl_events_psys[] = {
@@ -511,6 +521,7 @@ static struct attribute *rapl_events_psys[] = {
 static struct attribute_group rapl_events_psys_group = {
.name  = "events",
.attrs = rapl_events_psys,
+   .is_visible = rapl_not_visible,
 };
 
 static bool test_msr(int idx, void *data)


[tip: perf/core] perf/x86/rapl: Move RAPL support to common x86 code

2020-05-28 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/core branch of tip:

Commit-ID: fd3ae1e1587d64ef8cc8e361903d33625458073e
Gitweb:
https://git.kernel.org/tip/fd3ae1e1587d64ef8cc8e361903d33625458073e
Author:Stephane Eranian 
AuthorDate:Wed, 27 May 2020 15:46:55 -07:00
Committer: Ingo Molnar 
CommitterDate: Thu, 28 May 2020 07:58:55 +02:00

perf/x86/rapl: Move RAPL support to common x86 code

To prepare for support of both Intel and AMD RAPL.

As per the AMD PPR, Fam17h support Package RAPL counters to monitor power usage.
The RAPL counter operates as with Intel RAPL, and as such it is beneficial
to share the code.

No change in functionality.

Signed-off-by: Stephane Eranian 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20200527224659.206129-2-eran...@google.com
---
 arch/x86/events/Kconfig|   6 +-
 arch/x86/events/Makefile   |   1 +-
 arch/x86/events/intel/Makefile |   2 +-
 arch/x86/events/intel/rapl.c   | 802 +
 arch/x86/events/rapl.c | 805 -
 5 files changed, 809 insertions(+), 807 deletions(-)
 delete mode 100644 arch/x86/events/intel/rapl.c
 create mode 100644 arch/x86/events/rapl.c

diff --git a/arch/x86/events/Kconfig b/arch/x86/events/Kconfig
index 9a7a144..4a809c6 100644
--- a/arch/x86/events/Kconfig
+++ b/arch/x86/events/Kconfig
@@ -10,11 +10,11 @@ config PERF_EVENTS_INTEL_UNCORE
available on NehalemEX and more modern processors.
 
 config PERF_EVENTS_INTEL_RAPL
-   tristate "Intel rapl performance events"
-   depends on PERF_EVENTS && CPU_SUP_INTEL && PCI
+   tristate "Intel/AMD rapl performance events"
+   depends on PERF_EVENTS && (CPU_SUP_INTEL || CPU_SUP_AMD) && PCI
default y
---help---
-   Include support for Intel rapl performance events for power
+   Include support for Intel and AMD rapl performance events for power
monitoring on modern processors.
 
 config PERF_EVENTS_INTEL_CSTATE
diff --git a/arch/x86/events/Makefile b/arch/x86/events/Makefile
index 6f1d1fd..12c42eb 100644
--- a/arch/x86/events/Makefile
+++ b/arch/x86/events/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-y  += core.o probe.o
+obj-$(PERF_EVENTS_INTEL_RAPL)  += rapl.o
 obj-y  += amd/
 obj-$(CONFIG_X86_LOCAL_APIC)+= msr.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= intel/
diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
index 3468b0c..e67a588 100644
--- a/arch/x86/events/intel/Makefile
+++ b/arch/x86/events/intel/Makefile
@@ -2,8 +2,6 @@
 obj-$(CONFIG_CPU_SUP_INTEL)+= core.o bts.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= ds.o knc.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= lbr.o p4.o p6.o pt.o
-obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL)   += intel-rapl-perf.o
-intel-rapl-perf-objs   := rapl.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += intel-uncore.o
 intel-uncore-objs  := uncore.o uncore_nhmex.o uncore_snb.o 
uncore_snbep.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_CSTATE) += intel-cstate.o
diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/intel/rapl.c
deleted file mode 100644
index 9e1e141..000
--- a/arch/x86/events/intel/rapl.c
+++ /dev/null
@@ -1,802 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Support Intel RAPL energy consumption counters
- * Copyright (C) 2013 Google, Inc., Stephane Eranian
- *
- * Intel RAPL interface is specified in the IA-32 Manual Vol3b
- * section 14.7.1 (September 2013)
- *
- * RAPL provides more controls than just reporting energy consumption
- * however here we only expose the 3 energy consumption free running
- * counters (pp0, pkg, dram).
- *
- * Each of those counters increments in a power unit defined by the
- * RAPL_POWER_UNIT MSR. On SandyBridge, this unit is 1/(2^16) Joules
- * but it can vary.
- *
- * Counter to rapl events mappings:
- *
- *  pp0 counter: consumption of all physical cores (power plane 0)
- *   event: rapl_energy_cores
- *perf code: 0x1
- *
- *  pkg counter: consumption of the whole processor package
- *   event: rapl_energy_pkg
- *perf code: 0x2
- *
- * dram counter: consumption of the dram domain (servers only)
- *   event: rapl_energy_dram
- *perf code: 0x3
- *
- * gpu counter: consumption of the builtin-gpu domain (client only)
- *   event: rapl_energy_gpu
- *perf code: 0x4
- *
- *  psys counter: consumption of the builtin-psys domain (client only)
- *   event: rapl_energy_psys
- *perf code: 0x5
- *
- * We manage those counters as free running (read-only). They may be
- * use simultaneously by other tools, such as turbostat.
- *
- * The events only support system-wide mode counting. There is no
- * sampling support because it does not make sense and is not
- * supported by the RAPL hardware.
- *
- 

[tip: perf/core] perf/x86/rapl: Add AMD Fam17h RAPL support

2020-05-28 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 5cde265384cad739b162cf08afba6da8857778bd
Gitweb:
https://git.kernel.org/tip/5cde265384cad739b162cf08afba6da8857778bd
Author:Stephane Eranian 
AuthorDate:Wed, 27 May 2020 15:46:59 -07:00
Committer: Ingo Molnar 
CommitterDate: Thu, 28 May 2020 07:58:56 +02:00

perf/x86/rapl: Add AMD Fam17h RAPL support

This patch enables AMD Fam17h RAPL support for the Package level metric.
The support is as per AMD Fam17h Model31h (Zen2) and model 00-ffh (Zen1) PPR.

The same output is available via the energy-pkg pseudo event:

  $ perf stat -a -I 1000 --per-socket -e power/energy-pkg/

Signed-off-by: Stephane Eranian 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20200527224659.206129-6-eran...@google.com
---
 arch/x86/events/rapl.c   | 18 ++
 arch/x86/include/asm/msr-index.h |  3 +++
 2 files changed, 21 insertions(+)

diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
index 8d17af4..0f2bf59 100644
--- a/arch/x86/events/rapl.c
+++ b/arch/x86/events/rapl.c
@@ -537,6 +537,16 @@ static struct perf_msr intel_rapl_msrs[] = {
[PERF_RAPL_PSYS] = { MSR_PLATFORM_ENERGY_STATUS, 
_events_psys_group,  test_msr },
 };
 
+/*
+ * Force to PERF_RAPL_MAX size due to:
+ * - perf_msr_probe(PERF_RAPL_MAX)
+ * - want to use same event codes across both architectures
+ */
+static struct perf_msr amd_rapl_msrs[PERF_RAPL_MAX] = {
+   [PERF_RAPL_PKG]  = { MSR_AMD_PKG_ENERGY_STATUS,  
_events_pkg_group,   test_msr },
+};
+
+
 static int rapl_cpu_offline(unsigned int cpu)
 {
struct rapl_pmu *pmu = cpu_to_rapl_pmu(cpu);
@@ -740,6 +750,13 @@ static struct rapl_model model_skl = {
.rapl_msrs  = intel_rapl_msrs,
 };
 
+static struct rapl_model model_amd_fam17h = {
+   .events = BIT(PERF_RAPL_PKG),
+   .apply_quirk= false,
+   .msr_power_unit = MSR_AMD_RAPL_POWER_UNIT,
+   .rapl_msrs  = amd_rapl_msrs,
+};
+
 static const struct x86_cpu_id rapl_model_match[] __initconst = {
X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE, _snb),
X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE_X,   _snbep),
@@ -770,6 +787,7 @@ static const struct x86_cpu_id rapl_model_match[] 
__initconst = {
X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X,   _hsx),
X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE_L, _skl),
X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE,   _skl),
+   X86_MATCH_VENDOR_FAM(AMD, 0x17, _amd_fam17h),
{},
 };
 MODULE_DEVICE_TABLE(x86cpu, rapl_model_match);
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 12c9684..ef452b8 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -301,6 +301,9 @@
 #define MSR_PP1_ENERGY_STATUS  0x0641
 #define MSR_PP1_POLICY 0x0642
 
+#define MSR_AMD_PKG_ENERGY_STATUS  0xc001029b
+#define MSR_AMD_RAPL_POWER_UNIT0xc0010299
+
 /* Config TDP MSRs */
 #define MSR_CONFIG_TDP_NOMINAL 0x0648
 #define MSR_CONFIG_TDP_LEVEL_1 0x0649


[tip: perf/core] perf/x86/rapl: Make perf_probe_msr() more robust and flexible

2020-05-28 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 4c953f879460bf65ea3c119354026b126fe8ee57
Gitweb:
https://git.kernel.org/tip/4c953f879460bf65ea3c119354026b126fe8ee57
Author:Stephane Eranian 
AuthorDate:Wed, 27 May 2020 15:46:58 -07:00
Committer: Ingo Molnar 
CommitterDate: Thu, 28 May 2020 07:58:55 +02:00

perf/x86/rapl: Make perf_probe_msr() more robust and flexible

This patch modifies perf_probe_msr() by allowing passing of
struct perf_msr array where some entries are not populated, i.e.,
they have either an msr address of 0 or no attribute_group pointer.
This helps with certain call paths, e.g., RAPL.

In case the grp is NULL, the default sysfs visibility rule
applies which is to make the group visible. Without the patch,
you would get a kernel crash with a NULL group.

Signed-off-by: Stephane Eranian 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20200527224659.206129-5-eran...@google.com
---
 arch/x86/events/probe.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/events/probe.c b/arch/x86/events/probe.c
index c2ede2f..136a1e8 100644
--- a/arch/x86/events/probe.c
+++ b/arch/x86/events/probe.c
@@ -10,6 +10,11 @@ not_visible(struct kobject *kobj, struct attribute *attr, 
int i)
return 0;
 }
 
+/*
+ * Accepts msr[] array with non populated entries as long as either
+ * msr[i].msr is 0 or msr[i].grp is NULL. Note that the default sysfs
+ * visibility is visible when group->is_visible callback is set.
+ */
 unsigned long
 perf_msr_probe(struct perf_msr *msr, int cnt, bool zero, void *data)
 {
@@ -24,8 +29,16 @@ perf_msr_probe(struct perf_msr *msr, int cnt, bool zero, 
void *data)
if (!msr[bit].no_check) {
struct attribute_group *grp = msr[bit].grp;
 
+   /* skip entry with no group */
+   if (!grp)
+   continue;
+
grp->is_visible = not_visible;
 
+   /* skip unpopulated entry */
+   if (!msr[bit].msr)
+   continue;
+
if (msr[bit].test && !msr[bit].test(bit, data))
continue;
/* Virt sucks; you cannot tell if a R/O MSR is present 
:/ */


[PATCH v2 4/5] perf/x86/rapl: make perf_probe_msr() more robust and flexible

2020-05-27 Thread Stephane Eranian
This patch modifies perf_probe_msr() by allowing passing of
struct perf_msr array where some entries are not populated, i.e.,
they have either an msr address of 0 or no attribute_group pointer.
This helps with certain call paths, e.g., RAPL.

In case the grp is NULL, the default sysfs visibility rule
applies which is to make the group visible. Without the patch,
you would get a kernel crash with a NULL group.

Signed-off-by: Stephane Eranian 
---
 arch/x86/events/probe.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/events/probe.c b/arch/x86/events/probe.c
index c2ede2f3b2770..34ee0ee60ace8 100644
--- a/arch/x86/events/probe.c
+++ b/arch/x86/events/probe.c
@@ -10,6 +10,11 @@ not_visible(struct kobject *kobj, struct attribute *attr, 
int i)
return 0;
 }
 
+/*
+ * accepts msr[] array with non populated entries as long as either
+ * msr[i].msr is 0 or msr[i].grp is NULL. Note that the default sysfs
+ * visibility is visible when group->is_visible callback is set.
+ */
 unsigned long
 perf_msr_probe(struct perf_msr *msr, int cnt, bool zero, void *data)
 {
@@ -24,8 +29,16 @@ perf_msr_probe(struct perf_msr *msr, int cnt, bool zero, 
void *data)
if (!msr[bit].no_check) {
struct attribute_group *grp = msr[bit].grp;
 
+   /* skip entry with no group */
+   if (!grp)
+   continue;
+
grp->is_visible = not_visible;
 
+   /* skip unpopulated entry */
+   if (!msr[bit].msr)
+   continue;
+
if (msr[bit].test && !msr[bit].test(bit, data))
continue;
/* Virt sucks; you cannot tell if a R/O MSR is present 
:/ */
-- 
2.27.0.rc0.183.gde8f92d652-goog



[PATCH v2 5/5] perf/x86/rapl: add AMD Fam17h RAPL support

2020-05-27 Thread Stephane Eranian
This patch enables AMD Fam17h RAPL support for the Package level metric.
The support is as per AMD Fam17h Model31h (Zen2) and model 00-ffh (Zen1) PPR.

The same output is available via the energy-pkg pseudo event:

$ perf stat -a -I 1000 --per-socket -e power/energy-pkg/

Signed-off-by: Stephane Eranian 
---
 arch/x86/events/rapl.c   | 18 ++
 arch/x86/include/asm/msr-index.h |  3 +++
 2 files changed, 21 insertions(+)

diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
index fcb21fbcfe0d0..4ed95d03f2a74 100644
--- a/arch/x86/events/rapl.c
+++ b/arch/x86/events/rapl.c
@@ -537,6 +537,16 @@ static struct perf_msr intel_rapl_msrs[] = {
[PERF_RAPL_PSYS] = { MSR_PLATFORM_ENERGY_STATUS, 
_events_psys_group,  test_msr },
 };
 
+/*
+ * force to PERF_RAPL_MAX size due to:
+ * - perf_msr_probe(PERF_RAPL_MAX)
+ * - want to use same event codes across both architectures
+ */
+static struct perf_msr amd_rapl_msrs[PERF_RAPL_MAX] = {
+   [PERF_RAPL_PKG]  = { MSR_AMD_PKG_ENERGY_STATUS,  
_events_pkg_group,   test_msr },
+};
+
+
 static int rapl_cpu_offline(unsigned int cpu)
 {
struct rapl_pmu *pmu = cpu_to_rapl_pmu(cpu);
@@ -740,6 +750,13 @@ static struct rapl_model model_skl = {
.rapl_msrs  = intel_rapl_msrs,
 };
 
+static struct rapl_model model_amd_fam17h = {
+   .events = BIT(PERF_RAPL_PKG),
+   .apply_quirk= false,
+   .msr_power_unit = MSR_AMD_RAPL_POWER_UNIT,
+   .rapl_msrs  = amd_rapl_msrs,
+};
+
 static const struct x86_cpu_id rapl_model_match[] __initconst = {
X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE, _snb),
X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE_X,   _snbep),
@@ -768,6 +785,7 @@ static const struct x86_cpu_id rapl_model_match[] 
__initconst = {
X86_MATCH_INTEL_FAM6_MODEL(ICELAKE, _skl),
X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE_L, _skl),
X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE,   _skl),
+   X86_MATCH_VENDOR_FAM(AMD, 0x17, _amd_fam17h),
{},
 };
 MODULE_DEVICE_TABLE(x86cpu, rapl_model_match);
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 12c9684d59ba6..ef452b817f44f 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -301,6 +301,9 @@
 #define MSR_PP1_ENERGY_STATUS  0x0641
 #define MSR_PP1_POLICY 0x0642
 
+#define MSR_AMD_PKG_ENERGY_STATUS  0xc001029b
+#define MSR_AMD_RAPL_POWER_UNIT0xc0010299
+
 /* Config TDP MSRs */
 #define MSR_CONFIG_TDP_NOMINAL 0x0648
 #define MSR_CONFIG_TDP_LEVEL_1 0x0649
-- 
2.27.0.rc0.183.gde8f92d652-goog



[PATCH v2 2/5] perf/x86/rapl: refactor code for Intel/AMD sharing

2020-05-27 Thread Stephane Eranian
This patch modifies the rapl_model struct to include architecture specific
knowledge to Intel specific structure, and in particular the MSR for
POWER_UNIT and the rapl_msrs array.

No functional changes.

Signed-off-by: Stephane Eranian 
---
 arch/x86/events/rapl.c | 29 +++--
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
index ece043fb7b494..72990e9a4e71f 100644
--- a/arch/x86/events/rapl.c
+++ b/arch/x86/events/rapl.c
@@ -131,7 +131,9 @@ struct rapl_pmus {
 };
 
 struct rapl_model {
+   struct perf_msr *rapl_msrs;
unsigned long   events;
+   unsigned intmsr_power_unit;
boolapply_quirk;
 };
 
@@ -141,7 +143,7 @@ static struct rapl_pmus *rapl_pmus;
 static cpumask_t rapl_cpu_mask;
 static unsigned int rapl_cntr_mask;
 static u64 rapl_timer_ms;
-static struct perf_msr rapl_msrs[];
+static struct perf_msr *rapl_msrs;
 
 static inline struct rapl_pmu *cpu_to_rapl_pmu(unsigned int cpu)
 {
@@ -516,7 +518,7 @@ static bool test_msr(int idx, void *data)
return test_bit(idx, (unsigned long *) data);
 }
 
-static struct perf_msr rapl_msrs[] = {
+static struct perf_msr intel_rapl_msrs[] = {
[PERF_RAPL_PP0]  = { MSR_PP0_ENERGY_STATUS,  
_events_cores_group, test_msr },
[PERF_RAPL_PKG]  = { MSR_PKG_ENERGY_STATUS,  
_events_pkg_group,   test_msr },
[PERF_RAPL_RAM]  = { MSR_DRAM_ENERGY_STATUS, 
_events_ram_group,   test_msr },
@@ -578,13 +580,13 @@ static int rapl_cpu_online(unsigned int cpu)
return 0;
 }
 
-static int rapl_check_hw_unit(bool apply_quirk)
+static int rapl_check_hw_unit(struct rapl_model *rm)
 {
u64 msr_rapl_power_unit_bits;
int i;
 
/* protect rdmsrl() to handle virtualization */
-   if (rdmsrl_safe(MSR_RAPL_POWER_UNIT, _rapl_power_unit_bits))
+   if (rdmsrl_safe(rm->msr_power_unit, _rapl_power_unit_bits))
return -1;
for (i = 0; i < NR_RAPL_DOMAINS; i++)
rapl_hw_unit[i] = (msr_rapl_power_unit_bits >> 8) & 0x1FULL;
@@ -595,7 +597,7 @@ static int rapl_check_hw_unit(bool apply_quirk)
 * "Intel Xeon Processor E5-1600 and E5-2600 v3 Product Families, V2
 * of 2. Datasheet, September 2014, Reference Number: 330784-001 "
 */
-   if (apply_quirk)
+   if (rm->apply_quirk)
rapl_hw_unit[PERF_RAPL_RAM] = 16;
 
/*
@@ -676,6 +678,8 @@ static struct rapl_model model_snb = {
  BIT(PERF_RAPL_PKG) |
  BIT(PERF_RAPL_PP1),
.apply_quirk= false,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_snbep = {
@@ -683,6 +687,8 @@ static struct rapl_model model_snbep = {
  BIT(PERF_RAPL_PKG) |
  BIT(PERF_RAPL_RAM),
.apply_quirk= false,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_hsw = {
@@ -691,6 +697,8 @@ static struct rapl_model model_hsw = {
  BIT(PERF_RAPL_RAM) |
  BIT(PERF_RAPL_PP1),
.apply_quirk= false,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_hsx = {
@@ -698,12 +706,16 @@ static struct rapl_model model_hsx = {
  BIT(PERF_RAPL_PKG) |
  BIT(PERF_RAPL_RAM),
.apply_quirk= true,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_knl = {
.events = BIT(PERF_RAPL_PKG) |
  BIT(PERF_RAPL_RAM),
.apply_quirk= true,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_skl = {
@@ -713,6 +725,8 @@ static struct rapl_model model_skl = {
  BIT(PERF_RAPL_PP1) |
  BIT(PERF_RAPL_PSYS),
.apply_quirk= false,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static const struct x86_cpu_id rapl_model_match[] __initconst = {
@@ -758,10 +772,13 @@ static int __init rapl_pmu_init(void)
return -ENODEV;
 
rm = (struct rapl_model *) id->driver_data;
+
+   rapl_msrs = rm->rapl_msrs;
+
rapl_cntr_mask = perf_msr_probe(rapl_msrs, PERF_RAPL_MAX,
false, (void *) >events);
 
-   ret = rapl_check_hw_unit(rm->apply_quirk);
+   ret = rapl_check_hw_unit(rm);
if (ret)
return ret;
 
-- 
2.27.0.rc0.183.gde8f92d652-goog



[PATCH v2 0/5] perf/x86/rapl: Enable RAPL for AMD Fam17h

2020-05-27 Thread Stephane Eranian
This patch series adds support for AMD Fam17h RAPL counters. As per
AMD PPR, Fam17h support Package RAPL counters to monitor power usage.
The RAPL counter operates as with Intel RAPL. As such, it is beneficial
to share the code.

The series first moves the rapl.c file to common perf_events x86 and then
adds the support.
>From the user's point of view, the interface is identical with
/sys/devices/power. The energy-pkg event is the only one supported.

$ perf stat -a --per-socket -I 1000 -e power/energy-pkg/

In V2, we integrated Peter's comments:
- keep the same CONFIG_PERF_EVENTS_INTEL_RAPL for both Intel and AMD support
- msr is unsigned int
- cleanup initialization of the *_rapl_msrs[] arrays

In particular, we split the patch some more to clearly identify the changes.
We flip the visibility logic to work around the behavior of perf_msr_probe().
We improve that function to handle msrs[] array with unpopulated entries.
This help RAPL on AMD, because only one MSR (PKG) is define. That way
we can initialize the amd_rapl_msrs[] array just with that entry. But because
we prefer having the same encoding for the same RAPL event between AMD and Intel
this means, we need to handle unpopulated entries in the array and in 
perf_msr_probe()
which is what patch 4 does.

Signed-off-by: Stephane Eranian 


Stephane Eranian (5):
  perf/x86/rapl: move RAPL support to common x86 code
  perf/x86/rapl: refactor code for Intel/AMD sharing
  perf/x86/rapl: flip logic on default events visibility
  perf/x86: make perf_probe_msr() more robust and flexible
  perf/x86/rapl: add AMD Fam17h RAPL support

 arch/x86/events/Kconfig|  6 +--
 arch/x86/events/Makefile   |  1 +
 arch/x86/events/intel/Makefile |  2 -
 arch/x86/events/probe.c| 13 ++
 arch/x86/events/{intel => }/rapl.c | 67 ++
 arch/x86/include/asm/msr-index.h   |  3 ++
 6 files changed, 78 insertions(+), 14 deletions(-)
 rename arch/x86/events/{intel => }/rapl.c (92%)

-- 
2.27.0.rc0.183.gde8f92d652-goog



[PATCH v2 1/5] perf/x86/rapl: move RAPL support to common x86 code

2020-05-27 Thread Stephane Eranian
To prepare for support of both Intel and AMD RAPL.

Signed-off-by: Stephane Eranian 
---
 arch/x86/events/Kconfig| 6 +++---
 arch/x86/events/Makefile   | 1 +
 arch/x86/events/intel/Makefile | 2 --
 arch/x86/events/{intel => }/rapl.c | 9 ++---
 4 files changed, 10 insertions(+), 8 deletions(-)
 rename arch/x86/events/{intel => }/rapl.c (98%)

diff --git a/arch/x86/events/Kconfig b/arch/x86/events/Kconfig
index 9a7a1446cb3a0..4a809c6cbd2f5 100644
--- a/arch/x86/events/Kconfig
+++ b/arch/x86/events/Kconfig
@@ -10,11 +10,11 @@ config PERF_EVENTS_INTEL_UNCORE
available on NehalemEX and more modern processors.
 
 config PERF_EVENTS_INTEL_RAPL
-   tristate "Intel rapl performance events"
-   depends on PERF_EVENTS && CPU_SUP_INTEL && PCI
+   tristate "Intel/AMD rapl performance events"
+   depends on PERF_EVENTS && (CPU_SUP_INTEL || CPU_SUP_AMD) && PCI
default y
---help---
-   Include support for Intel rapl performance events for power
+   Include support for Intel and AMD rapl performance events for power
monitoring on modern processors.
 
 config PERF_EVENTS_INTEL_CSTATE
diff --git a/arch/x86/events/Makefile b/arch/x86/events/Makefile
index 6f1d1fde8b2de..12c42eba77ec3 100644
--- a/arch/x86/events/Makefile
+++ b/arch/x86/events/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-y  += core.o probe.o
+obj-$(PERF_EVENTS_INTEL_RAPL)  += rapl.o
 obj-y  += amd/
 obj-$(CONFIG_X86_LOCAL_APIC)+= msr.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= intel/
diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
index 3468b0c1dc7c9..e67a5886336c1 100644
--- a/arch/x86/events/intel/Makefile
+++ b/arch/x86/events/intel/Makefile
@@ -2,8 +2,6 @@
 obj-$(CONFIG_CPU_SUP_INTEL)+= core.o bts.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= ds.o knc.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= lbr.o p4.o p6.o pt.o
-obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL)   += intel-rapl-perf.o
-intel-rapl-perf-objs   := rapl.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += intel-uncore.o
 intel-uncore-objs  := uncore.o uncore_nhmex.o uncore_snb.o 
uncore_snbep.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_CSTATE) += intel-cstate.o
diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/rapl.c
similarity index 98%
rename from arch/x86/events/intel/rapl.c
rename to arch/x86/events/rapl.c
index a5dbd25852cb7..ece043fb7b494 100644
--- a/arch/x86/events/intel/rapl.c
+++ b/arch/x86/events/rapl.c
@@ -1,11 +1,14 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
- * Support Intel RAPL energy consumption counters
+ * Support Intel/AMD RAPL energy consumption counters
  * Copyright (C) 2013 Google, Inc., Stephane Eranian
  *
  * Intel RAPL interface is specified in the IA-32 Manual Vol3b
  * section 14.7.1 (September 2013)
  *
+ * AMD RAPL interface for Fam17h is described in the public PPR:
+ * https://bugzilla.kernel.org/show_bug.cgi?id=206537
+ *
  * RAPL provides more controls than just reporting energy consumption
  * however here we only expose the 3 energy consumption free running
  * counters (pp0, pkg, dram).
@@ -58,8 +61,8 @@
 #include 
 #include 
 #include 
-#include "../perf_event.h"
-#include "../probe.h"
+#include "perf_event.h"
+#include "probe.h"
 
 MODULE_LICENSE("GPL");
 
-- 
2.27.0.rc0.183.gde8f92d652-goog



[PATCH v2 3/5] perf/x86/rapl: flip logic on default events visibility

2020-05-27 Thread Stephane Eranian
This patch modifies the default visibility of the attribute_group
for each RAPL event. By default if the grp.is_visible field is NULL,
then sysfs considers that it must display the attribute group.
If the field is not NULL (callback function), then the return value
of the callback determines the visibility (0 = not visible). The RAPL
attribute groups had the field set to NULL, meaning that unless they
failed the probing from perf_msr_probe(), they would be visible. We want
to avoid having to specify attribute groups that are not supported by hw
in the rapl_msrs[] array, they don't have MSR address to begin with.
Therefore, we intialize the visible field of all RAPL attribute groups
to a callback that returns 0. If the RAPL msr goes through probing
and succeeds the is_visible field will be set back to NULL (visible).
If the probing fails the field is set to a callback that return 0 (no visible).

Signed-off-by: Stephane Eranian 

---
 arch/x86/events/rapl.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
index 72990e9a4e71f..fcb21fbcfe0d0 100644
--- a/arch/x86/events/rapl.c
+++ b/arch/x86/events/rapl.c
@@ -460,9 +460,16 @@ static struct attribute *rapl_events_cores[] = {
NULL,
 };
 
+static umode_t
+rapl_not_visible(struct kobject *kobj, struct attribute *attr, int i)
+{
+   return 0;
+}
+
 static struct attribute_group rapl_events_cores_group = {
.name  = "events",
.attrs = rapl_events_cores,
+   .is_visible = rapl_not_visible,
 };
 
 static struct attribute *rapl_events_pkg[] = {
@@ -475,6 +482,7 @@ static struct attribute *rapl_events_pkg[] = {
 static struct attribute_group rapl_events_pkg_group = {
.name  = "events",
.attrs = rapl_events_pkg,
+   .is_visible = rapl_not_visible,
 };
 
 static struct attribute *rapl_events_ram[] = {
@@ -487,6 +495,7 @@ static struct attribute *rapl_events_ram[] = {
 static struct attribute_group rapl_events_ram_group = {
.name  = "events",
.attrs = rapl_events_ram,
+   .is_visible = rapl_not_visible,
 };
 
 static struct attribute *rapl_events_gpu[] = {
@@ -499,6 +508,7 @@ static struct attribute *rapl_events_gpu[] = {
 static struct attribute_group rapl_events_gpu_group = {
.name  = "events",
.attrs = rapl_events_gpu,
+   .is_visible = rapl_not_visible,
 };
 
 static struct attribute *rapl_events_psys[] = {
@@ -511,6 +521,7 @@ static struct attribute *rapl_events_psys[] = {
 static struct attribute_group rapl_events_psys_group = {
.name  = "events",
.attrs = rapl_events_psys,
+   .is_visible = rapl_not_visible,
 };
 
 static bool test_msr(int idx, void *data)
-- 
2.27.0.rc0.183.gde8f92d652-goog



Re: [PATCH 3/3] perf/x86/rapl: add AMD Fam17h RAPL support

2020-05-20 Thread Stephane Eranian
Hi,

On Mon, May 18, 2020 at 1:16 PM Stephane Eranian  wrote:
>
> On Mon, May 18, 2020 at 2:34 AM Peter Zijlstra  wrote:
> >
> > On Fri, May 15, 2020 at 02:57:33PM -0700, Stephane Eranian wrote:
> >
> > > +static struct perf_msr amd_rapl_msrs[] = {
> > > + [PERF_RAPL_PP0]  = { 0, _events_cores_group, NULL},
> > > + [PERF_RAPL_PKG]  = { MSR_AMD_PKG_ENERGY_STATUS,  
> > > _events_pkg_group,   test_msr },
> > > + [PERF_RAPL_RAM]  = { 0, _events_ram_group,   NULL},
> > > + [PERF_RAPL_PP1]  = { 0, _events_gpu_group,   NULL},
> > > + [PERF_RAPL_PSYS] = { 0, _events_psys_group,  NULL},
> > > +};
> >
> > Why have those !PKG things initialized? Wouldn't they default to 0
> > anyway? If not, surely { 0, } is sufficient.
>
> Yes, but that assumes that perf_msr_probe() is fixed to not expect a grp.
> I think it is best to fix perf_msr_probe(). I already fixed one
> problem, I'll fix this one as well.

Well, now I remember what I did it the way it is in the patch. This
grp is going to sysfs, i.e., visible vs. not_visible callback.
Even if I fix the handling of NULL grp in perf_msr_probe(), the rest
of the rapl code pushes every event to sysfs and if the visible
callback is set to NULL this means the event is visible for sysfs! We
can fix that in init_rapl_pmus() but that is not pretty or leave it
as is, your call.


Re: [PATCH 3/3] perf/x86/rapl: add AMD Fam17h RAPL support

2020-05-18 Thread Stephane Eranian
On Mon, May 18, 2020 at 2:34 AM Peter Zijlstra  wrote:
>
> On Fri, May 15, 2020 at 02:57:33PM -0700, Stephane Eranian wrote:
>
> > +static struct perf_msr amd_rapl_msrs[] = {
> > + [PERF_RAPL_PP0]  = { 0, _events_cores_group, NULL},
> > + [PERF_RAPL_PKG]  = { MSR_AMD_PKG_ENERGY_STATUS,  
> > _events_pkg_group,   test_msr },
> > + [PERF_RAPL_RAM]  = { 0, _events_ram_group,   NULL},
> > + [PERF_RAPL_PP1]  = { 0, _events_gpu_group,   NULL},
> > + [PERF_RAPL_PSYS] = { 0, _events_psys_group,  NULL},
> > +};
>
> Why have those !PKG things initialized? Wouldn't they default to 0
> anyway? If not, surely { 0, } is sufficient.

Yes, but that assumes that perf_msr_probe() is fixed to not expect a grp.
I think it is best to fix perf_msr_probe(). I already fixed one
problem, I'll fix this one as well.


Re: metric expressions including metrics?

2020-05-18 Thread Stephane Eranian
On Mon, May 18, 2020 at 12:21 PM Jiri Olsa  wrote:
>
> On Mon, May 18, 2020 at 02:12:42PM -0500, Paul A. Clarke wrote:
> > I'm curious how hard it would be to define metrics using other metrics,
> > in the metrics definition files.
> >
> > Currently, to my understanding, every metric definition must be an
> > expresssion based solely on arithmetic combinations of hardware events.
> >
> > Some metrics are hierarchical in nature such that a higher-level metric
> > can be defined as an arithmetic expression of two other metrics, e.g.
> >
> > cache_miss_cycles_per_instruction =
> >   data_cache_miss_cycles_per_instruction +
> >   instruction_cache_miss_cycles_per_instruction
> >
> > This would need to be defined something like:
> > dcache_miss_cpi = "dcache_miss_cycles / instructions"
> > icache_miss_cpi = "icache_miss_cycles / instructions"
> > cache_miss_cpi = "(dcache_miss_cycles + icache_miss_cycles) / instructions"
> >
> > Could the latter definition be simplified to:
> > cache_miss_cpi = "dcache_miss_cpi + icache_miss_cpi"
> >
> > With multi-level caches and NUMA hierarchies, some of these higher-level
> > metrics can involve a lot of hardware events.
> >
> > Given the recent activity in this area, I'm curious if this has been
> > considered and already on a wish/to-do list, or found onerous.
>
> hi,
> actually we were discussing this with Ian and Stephane and I plan on
> checking on that.. should be doable, I'll keep you in the loop
>
Yes, this is needed to minimize the number of events needed to compute
metrics groups.
Then across all metrics groups, event duplicates must be eliminated
whenever possible, except when explicit event grouping is required.

>
> jirk
>
> a
>
> >
> > Regards,
> > Paul Clarke
> >
>


[PATCH 2/3] perf/x86/rapl: refactor code for Intel/AMD sharing

2020-05-15 Thread Stephane Eranian
This patch modifies the rapl_model struct to include architecture specific
knowledge to Intel specific structure, and in particular the MSR for
POWER_UNIT and the rapl_msrs array.

No functional changes.

Signed-off-by: Stephane Eranian 
---
 arch/x86/events/rapl.c | 29 +++--
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
index ece043fb7b494..e98f627a13fa8 100644
--- a/arch/x86/events/rapl.c
+++ b/arch/x86/events/rapl.c
@@ -131,7 +131,9 @@ struct rapl_pmus {
 };
 
 struct rapl_model {
+   struct perf_msr *rapl_msrs;
unsigned long   events;
+   int msr_power_unit;
boolapply_quirk;
 };
 
@@ -141,7 +143,7 @@ static struct rapl_pmus *rapl_pmus;
 static cpumask_t rapl_cpu_mask;
 static unsigned int rapl_cntr_mask;
 static u64 rapl_timer_ms;
-static struct perf_msr rapl_msrs[];
+static struct perf_msr *rapl_msrs;
 
 static inline struct rapl_pmu *cpu_to_rapl_pmu(unsigned int cpu)
 {
@@ -516,7 +518,7 @@ static bool test_msr(int idx, void *data)
return test_bit(idx, (unsigned long *) data);
 }
 
-static struct perf_msr rapl_msrs[] = {
+static struct perf_msr intel_rapl_msrs[] = {
[PERF_RAPL_PP0]  = { MSR_PP0_ENERGY_STATUS,  
_events_cores_group, test_msr },
[PERF_RAPL_PKG]  = { MSR_PKG_ENERGY_STATUS,  
_events_pkg_group,   test_msr },
[PERF_RAPL_RAM]  = { MSR_DRAM_ENERGY_STATUS, 
_events_ram_group,   test_msr },
@@ -578,13 +580,13 @@ static int rapl_cpu_online(unsigned int cpu)
return 0;
 }
 
-static int rapl_check_hw_unit(bool apply_quirk)
+static int rapl_check_hw_unit(struct rapl_model *rm)
 {
u64 msr_rapl_power_unit_bits;
int i;
 
/* protect rdmsrl() to handle virtualization */
-   if (rdmsrl_safe(MSR_RAPL_POWER_UNIT, _rapl_power_unit_bits))
+   if (rdmsrl_safe(rm->msr_power_unit, _rapl_power_unit_bits))
return -1;
for (i = 0; i < NR_RAPL_DOMAINS; i++)
rapl_hw_unit[i] = (msr_rapl_power_unit_bits >> 8) & 0x1FULL;
@@ -595,7 +597,7 @@ static int rapl_check_hw_unit(bool apply_quirk)
 * "Intel Xeon Processor E5-1600 and E5-2600 v3 Product Families, V2
 * of 2. Datasheet, September 2014, Reference Number: 330784-001 "
 */
-   if (apply_quirk)
+   if (rm->apply_quirk)
rapl_hw_unit[PERF_RAPL_RAM] = 16;
 
/*
@@ -676,6 +678,8 @@ static struct rapl_model model_snb = {
  BIT(PERF_RAPL_PKG) |
  BIT(PERF_RAPL_PP1),
.apply_quirk= false,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_snbep = {
@@ -683,6 +687,8 @@ static struct rapl_model model_snbep = {
  BIT(PERF_RAPL_PKG) |
  BIT(PERF_RAPL_RAM),
.apply_quirk= false,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_hsw = {
@@ -691,6 +697,8 @@ static struct rapl_model model_hsw = {
  BIT(PERF_RAPL_RAM) |
  BIT(PERF_RAPL_PP1),
.apply_quirk= false,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_hsx = {
@@ -698,12 +706,16 @@ static struct rapl_model model_hsx = {
  BIT(PERF_RAPL_PKG) |
  BIT(PERF_RAPL_RAM),
.apply_quirk= true,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_knl = {
.events = BIT(PERF_RAPL_PKG) |
  BIT(PERF_RAPL_RAM),
.apply_quirk= true,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static struct rapl_model model_skl = {
@@ -713,6 +725,8 @@ static struct rapl_model model_skl = {
  BIT(PERF_RAPL_PP1) |
  BIT(PERF_RAPL_PSYS),
.apply_quirk= false,
+   .msr_power_unit = MSR_RAPL_POWER_UNIT,
+   .rapl_msrs  = intel_rapl_msrs,
 };
 
 static const struct x86_cpu_id rapl_model_match[] __initconst = {
@@ -758,10 +772,13 @@ static int __init rapl_pmu_init(void)
return -ENODEV;
 
rm = (struct rapl_model *) id->driver_data;
+
+   rapl_msrs = rm->rapl_msrs;
+
rapl_cntr_mask = perf_msr_probe(rapl_msrs, PERF_RAPL_MAX,
false, (void *) >events);
 
-   ret = rapl_check_hw_unit(rm->apply_quirk);
+   ret = rapl_check_hw_unit(rm);
if (ret)
return ret;
 
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH 3/3] perf/x86/rapl: add AMD Fam17h RAPL support

2020-05-15 Thread Stephane Eranian
This patch enables AMD Fam17h RAPL support for the Package level metric.
The support is as per AMD Fam17h Model31h (Zen2) and model 00-ffh (Zen1) PPR.

The same output is available via the energy-pkg pseudo event:

$ perf stat -a -I 1000 --per-socket -e power/energy-pkg/

Signed-off-by: Stephane Eranian 
---
 arch/x86/events/probe.c  |  4 
 arch/x86/events/rapl.c   | 17 +
 arch/x86/include/asm/msr-index.h |  3 +++
 3 files changed, 24 insertions(+)

diff --git a/arch/x86/events/probe.c b/arch/x86/events/probe.c
index c2ede2f3b2770..b3a9df2e48dfa 100644
--- a/arch/x86/events/probe.c
+++ b/arch/x86/events/probe.c
@@ -26,6 +26,10 @@ perf_msr_probe(struct perf_msr *msr, int cnt, bool zero, 
void *data)
 
grp->is_visible = not_visible;
 
+   /* avoid unpopulated entries */
+   if (!msr[bit].msr)
+   continue;
+
if (msr[bit].test && !msr[bit].test(bit, data))
continue;
/* Virt sucks; you cannot tell if a R/O MSR is present 
:/ */
diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
index e98f627a13fa8..47ff20dfde889 100644
--- a/arch/x86/events/rapl.c
+++ b/arch/x86/events/rapl.c
@@ -526,6 +526,15 @@ static struct perf_msr intel_rapl_msrs[] = {
[PERF_RAPL_PSYS] = { MSR_PLATFORM_ENERGY_STATUS, 
_events_psys_group,  test_msr },
 };
 
+static struct perf_msr amd_rapl_msrs[] = {
+   [PERF_RAPL_PP0]  = { 0, _events_cores_group, NULL},
+   [PERF_RAPL_PKG]  = { MSR_AMD_PKG_ENERGY_STATUS,  
_events_pkg_group,   test_msr },
+   [PERF_RAPL_RAM]  = { 0, _events_ram_group,   NULL},
+   [PERF_RAPL_PP1]  = { 0, _events_gpu_group,   NULL},
+   [PERF_RAPL_PSYS] = { 0, _events_psys_group,  NULL},
+};
+
+
 static int rapl_cpu_offline(unsigned int cpu)
 {
struct rapl_pmu *pmu = cpu_to_rapl_pmu(cpu);
@@ -729,6 +738,13 @@ static struct rapl_model model_skl = {
.rapl_msrs  = intel_rapl_msrs,
 };
 
+static struct rapl_model model_amd_fam17h = {
+   .events = BIT(PERF_RAPL_PKG),
+   .apply_quirk= false,
+   .msr_power_unit = MSR_AMD_RAPL_POWER_UNIT,
+   .rapl_msrs  = amd_rapl_msrs,
+};
+
 static const struct x86_cpu_id rapl_model_match[] __initconst = {
X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE, _snb),
X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE_X,   _snbep),
@@ -757,6 +773,7 @@ static const struct x86_cpu_id rapl_model_match[] 
__initconst = {
X86_MATCH_INTEL_FAM6_MODEL(ICELAKE, _skl),
X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE_L, _skl),
X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE,   _skl),
+   X86_MATCH_VENDOR_FAM(AMD, 0x17, _amd_fam17h),
{},
 };
 MODULE_DEVICE_TABLE(x86cpu, rapl_model_match);
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 12c9684d59ba6..ef452b817f44f 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -301,6 +301,9 @@
 #define MSR_PP1_ENERGY_STATUS  0x0641
 #define MSR_PP1_POLICY 0x0642
 
+#define MSR_AMD_PKG_ENERGY_STATUS  0xc001029b
+#define MSR_AMD_RAPL_POWER_UNIT0xc0010299
+
 /* Config TDP MSRs */
 #define MSR_CONFIG_TDP_NOMINAL 0x0648
 #define MSR_CONFIG_TDP_LEVEL_1 0x0649
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH 1/3] perf/x86/rapl: move RAPL support to common x86 code

2020-05-15 Thread Stephane Eranian
To prepare for support of both Intel and AMD RAPL.
Move rapl.c to arch/x86/events. Rename config option.
Fixup header paths.

Signed-off-by: Stephane Eranian 
---
 arch/x86/events/Kconfig| 8 
 arch/x86/events/Makefile   | 1 +
 arch/x86/events/intel/Makefile | 2 --
 arch/x86/events/{intel => }/rapl.c | 9 ++---
 4 files changed, 11 insertions(+), 9 deletions(-)
 rename arch/x86/events/{intel => }/rapl.c (98%)

diff --git a/arch/x86/events/Kconfig b/arch/x86/events/Kconfig
index 9a7a1446cb3a0..e542c32b0a55f 100644
--- a/arch/x86/events/Kconfig
+++ b/arch/x86/events/Kconfig
@@ -9,12 +9,12 @@ config PERF_EVENTS_INTEL_UNCORE
Include support for Intel uncore performance events. These are
available on NehalemEX and more modern processors.
 
-config PERF_EVENTS_INTEL_RAPL
-   tristate "Intel rapl performance events"
-   depends on PERF_EVENTS && CPU_SUP_INTEL && PCI
+config PERF_EVENTS_X86_RAPL
+   tristate "RAPL performance events"
+   depends on PERF_EVENTS && (CPU_SUP_INTEL || CPU_SUP_AMD) && PCI
default y
---help---
-   Include support for Intel rapl performance events for power
+   Include support for Intel and AMD rapl performance events for power
monitoring on modern processors.
 
 config PERF_EVENTS_INTEL_CSTATE
diff --git a/arch/x86/events/Makefile b/arch/x86/events/Makefile
index 6f1d1fde8b2de..d5087a5745108 100644
--- a/arch/x86/events/Makefile
+++ b/arch/x86/events/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-y  += core.o probe.o
+obj-$(CONFIG_PERF_EVENTS_X86_RAPL) += rapl.o
 obj-y  += amd/
 obj-$(CONFIG_X86_LOCAL_APIC)+= msr.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= intel/
diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
index 3468b0c1dc7c9..e67a5886336c1 100644
--- a/arch/x86/events/intel/Makefile
+++ b/arch/x86/events/intel/Makefile
@@ -2,8 +2,6 @@
 obj-$(CONFIG_CPU_SUP_INTEL)+= core.o bts.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= ds.o knc.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= lbr.o p4.o p6.o pt.o
-obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL)   += intel-rapl-perf.o
-intel-rapl-perf-objs   := rapl.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += intel-uncore.o
 intel-uncore-objs  := uncore.o uncore_nhmex.o uncore_snb.o 
uncore_snbep.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_CSTATE) += intel-cstate.o
diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/rapl.c
similarity index 98%
rename from arch/x86/events/intel/rapl.c
rename to arch/x86/events/rapl.c
index a5dbd25852cb7..ece043fb7b494 100644
--- a/arch/x86/events/intel/rapl.c
+++ b/arch/x86/events/rapl.c
@@ -1,11 +1,14 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
- * Support Intel RAPL energy consumption counters
+ * Support Intel/AMD RAPL energy consumption counters
  * Copyright (C) 2013 Google, Inc., Stephane Eranian
  *
  * Intel RAPL interface is specified in the IA-32 Manual Vol3b
  * section 14.7.1 (September 2013)
  *
+ * AMD RAPL interface for Fam17h is described in the public PPR:
+ * https://bugzilla.kernel.org/show_bug.cgi?id=206537
+ *
  * RAPL provides more controls than just reporting energy consumption
  * however here we only expose the 3 energy consumption free running
  * counters (pp0, pkg, dram).
@@ -58,8 +61,8 @@
 #include 
 #include 
 #include 
-#include "../perf_event.h"
-#include "../probe.h"
+#include "perf_event.h"
+#include "probe.h"
 
 MODULE_LICENSE("GPL");
 
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH 0/3] perf/x86/rapl: Enable RAPL for AMD Fam17h

2020-05-15 Thread Stephane Eranian
This patch series adds support for AMD Fam17h RAPL counters. As per
AMD PPR, Fam17h support Package RAPL counters to monitor power usage.
The RAPL counter operates as with Intel RAPL. As such, it is beneficial
to share the code.

The series first moves the rapl.c file to common perf_events x86 and then
adds the support.
>From the user's point of view, the interface is identical with
/sys/devices/power. The energy-pkg event is the only one supported.

$ perf stat -a --per-socket -I 1000 -e power/energy-pkg/

Signed-off-by: Stephane Eranian 

Stephane Eranian (3):
  perf/x86/rapl: move RAPL support to common x86 code
  perf/x86/rapl: refactor code for Intel/AMD sharing
  perf/x86/rapl: add AMD Fam17h RAPL support

 arch/x86/events/Kconfig|  8 ++---
 arch/x86/events/Makefile   |  1 +
 arch/x86/events/intel/Makefile |  2 --
 arch/x86/events/probe.c|  4 +++
 arch/x86/events/{intel => }/rapl.c | 55 +-
 arch/x86/include/asm/msr-index.h   |  3 ++
 6 files changed, 58 insertions(+), 15 deletions(-)
 rename arch/x86/events/{intel => }/rapl.c (92%)

-- 
2.26.2.761.g0e0b3e54be-goog



[tip: perf/core] tools feature: Add support for detecting libpfm4

2020-05-08 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 5ef86146de941f273d669a8e018036f549bf058c
Gitweb:
https://git.kernel.org/tip/5ef86146de941f273d669a8e018036f549bf058c
Author:Stephane Eranian 
AuthorDate:Wed, 29 Apr 2020 16:14:41 -07:00
Committer: Arnaldo Carvalho de Melo 
CommitterDate: Tue, 05 May 2020 16:35:31 -03:00

tools feature: Add support for detecting libpfm4

libpfm4 provides an alternate command line encoding of perf events.

Signed-off-by: Stephane Eranian 
Reviewed-by: Ian Rogers 
Acked-by: Jiri Olsa 
Cc: Adrian Hunter 
Cc: Alexander Shishkin 
Cc: Alexei Starovoitov 
Cc: Alexey Budankov 
Cc: Andi Kleen 
Cc: Andrii Nakryiko 
Cc: Daniel Borkmann 
Cc: Florian Fainelli 
Cc: Greg Kroah-Hartman 
Cc: Igor Lubashev 
Cc: Jin Yao 
Cc: Jiwei Sun 
Cc: John Garry 
Cc: Kan Liang 
Cc: Leo Yan 
Cc: Mark Rutland 
Cc: Martin KaFai Lau 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Yonghong Song 
Cc: b...@vger.kernel.org
Cc: net...@vger.kernel.org
Cc: yuzhoujian 
Link: http://lore.kernel.org/lkml/20200429231443.207201-3-irog...@google.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/build/Makefile.feature   |  3 ++-
 tools/build/feature/Makefile   |  6 +-
 tools/build/feature/test-libpfm4.c |  9 +
 3 files changed, 16 insertions(+), 2 deletions(-)
 create mode 100644 tools/build/feature/test-libpfm4.c

diff --git a/tools/build/Makefile.feature b/tools/build/Makefile.feature
index 3e0c019..3abd431 100644
--- a/tools/build/Makefile.feature
+++ b/tools/build/Makefile.feature
@@ -98,7 +98,8 @@ FEATURE_TESTS_EXTRA :=  \
  llvm   \
  llvm-version   \
  clang  \
- libbpf
+ libbpf \
+ libpfm4
 
 FEATURE_TESTS ?= $(FEATURE_TESTS_BASIC)
 
diff --git a/tools/build/feature/Makefile b/tools/build/feature/Makefile
index 9201238..84f845b 100644
--- a/tools/build/feature/Makefile
+++ b/tools/build/feature/Makefile
@@ -69,7 +69,8 @@ FILES=  \
  test-libaio.bin   \
  test-libzstd.bin  \
  test-clang-bpf-global-var.bin \
- test-file-handle.bin
+ test-file-handle.bin  \
+ test-libpfm4.bin
 
 FILES := $(addprefix $(OUTPUT),$(FILES))
 
@@ -331,6 +332,9 @@ $(OUTPUT)test-clang-bpf-global-var.bin:
 $(OUTPUT)test-file-handle.bin:
$(BUILD)
 
+$(OUTPUT)test-libpfm4.bin:
+   $(BUILD) -lpfm
+
 ###
 
 clean:
diff --git a/tools/build/feature/test-libpfm4.c 
b/tools/build/feature/test-libpfm4.c
new file mode 100644
index 000..af49b25
--- /dev/null
+++ b/tools/build/feature/test-libpfm4.c
@@ -0,0 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+
+int main(void)
+{
+   pfm_initialize();
+   return 0;
+}


[tip: perf/core] perf pmu: Add perf_pmu__find_by_type helper

2020-05-08 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 3a50dc76058d7cd8315f9c712b793d81a7ff4541
Gitweb:
https://git.kernel.org/tip/3a50dc76058d7cd8315f9c712b793d81a7ff4541
Author:Stephane Eranian 
AuthorDate:Wed, 29 Apr 2020 16:14:42 -07:00
Committer: Arnaldo Carvalho de Melo 
CommitterDate: Tue, 05 May 2020 16:35:31 -03:00

perf pmu: Add perf_pmu__find_by_type helper

This is used by libpfm4 during event parsing to locate the pmu for an
event.

Signed-off-by: Stephane Eranian 
Reviewed-by: Ian Rogers 
Acked-by: Jiri Olsa 
Cc: Adrian Hunter 
Cc: Alexander Shishkin 
Cc: Alexei Starovoitov 
Cc: Alexey Budankov 
Cc: Andi Kleen 
Cc: Andrii Nakryiko 
Cc: Daniel Borkmann 
Cc: Florian Fainelli 
Cc: Greg Kroah-Hartman 
Cc: Igor Lubashev 
Cc: Jin Yao 
Cc: Jiwei Sun 
Cc: John Garry 
Cc: Kan Liang 
Cc: Leo Yan 
Cc: Mark Rutland 
Cc: Martin KaFai Lau 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Yonghong Song 
Cc: b...@vger.kernel.org
Cc: net...@vger.kernel.org
Cc: yuzhoujian 
Link: http://lore.kernel.org/lkml/20200429231443.207201-4-irog...@google.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/pmu.c | 11 +++
 tools/perf/util/pmu.h |  1 +
 2 files changed, 12 insertions(+)

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 5642de7..92bd7fa 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -871,6 +871,17 @@ static struct perf_pmu *pmu_find(const char *name)
return NULL;
 }
 
+struct perf_pmu *perf_pmu__find_by_type(unsigned int type)
+{
+   struct perf_pmu *pmu;
+
+   list_for_each_entry(pmu, , list)
+   if (pmu->type == type)
+   return pmu;
+
+   return NULL;
+}
+
 struct perf_pmu *perf_pmu__scan(struct perf_pmu *pmu)
 {
/*
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 1edd214..cb6fbec 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -72,6 +72,7 @@ struct perf_pmu_alias {
 };
 
 struct perf_pmu *perf_pmu__find(const char *name);
+struct perf_pmu *perf_pmu__find_by_type(unsigned int type);
 int perf_pmu__config(struct perf_pmu *pmu, struct perf_event_attr *attr,
 struct list_head *head_terms,
 struct parse_events_error *error);


[tip: perf/core] perf script: Remove extraneous newline in perf_sample__fprintf_regs()

2020-05-08 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/core branch of tip:

Commit-ID: fad1f1e7dedcd97593e8af36786b6bbdd093990d
Gitweb:
https://git.kernel.org/tip/fad1f1e7dedcd97593e8af36786b6bbdd093990d
Author:Stephane Eranian 
AuthorDate:Sat, 18 Apr 2020 16:19:08 -07:00
Committer: Arnaldo Carvalho de Melo 
CommitterDate: Thu, 30 Apr 2020 10:48:32 -03:00

perf script: Remove extraneous newline in perf_sample__fprintf_regs()

When printing iregs, there was a double newline printed because
perf_sample__fprintf_regs() was printing its own and then at the end of
all fields, perf script was adding one.  This was causing blank line in
the output:

Before:

  $ perf script -Fip,iregs
 401b8d ABI:2DX:0x100SI:0x4a8340DI:0x4a9340

 401b8d ABI:2DX:0x100SI:0x4a9340DI:0x4a8340

 401b8d ABI:2DX:0x100SI:0x4a8340DI:0x4a9340

 401b8d ABI:2DX:0x100SI:0x4a9340DI:0x4a8340

After:

  $ perf script -Fip,iregs
 401b8d ABI:2DX:0x100SI:0x4a8340DI:0x4a9340
 401b8d ABI:2DX:0x100SI:0x4a9340DI:0x4a8340
 401b8d ABI:2DX:0x100SI:0x4a8340DI:0x4a9340

Committer testing:

First we need to figure out how to request that registers be recorded,
so we use:

  # perf record -h reg

   Usage: perf record [] []
  or: perf record [] --  []

  -I, --intr-regs[=]
sample selected machine registers on interrupt, use 
'-I?' to list register names
  --buildid-all Record build-id of all DSOs regardless of hits
  --user-regs[=]
sample selected machine registers on interrupt, use 
'--user-regs=?' to list register names

  #

Ok, now lets ask for them all:

  # perf record -a --intr-regs --user-regs sleep 1
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 4.105 MB perf.data (2760 samples) ]
  #

Lets look at the first 6 output lines:

  # perf script -Fip,iregs | head -6
   8a06f2f4 ABI:2AX:0xd168fee0a980BX:0x8a23b087f000
CX:0xfffeb69aaeb25d73DX:0x8a253e8310f0SI:0xfff9bafe7359
DI:0xb1690204fb10BP:0xd168fee0a950SP:0xb1690204fb88
IP:0x8a06f2f4 FLAGS:0x4eCS:0x10SS:0x18R8:0x1495f0a91129a
R9:0x8a23b087f000   R10:0x1   R11:0x   R12:0x0   
R13:0x8a253e827e00   R14:0xd168fee0aa5c   R15:0xd168fee0a980

   8a06f2f4 ABI:2AX:0x0BX:0xd168fee0a950
CX:0x5684cc1118491900DX:0x0SI:0xd168fee0a9d0DI:0x202
BP:0xb1690204fd70SP:0xb1690204fd20IP:0x8a06f2f4 
FLAGS:0x24eCS:0x10SS:0x18R8:0x0R9:0xd168fee0a9d0   R10:0x1  
 R11:0x   R12:0x8a23e480   R13:0x8a23b087f240   
R14:0x8a23b087f000   R15:0xd168fee0a950

   8a06f2f4 ABI:2AX:0x0BX:0x0CX:0x7f25f334335bDX:0x0
SI:0x2400DI:0x4BP:0x7fff5f264570SP:0x7fff5f264538
IP:0x8a06f2f4 FLAGS:0x24eCS:0x10SS:0x2bR8:0x0
R9:0x2312d20   R10:0x0   R11:0x246   R12:0x22cc0e0   R13:0x0   R14:0x0   
R15:0x22d0780

  #

Reproduced, apply the patch and:

[root@five ~]# perf script -Fip,iregs | head -6
 8a06f2f4 ABI:2AX:0xd168fee0a980BX:0x8a23b087f000
CX:0xfffeb69aaeb25d73DX:0x8a253e8310f0SI:0xfff9bafe7359
DI:0xb1690204fb10BP:0xd168fee0a950SP:0xb1690204fb88
IP:0x8a06f2f4 FLAGS:0x4eCS:0x10SS:0x18R8:0x1495f0a91129a
R9:0x8a23b087f000   R10:0x1   R11:0x   R12:0x0   
R13:0x8a253e827e00   R14:0xd168fee0aa5c   R15:0xd168fee0a980
 8a06f2f4 ABI:2AX:0x0BX:0xd168fee0a950
CX:0x5684cc1118491900DX:0x0SI:0xd168fee0a9d0DI:0x202
BP:0xb1690204fd70SP:0xb1690204fd20IP:0x8a06f2f4 
FLAGS:0x24eCS:0x10SS:0x18R8:0x0R9:0xd168fee0a9d0   R10:0x1  
 R11:0x   R12:0x8a23e480   R13:0x8a23b087f240   
R14:0x8a23b087f000   R15:0xd168fee0a950
 8a06f2f4 ABI:2AX:0x0BX:0x0CX:0x7f25f334335bDX:0x0
SI:0x2400DI:0x4BP:0x7fff5f264570SP:0x7fff5f264538
IP:0x8a06f2f4 FLAGS:0x24eCS:0x10SS:0x2bR8:0x0
R9:0x2312d20   R10:0x0   R11:0x246   R12:0x22cc0e0   R13:0x0   R14:0x0   
R15:0x22d0780
 8a24074b ABI:2AX:0xcbBX:0xcbCX:0x0DX:0x0
SI:0xb1690204ff58DI:0xcbBP:0xb1690204ff58
SP:0xb1690204ff40IP:0x8a24074b FLAGS:0x24eCS:0x10
SS:0x18R8:0x0R9:0x0   R10:0x0   R11:0x0   R12:0x0   R13:0x0   R14:0x0   
R15:0x0
 8a310600 ABI:2AX:0x0BX:0x8b8c39a0CX:0x0
DX:0x8a2503890300SI:0xb1690204ff20DI:0x8a23e408
BP:0x8a23e408SP:0xb1690204fec0IP:0x8a310600 
FLAGS

[tip: perf/core] perf record: Add num-synthesize-threads option

2020-05-08 Thread tip-bot2 for Stephane Eranian
The following commit has been merged into the perf/core branch of tip:

Commit-ID: d99c22eabee45f40ca44b877a1adde028f14b6b4
Gitweb:
https://git.kernel.org/tip/d99c22eabee45f40ca44b877a1adde028f14b6b4
Author:Stephane Eranian 
AuthorDate:Wed, 22 Apr 2020 08:50:38 -07:00
Committer: Arnaldo Carvalho de Melo 
CommitterDate: Thu, 23 Apr 2020 11:10:41 -03:00

perf record: Add num-synthesize-threads option

To control degree of parallelism of the synthesize_mmap() code which
is scanning /proc/PID/task/PID/maps and can be time consuming.
Mimic perf top way of handling the option.
If not specified will default to 1 thread, i.e. default behavior before
this option.

On a desktop computer the processing of /proc/PID/task/PID/maps isn't
slow enough to warrant parallel processing and the thread creation has
some cost - hence the default of 1. On a loaded server with
>100 cores it is possible to see synthesis times in the order of
seconds and in this case having the option is desirable.

As the processing is a synchronization point, it is legitimate to worry if
Amdahl's law will apply to this patch. Profiling with this patch in
place:
https://lore.kernel.org/lkml/20200415054050.31645-4-irog...@google.com/
shows:
...
  - 32.59% __perf_event__synthesize_threads
 - 32.54% __event__synthesize_thread
+ 22.13% perf_event__synthesize_mmap_events
+ 6.68% perf_event__get_comm_ids.constprop.0
+ 1.49% process_synthesized_event
+ 1.29% __GI___readdir64
+ 0.60% __opendir
...
That is the processing is 1.49% of execution time and there is plenty to
make parallel. This is shown in the benchmark in this patch:

https://lore.kernel.org/lkml/20200415054050.31645-2-irog...@google.com/

  Computing performance of multi threaded perf event synthesis by
  synthesizing events on CPU 0:
   Number of synthesis threads: 1
 Average synthesis took: 127729.000 usec (+- 3372.880 usec)
 Average num. events: 21548.600 (+- 0.306)
 Average time per event 5.927 usec
   Number of synthesis threads: 2
 Average synthesis took: 88863.500 usec (+- 385.168 usec)
 Average num. events: 21552.800 (+- 0.327)
 Average time per event 4.123 usec
   Number of synthesis threads: 3
 Average synthesis took: 83257.400 usec (+- 348.617 usec)
 Average num. events: 21553.200 (+- 0.327)
 Average time per event 3.863 usec
   Number of synthesis threads: 4
 Average synthesis took: 75093.000 usec (+- 422.978 usec)
 Average num. events: 21554.200 (+- 0.200)
 Average time per event 3.484 usec
   Number of synthesis threads: 5
 Average synthesis took: 64896.600 usec (+- 353.348 usec)
 Average num. events: 21558.000 (+- 0.000)
 Average time per event 3.010 usec
   Number of synthesis threads: 6
 Average synthesis took: 59210.200 usec (+- 342.890 usec)
 Average num. events: 21560.000 (+- 0.000)
 Average time per event 2.746 usec
   Number of synthesis threads: 7
 Average synthesis took: 54093.900 usec (+- 306.247 usec)
 Average num. events: 21562.000 (+- 0.000)
 Average time per event 2.509 usec
   Number of synthesis threads: 8
 Average synthesis took: 48938.700 usec (+- 341.732 usec)
 Average num. events: 21564.000 (+- 0.000)
 Average time per event 2.269 usec

Where average time per synthesized event goes from 5.927 usec with 1
thread to 2.269 usec with 8. This isn't a linear speed up as not all of
synthesize code has been made parallel. If the synthesis time was about
10 seconds then using 8 threads may bring this down to less than 4.

Signed-off-by: Stephane Eranian 
Reviewed-by: Ian Rogers 
Acked-by: Jiri Olsa 
Cc: Adrian Hunter 
Cc: Alexander Shishkin 
Cc: Alexey Budankov 
Cc: Kan Liang 
Cc: Mark Rutland 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Tony Jones 
Cc: yuzhoujian 
Link: http://lore.kernel.org/lkml/20200422155038.9380-1-irog...@google.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/Documentation/perf-record.txt |  4 +++-
 tools/perf/builtin-record.c  | 34 +--
 tools/perf/util/record.h |  1 +-
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index b3f3b3f..6e8b464 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -596,6 +596,10 @@ Make a copy of /proc/kcore and place it into a directory 
with the perf data file
 Limit the sample data max size,  is expected to be a number with
 appended unit character - B/K/M/G
 
+--num-thread-synthesize::
+   The number of threads to run when synthesizing events for existing 
processes.
+   By default, the number of threads equals 1.
+
 SEE ALSO
 
 linkperf:perf-stat[1], linkperf:perf-list[1], linkperf:perf-intel-pt[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 1ab349a..2e80

Re: callchain ABI change with commit 6cbc304f2f360

2020-05-06 Thread Stephane Eranian
On Wed, May 6, 2020 at 4:37 AM Peter Zijlstra  wrote:
>
> On Tue, May 05, 2020 at 08:37:40PM -0700, Stephane Eranian wrote:
> > Hi,
> >
> > I have received reports from users who have noticed a change of
> > behaviour caused by
> > commit:
> >
> > 6cbc304f2f360 ("perf/x86/intel: Fix unwind errors from PEBS entries 
> > (mk-II)")
> >
> > When using PEBS sampling on Intel processors.
> >
> > Doing simple profiling with:
> > $ perf record -g -e cycles:pp ...
> >
> > Before:
> >
> > 1 1595951041120856 0x7f77f8 [0xe8]: PERF_RECORD_SAMPLE(IP, 0x4002):
> > 795385/690513: 0x558aa66a9607 period: 1019 addr: 0
> > ... FP chain: nr:22
> > .  0: fe00
> > .  1: 558aa66a9607
> > .  2: 558aa66a8751
> > .  3: 558a984a3d4f
> >
> > Entry 1: matches sampled IP 0x558aa66a9607.
> >
> > After:
> >
> > 3 487420973381085 0x2f797c0 [0x90]: PERF_RECORD_SAMPLE(IP, 0x4002):
> > 349591/146458: 0x559dcd2ef889 period: 1019 addr: 0
> > ... FP chain: nr:11
> > .  0: fe00
> > .  1: 559dcd2ef88b
> > .  2: 559dcd19787d
> > .  3: 559dcd1cf1be
> >
> > entry 1 does not match sampled IP anymore.
> >
> > Before the patch the kernel was stashing the sampled IP from PEBS into
> > the callchain. After the patch it is stashing the interrupted IP, thus
> > with the skid.
> >
> > I am trying to understand whether this is an intentional change or not
> > for the IP.
> >
> > It seems that stashing the interrupted IP would be more consistent across 
> > all
> > sampling modes, i.e., with and without PEBS. Entry 1: would always be
> > the interrupted IP.
> > The changelog talks about ORC unwinder being more happy this the
> > interrupted machine
> > state, but not about the ABI expectation here.
> > Could you clarify?
>
> Intentional; fundamentally, we cannot unwind a stack that no longer
> exists.
>
Ok, thanks for clarifying this.

> The PEBS record comes in after the fact, the stack at the time of record
> is irretrievably gone. The only (and best) thing we can do is provide
> the unwind at the interrupt.
>
The PEBS record is always at an IP BEFORE or EQUAL to the interrupted IP.
The stack at PEBS may be gone in case the PEBS sample was captured at the
epilogue of the function where sp/rbp are modified.

> Adding a previous IP on top of a later unwind gives a completely
> insane/broken call-stacks.

I agree that using the interrupted IP is the most reliable thing to do.

You can get the callstack at the PEBS sample with LBR callstack on Icelake
because PEBS can record LBR. I am hoping this works with the existing code.


callchain ABI change with commit 6cbc304f2f360

2020-05-05 Thread Stephane Eranian
Hi,

I have received reports from users who have noticed a change of
behaviour caused by
commit:

6cbc304f2f360 ("perf/x86/intel: Fix unwind errors from PEBS entries (mk-II)")

When using PEBS sampling on Intel processors.

Doing simple profiling with:
$ perf record -g -e cycles:pp ...

Before:

1 1595951041120856 0x7f77f8 [0xe8]: PERF_RECORD_SAMPLE(IP, 0x4002):
795385/690513: 0x558aa66a9607 period: 1019 addr: 0
... FP chain: nr:22
.  0: fe00
.  1: 558aa66a9607
.  2: 558aa66a8751
.  3: 558a984a3d4f

Entry 1: matches sampled IP 0x558aa66a9607.

After:

3 487420973381085 0x2f797c0 [0x90]: PERF_RECORD_SAMPLE(IP, 0x4002):
349591/146458: 0x559dcd2ef889 period: 1019 addr: 0
... FP chain: nr:11
.  0: fe00
.  1: 559dcd2ef88b
.  2: 559dcd19787d
.  3: 559dcd1cf1be

entry 1 does not match sampled IP anymore.

Before the patch the kernel was stashing the sampled IP from PEBS into
the callchain. After the patch it is stashing the interrupted IP, thus
with the skid.

I am trying to understand whether this is an intentional change or not
for the IP.

It seems that stashing the interrupted IP would be more consistent across all
sampling modes, i.e., with and without PEBS. Entry 1: would always be
the interrupted IP.
The changelog talks about ORC unwinder being more happy this the
interrupted machine
state, but not about the ABI expectation here.
Could you clarify?
Thanks.


Re: [PATCH] perf/script: remove extraneous newline in perf_sample__fprintf_regs()

2020-04-29 Thread Stephane Eranian
On Wed, Apr 29, 2020 at 7:09 PM Andi Kleen  wrote:
>
> > I was under the impression that perf script was generating one line
> > per sample. Otherwise, seems hard to parse.
>
> That's only true for simple cases. A lot of the extended output options
> have long generated multiple lines. And of course call stacks always did.
>
> > Could you give me the cmdline options of perf script that justify the 
> > newline.
>
> e.g.  perf script -F iregs,uregs
>
But then it should only use the \n when needed.
It is a bit like perf stat printing cgroup as "" when not using cgroup
mode add a bunch of white spaces/tab at the end of the line for
nothing.

I also suggest that we improve perf stat/script output with an output
format that is easier to parse, such as JSON with name: value pairs.
that would avoid all these \n and
flaky parsing regexp or script I have seen, even internally.

> -Andi


Re: [PATCH] perf/script: remove extraneous newline in perf_sample__fprintf_regs()

2020-04-29 Thread Stephane Eranian
On Mon, Apr 27, 2020 at 7:47 PM Andi Kleen  wrote:
>
> On Sat, Apr 18, 2020 at 04:19:08PM -0700, Stephane Eranian wrote:
> > When printing iregs, there was a double newline printed because
> > perf_sample__fprintf_regs() was printing its own and then at the
> > end of all fields, perf script was adding one.
> > This was causing blank line in the output:
>
> I don't think the patch is quite correct because there could be
> other fields after it, and they need to be separated by a
> new line too.
>
> e.g. i suspect if someone prints iregs,uregs or iregs,brstack
> or something else that is printed in process_event after *regs
> you would get garbled output.
>
I was under the impression that perf script was generating one line
per sample. Otherwise, seems hard to parse.
Could you give me the cmdline options of perf script that justify the newline.
 Thanks.

> So you have to track if the newline is needed or not.
>
> -Andi


Re: [PATCH] perf/core: fix multiplexing event scheduling issue

2019-10-23 Thread Stephane Eranian
On Wed, Oct 23, 2019 at 4:02 AM Peter Zijlstra  wrote:
>
> On Wed, Oct 23, 2019 at 12:30:03AM -0700, Stephane Eranian wrote:
> > On Mon, Oct 21, 2019 at 3:21 AM Peter Zijlstra  wrote:
> > >
> > > On Thu, Oct 17, 2019 at 05:27:46PM -0700, Stephane Eranian wrote:
> > > > This patch complements the following commit:
> > > > 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")
> > > >
> > > > The fix from Song addresses the consequences of the problem but
> > > > not the cause. This patch fixes the causes and can sit on top of
> > > > Song's patch.
> > >
> > > I'm tempted to say the other way around.
> > >
> > > Consider the case where you claim fixed2 with a pinned event and then
> > > have another fixed2 in the flexible list. At that point you're _never_
> > > going to run any other flexible events (without Song's patch).
> > >
> > In that case, there is no deactivation or removal of events, so yes, my 
> > patch
> > will not help that case. I said his patch is still useful. You gave one 
> > example,
> > even though in this case the rotate will not yield a reschedule of that 
> > flexible
> > event because fixed2 is used by a pinned event. So checking for it, will not
> > really help.
>
> Stick 10 cycle events after the fixed2 flexible event. Without Song's
> patch you'll never see those 10 cycle events get scheduled.
>
> > > This patch isn't going to help with that. Similarly, Songs patch helps
> > > with your situation where it will allow rotation to resume after you
> > > disable/remove all active events (while you still have pending events).
> > >
> > Yes, it will unblock the case where active events are deactivated or
> > removed. But it will delay the unblocking until the next mux timer
> > expires. And I am saying this is too far away in many cases. For instance,
> > we do not run with the 1ms timer for uncore, this is way too much overhead.
> > Imagine this timer is set to 10ms or event 100ms, just with Song's patch, 
> > the
> > inactive events would have to wait for up to 100ms to be scheduled again.
> > This is not acceptable for us.
>
> Then how was it acceptible to mux in the first place? And if
> multiplexing wasn't acceptible, then why were you doing it?
>
> > > > However, the cause is not addressed. The kernel should not rely on
> > > > the multiplexing hrtimer to unblock inactive events. That timer
> > > > can have abitrary duration in the milliseconds. Until the timer
> > > > fires, counters are available, but no measurable events are using
> > > > them. We do not want to introduce blind spots of arbitrary durations.
> > >
> > > This I disagree with -- you don't get a guarantee other than
> > > timer_period/n when you multiplex, and idling the counters until the
> > > next tick doesn't violate that at all.
> >
> > My take is that if you have free counters and "idling" events, the kernel
> > should take every effort to schedule them as soon as they become available.
> > In the situation I described in the patch, once I remove the active
> > events, there
> > is no more reasons for multiplexing, all the counters are free (ignore
> > watchdog).
>
> That's fine; all I'm arguing is that the current behaviour doesn't
> violate the guarantees given. Now you want to improve counter
> utilization (at a cost) and that is fine. Just don't argue that there's
> something broken -- there is not.
>
> Your patch also does not fix something more fundamental than Song's
> patch did. Quite the reverse. Yours is purely a utilization efficiency
> thing, while Song's addressed a correctness issue.
>
Going back to Song's patch and his test case. It exposes a problem that was
introduced with the RB tree and multiple event list changes. In the event
scheduler, there was this guarantee that each event will get a chance to
be scheduled because each would eventually get to the head of the list and
thus get a chance to be scheduled as the first event of its priority
class, assuming
there was still at least one compatible counter available from higher
priority classes.
The scheduler also still stops at the first error. In Song's case
ref-cycles:D,ref-cycles,cycles,
the pinned event is commandeering fixed2. But I believe the rotation
code was not
rotating the list *even* if it could not schedule the first event.
That explained why
the cycle event could never be scheduled. That's a violation of the guarantee.
At each timer, the list must rotate. I think his patch somehow addresses this.

> > Now

Re: [PATCH] perf/core: fix multiplexing event scheduling issue

2019-10-23 Thread Stephane Eranian
On Mon, Oct 21, 2019 at 3:21 AM Peter Zijlstra  wrote:
>
> On Thu, Oct 17, 2019 at 05:27:46PM -0700, Stephane Eranian wrote:
> > This patch complements the following commit:
> > 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")
> >
> > The fix from Song addresses the consequences of the problem but
> > not the cause. This patch fixes the causes and can sit on top of
> > Song's patch.
>
> I'm tempted to say the other way around.
>
> Consider the case where you claim fixed2 with a pinned event and then
> have another fixed2 in the flexible list. At that point you're _never_
> going to run any other flexible events (without Song's patch).
>
In that case, there is no deactivation or removal of events, so yes, my patch
will not help that case. I said his patch is still useful. You gave one example,
even though in this case the rotate will not yield a reschedule of that flexible
event because fixed2 is used by a pinned event. So checking for it, will not
really help.

> This patch isn't going to help with that. Similarly, Songs patch helps
> with your situation where it will allow rotation to resume after you
> disable/remove all active events (while you still have pending events).
>
Yes, it will unblock the case where active events are deactivated or
removed. But it will delay the unblocking until the next mux timer
expires. And I am saying this is too far away in many cases. For instance,
we do not run with the 1ms timer for uncore, this is way too much overhead.
Imagine this timer is set to 10ms or event 100ms, just with Song's patch, the
inactive events would have to wait for up to 100ms to be scheduled again.
This is not acceptable for us.

> > This patch fixes a scheduling problem in the core functions of
> > perf_events. Under certain conditions, some events would not be
> > scheduled even though many counters would be available. This
> > is related to multiplexing and is architecture agnostic and
> > PMU agnostic (i.e., core or uncore).
> >
> > This problem can easily be reproduced when you have two perf
> > stat sessions. The first session does not cause multiplexing,
> > let's say it is measuring 1 event, E1. While it is measuring,
> > a second session starts and causes multiplexing. Let's say it
> > adds 6 events, B1-B6. Now, 7 events compete and are multiplexed.
> > When the second session terminates, all 6 (B1-B6) events are
> > removed. Normally, you'd expect the E1 event to continue to run
> > with no multiplexing. However, the problem is that depending on
> > the state Of E1 when B1-B6 are removed, it may never be scheduled
> > again. If E1 was inactive at the time of removal, despite the
> > multiplexing hrtimer still firing, it will not find any active
> > events and will not try to reschedule. This is what Song's patch
> > fixes. It forces the multiplexing code to consider non-active events.
>
> This; so Song's patch fixes the fundamental problem of the rotation not
> working right under certain conditions.
>
> > However, the cause is not addressed. The kernel should not rely on
> > the multiplexing hrtimer to unblock inactive events. That timer
> > can have abitrary duration in the milliseconds. Until the timer
> > fires, counters are available, but no measurable events are using
> > them. We do not want to introduce blind spots of arbitrary durations.
>
> This I disagree with -- you don't get a guarantee other than
> timer_period/n when you multiplex, and idling the counters until the
> next tick doesn't violate that at all.

My take is that if you have free counters and "idling" events, the kernel
should take every effort to schedule them as soon as they become available.
In the situation I described in the patch, once I remove the active
events, there
is no more reasons for multiplexing, all the counters are free (ignore
watchdog).
Now you may be arguing, that it may take more time to ctx_resched() then to
wait for the timer to expire. But I am not sure I buy that. Similarly,
I am not sure
there is code to cancel an active mux hrtimer when we clear rotate_necessary.
Maybe we just let it lapse and clear itself via a ctx_sched_out() in
the rotation code.

>
> > This patch addresses the cause of the problem, by checking that,
> > when an event is disabled or removed and the context was multiplexing
> > events, inactive events gets immediately a chance to be scheduled by
> > calling ctx_resched(). The rescheduling is done on  event of equal
> > or lower priority types.  With that in place, as soon as a counter
> > is freed, schedulable inactive events may run, thereby eliminating
> > a blind spot.
>
> Disagreed, Song's patch removed the fundamental blind spot of rot

Re: [PATCH] perf/core: fix multiplexing event scheduling issue

2019-10-23 Thread Stephane Eranian
On Mon, Oct 21, 2019 at 3:06 AM Peter Zijlstra  wrote:
>
> On Thu, Oct 17, 2019 at 05:27:46PM -0700, Stephane Eranian wrote:
> > @@ -2153,6 +2157,7 @@ __perf_remove_from_context(struct perf_event *event,
> >  void *info)
> >  {
> >   unsigned long flags = (unsigned long)info;
> > + int was_necessary = ctx->rotate_necessary;
> >
> >   if (ctx->is_active & EVENT_TIME) {
> >   update_context_time(ctx);
> > @@ -2171,6 +2176,37 @@ __perf_remove_from_context(struct perf_event *event,
> >   cpuctx->task_ctx = NULL;
> >   }
> >   }
> > +
> > + /*
> > +  * sanity check that event_sched_out() does not and will not
> > +  * change the state of ctx->rotate_necessary
> > +  */
> > + WARN_ON(was_necessary != event->ctx->rotate_necessary);
>
> It doesn't... why is this important to check?
>
I can remove that. It is leftover from debugging. It is okay to look
at the situation
after event_sched_out(). Today, it does not change rotate_necessary.

> > + /*
> > +  * if we remove an event AND we were multiplexing then, that means
> > +  * we had more events than we have counters for, and thus, at least,
> > +  * one event was in INACTIVE state. Now, that we removed an event,
> > +  * we need to resched to give a chance to all events to get scheduled,
> > +  * otherwise some may get stuck.
> > +  *
> > +  * By the time this function is called the event is usually in the OFF
> > +  * state.
> > +  * Note that this is not a duplicate of the same code in 
> > _perf_event_disable()
> > +  * because the call path are different. Some events may be simply 
> > disabled
>
> It is the exact same code twice though; IIRC this C language has a
> feature to help with that.

Sure! I will make a function to check on the condition.

>
> > +  * others removed. There is a way to get removed and not be disabled 
> > first.
> > +  */
> > + if (ctx->rotate_necessary && ctx->nr_events) {
> > + int type = get_event_type(event);
> > + /*
> > +  * In case we removed a pinned event, then we need to
> > +  * resched for both pinned and flexible events. The
> > +  * opposite is not true. A pinned event can never be
> > +  * inactive due to multiplexing.
> > +  */
> > + if (type & EVENT_PINNED)
> > + type |= EVENT_FLEXIBLE;
> > + ctx_resched(cpuctx, cpuctx->task_ctx, type);
> > + }
>
> What you're relying on is that ->rotate_necessary implies ->is_active
> and there's pending events. And if we tighten ->rotate_necessary you can
> remove the && ->nr_events.
>
Imagine I have 6 events and 4 counters and I do delete them all before
the timer expires.
Then, I can be in a situation where rotate_necessary is still true and
yet have no more events
in the context. That is because only ctx_sched_out() clears
rotate_necessary, IIRC. So that
is why there is the && nr_events. Now, calling ctx_resched() with no
events wouldn't probably
cause any harm, just wasted work.  So if by tightening, I am guessing
you mean clearing
rotate_necessary earlier. But that would be tricky because the only
reliable way of clearing
it is when you know you are about the reschedule everything. Removing
an event by itself
may not be enough to eliminate multiplexing.


> > @@ -2232,6 +2270,35 @@ static void __perf_event_disable(struct perf_event 
> > *event,
> >   event_sched_out(event, cpuctx, ctx);
> >
> >   perf_event_set_state(event, PERF_EVENT_STATE_OFF);
> > + /*
> > +  * sanity check that event_sched_out() does not and will not
> > +  * change the state of ctx->rotate_necessary
> > +  */
> > + WARN_ON_ONCE(was_necessary != event->ctx->rotate_necessary);
> > +
> > + /*
> > +  * if we disable an event AND we were multiplexing then, that means
> > +  * we had more events than we have counters for, and thus, at least,
> > +  * one event was in INACTIVE state. Now, that we disabled an event,
> > +  * we need to resched to give a chance to all events to be scheduled,
> > +  * otherwise some may get stuck.
> > +  *
> > +  * Note that this is not a duplicate of the same code in
> > +  * __perf_remove_from_context()
> > +  * because events can be disabled without being removed.
>
>

Re: [PATCH] perf/core: fix multiplexing event scheduling issue

2019-10-18 Thread Stephane Eranian
On Thu, Oct 17, 2019 at 11:13 PM Song Liu  wrote:
>
>
>
> > On Oct 17, 2019, at 5:27 PM, Stephane Eranian  wrote:
> >
> > This patch complements the following commit:
> > 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")
> >
> > The fix from Song addresses the consequences of the problem but
> > not the cause. This patch fixes the causes and can sit on top of
> > Song's patch.
> >
> > This patch fixes a scheduling problem in the core functions of
> > perf_events. Under certain conditions, some events would not be
> > scheduled even though many counters would be available. This
> > is related to multiplexing and is architecture agnostic and
> > PMU agnostic (i.e., core or uncore).
> >
> > This problem can easily be reproduced when you have two perf
> > stat sessions. The first session does not cause multiplexing,
> > let's say it is measuring 1 event, E1. While it is measuring,
> > a second session starts and causes multiplexing. Let's say it
> > adds 6 events, B1-B6. Now, 7 events compete and are multiplexed.
> > When the second session terminates, all 6 (B1-B6) events are
> > removed. Normally, you'd expect the E1 event to continue to run
> > with no multiplexing. However, the problem is that depending on
> > the state Of E1 when B1-B6 are removed, it may never be scheduled
> > again. If E1 was inactive at the time of removal, despite the
> > multiplexing hrtimer still firing, it will not find any active
> > events and will not try to reschedule. This is what Song's patch
> > fixes. It forces the multiplexing code to consider non-active events.
>
> Good analysis! I kinda knew the example I had (with pinned event)
> was not the only way to trigger this issue. But I didn't think
> about event remove path.
>
I was pursuing this bug without knowledged of your patch. Your patch
makes it harder to see. Clearly in my test case, it disappears, but it is
just because of the multiplexing interval. If we get to the rotate code
and we have no active events yet some inactive, there is something
wrong because we are wasting counters. When I tracked the bug,
I started from the remove_context code, then realized there was also
the disable case. I fixed both and they I discovered your patch which
is fixing it at the receiving end. Hopefully, there aren't any code path
that can lead to this situation.


> > However, the cause is not addressed. The kernel should not rely on
> > the multiplexing hrtimer to unblock inactive events. That timer
> > can have abitrary duration in the milliseconds. Until the timer
> > fires, counters are available, but no measurable events are using
> > them. We do not want to introduce blind spots of arbitrary durations.
> >
> > This patch addresses the cause of the problem, by checking that,
> > when an event is disabled or removed and the context was multiplexing
> > events, inactive events gets immediately a chance to be scheduled by
> > calling ctx_resched(). The rescheduling is done on  event of equal
> > or lower priority types.  With that in place, as soon as a counter
> > is freed, schedulable inactive events may run, thereby eliminating
> > a blind spot.
> >
> > This can be illustrated as follows using Skylake uncore CHA here:
> >
> > $ perf stat --no-merge -a -I 1000 -C 28 -e uncore_cha_0/event=0x0/
> >42.007856531  2,000,291,322  uncore_cha_0/event=0x0/
> >43.008062166  2,000,399,526  uncore_cha_0/event=0x0/
> >44.008293244  2,000,473,720  uncore_cha_0/event=0x0/
> >45.008501847  2,000,423,420  uncore_cha_0/event=0x0/
> >46.008706558  2,000,411,132  uncore_cha_0/event=0x0/
> >47.008928543  2,000,441,660  uncore_cha_0/event=0x0/
> >
> > Adding second sessiont with 4 events for 4s
> >
> > $ perf stat -a -I 1000 -C 28 --no-merge -e 
> > uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/
> >  sleep 4
> >48.009114643  1,983,129,830  uncore_cha_0/event=0x0/ 
> >   (99.71%)
> >49.009279616  1,976,067,751  uncore_cha_0/event=0x0/ 
> >   (99.30%)
> >50.009428660  1,974,448,006  uncore_cha_0/event=0x0/ 
> >   (98.92%)
> >51.009524309  1,973,083,237  uncore_cha_0/event=0x0/ 
> >   (98.55%)
> >52.009673467  1,972,097,678  uncore_cha_0/event=0x0/ 
> >   (97.11%)
> >
> > End of 4s, second 

[PATCH] perf/core: fix multiplexing event scheduling issue

2019-10-17 Thread Stephane Eranian
This patch complements the following commit:
7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")

The fix from Song addresses the consequences of the problem but
not the cause. This patch fixes the causes and can sit on top of
Song's patch.

This patch fixes a scheduling problem in the core functions of
perf_events. Under certain conditions, some events would not be
scheduled even though many counters would be available. This
is related to multiplexing and is architecture agnostic and
PMU agnostic (i.e., core or uncore).

This problem can easily be reproduced when you have two perf
stat sessions. The first session does not cause multiplexing,
let's say it is measuring 1 event, E1. While it is measuring,
a second session starts and causes multiplexing. Let's say it
adds 6 events, B1-B6. Now, 7 events compete and are multiplexed.
When the second session terminates, all 6 (B1-B6) events are
removed. Normally, you'd expect the E1 event to continue to run
with no multiplexing. However, the problem is that depending on
the state Of E1 when B1-B6 are removed, it may never be scheduled
again. If E1 was inactive at the time of removal, despite the
multiplexing hrtimer still firing, it will not find any active
events and will not try to reschedule. This is what Song's patch
fixes. It forces the multiplexing code to consider non-active events.
However, the cause is not addressed. The kernel should not rely on
the multiplexing hrtimer to unblock inactive events. That timer
can have abitrary duration in the milliseconds. Until the timer
fires, counters are available, but no measurable events are using
them. We do not want to introduce blind spots of arbitrary durations.

This patch addresses the cause of the problem, by checking that,
when an event is disabled or removed and the context was multiplexing
events, inactive events gets immediately a chance to be scheduled by
calling ctx_resched(). The rescheduling is done on  event of equal
or lower priority types.  With that in place, as soon as a counter
is freed, schedulable inactive events may run, thereby eliminating
a blind spot.

This can be illustrated as follows using Skylake uncore CHA here:

$ perf stat --no-merge -a -I 1000 -C 28 -e uncore_cha_0/event=0x0/
42.007856531  2,000,291,322  uncore_cha_0/event=0x0/
43.008062166  2,000,399,526  uncore_cha_0/event=0x0/
44.008293244  2,000,473,720  uncore_cha_0/event=0x0/
45.008501847  2,000,423,420  uncore_cha_0/event=0x0/
46.008706558  2,000,411,132  uncore_cha_0/event=0x0/
47.008928543  2,000,441,660  uncore_cha_0/event=0x0/

Adding second sessiont with 4 events for 4s

$ perf stat -a -I 1000 -C 28 --no-merge -e 
uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/
 sleep 4
48.009114643  1,983,129,830  uncore_cha_0/event=0x0/
   (99.71%)
49.009279616  1,976,067,751  uncore_cha_0/event=0x0/
   (99.30%)
50.009428660  1,974,448,006  uncore_cha_0/event=0x0/
   (98.92%)
51.009524309  1,973,083,237  uncore_cha_0/event=0x0/
   (98.55%)
52.009673467  1,972,097,678  uncore_cha_0/event=0x0/
   (97.11%)

End of 4s, second session is removed. But the first event does not schedule and 
never will
unless new events force multiplexing again.

53.009815999uncore_cha_0/event=0x0/
   (95.28%)
54.009961809uncore_cha_0/event=0x0/
   (93.52%)
55.010110972uncore_cha_0/event=0x0/
   (91.82%)
56.010217579uncore_cha_0/event=0x0/
   (90.18%)

Signed-off-by: Stephane Eranian 
---
 kernel/events/core.c | 67 
 1 file changed, 67 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9ec0b0bfddbd..578587246ffb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2140,6 +2140,10 @@ group_sched_out(struct perf_event *group_event,
 
 #define DETACH_GROUP   0x01UL
 
+static void ctx_resched(struct perf_cpu_context *cpuctx,
+   struct perf_event_context *task_ctx,
+   enum event_type_t event_type);
+
 /*
  * Cross CPU call to remove a performance event
  *
@@ -2153,6 +2157,7 @@ __perf_remove_from_context(struct perf_event *event,
   void *info)
 {
unsigned long flags = (unsigned long)info;
+   int was_necessary = ctx->rotate_necessary;
 
if (ctx->is_active & EVENT_TIME) {
update_context_time(ctx);
@@ -2171,6 +2176,37 @@ __perf_remove_from_context(struct perf_event *event,
  

Re: [PATCH 4/4] perf docs: Correct and clarify jitdump spec

2019-09-27 Thread Stephane Eranian
On Fri, Sep 27, 2019 at 6:53 PM Steve MacLean
 wrote:
>
> Specification claims latest version of jitdump file format is 2. Current
> jit dump reading code treats 1 as the latest version.
>
> Correct spec to match code.
>
> The original language made it unclear the value to be written in the magic
> field.
>
> Revise language that the writer always writes the same value. Specify that
> the reader uses the value to detect endian mismatches.
>
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Mark Rutland 
> Cc: Alexander Shishkin 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Cc: Stephane Eranian 
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Steve MacLean 

Corrections are valid.

Acked-by: Stephane Eranian 

> ---
>  tools/perf/Documentation/jitdump-specification.txt | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/Documentation/jitdump-specification.txt 
> b/tools/perf/Documentation/jitdump-specification.txt
> index 4c62b07..52152d1 100644
> --- a/tools/perf/Documentation/jitdump-specification.txt
> +++ b/tools/perf/Documentation/jitdump-specification.txt
> @@ -36,8 +36,8 @@ III/ Jitdump file header format
>  Each jitdump file starts with a fixed size header containing the following 
> fields in order:
>
>
> -* uint32_t magic : a magic number tagging the file type. The value is 
> 4-byte long and represents the string "JiTD" in ASCII form. It is 0x4A695444 
> or 0x4454694a depending on the endianness. The field can be used to detect 
> the endianness of the file
> -* uint32_t version   : a 4-byte value representing the format version. It is 
> currently set to 2
> +* uint32_t magic : a magic number tagging the file type. The value is 
> 4-byte long and represents the string "JiTD" in ASCII form. It written is as 
> 0x4A695444. The reader will detect an endian mismatch when it reads 
> 0x4454694a.
> +* uint32_t version   : a 4-byte value representing the format version. It is 
> currently set to 1
>  * uint32_t total_size: size in bytes of file header
>  * uint32_t elf_mach  : ELF architecture encoding (ELF e_machine value as 
> specified in /usr/include/elf.h)
>  * uint32_t pad1  : padding. Reserved for future use
> --
> 2.7.4
>


Re: [PATCH] perf record: fix priv level with branch sampling for paranoid=2

2019-09-20 Thread Stephane Eranian
On Fri, Sep 20, 2019 at 12:12 PM Jiri Olsa  wrote:
>
> On Tue, Sep 03, 2019 at 11:26:03PM -0700, Stephane Eranian wrote:
> > Now that the default perf_events paranoid level is set to 2, a regular user
> > cannot monitor kernel level activity anymore. As such, with the following
> > cmdline:
> >
> > $ perf record -e cycles date
> >
> > The perf tool first tries cycles:uk but then falls back to cycles:u
> > as can be seen in the perf report --header-only output:
> >
> >   cmdline : /export/hda3/tmp/perf.tip record -e cycles ls
> >   event : name = cycles:u, , id = { 436186, ... }
> >
> > This is okay as long as there is way to learn the priv level was changed
> > internally by the tool.
> >
> > But consider a similar example:
> >
> > $ perf record -b -e cycles date
> > Error:
> > You may not have permission to collect stats.
> >
> > Consider tweaking /proc/sys/kernel/perf_event_paranoid,
> > which controls use of the performance events system by
> > unprivileged users (without CAP_SYS_ADMIN).
> > ...
> >
> > Why is that treated differently given that the branch sampling inherits the
> > priv level of the first event in this case, i.e., cycles:u? It turns out
> > that the branch sampling code is more picky and also checks exclude_hv.
> >
> > In the fallback path, perf record is setting exclude_kernel = 1, but it
> > does not change exclude_hv. This does not seem to match the restriction
> > imposed by paranoid = 2.
> >
> > This patch fixes the problem by forcing exclude_hv = 1 in the fallback
> > for paranoid=2. With this in place:
> >
> > $ perf record -b -e cycles date
> >   cmdline : /export/hda3/tmp/perf.tip record -b -e cycles ls
> >   event : name = cycles:u, , id = { 436847, ... }
> >
> > And the command succeeds as expected.
> >
> > Signed-off-by: Stephane Eranian 
> > ---
> >  tools/perf/util/evsel.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > index 85825384f9e8..3cbe06fdf7f7 100644
> > --- a/tools/perf/util/evsel.c
> > +++ b/tools/perf/util/evsel.c
> > @@ -2811,9 +2811,11 @@ bool perf_evsel__fallback(struct evsel *evsel, int 
> > err,
> >   if (evsel->name)
> >   free(evsel->name);
> >   evsel->name = new_name;
> > - scnprintf(msg, msgsize,
> > -"kernel.perf_event_paranoid=%d, trying to fall back to excluding kernel 
> > samples", paranoid);
> > + scnprintf(msg, msgsize, "kernel.perf_event_paranoid=%d, 
> > trying "
> > +   "to fall back to excluding kernel and hypervisor "
> > +   " samples", paranoid);
>
> extra space in here^
>
> Warning:
> kernel.perf_event_paranoid=2, trying to fall back to excluding kernel 
> and hypervisor  samples
>
> other than that it looks good to me
>
Fixed in v2.

> Acked-by: Jiri Olsa 
>
> thanks,
> jirka


[PATCH v2] perf record: fix priv level with branch sampling for paranoid=2

2019-09-20 Thread Stephane Eranian
Now that the default perf_events paranoid level is set to 2, a regular user
cannot monitor kernel level activity anymore. As such, with the following
cmdline:

$ perf record -e cycles date

The perf tool first tries cycles:uk but then falls back to cycles:u
as can be seen in the perf report --header-only output:

  cmdline : /export/hda3/tmp/perf.tip record -e cycles ls
  event : name = cycles:u, , id = { 436186, ... }

This is okay as long as there is way to learn the priv level was changed
internally by the tool.

But consider a similar example:

$ perf record -b -e cycles date
Error:
You may not have permission to collect stats.

Consider tweaking /proc/sys/kernel/perf_event_paranoid,
which controls use of the performance events system by
unprivileged users (without CAP_SYS_ADMIN).
...

Why is that treated differently given that the branch sampling inherits the
priv level of the first event in this case, i.e., cycles:u? It turns out
that the branch sampling code is more picky and also checks exclude_hv.

In the fallback path, perf record is setting exclude_kernel = 1, but it
does not change exclude_hv. This does not seem to match the restriction
imposed by paranoid = 2.

This patch fixes the problem by forcing exclude_hv = 1 in the fallback
for paranoid=2. With this in place:

$ perf record -b -e cycles date
  cmdline : /export/hda3/tmp/perf.tip record -b -e cycles ls
  event : name = cycles:u, , id = { 436847, ... }

And the command succeeds as expected.

V2 fix a white space.

Signed-off-by: Stephane Eranian 
---
 tools/perf/util/evsel.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 85825384f9e8..3cbe06fdf7f7 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2811,9 +2811,11 @@ bool perf_evsel__fallback(struct evsel *evsel, int err,
if (evsel->name)
free(evsel->name);
evsel->name = new_name;
-   scnprintf(msg, msgsize,
-"kernel.perf_event_paranoid=%d, trying to fall back to excluding kernel 
samples", paranoid);
+   scnprintf(msg, msgsize, "kernel.perf_event_paranoid=%d, trying "
+ "to fall back to excluding kernel and hypervisor "
+ " samples", paranoid);
evsel->core.attr.exclude_kernel = 1;
+   evsel->core.attr.exclude_hv = 1;

return true;
}
-- 
2.23.0.187.g17f5b7556c-goog



Re: [PATCH] perf record: fix priv level with branch sampling for paranoid=2

2019-09-13 Thread Stephane Eranian
On Tue, Sep 3, 2019 at 11:26 PM Stephane Eranian  wrote:
>
> Now that the default perf_events paranoid level is set to 2, a regular user
> cannot monitor kernel level activity anymore. As such, with the following
> cmdline:
>
> $ perf record -e cycles date
>
> The perf tool first tries cycles:uk but then falls back to cycles:u
> as can be seen in the perf report --header-only output:
>
>   cmdline : /export/hda3/tmp/perf.tip record -e cycles ls
>   event : name = cycles:u, , id = { 436186, ... }
>
> This is okay as long as there is way to learn the priv level was changed
> internally by the tool.
>
> But consider a similar example:
>
> $ perf record -b -e cycles date
> Error:
> You may not have permission to collect stats.
>
> Consider tweaking /proc/sys/kernel/perf_event_paranoid,
> which controls use of the performance events system by
> unprivileged users (without CAP_SYS_ADMIN).
> ...
>
> Why is that treated differently given that the branch sampling inherits the
> priv level of the first event in this case, i.e., cycles:u? It turns out
> that the branch sampling code is more picky and also checks exclude_hv.
>
> In the fallback path, perf record is setting exclude_kernel = 1, but it
> does not change exclude_hv. This does not seem to match the restriction
> imposed by paranoid = 2.
>
> This patch fixes the problem by forcing exclude_hv = 1 in the fallback
> for paranoid=2. With this in place:
>
> $ perf record -b -e cycles date
>   cmdline : /export/hda3/tmp/perf.tip record -b -e cycles ls
>   event : name = cycles:u, , id = { 436847, ... }
>
> And the command succeeds as expected.
>
Any comment on this patch?

> Signed-off-by: Stephane Eranian 
> ---
>  tools/perf/util/evsel.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 85825384f9e8..3cbe06fdf7f7 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2811,9 +2811,11 @@ bool perf_evsel__fallback(struct evsel *evsel, int err,
> if (evsel->name)
> free(evsel->name);
> evsel->name = new_name;
> -   scnprintf(msg, msgsize,
> -"kernel.perf_event_paranoid=%d, trying to fall back to excluding kernel 
> samples", paranoid);
> +   scnprintf(msg, msgsize, "kernel.perf_event_paranoid=%d, 
> trying "
> + "to fall back to excluding kernel and hypervisor "
> + " samples", paranoid);
> evsel->core.attr.exclude_kernel = 1;
> +   evsel->core.attr.exclude_hv = 1;
>
> return true;
> }
> --
> 2.23.0.187.g17f5b7556c-goog
>


[PATCH] perf record: fix priv level with branch sampling for paranoid=2

2019-09-04 Thread Stephane Eranian
Now that the default perf_events paranoid level is set to 2, a regular user
cannot monitor kernel level activity anymore. As such, with the following
cmdline:

$ perf record -e cycles date

The perf tool first tries cycles:uk but then falls back to cycles:u
as can be seen in the perf report --header-only output:

  cmdline : /export/hda3/tmp/perf.tip record -e cycles ls
  event : name = cycles:u, , id = { 436186, ... }

This is okay as long as there is way to learn the priv level was changed
internally by the tool.

But consider a similar example:

$ perf record -b -e cycles date
Error:
You may not have permission to collect stats.

Consider tweaking /proc/sys/kernel/perf_event_paranoid,
which controls use of the performance events system by
unprivileged users (without CAP_SYS_ADMIN).
...

Why is that treated differently given that the branch sampling inherits the
priv level of the first event in this case, i.e., cycles:u? It turns out
that the branch sampling code is more picky and also checks exclude_hv.

In the fallback path, perf record is setting exclude_kernel = 1, but it
does not change exclude_hv. This does not seem to match the restriction
imposed by paranoid = 2.

This patch fixes the problem by forcing exclude_hv = 1 in the fallback
for paranoid=2. With this in place:

$ perf record -b -e cycles date
  cmdline : /export/hda3/tmp/perf.tip record -b -e cycles ls
  event : name = cycles:u, , id = { 436847, ... }

And the command succeeds as expected.

Signed-off-by: Stephane Eranian 
---
 tools/perf/util/evsel.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 85825384f9e8..3cbe06fdf7f7 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2811,9 +2811,11 @@ bool perf_evsel__fallback(struct evsel *evsel, int err,
if (evsel->name)
free(evsel->name);
evsel->name = new_name;
-   scnprintf(msg, msgsize,
-"kernel.perf_event_paranoid=%d, trying to fall back to excluding kernel 
samples", paranoid);
+   scnprintf(msg, msgsize, "kernel.perf_event_paranoid=%d, trying "
+ "to fall back to excluding kernel and hypervisor "
+ " samples", paranoid);
evsel->core.attr.exclude_kernel = 1;
+   evsel->core.attr.exclude_hv = 1;
 
return true;
}
-- 
2.23.0.187.g17f5b7556c-goog



Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-31 Thread Stephane Eranian
Andi,

On Fri, Aug 30, 2019 at 5:31 PM Andi Kleen  wrote:
>
> > the same manner. It would greatly simplify the kernel implementation.
>
> I tried that originally. It was actually more complicated.
>
> You can't really do deltas on raw metrics, and a lot of the perf
> infrastructure is built around deltas.
>
How is RAPL handled? No deltas there either. It uses the snapshot model.
At each interval, perf stat just reads the current count, and does not compute
a delta since previous read.
With PERF_METRICS, the delta is always since previous read. If you read
frequently enough you do not lose precision.

>
> To do the regular reset and not lose precision over time internally
> you have to keep expanded counters anyways. And if you do that
> you can just expose them to user space too, and have everything
> in user space just work without any changes (except for the final
> output)
>
> -Andi
>


Re: [PATCH 1/9] perf/core: Add PERF_RECORD_CGROUP event

2019-08-30 Thread Stephane Eranian
On Fri, Aug 30, 2019 at 3:49 PM Namhyung Kim  wrote:
>
> On Fri, Aug 30, 2019 at 4:35 PM Peter Zijlstra  wrote:
> >
> > On Fri, Aug 30, 2019 at 12:46:51PM +0900, Namhyung Kim wrote:
> > > Hi Peter,
> > >
> > > On Wed, Aug 28, 2019 at 6:45 PM Peter Zijlstra  
> > > wrote:
> > > >
> > > > On Wed, Aug 28, 2019 at 04:31:22PM +0900, Namhyung Kim wrote:
> > > > > To support cgroup tracking, add CGROUP event to save a link between
> > > > > cgroup path and inode number.  The attr.cgroup bit was also added to
> > > > > enable cgroup tracking from userspace.
> > > > >
> > > > > This event will be generated when a new cgroup becomes active.
> > > > > Userspace might need to synthesize those events for existing cgroups.
> > > > >
> > > > > As aux_output change is also going on, I just added the bit here as
> > > > > well to remove possible conflicts later.
> > > >
> > > > Why do we want this?
> > >
> > > I saw below [1] and thought you have the patch introduced aux_output
> > > and it's gonna to be merged soon.
> > > Also the tooling patches are already in the acme/perf/core
> > > so I just wanted to avoid conflicts.
> > >
> > > Anyway, I'm ok with changing it.  Will remove in v2.
> >
> > I seem to have confused both you and Arnaldo with this. This email
> > questions the Changelog as a whole, not just the aux thing (I send a
> > separate email for that).
> >
> > This Changelog utterly fails to explain to me _why_ we need/want cgroup
> > tracking. So why do I want to review and possibly merge this? Changelog
> > needs to answer this.
>
> OK.  How about this?
>
> Systems running a large number of jobs in different cgroups want to
> profiling such jobs precisely.  This includes container hosting systems
> widely used today.  Currently perf supports namespace tracking but
> the systems may not use (cgroup) namespace for their jobs.  Also
> it'd be more intuitive to see cgroup names (as they're given by user
> or sysadmin) rather than numeric cgroup/namespace id even if they
> use the namespaces.
>

In data centers you care about attributing samples to a job not such
much to a process.
A job may have multiple processes which may come and go. The cgroup on
the other hand
stays around for the entire lifetime of the job. It is much easier to
map a cgroup name to a particular
job than it is to map a pid back to a job name, especially for offline
post-processing.

Hope this clarifies why we would like this feature upstream.


>
> Thanks,
> Namhyung


Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

2019-08-30 Thread Stephane Eranian
Hi,

On Mon, Aug 26, 2019 at 7:48 AM  wrote:
>
> From: Kan Liang 
>
> Intro
> =
>
> Icelake has support for measuring the four top level TopDown metrics
> directly in hardware. This is implemented by an additional "metrics"
> register, and a new Fixed Counter 3 that measures pipeline "slots".
>
> Events
> ==
>
> We export four metric events as separate perf events, which map to
> internal "metrics" counter register. Those events do not exist in
> hardware, but can be allocated by the scheduler.
>
There is another approach possible for supporting Topdown-style counters.
Instead of trying to abstract them as separate events to the user and then
trying to put them back together in the kernel and then using slots to scale
them as counts, we could just expose them as is, i.e., structured counter
values. The kernel already handles structured counter configs and exports
the fields on the config via sysfs and the perf tool picks them up and can
encode any event. We could have a similar approach for a counter
value. It could have fields, unit, types. Perf stat would pick them up in
the same manner. It would greatly simplify the kernel implementation.
You would need to publish an pseudo-event code for each group of
metrics. Note that I am not advocating expose the raw counter value.
That way you would maintain one event code -> one "counter" on hw.
The reset on read would also work. It would generate only one rdmsr
per read without forcing any grouping. You would not need any grouping,
or using slots under the hood to scale. If PERF_METRICS gets extended, you
can just add another pseudo event code or umask.

The PERF_METRICS events do not make real sense in isolation. The SLOTS
scaling is hard to interpret. You can never profiling on PERF_METRICS event
so keeping them grouped is okay.


> For the event mapping we use a special 0x00 event code, which is
> reserved for fake events. The metric events start from umask 0x10.
>
> When setting up such events they point to the slots counter, and a
> special callback, update_topdown_event(), reads the additional metrics
> msr to generate the metrics. Then the metric is reported by multiplying
> the metric (percentage) with slots.
>
> This multiplication allows to easily keep a running count, for example
> when the slots counter overflows, and makes all the standard tools, such
> as a perf stat, work. They can do deltas of the values without needing
> to know about percentages. This also simplifies accumulating the counts
> of child events, which otherwise would need to know how to average
> percent values.
>
> All four metric events don't support sampling. Since they will be
> handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is
> introduced to indicate this case.
>
> The slots event can support both sampling and counting.
> For counting, the flag is also applied.
> For sampling, it will be handled normally as other normal events.
>
> Groups
> ==
>
> To avoid reading the METRICS register multiple times, the metrics and
> slots value can only be updated by the first slots/metrics event in a
> group. All active slots and metrics events will be updated one time.
>
> Reset
> ==
>
> The PERF_METRICS and Fixed counter 3 have to be reset for each read,
> because:
> - The 8bit metrics ratio values lose precision when the measurement
>   period gets longer.
> - The PERF_METRICS may report wrong value if its delta was less than
>   1/255 of SLOTS (Fixed counter 3).
>
> Also, for counting, the -max_period is the initial value of the SLOTS.
> The huge initial value will definitely trigger the issue mentioned
> above. Force initial value as 0 for topdown and slots event counting.
>
> NMI
> ==
>
> The METRICS register may be overflow. The bit 48 of STATUS register
> will be set. If so, update all active slots and metrics events.
>
> The update_topdown_event() has to read two registers separately. The
> values may be modify by a NMI. PMU has to be disabled before calling the
> function.
>
> RDPMC
> ==
>
> RDPMC is temporarily disabled. The following patch will enable it.
>
> Originally-by: Andi Kleen 
> Signed-off-by: Kan Liang 
> ---
>  arch/x86/events/core.c   |  10 ++
>  arch/x86/events/intel/core.c | 230 ++-
>  arch/x86/events/perf_event.h |  17 +++
>  arch/x86/include/asm/msr-index.h |   2 +
>  4 files changed, 255 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 54534ff00940..1ae23db5c2d7 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -76,6 +76,8 @@ u64 x86_perf_event_update(struct perf_event *event)
> if (idx == INTEL_PMC_IDX_FIXED_BTS)
> return 0;
>
> +   if (is_topdown_count(event) && x86_pmu.update_topdown_event)
> +   return x86_pmu.update_topdown_event(event);
> /*
>  * Careful: an NMI might modify the previous event value.
>  *
> @@ -1003,6 +1005,10 

Re: [RFC] perf/x86/amd: add support for Large Increment per Cycle Events

2019-08-28 Thread Stephane Eranian
On Wed, Aug 28, 2019 at 5:47 AM Peter Zijlstra  wrote:
>
> On Mon, Aug 26, 2019 at 02:59:15PM -0500, Kim Phillips wrote:
> > The core AMD PMU has a 4-bit wide per-cycle increment for each
> > performance monitor counter.  That works for most counters, but
> > now with AMD Family 17h and above processors, for some, more than 15
> > events can occur in a cycle.  Those events are called "Large
> > Increment per Cycle" events, and one example is the number of
> > SSE/AVX FLOPs retired (event code 0x003).  In order to count these
> > events, two adjacent h/w PMCs get their count signals merged
> > to form 8 bits per cycle total.
>
> *groan*
>
> > In addition, the PERF_CTR count
> > registers are merged to be able to count up to 64 bits.
>
> That is daft; why can't you extend the existing MSR to 64bit?
>
My understanding is that the problem is not the width of the counter
but its ability to increment by more than 15 per cycle. They need two
counters to swallow 16+ events/cycle.


> > Normally, events like instructions retired, get programmed on a single
> > counter like so:
> >
> > PERF_CTL0 (MSR 0xc0010200) 0x0053ff0c # event 0x0c, umask 0xff
> > PERF_CTR0 (MSR 0xc0010201) 0x8001 # r/w 48-bit count
> >
> > The next counter at MSRs 0xc0010202-3 remains unused, or can be used
> > independently to count something else.
> >
> > When counting Large Increment per Cycle events, such as FLOPs,
> > however, we now have to reserve the next counter and program the
> > PERF_CTL (config) register with the Merge event (0xFFF), like so:
> >
> > PERF_CTL0 (msr 0xc0010200) 0x0053ff03 # FLOPs event, umask 0xff
> > PERF_CTR0 (msr 0xc0010201) 0x8001 # read 64-bit count, wr low 
> > 48b
> > PERF_CTL1 (msr 0xc0010202) 0x000f004000ff # Merge event, enable bit
> > PERF_CTR1 (msr 0xc0010203) 0x # write higher 16-bits of 
> > count
> >
> > The count is widened from the normal 48-bits to 64 bits by having the
> > second counter carry the higher 16 bits of the count in its lower 16
> > bits of its counter register.  Support for mixed 48- and 64-bit counting
> > is not supported in this version.
>
> This is diguisting.. please talk to your hardware people. I sort of
> understand the pairing, but that upper 16 bit split for writes is just
> woeful crap.
>
> > For more details, search a Family 17h PPR for the "Large Increment per
> > Cycle Events" section, e.g., section 2.1.15.3 on p. 173 in this version:
> >
> > https://www.amd.com/system/files/TechDocs/56176_ppr_Family_17h_Model_71h_B0_pub_Rev_3.06.zip
>
> My mama told me not to open random zip files of the interweb :-)
>
> Also; afaict the only additional information there is that it works in
> odd/even pairs and you have to program the odd one before the even one.
> Surely you could've included that here.
>
> > In order to support reserving the extra counter for a single Large
> > Increment per Cycle event in the perf core, we:
> >
> > 1. Add a f17h get_event_constraints() that returns only an even counter
> > bitmask, since Large Increment events can only be placed on counters 0,
> > 2, and 4 out of the currently available 0-5.
>
> So hereby you promise that all LI events are unconstrained, right?
> Also, what marks the paired counter in the used mask? Aaah, you modify
> __perf_sched_find_counter(). Comments below.
>
> > 2. We add a commit_scheduler hook that adds the Merge event (0xFFF) to
> > any Large Increment event being scheduled.  If the event being scheduled
> > is not a Large Increment event, we check for, and remove any
> > pre-existing Large Increment events on the next counter.
>
> That is weird at best; the scheduling hooks shouldn't be the one doing
> the programming; that should be done in x86_pmu_enable(). Can't you do
> this by changing amd_pmu::{en,dis}able() ?
>
> (also; we really should rename some of those x86_pmu::ops :/)
>
> > 3. In the main x86 scheduler, we reduce the number of available
> > counters by the number of Large Increment per Cycle events being added.
> > This improves the counter scheduler success rate.
> >
> > 4. In perf_assign_events(), if a counter is assigned to a Large
> > Increment event, we increment the current counter variable, so the
> > counter used for the Merge event is skipped.
> >
> > 5. In find_counter(), if a counter has been found for the
> > Large Increment event, we set the next counter as used, to
> > prevent other events from using it.
> >
> > A side-effect of assigning a new get_constraints function for f17h
> > disables calling the old (prior to f15h) amd_get_event_constraints
> > implementation left enabled by commit e40ed1542dd7 ("perf/x86: Add perf
> > support for AMD family-17h processors"), which is no longer
> > necessary since those North Bridge events are obsolete.
>
> > RFC because I'd like input on the approach, including how to add support
> > for mixed-width (48- and 64-bit) counting for a single PMU.
>
> Ideally I'd tell you to wait for sane hardware :/

[BUG] perf report: segfault with --no-group in pipe mode

2019-08-02 Thread Stephane Eranian
Hi,

When trying the following command line with perf from tip,git, I got:

$ perf record --group -c 10 -e '{branch-misses,branches}' -a -o -
sleep 1| perf report --no-group -F sample,cpu,period -i -
# To display the perf.data header info, please use
--header/--header-only options.
#
Segmentation fault (core dumped)

(gdb) r report --no-group -F sample,cpu,period -i - < tt
Starting program: /export/hda3/perftest/perf.tip report --no-group -F
sample,cpu,period -i - < tt
# To display the perf.data header info, please use
--header/--header-only options.
#

Program received signal SIGSEGV, Segmentation fault.
hlist_add_head (h=0xeb9ed8, n=0xebdfd0) at
/usr/local/google/home/eranian/G/bnw.tip/tools/include/linux/list.h:644
644 /usr/local/google/home/eranian/G/bnw.tip/tools/include/linux/list.h:
No such file or directory.
(gdb)

Can you reproduce this?
Thanks.


Re: [PATCH] Fix perf stat repeat segfault

2019-07-15 Thread Stephane Eranian
On Mon, Jul 15, 2019 at 12:59 AM Jiri Olsa  wrote:
>
> On Sun, Jul 14, 2019 at 02:36:42PM -0700, Stephane Eranian wrote:
> > On Sun, Jul 14, 2019 at 1:55 PM Jiri Olsa  wrote:
> > >
> > > On Sun, Jul 14, 2019 at 10:44:36PM +0200, Jiri Olsa wrote:
> > > > On Wed, Jul 10, 2019 at 01:45:40PM -0700, Numfor Mbiziwo-Tiapo wrote:
> > > > > When perf stat is called with event groups and the repeat option,
> > > > > a segfault occurs because the cpu ids are stored on each iteration
> > > > > of the repeat, when they should only be stored on the first iteration,
> > > > > which causes a buffer overflow.
> > > > >
> > > > > This can be replicated by running (from the tip directory):
> > > > >
> > > > > make -C tools/perf
> > > > >
> > > > > then running:
> > > > >
> > > > > tools/perf/perf stat -e '{cycles,instructions}' -r 10 ls
> > > > >
> > > > > Since run_idx keeps track of the current iteration of the repeat,
> > > > > only storing the cpu ids on the first iteration (when run_idx < 1)
> > > > > fixes this issue.
> > > > >
> > > > > Signed-off-by: Numfor Mbiziwo-Tiapo 
> > > > > ---
> > > > >  tools/perf/builtin-stat.c | 7 ---
> > > > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> > > > > index 63a3afc7f32b..92d6694367e4 100644
> > > > > --- a/tools/perf/builtin-stat.c
> > > > > +++ b/tools/perf/builtin-stat.c
> > > > > @@ -378,9 +378,10 @@ static void workload_exec_failed_signal(int 
> > > > > signo __maybe_unused, siginfo_t *inf
> > > > > workload_exec_errno = info->si_value.sival_int;
> > > > >  }
> > > > >
> > > > > -static bool perf_evsel__should_store_id(struct perf_evsel *counter)
> > > > > +static bool perf_evsel__should_store_id(struct perf_evsel *counter, 
> > > > > int run_idx)
> > > > >  {
> > > > > -   return STAT_RECORD || counter->attr.read_format & PERF_FORMAT_ID;
> > > > > +   return STAT_RECORD || counter->attr.read_format & PERF_FORMAT_ID
> > > > > +   && run_idx < 1;
> > > >
> > > > we create counters for every iteration, so this can't be
> > > > based on iteration
> > > >
> > > > I think that's just a workaround for memory corruption,
> > > > that's happening for repeating groupped events stats,
> > > > I'll check on this
> > >
> > > how about something like this? we did not cleanup
> > > ids on evlist close, so it kept on raising and
> > > causing corruption in next iterations
> > >
> > not sure, that would realloc on each iteration of the repeats.
>
> well, we need new ids, because we create new events every iteration
>
If you recreate them, then agreed.
It is not clear to me why you need ids when not running is STAT_RECORD mode.

> jirka
>
> >
> > >
> > > jirka
> > >
> > >
> > > ---
> > > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > > index ebb46da4dfe5..52459dd5ad0c 100644
> > > --- a/tools/perf/util/evsel.c
> > > +++ b/tools/perf/util/evsel.c
> > > @@ -1291,6 +1291,7 @@ static void perf_evsel__free_id(struct perf_evsel 
> > > *evsel)
> > > xyarray__delete(evsel->sample_id);
> > > evsel->sample_id = NULL;
> > > zfree(>id);
> > > +   evsel->ids = 0;
> > >  }
> > >
> > >  static void perf_evsel__free_config_terms(struct perf_evsel *evsel)
> > > @@ -2077,6 +2078,7 @@ void perf_evsel__close(struct perf_evsel *evsel)
> > >
> > > perf_evsel__close_fd(evsel);
> > > perf_evsel__free_fd(evsel);
> > > +   perf_evsel__free_id(evsel);
> > >  }
> > >
> > >  int perf_evsel__open_per_cpu(struct perf_evsel *evsel,


Re: [PATCH] Fix perf stat repeat segfault

2019-07-14 Thread Stephane Eranian
On Sun, Jul 14, 2019 at 1:55 PM Jiri Olsa  wrote:
>
> On Sun, Jul 14, 2019 at 10:44:36PM +0200, Jiri Olsa wrote:
> > On Wed, Jul 10, 2019 at 01:45:40PM -0700, Numfor Mbiziwo-Tiapo wrote:
> > > When perf stat is called with event groups and the repeat option,
> > > a segfault occurs because the cpu ids are stored on each iteration
> > > of the repeat, when they should only be stored on the first iteration,
> > > which causes a buffer overflow.
> > >
> > > This can be replicated by running (from the tip directory):
> > >
> > > make -C tools/perf
> > >
> > > then running:
> > >
> > > tools/perf/perf stat -e '{cycles,instructions}' -r 10 ls
> > >
> > > Since run_idx keeps track of the current iteration of the repeat,
> > > only storing the cpu ids on the first iteration (when run_idx < 1)
> > > fixes this issue.
> > >
> > > Signed-off-by: Numfor Mbiziwo-Tiapo 
> > > ---
> > >  tools/perf/builtin-stat.c | 7 ---
> > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> > > index 63a3afc7f32b..92d6694367e4 100644
> > > --- a/tools/perf/builtin-stat.c
> > > +++ b/tools/perf/builtin-stat.c
> > > @@ -378,9 +378,10 @@ static void workload_exec_failed_signal(int signo 
> > > __maybe_unused, siginfo_t *inf
> > > workload_exec_errno = info->si_value.sival_int;
> > >  }
> > >
> > > -static bool perf_evsel__should_store_id(struct perf_evsel *counter)
> > > +static bool perf_evsel__should_store_id(struct perf_evsel *counter, int 
> > > run_idx)
> > >  {
> > > -   return STAT_RECORD || counter->attr.read_format & PERF_FORMAT_ID;
> > > +   return STAT_RECORD || counter->attr.read_format & PERF_FORMAT_ID
> > > +   && run_idx < 1;
> >
> > we create counters for every iteration, so this can't be
> > based on iteration
> >
> > I think that's just a workaround for memory corruption,
> > that's happening for repeating groupped events stats,
> > I'll check on this
>
> how about something like this? we did not cleanup
> ids on evlist close, so it kept on raising and
> causing corruption in next iterations
>
not sure, that would realloc on each iteration of the repeats.

>
> jirka
>
>
> ---
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index ebb46da4dfe5..52459dd5ad0c 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1291,6 +1291,7 @@ static void perf_evsel__free_id(struct perf_evsel 
> *evsel)
> xyarray__delete(evsel->sample_id);
> evsel->sample_id = NULL;
> zfree(>id);
> +   evsel->ids = 0;
>  }
>
>  static void perf_evsel__free_config_terms(struct perf_evsel *evsel)
> @@ -2077,6 +2078,7 @@ void perf_evsel__close(struct perf_evsel *evsel)
>
> perf_evsel__close_fd(evsel);
> perf_evsel__free_fd(evsel);
> +   perf_evsel__free_id(evsel);
>  }
>
>  int perf_evsel__open_per_cpu(struct perf_evsel *evsel,


Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog

2019-06-17 Thread Stephane Eranian
Hi,

On Mon, Jun 17, 2019 at 1:25 AM Thomas Gleixner  wrote:
>
> On Sun, 16 Jun 2019, Thomas Gleixner wrote:
> > On Thu, 23 May 2019, Ricardo Neri wrote:
> > > When the hardlockup detector is enabled, the function
> > > hld_hpet_intremapactivate_irq() activates the recently created entry
> > > in the interrupt remapping table via the modify_irte() functions. While
> > > doing this, it specifies which CPU the interrupt must target via its APIC
> > > ID. This function can be called every time the destination iD of the
> > > interrupt needs to be updated; there is no need to allocate or remove
> > > entries in the interrupt remapping table.
> >
> > Brilliant.
> >
> > > +int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
> > > +{
> > > +   u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
> > > +   struct intel_ir_data *data;
> > > +
> > > +   data = (struct intel_ir_data *)hdata->intremap_data;
> > > +   data->irte_entry.dest_id = IRTE_DEST(destid);
> > > +   return modify_irte(>irq_2_iommu, >irte_entry);
> >
> > This calls modify_irte() which does at the very beginning:
> >
> >raw_spin_lock_irqsave(_2_ir_lock, flags);
> >
> > How is that supposed to work from NMI context? Not to talk about the
> > other spinlocks which are taken in the subsequent call chain.
> >
> > You cannot call in any of that code from NMI context.
> >
> > The only reason why this never deadlocked in your testing is that nothing
> > else touched that particular iommu where the HPET hangs off concurrently.
> >
> > But that's just pure luck and not design.
>
> And just for the record. I warned you about that problem during the review
> of an earlier version and told you to talk to IOMMU folks whether there is
> a way to update the entry w/o running into that lock problem.
>
> Can you tell my why am I actually reviewing patches and spending time on
> this when the result is ignored anyway?
>
> I also tried to figure out why you went away from the IPI broadcast
> design. The only information I found is:
>
> Changes vs. v1:
>
>  * Brought back the round-robin mechanism proposed in v1 (this time not
>using the interrupt subsystem). This also requires to compute
>expiration times as in v1 (Andi Kleen, Stephane Eranian).
>
> Great that there is no trace of any mail from Andi or Stephane about this
> on LKML. There is no problem with talking offlist about this stuff, but
> then you should at least provide a rationale for those who were not part of
> the private conversation.
>
Let me add some context to this whole patch series. The pressure on
the core PMU counters
is increasing as more people want to use them to measure always more
events. When the PMU
is overcommitted, i.e., more events than counters for them, there is
multiplexing. It comes
with an overhead that is too high for certain applications. One way to
avoid this is to lower the
multiplexing frequency, which is by default 1ms, but that comes with
loss of accuracy. Another approach is
to measure only a small number of events at a time and use multiple
runs, but then you lose consistent event
view. Another approach is to push for increasing the number of
counters. But getting new hardware
counters takes time. Short term, we can investigate what it would take
to free one cycle-capable
counter which is commandeered by the hard lockup detector on all X86
processors today. The functionality
of the watchdog, being able to get a crash dump on kernel deadlocks,
is important and we cannot simply
disable it. At scale, many bugs are exposed and thus machines
deadlock. Therefore, we want to investigate
what it would take to move the detector to another NMI-capable source,
such as the HPET because the
detector does not need high low granularity timer and interrupts only every 2s.

Furthermore, recent Intel erratum, e.g., the TSX issue forcing the TFA
code in perf_events, have increased the pressure
even more with only 3 generic counters left. Thus, it is time to look
at alternative ways of  getting a hard lockup detector
(NMI watchdog) from another NMI source than the PMU. To that extent, I
have been discussing about alternatives.
Intel suggested using the HPET and Ricardo has been working on
producing this patch series. It is clear from your review
that the patches have issues, but I am hoping that they can be
resolved with constructive feedback knowing what the end goal is.

As for the round-robin changes, yes, we discussed this as an
alternative to avoid overloading CPU0 with handling
all of the work to broadcasting IPI to 100+ other CPUs.

Thanks.


Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

2019-06-14 Thread Stephane Eranian
On Thu, Jun 13, 2019 at 9:13 AM Liang, Kan  wrote:
>
>
>
> On 6/1/2019 4:27 AM, Ian Rogers wrote:
> > Currently perf_rotate_context assumes that if the context's nr_events !=
> > nr_active a rotation is necessary for perf event multiplexing. With
> > cgroups, nr_events is the total count of events for all cgroups and
> > nr_active will not include events in a cgroup other than the current
> > task's. This makes rotation appear necessary for cgroups when it is not.
> >
> > Add a perf_event_context flag that is set when rotation is necessary.
> > Clear the flag during sched_out and set it when a flexible sched_in
> > fails due to resources.
> >
> > Signed-off-by: Ian Rogers 
> > ---
> >   include/linux/perf_event.h |  5 +
> >   kernel/events/core.c   | 42 +++---
> >   2 files changed, 30 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 15a82ff0aefe..7ab6c251aa3d 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -747,6 +747,11 @@ struct perf_event_context {
> >   int nr_stat;
> >   int nr_freq;
> >   int rotate_disable;
> > + /*
> > +  * Set when nr_events != nr_active, except tolerant to events not
> > +  * needing to be active due to scheduling constraints, such as 
> > cgroups.
> > +  */
> > + int rotate_necessary;
>
> It looks like the rotate_necessary is only useful for cgroup and cpuctx.
> Why not move it to struct perf_cpu_context and under #ifdef
> CONFIG_CGROUP_PERF?
> And rename it cgrp_rotate_necessary?
>
I am not sure I see the point here. What I'd like to see is a uniform
signal for rotation needed in per-task, per-cpu or per-cgroup modes.
Ian's patch does that. It does make it a lot more efficient in cgroup
mode, by avoiding unnecessary rotations, and does not alter/improve
on any of the other two modes.

> Thanks,
> Kan
>
> >   refcount_t  refcount;
> >   struct task_struct  *task;
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index abbd4b3b96c2..41ae424b9f1d 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2952,6 +2952,12 @@ static void ctx_sched_out(struct perf_event_context 
> > *ctx,
> >   if (!ctx->nr_active || !(is_active & EVENT_ALL))
> >   return;
> >
> > + /*
> > +  * If we had been multiplexing, no rotations are necessary now no 
> > events
> > +  * are active.
> > +  */
> > + ctx->rotate_necessary = 0;
> > +
> >   perf_pmu_disable(ctx->pmu);
> >   if (is_active & EVENT_PINNED) {
> >   list_for_each_entry_safe(event, tmp, >pinned_active, 
> > active_list)
> > @@ -3325,6 +3331,15 @@ static int flexible_sched_in(struct perf_event 
> > *event, void *data)
> >   sid->can_add_hw = 0;
> >   }
> >
> > + /*
> > +  * If the group wasn't scheduled then set that multiplexing is 
> > necessary
> > +  * for the context. Note, this won't be set if the event wasn't
> > +  * scheduled due to event_filter_match failing due to the earlier
> > +  * return.
> > +  */
> > + if (event->state == PERF_EVENT_STATE_INACTIVE)
> > + sid->ctx->rotate_necessary = 1;
> > +
> >   return 0;
> >   }
> >
> > @@ -3690,24 +3705,17 @@ ctx_first_active(struct perf_event_context *ctx)
> >   static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
> >   {
> >   struct perf_event *cpu_event = NULL, *task_event = NULL;
> > - bool cpu_rotate = false, task_rotate = false;
> > - struct perf_event_context *ctx = NULL;
> > + struct perf_event_context *task_ctx = NULL;
> > + int cpu_rotate, task_rotate;
> >
> >   /*
> >* Since we run this from IRQ context, nobody can install new
> >* events, thus the event count values are stable.
> >*/
> >
> > - if (cpuctx->ctx.nr_events) {
> > - if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
> > - cpu_rotate = true;
> > - }
> > -
> > - ctx = cpuctx->task_ctx;
> > - if (ctx && ctx->nr_events) {
> > - if (ctx->nr_events != ctx->nr_active)
> > - task_rotate = true;
> > - }
> > + cpu_rotate = cpuctx->ctx.rotate_necessary;
> > + task_ctx = cpuctx->task_ctx;
> > + task_rotate = task_ctx ? task_ctx->rotate_necessary : 0;
> >
> >   if (!(cpu_rotate || task_rotate))
> >   return false;
> > @@ -3716,7 +3724,7 @@ static bool perf_rotate_context(struct 
> > perf_cpu_context *cpuctx)
> >   perf_pmu_disable(cpuctx->ctx.pmu);
> >
> >   if (task_rotate)
> > - task_event = ctx_first_active(ctx);
> > + task_event = ctx_first_active(task_ctx);
> >   if (cpu_rotate)
> >   cpu_event = 

[tip:perf/urgent] perf/x86/intel/ds: Fix EVENT vs. UEVENT PEBS constraints

2019-05-21 Thread tip-bot for Stephane Eranian
Commit-ID:  23e3983a466cd540ffdd2bbc6e0c51e31934f941
Gitweb: https://git.kernel.org/tip/23e3983a466cd540ffdd2bbc6e0c51e31934f941
Author: Stephane Eranian 
AuthorDate: Mon, 20 May 2019 17:52:46 -0700
Committer:  Ingo Molnar 
CommitDate: Tue, 21 May 2019 10:25:29 +0200

perf/x86/intel/ds: Fix EVENT vs. UEVENT PEBS constraints

This patch fixes an bug revealed by the following commit:

  6b89d4c1ae85 ("perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* masking")

That patch modified INTEL_FLAGS_EVENT_CONSTRAINT() to only look at the event 
code
when matching a constraint. If code+umask were needed, then the
INTEL_FLAGS_UEVENT_CONSTRAINT() macro was needed instead.
This broke with some of the constraints for PEBS events.

Several of them, including the one used for cycles:p, cycles:pp, cycles:ppp
fell in that category and caused the event to be rejected in PEBS mode.
In other words, on some platforms a cmdline such as:

  $ perf top -e cycles:pp

would fail with -EINVAL.

This patch fixes this bug by properly using INTEL_FLAGS_UEVENT_CONSTRAINT()
when needed in the PEBS constraint tables.

Reported-by: Ingo Molnar 
Signed-off-by: Stephane Eranian 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: kan.li...@intel.com
Link: http://lkml.kernel.org/r/20190521005246.423-1-eran...@google.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/ds.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 7a9f5dac5abe..7acc526b4ad2 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -684,7 +684,7 @@ struct event_constraint 
intel_core2_pebs_event_constraints[] = {
INTEL_FLAGS_UEVENT_CONSTRAINT(0x1fc7, 0x1), /* SIMD_INST_RETURED.ANY */
INTEL_FLAGS_EVENT_CONSTRAINT(0xcb, 0x1),/* MEM_LOAD_RETIRED.* */
/* INST_RETIRED.ANY_P, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x01),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108000c0, 0x01),
EVENT_CONSTRAINT_END
 };
 
@@ -693,7 +693,7 @@ struct event_constraint intel_atom_pebs_event_constraints[] 
= {
INTEL_FLAGS_UEVENT_CONSTRAINT(0x00c5, 0x1), /* 
MISPREDICTED_BRANCH_RETIRED */
INTEL_FLAGS_EVENT_CONSTRAINT(0xcb, 0x1),/* MEM_LOAD_RETIRED.* */
/* INST_RETIRED.ANY_P, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x01),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108000c0, 0x01),
/* Allow all events as PEBS with no flags */
INTEL_ALL_EVENT_CONSTRAINT(0, 0x1),
EVENT_CONSTRAINT_END
@@ -701,7 +701,7 @@ struct event_constraint intel_atom_pebs_event_constraints[] 
= {
 
 struct event_constraint intel_slm_pebs_event_constraints[] = {
/* INST_RETIRED.ANY_P, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x1),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108000c0, 0x1),
/* Allow all events as PEBS with no flags */
INTEL_ALL_EVENT_CONSTRAINT(0, 0x1),
EVENT_CONSTRAINT_END
@@ -726,7 +726,7 @@ struct event_constraint 
intel_nehalem_pebs_event_constraints[] = {
INTEL_FLAGS_EVENT_CONSTRAINT(0xcb, 0xf),/* MEM_LOAD_RETIRED.* */
INTEL_FLAGS_EVENT_CONSTRAINT(0xf7, 0xf),/* FP_ASSIST.* */
/* INST_RETIRED.ANY_P, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x0f),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108000c0, 0x0f),
EVENT_CONSTRAINT_END
 };
 
@@ -743,7 +743,7 @@ struct event_constraint 
intel_westmere_pebs_event_constraints[] = {
INTEL_FLAGS_EVENT_CONSTRAINT(0xcb, 0xf),/* MEM_LOAD_RETIRED.* */
INTEL_FLAGS_EVENT_CONSTRAINT(0xf7, 0xf),/* FP_ASSIST.* */
/* INST_RETIRED.ANY_P, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x0f),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108000c0, 0x0f),
EVENT_CONSTRAINT_END
 };
 
@@ -752,7 +752,7 @@ struct event_constraint intel_snb_pebs_event_constraints[] 
= {
INTEL_PLD_CONSTRAINT(0x01cd, 0x8),/* 
MEM_TRANS_RETIRED.LAT_ABOVE_THR */
INTEL_PST_CONSTRAINT(0x02cd, 0x8),/* 
MEM_TRANS_RETIRED.PRECISE_STORES */
/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108001c2, 0xf),
 INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf),/* MEM_UOP_RETIRED.* */
 INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf),/* MEM_LOAD_UOPS_RETIRED.* */
 INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf),/* 
MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -767,9 +767,9 @@ struct event_constraint intel_ivb_pebs_event_constraints[] 
= {
 INTEL_PLD_CONSTRAINT(0x01cd, 0x8),/* 
MEM_TRANS_RETIRED.LAT_ABOVE_THR */
INTEL_PST_CONSTRAINT(0x02cd, 0x8),/* 
MEM_TRANS_RETIRED.PREC

[PATCH v2] perf/x86/intel/ds: fix EVENT vs. UEVENT PEBS constraints

2019-05-20 Thread Stephane Eranian
This patch fixes an issue revealed by the following commit:
Commit 6b89d4c1ae85 ("perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* 
masking")

That patch modified INTEL_FLAGS_EVENT_CONSTRAINT() to only look at the event 
code
when matching a constraint. If code+umask were needed, then the
INTEL_FLAGS_UEVENT_CONSTRAINT() macro was needed instead.
This broke with some of the constraints for PEBS events.
Several of them, including the one used for cycles:p, cycles:pp, cycles:ppp
fell in that category and caused the event to be rejected in PEBS mode.
In other words, on some platforms a cmdline such as:

  $ perf top -e cycles:pp

  would fail with EINVAL.

This patch fixes this issue by properly using INTEL_FLAGS_UEVENT_CONSTRAINT()
when needed in the PEBS constraint tables.

In v2:
  - add fixes for Core2, Nehalem, Silvermont, and Atom

Reported-by: Ingo Molnar 
Signed-off-by: Stephane Eranian 
---
 arch/x86/events/intel/ds.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index ea2cb6b7e456..5e9bb246b3a6 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -684,7 +684,7 @@ struct event_constraint 
intel_core2_pebs_event_constraints[] = {
INTEL_FLAGS_UEVENT_CONSTRAINT(0x1fc7, 0x1), /* SIMD_INST_RETURED.ANY */
INTEL_FLAGS_EVENT_CONSTRAINT(0xcb, 0x1),/* MEM_LOAD_RETIRED.* */
/* INST_RETIRED.ANY_P, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x01),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108000c0, 0x01),
EVENT_CONSTRAINT_END
 };
 
@@ -693,7 +693,7 @@ struct event_constraint intel_atom_pebs_event_constraints[] 
= {
INTEL_FLAGS_UEVENT_CONSTRAINT(0x00c5, 0x1), /* 
MISPREDICTED_BRANCH_RETIRED */
INTEL_FLAGS_EVENT_CONSTRAINT(0xcb, 0x1),/* MEM_LOAD_RETIRED.* */
/* INST_RETIRED.ANY_P, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x01),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108000c0, 0x01),
/* Allow all events as PEBS with no flags */
INTEL_ALL_EVENT_CONSTRAINT(0, 0x1),
EVENT_CONSTRAINT_END
@@ -701,7 +701,7 @@ struct event_constraint intel_atom_pebs_event_constraints[] 
= {
 
 struct event_constraint intel_slm_pebs_event_constraints[] = {
/* INST_RETIRED.ANY_P, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x1),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108000c0, 0x1),
/* Allow all events as PEBS with no flags */
INTEL_ALL_EVENT_CONSTRAINT(0, 0x1),
EVENT_CONSTRAINT_END
@@ -726,7 +726,7 @@ struct event_constraint 
intel_nehalem_pebs_event_constraints[] = {
INTEL_FLAGS_EVENT_CONSTRAINT(0xcb, 0xf),/* MEM_LOAD_RETIRED.* */
INTEL_FLAGS_EVENT_CONSTRAINT(0xf7, 0xf),/* FP_ASSIST.* */
/* INST_RETIRED.ANY_P, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x0f),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108000c0, 0x0f),
EVENT_CONSTRAINT_END
 };
 
@@ -743,7 +743,7 @@ struct event_constraint 
intel_westmere_pebs_event_constraints[] = {
INTEL_FLAGS_EVENT_CONSTRAINT(0xcb, 0xf),/* MEM_LOAD_RETIRED.* */
INTEL_FLAGS_EVENT_CONSTRAINT(0xf7, 0xf),/* FP_ASSIST.* */
/* INST_RETIRED.ANY_P, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x0f),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108000c0, 0x0f),
EVENT_CONSTRAINT_END
 };
 
@@ -752,7 +752,7 @@ struct event_constraint intel_snb_pebs_event_constraints[] 
= {
INTEL_PLD_CONSTRAINT(0x01cd, 0x8),/* 
MEM_TRANS_RETIRED.LAT_ABOVE_THR */
INTEL_PST_CONSTRAINT(0x02cd, 0x8),/* 
MEM_TRANS_RETIRED.PRECISE_STORES */
/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108001c2, 0xf),
 INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf),/* MEM_UOP_RETIRED.* */
 INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf),/* MEM_LOAD_UOPS_RETIRED.* */
 INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf),/* 
MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -767,9 +767,9 @@ struct event_constraint intel_ivb_pebs_event_constraints[] 
= {
 INTEL_PLD_CONSTRAINT(0x01cd, 0x8),/* 
MEM_TRANS_RETIRED.LAT_ABOVE_THR */
INTEL_PST_CONSTRAINT(0x02cd, 0x8),/* 
MEM_TRANS_RETIRED.PRECISE_STORES */
/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108001c2, 0xf),
/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108001c0, 0x2),
INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf),/* MEM_UOP_RETIRED.* */
INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf),/* MEM_LOAD_UOPS_RETIRED.* */
INTEL_EXCLEVT_

[PATCH] perf/x86/intel/ds: fix EVENT vs. UEVENT PEBS constraints

2019-05-20 Thread Stephane Eranian
This patch fixes an issue revealed by the following commit:
Commit 6b89d4c1ae85 ("perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* 
masking")

That patch modified INTEL_FLAGS_EVENT_CONSTRAINT() to only look at the event 
code
when matching a constraint. If code+umask were needed, then the
INTEL_FLAGS_UEVENT_CONSTRAINT() macro was needed instead.
This broke with some of the constraints for PEBS events.
Several of them, including the one used for cycles:p, cycles:pp, cycles:ppp
fell in that category and caused the event to be rejected in PEBS mode.
In other words, on some platforms a cmdline such as:

  $ perf top -e cycles:pp

  would fail with EINVAL.

This patch fixes this issue by properly using INTEL_FLAGS_UEVENT_CONSTRAINT()
when needed in the PEBS constraint tables.

Reported-by: Ingo Molnar 
Signed-off-by: Stephane Eranian 
---
 arch/x86/events/intel/ds.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index ea2cb6b7e456..88e73652a10c 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -743,7 +743,7 @@ struct event_constraint 
intel_westmere_pebs_event_constraints[] = {
INTEL_FLAGS_EVENT_CONSTRAINT(0xcb, 0xf),/* MEM_LOAD_RETIRED.* */
INTEL_FLAGS_EVENT_CONSTRAINT(0xf7, 0xf),/* FP_ASSIST.* */
/* INST_RETIRED.ANY_P, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x0f),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108000c0, 0x0f),
EVENT_CONSTRAINT_END
 };
 
@@ -752,7 +752,7 @@ struct event_constraint intel_snb_pebs_event_constraints[] 
= {
INTEL_PLD_CONSTRAINT(0x01cd, 0x8),/* 
MEM_TRANS_RETIRED.LAT_ABOVE_THR */
INTEL_PST_CONSTRAINT(0x02cd, 0x8),/* 
MEM_TRANS_RETIRED.PRECISE_STORES */
/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108001c2, 0xf),
 INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf),/* MEM_UOP_RETIRED.* */
 INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf),/* MEM_LOAD_UOPS_RETIRED.* */
 INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf),/* 
MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -767,9 +767,9 @@ struct event_constraint intel_ivb_pebs_event_constraints[] 
= {
 INTEL_PLD_CONSTRAINT(0x01cd, 0x8),/* 
MEM_TRANS_RETIRED.LAT_ABOVE_THR */
INTEL_PST_CONSTRAINT(0x02cd, 0x8),/* 
MEM_TRANS_RETIRED.PRECISE_STORES */
/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108001c2, 0xf),
/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108001c0, 0x2),
INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf),/* MEM_UOP_RETIRED.* */
INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf),/* MEM_LOAD_UOPS_RETIRED.* */
INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf),/* 
MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -783,9 +783,9 @@ struct event_constraint intel_hsw_pebs_event_constraints[] 
= {
INTEL_FLAGS_UEVENT_CONSTRAINT(0x01c0, 0x2), /* INST_RETIRED.PRECDIST */
INTEL_PLD_CONSTRAINT(0x01cd, 0xf),/* MEM_TRANS_RETIRED.* */
/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108001c2, 0xf),
/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108001c0, 0x2),
INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* 
UOPS_RETIRED.ALL */
INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x11d0, 0xf), /* 
MEM_UOPS_RETIRED.STLB_MISS_LOADS */
INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x21d0, 0xf), /* 
MEM_UOPS_RETIRED.LOCK_LOADS */
@@ -806,9 +806,9 @@ struct event_constraint intel_bdw_pebs_event_constraints[] 
= {
INTEL_FLAGS_UEVENT_CONSTRAINT(0x01c0, 0x2), /* INST_RETIRED.PRECDIST */
INTEL_PLD_CONSTRAINT(0x01cd, 0xf),/* MEM_TRANS_RETIRED.* */
/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108001c2, 0xf),
/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
-   INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
+   INTEL_FLAGS_UEVENT_CONSTRAINT(0x108001c0, 0x2),
INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* 
UOPS_RETIRED.ALL */
INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x11d0, 0xf), /* 
MEM_UOPS_RETIRED.STLB_MISS_LOADS */
INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x21d0, 0xf), /* 
MEM_UOPS_RETIRED.LOCK_LOADS */
@@ -829,9 +829,9 @@ struct event_constraint intel_bdw_pebs_event_constraints[] 
= {
 struct event_constraint intel_skl_pebs_event_c

Re: [tip:perf/urgent] perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* masking

2019-05-18 Thread Stephane Eranian
On Sat, May 18, 2019 at 2:16 PM Ingo Molnar  wrote:
>
>
> * tip-bot for Stephane Eranian  wrote:
>
> > Commit-ID:  6b89d4c1ae8596a8c9240f169ef108704de373f2
> > Gitweb: 
> > https://git.kernel.org/tip/6b89d4c1ae8596a8c9240f169ef108704de373f2
> > Author: Stephane Eranian 
> > AuthorDate: Thu, 9 May 2019 14:45:56 -0700
> > Committer:  Ingo Molnar 
> > CommitDate: Fri, 10 May 2019 08:04:17 +0200
> >
> > perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* masking
> >
> > On Intel Westmere, a cmdline as follows:
> >
> >   $ perf record -e 
> > cpu/event=0xc4,umask=0x2,name=br_inst_retired.near_call/p 
> >
> > was failing. Yet the event+ umask support PEBS.
> >
> > It turns out this is due to a bug in the the PEBS event constraint table for
> > westmere. All forms of BR_INST_RETIRED.* support PEBS. Therefore the 
> > constraint
> > mask should ignore the umask. The name of the macro 
> > INTEL_FLAGS_EVENT_CONSTRAINT()
> > hint that this is the case but it was not. That macros was checking both the
> > event code and event umask. Therefore, it was only matching on 0x00c4.
> > There are code+umask macros, they all have *UEVENT*.
> >
> > This bug fixes the issue by checking only the event code in the mask.
> > Both single and range version are modified.
> >
> > Signed-off-by: Stephane Eranian 
> > Cc: Alexander Shishkin 
> > Cc: Arnaldo Carvalho de Melo 
> > Cc: Jiri Olsa 
> > Cc: Linus Torvalds 
> > Cc: Peter Zijlstra 
> > Cc: Thomas Gleixner 
> > Cc: Vince Weaver 
> > Cc: kan.li...@intel.com
> > Link: http://lkml.kernel.org/r/20190509214556.123493-1-eran...@google.com
> > Signed-off-by: Ingo Molnar 
> > ---
> >  arch/x86/events/perf_event.h | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> > index 07fc84bb85c1..a6ac2f4f76fc 100644
> > --- a/arch/x86/events/perf_event.h
> > +++ b/arch/x86/events/perf_event.h
> > @@ -394,10 +394,10 @@ struct cpu_hw_events {
> >
> >  /* Event constraint, but match on all event flags too. */
> >  #define INTEL_FLAGS_EVENT_CONSTRAINT(c, n) \
> > - EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
> > + EVENT_CONSTRAINT(c, n, 
> > ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS)
> >
> >  #define INTEL_FLAGS_EVENT_CONSTRAINT_RANGE(c, e, n)  \
> > - EVENT_CONSTRAINT_RANGE(c, e, n, 
> > INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
> > + EVENT_CONSTRAINT_RANGE(c, e, n, 
> > ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS)
>
> This commit broke one of my testboxes - and unfortunately I noticed this
> too late and the commit is now upstream.
>
> The breakage is that 'perf top' stops working altogether, it errors out
> in the event creation:
>
>  $ perf top --stdio
>  Error:
>  The sys_perf_event_open() syscall returned with 22 (Invalid argument) for 
> event (cycles).
>
> I bisected it back to this commit:
>
>  6b89d4c1ae8596a8c9240f169ef108704de373f2 is the first bad commit
>  commit 6b89d4c1ae8596a8c9240f169ef108704de373f2
>  Author: Stephane Eranian 
>  Date:   Thu May 9 14:45:56 2019 -0700
>
> perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* masking
>
> The system is IvyBridge model 62, running a defconfig-ish kernel, and
> with perf_event_paranoid set to -1:
>
>  [3.756600] Performance Events: PEBS fmt1+, IvyBridge events, 16-deep 
> LBR, full-width counters, Intel PMU driver.
>
>  processor  : 39
>  vendor_id  : GenuineIntel
>  cpu family : 6
>  model  : 62
>  model name : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>  stepping   : 4
>  microcode  : 0x428
>
> If I revert the commit 'perf top' starts working again.
>
I have some ivybridge systems, let me debug this. This is likely
related to cycles:ppp stuff given what perf top does.
I think my patch is right, but there may be assumptions or bugs
elsewhere exposed by the fix.

>
> Thanks,
>
> Ingo


[tip:perf/urgent] perf/x86/intel: Allow PEBS multi-entry in watermark mode

2019-05-14 Thread tip-bot for Stephane Eranian
Commit-ID:  c7a286577d7592720c2f179aadfb325a1ff48c95
Gitweb: https://git.kernel.org/tip/c7a286577d7592720c2f179aadfb325a1ff48c95
Author: Stephane Eranian 
AuthorDate: Mon, 13 May 2019 17:34:00 -0700
Committer:  Ingo Molnar 
CommitDate: Tue, 14 May 2019 09:07:58 +0200

perf/x86/intel: Allow PEBS multi-entry in watermark mode

This patch fixes a restriction/bug introduced by:

   583feb08e7f7 ("perf/x86/intel: Fix handling of wakeup_events for multi-entry 
PEBS")

The original patch prevented using multi-entry PEBS when wakeup_events != 0.
However given that wakeup_events is part of a union with wakeup_watermark, it
means that in watermark mode, PEBS multi-entry is also disabled which is not the
intent. This patch fixes this by checking is watermark mode is enabled.

Signed-off-by: Stephane Eranian 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: jo...@redhat.com
Cc: kan.li...@intel.com
Cc: vincent.wea...@maine.edu
Fixes: 583feb08e7f7 ("perf/x86/intel: Fix handling of wakeup_events for 
multi-entry PEBS")
Link: http://lkml.kernel.org/r/20190514003400.224340-1-eran...@google.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ef763f535e3a..12ec402f4114 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3265,7 +3265,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
return ret;
 
if (event->attr.precise_ip) {
-   if (!(event->attr.freq || event->attr.wakeup_events)) {
+   if (!(event->attr.freq || (event->attr.wakeup_events && 
!event->attr.watermark))) {
event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
if (!(event->attr.sample_type &
  ~intel_pmu_large_pebs_flags(event)))


[PATCH] perf/x86/intel: allow PEBS multi-entry in watermark mode

2019-05-13 Thread Stephane Eranian
This patch fixes an issue introduced with:

   583feb08e7f7 ("perf/x86/intel: Fix handling of wakeup_events for multi-entry 
PEBS")

The original patch prevented using multi-entry PEBS when wakeup_events != 0.
However given that wakeup_events is part of a union with wakeup_watermark, it
means that in watermark mode, PEBS multi-entry is also disabled which is not the
intent. This patch fixes this by checking is watermark mode is enabled.

Signed-off-by: Stephane Eranian 
Change-Id: I8362bbcf9035c860b64b4c2e8faf310ebd74c234
---
 arch/x86/events/intel/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 416233f92b3c..613fabba2c99 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3280,7 +3280,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
return ret;
 
if (event->attr.precise_ip) {
-   if (!(event->attr.freq || event->attr.wakeup_events)) {
+   if (!(event->attr.freq || (event->attr.wakeup_events && 
!event->attr.watermark))) {
event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
if (!(event->attr.sample_type &
  ~intel_pmu_large_pebs_flags(event)))


[tip:perf/urgent] perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* masking

2019-05-10 Thread tip-bot for Stephane Eranian
Commit-ID:  6b89d4c1ae8596a8c9240f169ef108704de373f2
Gitweb: https://git.kernel.org/tip/6b89d4c1ae8596a8c9240f169ef108704de373f2
Author: Stephane Eranian 
AuthorDate: Thu, 9 May 2019 14:45:56 -0700
Committer:  Ingo Molnar 
CommitDate: Fri, 10 May 2019 08:04:17 +0200

perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* masking

On Intel Westmere, a cmdline as follows:

  $ perf record -e cpu/event=0xc4,umask=0x2,name=br_inst_retired.near_call/p 


was failing. Yet the event+ umask support PEBS.

It turns out this is due to a bug in the the PEBS event constraint table for
westmere. All forms of BR_INST_RETIRED.* support PEBS. Therefore the constraint
mask should ignore the umask. The name of the macro 
INTEL_FLAGS_EVENT_CONSTRAINT()
hint that this is the case but it was not. That macros was checking both the
event code and event umask. Therefore, it was only matching on 0x00c4.
There are code+umask macros, they all have *UEVENT*.

This bug fixes the issue by checking only the event code in the mask.
Both single and range version are modified.

Signed-off-by: Stephane Eranian 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: kan.li...@intel.com
Link: http://lkml.kernel.org/r/20190509214556.123493-1-eran...@google.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/perf_event.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 07fc84bb85c1..a6ac2f4f76fc 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -394,10 +394,10 @@ struct cpu_hw_events {
 
 /* Event constraint, but match on all event flags too. */
 #define INTEL_FLAGS_EVENT_CONSTRAINT(c, n) \
-   EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
+   EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS)
 
 #define INTEL_FLAGS_EVENT_CONSTRAINT_RANGE(c, e, n)\
-   EVENT_CONSTRAINT_RANGE(c, e, n, 
INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
+   EVENT_CONSTRAINT_RANGE(c, e, n, 
ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS)
 
 /* Check only flags, but allow all event/umask */
 #define INTEL_ALL_EVENT_CONSTRAINT(code, n)\


[PATCH] perf/x86: fix INTEL_FLAGS_EVENT_CONSTRAINT* masking

2019-05-09 Thread Stephane Eranian
On Intel Westmere, a cmdline as follows:
$ perf record -e cpu/event=0xc4,umask=0x2,name=br_inst_retired.near_call/p 

Was failing. Yet the event+ umask support PEBS.

It turns out this is due to a bug in the the PEBS event constraint table for
westmere. All forms of BR_INST_RETIRED.* support PEBS. Therefore the constraint
mask should ignore the umask. The name of the macro 
INTEL_FLAGS_EVENT_CONSTRAINT()
hint that this is the case but it was not. That macros was checking both the
event code and event umask. Therefore, it was only matching on 0x00c4.
There are code+umask macros, they all have *UEVENT*.

This bug fixes the issue by checking only the event code in the mask.
Both single and range version are modified.

Signed-off-by: Stephane Eranian 
---
 arch/x86/events/perf_event.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 07fc84bb85c1..a6ac2f4f76fc 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -394,10 +394,10 @@ struct cpu_hw_events {
 
 /* Event constraint, but match on all event flags too. */
 #define INTEL_FLAGS_EVENT_CONSTRAINT(c, n) \
-   EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
+   EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS)
 
 #define INTEL_FLAGS_EVENT_CONSTRAINT_RANGE(c, e, n)\
-   EVENT_CONSTRAINT_RANGE(c, e, n, 
INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
+   EVENT_CONSTRAINT_RANGE(c, e, n, 
ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS)
 
 /* Check only flags, but allow all event/umask */
 #define INTEL_ALL_EVENT_CONSTRAINT(code, n)\
-- 
2.21.0.1020.gf2820cf01a-goog



Re: [tip:perf/core] perf/x86/intel: Force resched when TFA sysctl is modified

2019-04-18 Thread Stephane Eranian
Vince

On Tue, Apr 16, 2019 at 11:06 PM Ingo Molnar  wrote:
>
>
> * Vince Weaver  wrote:
>
> > On Tue, 16 Apr 2019, tip-bot for Stephane Eranian wrote:
> >
> > > Commit-ID:  f447e4eb3ad1e60d173ca997fcb2ef2a66f12574
> > > Gitweb: 
> > > https://git.kernel.org/tip/f447e4eb3ad1e60d173ca997fcb2ef2a66f12574
> > > Author: Stephane Eranian 
> > > AuthorDate: Mon, 8 Apr 2019 10:32:52 -0700
> > > Committer:  Ingo Molnar 
> > > CommitDate: Tue, 16 Apr 2019 12:19:35 +0200
> > >
> > > perf/x86/intel: Force resched when TFA sysctl is modified
> >
> > What's TFA?  Tuna-fish-alarm?
>
> Heh, I wish! :-)
>
Sorry about the confusion. I was just trying to mimic the function
names that Peter used
in the code. Hard to fit the whole sysctl name in the title, anyway.

> > [...] Nowhere in the commit or in the code does it ever say what a TFA
> > is or why we'd want to resched when it is modified.
>
> Yeah, it's the TSX-Force-Abort acronym - Intel has numbed our general
> dislike to random acrynyms ...
>
> Peter and me usually fix such changelog context omissions, but this one
> slipped through. :-/
>
> The commit is too deep down perf/core already to rebase it just for the
> changelog, but if we are going to rebase it for some functional reason
> I'll take care of it next time around.
>
> TFA. (Thanks For your Assistance. :-)
>
> Ingo


[tip:perf/core] perf/x86/intel: Force resched when TFA sysctl is modified

2019-04-16 Thread tip-bot for Stephane Eranian
Commit-ID:  f447e4eb3ad1e60d173ca997fcb2ef2a66f12574
Gitweb: https://git.kernel.org/tip/f447e4eb3ad1e60d173ca997fcb2ef2a66f12574
Author: Stephane Eranian 
AuthorDate: Mon, 8 Apr 2019 10:32:52 -0700
Committer:  Ingo Molnar 
CommitDate: Tue, 16 Apr 2019 12:19:35 +0200

perf/x86/intel: Force resched when TFA sysctl is modified

This patch provides guarantee to the sysadmin that when TFA is disabled, no PMU
event is using PMC3 when the echo command returns. Vice-Versa, when TFA
is enabled, PMU can use PMC3 immediately (to eliminate possible multiplexing).

  $ perf stat -a -I 1000 --no-merge -e branches,branches,branches,branches
 1.000123979125,768,725,208  branches
 1.000562520125,631,000,456  branches
 1.000942898125,487,114,291  branches
 1.00116125,323,363,620  branches
 2.004721306125,514,968,546  branches
 2.005114560125,511,110,861  branches
 2.005482722125,510,132,724  branches
 2.005851245125,508,967,086  branches
 3.006323475125,166,570,648  branches
 3.006709247125,165,650,056  branches
 3.007086605125,164,639,142  branches
 3.007459298125,164,402,912  branches
 4.007922698125,045,577,140  branches
 4.008310775125,046,804,324  branches
 4.008670814125,048,265,111  branches
 4.009039251125,048,677,611  branches
 5.009503373125,122,240,217  branches
 5.009897067125,122,450,517  branches

Then on another connection, sysadmin does:

  $ echo  1 >/sys/devices/cpu/allow_tsx_force_abort

Then perf stat adjusts the events immediately:

 5.010286029125,121,393,483  branches
 5.010646308125,120,556,786  branches
 6.03588124,963,351,832  branches
 6.011510331124,964,267,566  branches
 6.011889913124,964,829,130  branches
 6.012262996124,965,841,156  branches
 7.012708299124,419,832,234  branches [79.69%]
 7.012847908124,416,363,853  branches [79.73%]
 7.013225462124,400,723,712  branches [79.73%]
 7.013598191124,376,154,434  branches [79.70%]
 8.014089834124,250,862,693  branches [74.98%]
 8.014481363124,267,539,139  branches [74.94%]
 8.014856006124,259,519,786  branches [74.98%]
 8.014980848124,225,457,969  branches [75.04%]
 9.015464576124,204,235,423  branches [75.03%]
 9.015858587124,204,988,490  branches [75.04%]
 9.016243680124,220,092,486  branches [74.99%]
 9.016620104124,231,260,146  branches [74.94%]

And vice-versa if the syadmin does:

  $ echo  0 >/sys/devices/cpu/allow_tsx_force_abort

Events are again spread over the 4 counters:

10.017096277124,276,230,565  branches [74.96%]
10.017237209124,228,062,171  branches [75.03%]
10.017478637124,178,780,626  branches [75.03%]
10.017853402124,198,316,177  branches [75.03%]
11.018334423124,602,418,933  branches [85.40%]
11.018722584124,602,921,320  branches [85.42%]
11.019095621124,603,956,093  branches [85.42%]
11.019467742124,595,273,783  branches [85.42%]
12.019945736125,110,114,864  branches
12.020330764125,109,334,472  branches
12.020688740125,109,818,865  branches
12.021054020125,108,594,014  branches
13.021516774125,109,164,018  branches
13.021903640125,108,794,510  branches
13.022270770125,107,756,978  branches
13.022630819125,109,380,471  branches
14.023114989125,133,140,817  branches
14.023501880125,133,785,858  branches
14.023868339125,133,852,700  branches

Signed-off-by: Stephane Eranian 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: kan.li...@intel.com
Cc: nelson.dso...@intel.com
Cc: to...@suse.com
Link: https://lkml.kernel.org/r/20190408173252.37932-3-eran...@google.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/core.c   |  4 
 arch/x86/events/intel/core.c | 50 ++--
 arch/x86/events/perf_event.h |  1 +
 3 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 87b50f4be201..fdd106267fd2 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -661,6 +661,10 @@ static inline int is_x86_event(struct perf_event *event)
return event->pmu == 
 }
 
+struct pmu *x86_get_pmu(void)
+{
+   return 
+}
 /*
  * Event scheduler state:
  *
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 1bb59c4c59f2..8265b5026a19 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/eve

[tip:perf/core] perf/core: Add perf_pmu_resched() as global function

2019-04-16 Thread tip-bot for Stephane Eranian
Commit-ID:  c68d224e5ed15605e651e2482c6ffd95915ddf58
Gitweb: https://git.kernel.org/tip/c68d224e5ed15605e651e2482c6ffd95915ddf58
Author: Stephane Eranian 
AuthorDate: Mon, 8 Apr 2019 10:32:51 -0700
Committer:  Ingo Molnar 
CommitDate: Tue, 16 Apr 2019 12:19:34 +0200

perf/core: Add perf_pmu_resched() as global function

This patch add perf_pmu_resched() a global function that can be called
to force rescheduling of events for a given PMU. The function locks
both cpuctx and task_ctx internally. This will be used by a subsequent
patch.

Signed-off-by: Stephane Eranian 
[ Simplified the calling convention. ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: kan.li...@intel.com
Cc: nelson.dso...@intel.com
Cc: to...@suse.com
Link: https://lkml.kernel.org/r/20190408173252.37932-2-eran...@google.com
Signed-off-by: Ingo Molnar 
---
 include/linux/perf_event.h |  3 +++
 kernel/events/core.c   | 10 ++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 085a95e2582a..f3864e1c5569 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -888,6 +888,9 @@ extern void perf_sched_cb_dec(struct pmu *pmu);
 extern void perf_sched_cb_inc(struct pmu *pmu);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);
+
+extern void perf_pmu_resched(struct pmu *pmu);
+
 extern int perf_event_refresh(struct perf_event *event, int refresh);
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 30a572e4c6f1..abbd4b3b96c2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2478,6 +2478,16 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
perf_pmu_enable(cpuctx->ctx.pmu);
 }
 
+void perf_pmu_resched(struct pmu *pmu)
+{
+   struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+   struct perf_event_context *task_ctx = cpuctx->task_ctx;
+
+   perf_ctx_lock(cpuctx, task_ctx);
+   ctx_resched(cpuctx, task_ctx, EVENT_ALL|EVENT_CPU);
+   perf_ctx_unlock(cpuctx, task_ctx);
+}
+
 /*
  * Cross CPU call to install and enable a performance event
  *


Re: [PATCH v2 2/2] perf/x86/intel: force resched when TFA sysctl is modified

2019-04-15 Thread Stephane Eranian
On Mon, Apr 15, 2019 at 8:57 AM Peter Zijlstra  wrote:
>
> On Mon, Apr 08, 2019 at 10:32:52AM -0700, Stephane Eranian wrote:
> > +static ssize_t set_sysctl_tfa(struct device *cdev,
> > +   struct device_attribute *attr,
> > +   const char *buf, size_t count)
> > +{
> > + bool val;
> > + ssize_t ret;
> > +
> > + ret = kstrtobool(buf, );
> > + if (ret)
> > + return ret;
> > +
> > + /* no change */
> > + if (val == allow_tsx_force_abort)
> > + return count;
> > +
> > + allow_tsx_force_abort = val;
> > +
> > + get_online_cpus();
> > + on_each_cpu(update_tfa_sched, NULL, 1);
> > + put_online_cpus();
> > +
> > + return count;
> > +}
>
> So we care about concurrent writing to that file?
Not likely but we care about seeing the effects on event scheduling
before the sysctl returns.


[PATCH v2 2/2] perf/x86/intel: force resched when TFA sysctl is modified

2019-04-08 Thread Stephane Eranian
This patch provides guarantee to the sysadmin that when TFA is disabled, no PMU
event is using PMC3 when the echo command returns. Vice-Versa, when TFA
is enabled, PMU can use PMC3 immediately (to eliminate possible multiplexing).

$ perf stat -a -I 1000 --no-merge -e branches,branches,branches,branches
 1.000123979125,768,725,208  branches
 1.000562520125,631,000,456  branches
 1.000942898125,487,114,291  branches
 1.00116125,323,363,620  branches
 2.004721306125,514,968,546  branches
 2.005114560125,511,110,861  branches
 2.005482722125,510,132,724  branches
 2.005851245125,508,967,086  branches
 3.006323475125,166,570,648  branches
 3.006709247125,165,650,056  branches
 3.007086605125,164,639,142  branches
 3.007459298125,164,402,912  branches
 4.007922698125,045,577,140  branches
 4.008310775125,046,804,324  branches
 4.008670814125,048,265,111  branches
 4.009039251125,048,677,611  branches
 5.009503373125,122,240,217  branches
 5.009897067125,122,450,517  branches

Then on another connection, sysadmin does:
$ echo  1 >/sys/devices/cpu/allow_tsx_force_abort

Then perf stat adjusts the events immediately:

 5.010286029125,121,393,483  branches
 5.010646308125,120,556,786  branches
 6.03588124,963,351,832  branches
 6.011510331124,964,267,566  branches
 6.011889913124,964,829,130  branches
 6.012262996124,965,841,156  branches
 7.012708299124,419,832,234  branches [79.69%]
 7.012847908124,416,363,853  branches [79.73%]
 7.013225462124,400,723,712  branches [79.73%]
 7.013598191124,376,154,434  branches [79.70%]
 8.014089834124,250,862,693  branches [74.98%]
 8.014481363124,267,539,139  branches [74.94%]
 8.014856006124,259,519,786  branches [74.98%]
 8.014980848124,225,457,969  branches [75.04%]
 9.015464576124,204,235,423  branches [75.03%]
 9.015858587124,204,988,490  branches [75.04%]
 9.016243680124,220,092,486  branches [74.99%]
 9.016620104124,231,260,146  branches [74.94%]

And vice-versa if the syadmin does:
$ echo  0 >/sys/devices/cpu/allow_tsx_force_abort

Events are again spread over the 4 counters:

10.017096277124,276,230,565  branches [74.96%]
10.017237209124,228,062,171  branches [75.03%]
10.017478637124,178,780,626  branches [75.03%]
10.017853402124,198,316,177  branches [75.03%]
11.018334423124,602,418,933  branches [85.40%]
11.018722584124,602,921,320  branches [85.42%]
11.019095621124,603,956,093  branches [85.42%]
11.019467742124,595,273,783  branches [85.42%]
12.019945736125,110,114,864  branches
12.020330764125,109,334,472  branches
12.020688740125,109,818,865  branches
12.021054020125,108,594,014  branches
13.021516774125,109,164,018  branches
13.021903640125,108,794,510  branches
13.022270770125,107,756,978  branches
13.022630819125,109,380,471  branches
14.023114989125,133,140,817  branches
14.023501880125,133,785,858  branches
14.023868339125,133,852,700  branches

Signed-off-by: Stephane Eranian 
Change-Id: Ib443265edce31b93ca4d10fe7695c05d00a7178e
---
 arch/x86/events/core.c   |  4 +++
 arch/x86/events/intel/core.c | 53 ++--
 arch/x86/events/perf_event.h |  1 +
 3 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 87b50f4be201..fdd106267fd2 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -661,6 +661,10 @@ static inline int is_x86_event(struct perf_event *event)
return event->pmu == 
 }
 
+struct pmu *x86_get_pmu(void)
+{
+   return 
+}
 /*
  * Event scheduler state:
  *
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 1403c05e25e2..20698e84c388 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4156,6 +4156,53 @@ static ssize_t freeze_on_smi_store(struct device *cdev,
return count;
 }
 
+static void update_tfa_sched(void *ignored)
+{
+   struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
+   struct pmu *pmu = x86_get_pmu();
+   struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+   struct perf_event_context *task_ctx = cpuctx->task_ctx;
+
+   /*
+* check if PMC3 is used
+* and if so force schedule out for all event types all contexts
+*/
+   if (test_bit(3, cpuc->active_mask))
+   perf_ctx_resched(cpuctx, task_ctx, E

[PATCH v2 1/2] perf/core: add perf_ctx_resched() as global function

2019-04-08 Thread Stephane Eranian
This patch add perf_ctx_resched() a global function that can be called
to force rescheduling of events based on event types. The function locks
both cpuctx and task_ctx internally. This will be used by a subsequent patch.

Signed-off-by: Stephane Eranian 
Change-Id: Icbc05e5f461fd6e091b46778fe62b23f308e2be7
---
 include/linux/perf_event.h | 14 ++
 kernel/events/core.c   | 18 +-
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 085a95e2582a..ee8a275df0ed 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -822,6 +822,15 @@ struct bpf_perf_event_data_kern {
struct perf_event *event;
 };
 
+enum event_type_t {
+   EVENT_FLEXIBLE = 0x1,
+   EVENT_PINNED = 0x2,
+   EVENT_TIME = 0x4,
+   /* see ctx_resched() for details */
+   EVENT_CPU = 0x8,
+   EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
+};
+
 #ifdef CONFIG_CGROUP_PERF
 
 /*
@@ -888,6 +897,11 @@ extern void perf_sched_cb_dec(struct pmu *pmu);
 extern void perf_sched_cb_inc(struct pmu *pmu);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);
+
+extern void perf_ctx_resched(struct perf_cpu_context *cpuctx,
+struct perf_event_context *task_ctx,
+enum event_type_t event_type);
+
 extern int perf_event_refresh(struct perf_event *event, int refresh);
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index dfc4bab0b02b..30474064ec22 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -354,15 +354,6 @@ static void event_function_local(struct perf_event *event, 
event_f func, void *d
(PERF_SAMPLE_BRANCH_KERNEL |\
 PERF_SAMPLE_BRANCH_HV)
 
-enum event_type_t {
-   EVENT_FLEXIBLE = 0x1,
-   EVENT_PINNED = 0x2,
-   EVENT_TIME = 0x4,
-   /* see ctx_resched() for details */
-   EVENT_CPU = 0x8,
-   EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
-};
-
 /*
  * perf_sched_events : >0 events exist
  * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
@@ -2477,6 +2468,15 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
perf_pmu_enable(cpuctx->ctx.pmu);
 }
 
+void perf_ctx_resched(struct perf_cpu_context *cpuctx,
+ struct perf_event_context *task_ctx,
+ enum event_type_t event_type)
+{
+   perf_ctx_lock(cpuctx, task_ctx);
+   ctx_resched(cpuctx, task_ctx, event_type);
+   perf_ctx_unlock(cpuctx, task_ctx);
+}
+
 /*
  * Cross CPU call to install and enable a performance event
  *
-- 
2.21.0.392.gf8f6787159e-goog



[PATCH v2 0/3] perf/x86/intel: force reschedule on TFA changes

2019-04-08 Thread Stephane Eranian
This short patch series improves the TFA patch series by adding a
guarantee to users each time the allow_force_tsx_abort (TFA) sysctl
control knob is modified. 

The current TFA support in perf_events operates as follow:
 - TFA=1
   The PMU has priority over TSX, if PMC3 is needed, then TSX transactions
   are forced to abort. PMU has access to PMC3 and can schedule events on it.

 - TFA=0
   TSX has priority over PMU. If PMC3 is needed for an event, then the event
   must be scheduled on another counter. PMC3 is not available.

When a sysadmin modifies TFA, the current code base does not change anything
to the events measured at the time nor the actual MSR controlling TFA. If the
kernel transitions from TFA=1 to TFA=0, nothing happens until the events are
descheduled on context switch, multiplexing or termination of measurement.
That means the TSX transactions still fail until then. There is no easy way
to evaluate how long this can take.

This patch series addresses this issue by rescheduling the events as part of the
sysctl changes. That way, there is the guarantee that no more perf_events events
are running on PMC3 by the time the write() syscall (from the echo) returns, and
that TSX transactions may succeed from then on. Similarly, when transitioning
from TFA=0 to TFA=1, the events are rescheduled and can use PMC3 immediately if
needed and TSX transactions systematically abort, by the time the write() 
syscall
returns.

To make this work, the patch uses an existing reschedule function in the generic
code, ctx_resched(). In V2, we export a new function called perf_ctx_resched()
which takes care of locking the contexts and invoking ctx_resched().

The patch adds a x86_get_pmu() call which is less than ideal, but I am open to
suggestions here.

In V2, we also switched from ksttoul() to kstrtobool() and added the proper
get_online_cpus()/put_online_cpus().

Signed-off-by: Stephane Eranian 


Stephane Eranian (2):
  perf/core: add perf_ctx_resched() as global function
  perf/x86/intel: force resched when TFA sysctl is modified

 arch/x86/events/core.c   |  4 +++
 arch/x86/events/intel/core.c | 53 ++--
 arch/x86/events/perf_event.h |  1 +
 include/linux/perf_event.h   | 14 ++
 kernel/events/core.c | 18 ++--
 5 files changed, 79 insertions(+), 11 deletions(-)

-- 
2.21.0.392.gf8f6787159e-goog



Re: [PATCH 3/3] perf/x86/intel: force resched when TFA sysctl is modified

2019-04-05 Thread Stephane Eranian
On Fri, Apr 5, 2019 at 1:26 PM Peter Zijlstra  wrote:
>
> On Fri, Apr 05, 2019 at 10:00:03AM -0700, Stephane Eranian wrote:
>
> > > > +static void update_tfa_sched(void *ignored)
> > > > +{
> > > > + struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
> > > > + struct pmu *pmu = x86_get_pmu();
> > > > + struct perf_cpu_context *cpuctx = 
> > > > this_cpu_ptr(pmu->pmu_cpu_context);
> > > > + struct perf_event_context *task_ctx = cpuctx->task_ctx;
> > > > +
> > > > + /* prevent any changes to the two contexts */
> > > > + perf_ctx_lock(cpuctx, task_ctx);
> > > > +
> > > > + /*
> > > > +  * check if PMC3 is used
> > > > +  * and if so force schedule out for all event types all contexts
> > > > +  */
> > > > + if (test_bit(3, cpuc->active_mask))
> > > > + perf_ctx_resched(cpuctx, task_ctx, EVENT_ALL|EVENT_CPU);
> > > > +
> > > > + perf_ctx_unlock(cpuctx, task_ctx);
> > >
> > > I'm not particularly happy with exporting all that. Can't we create this
> > > new perf_ctx_resched() to include the locking and everything. Then the
> > > above reduces to:
> > >
> > > if (test_bit(3, cpuc->active_mask))
> > > perf_ctx_resched(cpuctx);
> > >
> > > And we don't get to export the tricky bits.
> > >
> > The only reason I exported the locking is to protect
> > cpuc->active_mask. But if you
> > think there is no race, then sure,  we can just export a new
> > perf_ctx_resched() that
> > does the locking and invokes the ctx_resched() function.
>
> It doesn't matter if it races, if it was used and isn't anymore, it's
> a pointless reschedule, if it isn't used and we don't reschedule, it
> cannot be used because we've already set the flag.

True. I will post V2 shortly.


Re: [PATCH 3/3] perf/x86/intel: force resched when TFA sysctl is modified

2019-04-05 Thread Stephane Eranian
On Fri, Apr 5, 2019 at 12:13 AM Peter Zijlstra  wrote:
>
> On Thu, Apr 04, 2019 at 11:32:19AM -0700, Stephane Eranian wrote:
> > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > index a4b7711ef0ee..8d356c2096bc 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -4483,6 +4483,60 @@ static ssize_t freeze_on_smi_store(struct device 
> > *cdev,
> >   return count;
> >  }
> >
> > +static void update_tfa_sched(void *ignored)
> > +{
> > + struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
> > + struct pmu *pmu = x86_get_pmu();
> > + struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
> > + struct perf_event_context *task_ctx = cpuctx->task_ctx;
> > +
> > + /* prevent any changes to the two contexts */
> > + perf_ctx_lock(cpuctx, task_ctx);
> > +
> > + /*
> > +  * check if PMC3 is used
> > +  * and if so force schedule out for all event types all contexts
> > +  */
> > + if (test_bit(3, cpuc->active_mask))
> > + perf_ctx_resched(cpuctx, task_ctx, EVENT_ALL|EVENT_CPU);
> > +
> > + perf_ctx_unlock(cpuctx, task_ctx);
>
> I'm not particularly happy with exporting all that. Can't we create this
> new perf_ctx_resched() to include the locking and everything. Then the
> above reduces to:
>
> if (test_bit(3, cpuc->active_mask))
> perf_ctx_resched(cpuctx);
>
> And we don't get to export the tricky bits.
>
The only reason I exported the locking is to protect
cpuc->active_mask. But if you
think there is no race, then sure,  we can just export a new
perf_ctx_resched() that
does the locking and invokes the ctx_resched() function.

> > +}
> > +
> > +static ssize_t show_sysctl_tfa(struct device *cdev,
> > +   struct device_attribute *attr,
> > +   char *buf)
> > +{
> > + return snprintf(buf, 40, "%d\n", allow_tsx_force_abort);
> > +}
> > +
> > +static ssize_t set_sysctl_tfa(struct device *cdev,
> > +   struct device_attribute *attr,
> > +   const char *buf, size_t count)
> > +{
> > + unsigned long val;
> > + ssize_t ret;
> > +
> > + ret = kstrtoul(buf, 0, );
>
> You want kstrtobool()
>
ok.

> > + if (ret)
> > + return ret;
> > +
> > + /* looking for boolean value */
> > + if (val > 2)
> > + return -EINVAL;
> > +
> > + /* no change */
> > + if (val == allow_tsx_force_abort)
> > + return count;
> > +
> > + allow_tsx_force_abort ^= 1;
>
> allow_tsx_force_abort = val;
>
> is simpler
>
ok.

> > +
> > + on_each_cpu(update_tfa_sched, NULL, 1);
> > +
> > + return count;
> > +}


[PATCH 3/3] perf/x86/intel: force resched when TFA sysctl is modified

2019-04-04 Thread Stephane Eranian
This patch provides guarantee to the sysadmin that when TFA is disabled, no PMU
event is using PMC3 when the echo command returns. Vice-Versa, when TFA
is enabled, PMU can use PMC3 immediately (to eliminate possible multiplexing).

$ perf stat -a -I 1000 --no-merge -e branches,branches,branches,branches
 1.000123979125,768,725,208  branches
 1.000562520125,631,000,456  branches
 1.000942898125,487,114,291  branches
 1.00116125,323,363,620  branches
 2.004721306125,514,968,546  branches
 2.005114560125,511,110,861  branches
 2.005482722125,510,132,724  branches
 2.005851245125,508,967,086  branches
 3.006323475125,166,570,648  branches
 3.006709247125,165,650,056  branches
 3.007086605125,164,639,142  branches
 3.007459298125,164,402,912  branches
 4.007922698125,045,577,140  branches
 4.008310775125,046,804,324  branches
 4.008670814125,048,265,111  branches
 4.009039251125,048,677,611  branches
 5.009503373125,122,240,217  branches
 5.009897067125,122,450,517  branches

Then on another connection, sysadmin does:
$ echo  1 >/sys/devices/cpu/allow_tsx_force_abort

Then perf stat adjusts the events immediately:

 5.010286029125,121,393,483  branches
 5.010646308125,120,556,786  branches
 6.03588124,963,351,832  branches
 6.011510331124,964,267,566  branches
 6.011889913124,964,829,130  branches
 6.012262996124,965,841,156  branches
 7.012708299124,419,832,234  branches [79.69%]
 7.012847908124,416,363,853  branches [79.73%]
 7.013225462124,400,723,712  branches [79.73%]
 7.013598191124,376,154,434  branches [79.70%]
 8.014089834124,250,862,693  branches [74.98%]
 8.014481363124,267,539,139  branches [74.94%]
 8.014856006124,259,519,786  branches [74.98%]
 8.014980848124,225,457,969  branches [75.04%]
 9.015464576124,204,235,423  branches [75.03%]
 9.015858587124,204,988,490  branches [75.04%]
 9.016243680124,220,092,486  branches [74.99%]
 9.016620104124,231,260,146  branches [74.94%]

And vice-versa if the syadmin does:
$ echo  0 >/sys/devices/cpu/allow_tsx_force_abort

Events are again spread over the 4 counters:

10.017096277124,276,230,565  branches [74.96%]
10.017237209124,228,062,171  branches [75.03%]
10.017478637124,178,780,626  branches [75.03%]
10.017853402124,198,316,177  branches [75.03%]
11.018334423124,602,418,933  branches [85.40%]
11.018722584124,602,921,320  branches [85.42%]
11.019095621124,603,956,093  branches [85.42%]
11.019467742124,595,273,783  branches [85.42%]
12.019945736125,110,114,864  branches
12.020330764125,109,334,472  branches
12.020688740125,109,818,865  branches
12.021054020125,108,594,014  branches
13.021516774125,109,164,018  branches
13.021903640125,108,794,510  branches
13.022270770125,107,756,978  branches
13.022630819125,109,380,471  branches
14.023114989125,133,140,817  branches
14.023501880125,133,785,858  branches
14.023868339125,133,852,700  branches

Signed-off-by: Stephane Eranian 
---
 arch/x86/events/core.c   |  4 +++
 arch/x86/events/intel/core.c | 60 ++--
 arch/x86/events/perf_event.h |  1 +
 3 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 12d7d591843e..314173f89cc8 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -677,6 +677,10 @@ static inline int is_x86_event(struct perf_event *event)
return event->pmu == 
 }
 
+struct pmu *x86_get_pmu(void)
+{
+   return 
+}
 /*
  * Event scheduler state:
  *
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a4b7711ef0ee..8d356c2096bc 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4483,6 +4483,60 @@ static ssize_t freeze_on_smi_store(struct device *cdev,
return count;
 }
 
+static void update_tfa_sched(void *ignored)
+{
+   struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
+   struct pmu *pmu = x86_get_pmu();
+   struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+   struct perf_event_context *task_ctx = cpuctx->task_ctx;
+
+   /* prevent any changes to the two contexts */
+   perf_ctx_lock(cpuctx, task_ctx);
+
+   /*
+* check if PMC3 is used
+* and if so force schedule out for all event types all contexts
+*/
+   if (test_bit(3, cpuc->active_mask))
+ 

[PATCH 2/3] perf/core: make ctx_resched() a global function

2019-04-04 Thread Stephane Eranian
This patch renames ctx_resched() to perf_ctx_resched() and makes
the function globally accessible. This is to prepare for the next
patch which needs to call this function from arch specific code.

Signed-off-by: Stephane Eranian 
---
 include/linux/perf_event.h | 12 
 kernel/events/core.c   | 21 ++---
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 514de997696b..150cfd493ad2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -829,6 +829,15 @@ struct bpf_perf_event_data_kern {
struct perf_event *event;
 };
 
+enum event_type_t {
+   EVENT_FLEXIBLE = 0x1,
+   EVENT_PINNED = 0x2,
+   EVENT_TIME = 0x4,
+   /* see ctx_resched() for details */
+   EVENT_CPU = 0x8,
+   EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
+};
+
 #ifdef CONFIG_CGROUP_PERF
 
 /*
@@ -895,6 +904,9 @@ extern void perf_sched_cb_dec(struct pmu *pmu);
 extern void perf_sched_cb_inc(struct pmu *pmu);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);
+extern void perf_ctx_resched(struct perf_cpu_context *cpuctx,
+   struct perf_event_context *task_ctx,
+   enum event_type_t event_type);
 extern int perf_event_refresh(struct perf_event *event, int refresh);
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 429bf6d8be95..48b955a2b7f1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -338,15 +338,6 @@ static void event_function_local(struct perf_event *event, 
event_f func, void *d
(PERF_SAMPLE_BRANCH_KERNEL |\
 PERF_SAMPLE_BRANCH_HV)
 
-enum event_type_t {
-   EVENT_FLEXIBLE = 0x1,
-   EVENT_PINNED = 0x2,
-   EVENT_TIME = 0x4,
-   /* see ctx_resched() for details */
-   EVENT_CPU = 0x8,
-   EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
-};
-
 /*
  * perf_sched_events : >0 events exist
  * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
@@ -2430,9 +2421,9 @@ static void perf_event_sched_in(struct perf_cpu_context 
*cpuctx,
  * event_type is a bit mask of the types of events involved. For CPU events,
  * event_type is only either EVENT_PINNED or EVENT_FLEXIBLE.
  */
-static void ctx_resched(struct perf_cpu_context *cpuctx,
-   struct perf_event_context *task_ctx,
-   enum event_type_t event_type)
+void perf_ctx_resched(struct perf_cpu_context *cpuctx,
+ struct perf_event_context *task_ctx,
+ enum event_type_t event_type)
 {
enum event_type_t ctx_event_type;
bool cpu_event = !!(event_type & EVENT_CPU);
@@ -2520,7 +2511,7 @@ static int  __perf_install_in_context(void *info)
if (reprogram) {
ctx_sched_out(ctx, cpuctx, EVENT_TIME);
add_event_to_ctx(event, ctx);
-   ctx_resched(cpuctx, task_ctx, get_event_type(event));
+   perf_ctx_resched(cpuctx, task_ctx, get_event_type(event));
} else {
add_event_to_ctx(event, ctx);
}
@@ -2664,7 +2655,7 @@ static void __perf_event_enable(struct perf_event *event,
if (ctx->task)
WARN_ON_ONCE(task_ctx != ctx);
 
-   ctx_resched(cpuctx, task_ctx, get_event_type(event));
+   perf_ctx_resched(cpuctx, task_ctx, get_event_type(event));
 }
 
 /*
@@ -3782,7 +3773,7 @@ static void perf_event_enable_on_exec(int ctxn)
 */
if (enabled) {
clone_ctx = unclone_ctx(ctx);
-   ctx_resched(cpuctx, ctx, event_type);
+   perf_ctx_resched(cpuctx, ctx, event_type);
} else {
ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
}
-- 
2.21.0.392.gf8f6787159e-goog



[PATCH 0/3] perf/x86/intel: force reschedule on TFA changes

2019-04-04 Thread Stephane Eranian
This short patch series improves the TFA patch series by adding a
guarantee to users each time the allow_force_tsx_abort (TFA) sysctl
control knob is modified. 

The current TFA support in perf_events operates as follow:
 - TFA=1
   The PMU has priority over TSX, if PMC3 is needed, then TSX transactions
   are forced to abort. PMU has access to PMC3 and can schedule events on it.

 - TFA=0
   TSX has priority over PMU. If PMC3 is needed for an event, then the event
   must be scheduled on another counter. PMC3 is not available.

When a sysadmin modifies TFA, the current code base does not change anything
to the events measured at the time nor the actual MSR controlling TFA. If the
kernel transitions from TFA=1 to TFA=0, nothing happens until the events are
descheduled on context switch, multiplexing or termination of measurement.
That means the TSX transactions still fail until then. There is no easy way
to evaluate how long this can take.

This patch series addresses this issue by rescheduling the events as part of the
sysctl changes. That way, there is the guarantee that no more perf_events events
are running on PMC3 by the time the write() syscall (from the echo) returns, and
that TSX transactions may succeed from then on. Similarly, when transitioning
from TFA=0 to TFA=1, the events are rescheduled and can use PMC3 immediately if
needed and TSX transactions systematically abort, by the time the write() 
syscall
returns.

To make this work, the patch uses an existing reschedule function in the generic
code. It makes it visible outside the generic code as well as the context 
locking
code. All to avoid code duplication. Given there is no good way to find the 
struct
pmu, if you do not have it, the patch adds a x86_get_pmu() call which is less 
than
ideal, but I am open to suggestions here.

Signed-off-by: Stephane Eranian 

Stephane Eranian (3):
  perf/core: make perf_ctx_*lock() global inline functions
  perf/core: make ctx_resched() a global function
  perf/x86/intel: force resched when TFA sysctl is modified

 arch/x86/events/core.c   |  4 +++
 arch/x86/events/intel/core.c | 60 ++--
 arch/x86/events/perf_event.h |  1 +
 include/linux/perf_event.h   | 28 +
 kernel/events/core.c | 37 --
 5 files changed, 97 insertions(+), 33 deletions(-)

-- 
2.21.0.392.gf8f6787159e-goog



[PATCH 1/3] perf/core: make perf_ctx_*lock() global inline functions

2019-04-04 Thread Stephane Eranian
This patch makes the perf_ctx_lock()/perf_ctx_unlock() inlined functions
available throughout the perf_events code and not just in kernel/events/core.c
This will help with the next patch.

Signed-off-by: Stephane Eranian 
---
 include/linux/perf_event.h | 16 
 kernel/events/core.c   | 16 
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2a1405e907ec..514de997696b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1283,6 +1283,22 @@ perf_event_addr_filters(struct perf_event *event)
return ifh;
 }
 
+static inline void perf_ctx_lock(struct perf_cpu_context *cpuctx,
+ struct perf_event_context *ctx)
+{
+   raw_spin_lock(>ctx.lock);
+   if (ctx)
+   raw_spin_lock(>lock);
+}
+
+static inline void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
+   struct perf_event_context *ctx)
+{
+   if (ctx)
+   raw_spin_unlock(>lock);
+   raw_spin_unlock(>ctx.lock);
+}
+
 extern void perf_event_addr_filters_sync(struct perf_event *event);
 
 extern int perf_output_begin(struct perf_output_handle *handle,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 833f1bccf25a..429bf6d8be95 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -148,22 +148,6 @@ __get_cpu_context(struct perf_event_context *ctx)
return this_cpu_ptr(ctx->pmu->pmu_cpu_context);
 }
 
-static void perf_ctx_lock(struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
-{
-   raw_spin_lock(>ctx.lock);
-   if (ctx)
-   raw_spin_lock(>lock);
-}
-
-static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
-   struct perf_event_context *ctx)
-{
-   if (ctx)
-   raw_spin_unlock(>lock);
-   raw_spin_unlock(>ctx.lock);
-}
-
 #define TASK_TOMBSTONE ((void *)-1L)
 
 static bool is_kernel_event(struct perf_event *event)
-- 
2.21.0.392.gf8f6787159e-goog



[tip:perf/urgent] perf/x86/intel: Fix handling of wakeup_events for multi-entry PEBS

2019-04-03 Thread tip-bot for Stephane Eranian
Commit-ID:  583feb08e7f7ac9d533b446882eb3a54737a6dbb
Gitweb: https://git.kernel.org/tip/583feb08e7f7ac9d533b446882eb3a54737a6dbb
Author: Stephane Eranian 
AuthorDate: Wed, 6 Mar 2019 11:50:48 -0800
Committer:  Ingo Molnar 
CommitDate: Wed, 3 Apr 2019 09:57:43 +0200

perf/x86/intel: Fix handling of wakeup_events for multi-entry PEBS

When an event is programmed with attr.wakeup_events=N (N>0), it means
the caller is interested in getting a user level notification after
N samples have been recorded in the kernel sampling buffer.

With precise events on Intel processors, the kernel uses PEBS.
The kernel tries minimize sampling overhead by verifying
if the event configuration is compatible with multi-entry PEBS mode.
If so, the kernel is notified only when the buffer has reached its threshold.
Other PEBS operates in single-entry mode, the kenrel is notified for each
PEBS sample.

The problem is that the current implementation look at frequency
mode and event sample_type but ignores the wakeup_events field. Thus,
it may not be possible to receive a notification after each precise event.

This patch fixes this problem by disabling multi-entry PEBS if wakeup_events
is non-zero.

Signed-off-by: Stephane Eranian 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Andi Kleen 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: kan.li...@intel.com
Link: https://lkml.kernel.org/r/20190306195048.189514-1-eran...@google.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8baa441d8000..1539647ea39d 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3185,7 +3185,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
return ret;
 
if (event->attr.precise_ip) {
-   if (!event->attr.freq) {
+   if (!(event->attr.freq || event->attr.wakeup_events)) {
event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
if (!(event->attr.sample_type &
  ~intel_pmu_large_pebs_flags(event)))


  1   2   3   4   5   6   7   8   9   10   >