Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-17 Thread Will Deacon
On Fri, Nov 11, 2016 at 11:30:29AM +0100, Jan Glauber wrote:
> On Tue, Nov 08, 2016 at 11:50:10PM +, Will Deacon wrote:
> > On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:
> > > +/* node attribute depending on number of NUMA nodes */
> > > +static ssize_t node_show(struct device *dev, struct device_attribute 
> > > *attr,
> > > +  char *page)
> > > +{
> > > + if (NODES_SHIFT)
> > > + return sprintf(page, "config:16-%d\n", 16 + NODES_SHIFT - 1);
> > 
> > If NODES_SHIFT is 1, you'll end up with "config:16-16", which might confuse
> > userspace.
> 
> So should I use "config:16" in that case? Is it OK to use this also for
> NODES_SHIFT=0 ?

If you only need one bit, then "config:16" is the right thing to do.

Will


Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-17 Thread Will Deacon
On Fri, Nov 11, 2016 at 11:30:29AM +0100, Jan Glauber wrote:
> On Tue, Nov 08, 2016 at 11:50:10PM +, Will Deacon wrote:
> > On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:
> > > +/* node attribute depending on number of NUMA nodes */
> > > +static ssize_t node_show(struct device *dev, struct device_attribute 
> > > *attr,
> > > +  char *page)
> > > +{
> > > + if (NODES_SHIFT)
> > > + return sprintf(page, "config:16-%d\n", 16 + NODES_SHIFT - 1);
> > 
> > If NODES_SHIFT is 1, you'll end up with "config:16-16", which might confuse
> > userspace.
> 
> So should I use "config:16" in that case? Is it OK to use this also for
> NODES_SHIFT=0 ?

If you only need one bit, then "config:16" is the right thing to do.

Will


Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-11 Thread Mark Rutland
On Fri, Nov 11, 2016 at 11:39:21AM +0100, Jan Glauber wrote:
> Hi Mark,
> 
> thanks for reviewing. One question below,

> On Thu, Nov 10, 2016 at 04:54:06PM +, Mark Rutland wrote:

> > On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:

> > > +#include 
> > > +#include 
> > > +#include 
> > 
> > I believe the following includes are necessary for APIs and/or data
> > explicitly referenced by the driver code:

[...]

> Should I also add includes that are already in the included by 
> uncore_cavium.h?

Please do.

> I usually avoid includes that come through the "local" header file.

Generally, when you explcitly use some macro/function/data in a file,
that file should have the relevant include. 

If something's only used in the header (e.g. hidden in a macro or inline
function), then we only need that include in the header.

For example: uncore_cavium.h uses container_of(), and should include
. Also, uncore_cavium.c also uses container_of()
directly for something unrelated, and should also include
.

Thanks,
Mark.


Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-11 Thread Mark Rutland
On Fri, Nov 11, 2016 at 11:39:21AM +0100, Jan Glauber wrote:
> Hi Mark,
> 
> thanks for reviewing. One question below,

> On Thu, Nov 10, 2016 at 04:54:06PM +, Mark Rutland wrote:

> > On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:

> > > +#include 
> > > +#include 
> > > +#include 
> > 
> > I believe the following includes are necessary for APIs and/or data
> > explicitly referenced by the driver code:

[...]

> Should I also add includes that are already in the included by 
> uncore_cavium.h?

Please do.

> I usually avoid includes that come through the "local" header file.

Generally, when you explcitly use some macro/function/data in a file,
that file should have the relevant include. 

If something's only used in the header (e.g. hidden in a macro or inline
function), then we only need that include in the header.

For example: uncore_cavium.h uses container_of(), and should include
. Also, uncore_cavium.c also uses container_of()
directly for something unrelated, and should also include
.

Thanks,
Mark.


Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-11 Thread Jan Glauber
Hi Will,

thanks for the review!

On Tue, Nov 08, 2016 at 11:50:10PM +, Will Deacon wrote:
> Hi Jan,
> 
> Thanks for posting an updated series. I have a few minor comments, which
> we can hopefully address in time for 4.10.
> 
> Also, have you run the perf fuzzer with this series applied?

No, that's new to me. Will try it.

> https://github.com/deater/perf_event_tests
> 
> (build the tests and look under the fuzzer/ directory for the tool)
> 
> On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:
> > Provide "uncore" facilities for different non-CPU performance
> > counter units.
> > 
> > The uncore PMUs can be found under /sys/bus/event_source/devices.
> > All counters are exported via sysfs in the corresponding events
> > files under the PMU directory so the perf tool can list the event names.
> > 
> > There are some points that are special in this implementation:
> > 
> > 1) The PMU detection relies on PCI device detection. If a
> >matching PCI device is found the PMU is created. The code can deal
> >with multiple units of the same type, e.g. more than one memory
> >controller.
> > 
> > 2) Counters are summarized across different units of the same type
> >on one NUMA node but not across NUMA nodes.
> >For instance L2C TAD 0..7 are presented as a single counter
> >(adding the values from TAD 0 to 7). Although losing the ability
> >to read a single value the merged values are easier to use.
> > 
> > 3) The counters are not CPU related. A random CPU is picked regardless
> >of the NUMA node. There is a small performance penalty for accessing
> >counters on a remote note but reading a performance counter is a
> >slow operation anyway.
> > 
> > Signed-off-by: Jan Glauber 
> > ---
> >  drivers/perf/Kconfig|  13 ++
> >  drivers/perf/Makefile   |   1 +
> >  drivers/perf/uncore/Makefile|   1 +
> >  drivers/perf/uncore/uncore_cavium.c | 351 
> > 
> >  drivers/perf/uncore/uncore_cavium.h |  71 
> 
> We already have "uncore" PMUs under drivers/perf, so I'd prefer that we
> renamed this a bit to reflect better what's going on. How about:
> 
>   drivers/perf/cavium/
> 
> and then
> 
>   drivers/perf/cavium/uncore_thunder.[ch]
> 
> ?

OK, agreed.

> >  include/linux/cpuhotplug.h  |   1 +
> >  6 files changed, 438 insertions(+)
> >  create mode 100644 drivers/perf/uncore/Makefile
> >  create mode 100644 drivers/perf/uncore/uncore_cavium.c
> >  create mode 100644 drivers/perf/uncore/uncore_cavium.h
> > 
> > diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
> > index 4d5c5f9..3266c87 100644
> > --- a/drivers/perf/Kconfig
> > +++ b/drivers/perf/Kconfig
> > @@ -19,4 +19,17 @@ config XGENE_PMU
> >  help
> >Say y if you want to use APM X-Gene SoC performance monitors.
> >  
> > +config UNCORE_PMU
> > +   bool
> 
> This isn't needed.

I thought about a symbol for all uncore pmus. But when drivers/perf is
already that it can be removed.

> > +
> > +config UNCORE_PMU_CAVIUM
> > +   depends on PERF_EVENTS && NUMA && ARM64
> > +   bool "Cavium uncore PMU support"
> 
> Please mentioned Thunder somewhere, since that's the SoC being supported.

OK.

> > +   select UNCORE_PMU
> > +   default y
> > +   help
> > + Say y if you want to access performance counters of subsystems
> > + on a Cavium SOC like cache controller, memory controller or
> > + processor interconnect.
> > +
> >  endmenu
> > diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
> > index b116e98..d6c02c9 100644
> > --- a/drivers/perf/Makefile
> > +++ b/drivers/perf/Makefile
> > @@ -1,2 +1,3 @@
> >  obj-$(CONFIG_ARM_PMU) += arm_pmu.o
> >  obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
> > +obj-y += uncore/
> > diff --git a/drivers/perf/uncore/Makefile b/drivers/perf/uncore/Makefile
> > new file mode 100644
> > index 000..6130e18
> > --- /dev/null
> > +++ b/drivers/perf/uncore/Makefile
> > @@ -0,0 +1 @@
> > +obj-$(CONFIG_UNCORE_PMU_CAVIUM) += uncore_cavium.o
> > diff --git a/drivers/perf/uncore/uncore_cavium.c 
> > b/drivers/perf/uncore/uncore_cavium.c
> > new file mode 100644
> > index 000..a7b4277
> > --- /dev/null
> > +++ b/drivers/perf/uncore/uncore_cavium.c
> > @@ -0,0 +1,351 @@
> > +/*
> > + * Cavium Thunder uncore PMU support.
> > + *
> > + * Copyright (C) 2015,2016 Cavium Inc.
> > + * Author: Jan Glauber 
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include "uncore_cavium.h"
> > +
> > +/*
> > + * Some notes about the various counters supported by this "uncore" PMU
> > + * and the design:
> > + *
> > + * All counters are 64 bit long.
> > + * There are no overflow interrupts.
> > + * Counters are summarized per node/socket.
> > + * Most devices appear as separate PCI devices per socket with the 
> > exception
> > + * of OCX TLK which appears as one PCI device per socket and contains 
> > several
> > + * units 

Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-11 Thread Jan Glauber
Hi Will,

thanks for the review!

On Tue, Nov 08, 2016 at 11:50:10PM +, Will Deacon wrote:
> Hi Jan,
> 
> Thanks for posting an updated series. I have a few minor comments, which
> we can hopefully address in time for 4.10.
> 
> Also, have you run the perf fuzzer with this series applied?

No, that's new to me. Will try it.

> https://github.com/deater/perf_event_tests
> 
> (build the tests and look under the fuzzer/ directory for the tool)
> 
> On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:
> > Provide "uncore" facilities for different non-CPU performance
> > counter units.
> > 
> > The uncore PMUs can be found under /sys/bus/event_source/devices.
> > All counters are exported via sysfs in the corresponding events
> > files under the PMU directory so the perf tool can list the event names.
> > 
> > There are some points that are special in this implementation:
> > 
> > 1) The PMU detection relies on PCI device detection. If a
> >matching PCI device is found the PMU is created. The code can deal
> >with multiple units of the same type, e.g. more than one memory
> >controller.
> > 
> > 2) Counters are summarized across different units of the same type
> >on one NUMA node but not across NUMA nodes.
> >For instance L2C TAD 0..7 are presented as a single counter
> >(adding the values from TAD 0 to 7). Although losing the ability
> >to read a single value the merged values are easier to use.
> > 
> > 3) The counters are not CPU related. A random CPU is picked regardless
> >of the NUMA node. There is a small performance penalty for accessing
> >counters on a remote note but reading a performance counter is a
> >slow operation anyway.
> > 
> > Signed-off-by: Jan Glauber 
> > ---
> >  drivers/perf/Kconfig|  13 ++
> >  drivers/perf/Makefile   |   1 +
> >  drivers/perf/uncore/Makefile|   1 +
> >  drivers/perf/uncore/uncore_cavium.c | 351 
> > 
> >  drivers/perf/uncore/uncore_cavium.h |  71 
> 
> We already have "uncore" PMUs under drivers/perf, so I'd prefer that we
> renamed this a bit to reflect better what's going on. How about:
> 
>   drivers/perf/cavium/
> 
> and then
> 
>   drivers/perf/cavium/uncore_thunder.[ch]
> 
> ?

OK, agreed.

> >  include/linux/cpuhotplug.h  |   1 +
> >  6 files changed, 438 insertions(+)
> >  create mode 100644 drivers/perf/uncore/Makefile
> >  create mode 100644 drivers/perf/uncore/uncore_cavium.c
> >  create mode 100644 drivers/perf/uncore/uncore_cavium.h
> > 
> > diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
> > index 4d5c5f9..3266c87 100644
> > --- a/drivers/perf/Kconfig
> > +++ b/drivers/perf/Kconfig
> > @@ -19,4 +19,17 @@ config XGENE_PMU
> >  help
> >Say y if you want to use APM X-Gene SoC performance monitors.
> >  
> > +config UNCORE_PMU
> > +   bool
> 
> This isn't needed.

I thought about a symbol for all uncore pmus. But when drivers/perf is
already that it can be removed.

> > +
> > +config UNCORE_PMU_CAVIUM
> > +   depends on PERF_EVENTS && NUMA && ARM64
> > +   bool "Cavium uncore PMU support"
> 
> Please mentioned Thunder somewhere, since that's the SoC being supported.

OK.

> > +   select UNCORE_PMU
> > +   default y
> > +   help
> > + Say y if you want to access performance counters of subsystems
> > + on a Cavium SOC like cache controller, memory controller or
> > + processor interconnect.
> > +
> >  endmenu
> > diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
> > index b116e98..d6c02c9 100644
> > --- a/drivers/perf/Makefile
> > +++ b/drivers/perf/Makefile
> > @@ -1,2 +1,3 @@
> >  obj-$(CONFIG_ARM_PMU) += arm_pmu.o
> >  obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
> > +obj-y += uncore/
> > diff --git a/drivers/perf/uncore/Makefile b/drivers/perf/uncore/Makefile
> > new file mode 100644
> > index 000..6130e18
> > --- /dev/null
> > +++ b/drivers/perf/uncore/Makefile
> > @@ -0,0 +1 @@
> > +obj-$(CONFIG_UNCORE_PMU_CAVIUM) += uncore_cavium.o
> > diff --git a/drivers/perf/uncore/uncore_cavium.c 
> > b/drivers/perf/uncore/uncore_cavium.c
> > new file mode 100644
> > index 000..a7b4277
> > --- /dev/null
> > +++ b/drivers/perf/uncore/uncore_cavium.c
> > @@ -0,0 +1,351 @@
> > +/*
> > + * Cavium Thunder uncore PMU support.
> > + *
> > + * Copyright (C) 2015,2016 Cavium Inc.
> > + * Author: Jan Glauber 
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include "uncore_cavium.h"
> > +
> > +/*
> > + * Some notes about the various counters supported by this "uncore" PMU
> > + * and the design:
> > + *
> > + * All counters are 64 bit long.
> > + * There are no overflow interrupts.
> > + * Counters are summarized per node/socket.
> > + * Most devices appear as separate PCI devices per socket with the 
> > exception
> > + * of OCX TLK which appears as one PCI device per socket and contains 
> > several
> > + * units with counters that are merged.
> > + * Some 

Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-11 Thread Jan Glauber
Hi Mark,

thanks for reviewing. One question below, for most of your other comments
I think we need to come to a conclusion about the aggregation first.

On Thu, Nov 10, 2016 at 04:54:06PM +, Mark Rutland wrote:
> Hi Jan,
> 
> Apologies for the delay in getting to this.
> 
> On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:
> > diff --git a/drivers/perf/uncore/uncore_cavium.c 
> > b/drivers/perf/uncore/uncore_cavium.c
> > new file mode 100644
> > index 000..a7b4277
> > --- /dev/null
> > +++ b/drivers/perf/uncore/uncore_cavium.c
> > @@ -0,0 +1,351 @@
> > +/*
> > + * Cavium Thunder uncore PMU support.
> > + *
> > + * Copyright (C) 2015,2016 Cavium Inc.
> > + * Author: Jan Glauber 
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> 
> I believe the following includes are necessary for APIs and/or data
> explicitly referenced by the driver code:
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #include 
> 
> ... please add those here.

Should I also add includes that are already in the included by uncore_cavium.h?
I usually avoid includes that come through the "local" header file.


Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-11 Thread Jan Glauber
Hi Mark,

thanks for reviewing. One question below, for most of your other comments
I think we need to come to a conclusion about the aggregation first.

On Thu, Nov 10, 2016 at 04:54:06PM +, Mark Rutland wrote:
> Hi Jan,
> 
> Apologies for the delay in getting to this.
> 
> On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:
> > diff --git a/drivers/perf/uncore/uncore_cavium.c 
> > b/drivers/perf/uncore/uncore_cavium.c
> > new file mode 100644
> > index 000..a7b4277
> > --- /dev/null
> > +++ b/drivers/perf/uncore/uncore_cavium.c
> > @@ -0,0 +1,351 @@
> > +/*
> > + * Cavium Thunder uncore PMU support.
> > + *
> > + * Copyright (C) 2015,2016 Cavium Inc.
> > + * Author: Jan Glauber 
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> 
> I believe the following includes are necessary for APIs and/or data
> explicitly referenced by the driver code:
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #include 
> 
> ... please add those here.

Should I also add includes that are already in the included by uncore_cavium.h?
I usually avoid includes that come through the "local" header file.


Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-10 Thread Jan Glauber
On Thu, Nov 10, 2016 at 04:54:06PM +, Mark Rutland wrote:
> > +/*
> > + * Some notes about the various counters supported by this "uncore" PMU
> > + * and the design:
> > + *
> > + * All counters are 64 bit long.
> > + * There are no overflow interrupts.
> > + * Counters are summarized per node/socket.
> > + * Most devices appear as separate PCI devices per socket with the 
> > exception
> > + * of OCX TLK which appears as one PCI device per socket and contains 
> > several
> > + * units with counters that are merged.
> 
> As a general note, as I commented on the QC L2 PMU driver [1,2], we need
> to figure out if we should be aggregating physical PMUs or not.

As said before, although it would be possible to create separate PMUs
for each unit, the individual counters are not interesting. For example
we are not interested in individual counters of Tag-and-data unit 0..7,
we just want the global view.

> Judging by subsequent patches, each unit has individual counters and
> controls, and thus we cannot atomically read/write counters or controls
> across them. As such, I do not think we should aggregate them, and
> should expose them separately to userspace.

That sounds like just moving the problem of aggregating the counters to
user-space. And would make the results even worse, if the user needs
several calls to summarize the counters, given how slow a perf counter
read is.


> That will simplify a number of things (e.g. the CPU migration code no
> longer has to iterate over a list of units).

Sure, it simplifies the kernel part, but it moves the cost to the user.


Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-10 Thread Jan Glauber
On Thu, Nov 10, 2016 at 04:54:06PM +, Mark Rutland wrote:
> > +/*
> > + * Some notes about the various counters supported by this "uncore" PMU
> > + * and the design:
> > + *
> > + * All counters are 64 bit long.
> > + * There are no overflow interrupts.
> > + * Counters are summarized per node/socket.
> > + * Most devices appear as separate PCI devices per socket with the 
> > exception
> > + * of OCX TLK which appears as one PCI device per socket and contains 
> > several
> > + * units with counters that are merged.
> 
> As a general note, as I commented on the QC L2 PMU driver [1,2], we need
> to figure out if we should be aggregating physical PMUs or not.

As said before, although it would be possible to create separate PMUs
for each unit, the individual counters are not interesting. For example
we are not interested in individual counters of Tag-and-data unit 0..7,
we just want the global view.

> Judging by subsequent patches, each unit has individual counters and
> controls, and thus we cannot atomically read/write counters or controls
> across them. As such, I do not think we should aggregate them, and
> should expose them separately to userspace.

That sounds like just moving the problem of aggregating the counters to
user-space. And would make the results even worse, if the user needs
several calls to summarize the counters, given how slow a perf counter
read is.


> That will simplify a number of things (e.g. the CPU migration code no
> longer has to iterate over a list of units).

Sure, it simplifies the kernel part, but it moves the cost to the user.


Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-10 Thread Will Deacon
On Thu, Nov 10, 2016 at 04:54:06PM +, Mark Rutland wrote:
> On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:
> > diff --git a/drivers/perf/uncore/uncore_cavium.c 
> > b/drivers/perf/uncore/uncore_cavium.c
> > new file mode 100644
> > index 000..a7b4277
> > --- /dev/null
> > +++ b/drivers/perf/uncore/uncore_cavium.c
> > + * Some notes about the various counters supported by this "uncore" PMU
> > + * and the design:
> > + *
> > + * All counters are 64 bit long.
> > + * There are no overflow interrupts.
> > + * Counters are summarized per node/socket.
> > + * Most devices appear as separate PCI devices per socket with the 
> > exception
> > + * of OCX TLK which appears as one PCI device per socket and contains 
> > several
> > + * units with counters that are merged.
> 
> As a general note, as I commented on the QC L2 PMU driver [1,2], we need
> to figure out if we should be aggregating physical PMUs or not.
> 
> Judging by subsequent patches, each unit has individual counters and
> controls, and thus we cannot atomically read/write counters or controls
> across them. As such, I do not think we should aggregate them, and
> should expose them separately to userspace.

I thought each unit was registered as a separate PMU to perf? Or are you
specifically commenting on the OCX TLK? The comment there suggests that
the units cannot be individually enabled/disabled and, without docs, I
trust that's the case.

Will


Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-10 Thread Will Deacon
On Thu, Nov 10, 2016 at 04:54:06PM +, Mark Rutland wrote:
> On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:
> > diff --git a/drivers/perf/uncore/uncore_cavium.c 
> > b/drivers/perf/uncore/uncore_cavium.c
> > new file mode 100644
> > index 000..a7b4277
> > --- /dev/null
> > +++ b/drivers/perf/uncore/uncore_cavium.c
> > + * Some notes about the various counters supported by this "uncore" PMU
> > + * and the design:
> > + *
> > + * All counters are 64 bit long.
> > + * There are no overflow interrupts.
> > + * Counters are summarized per node/socket.
> > + * Most devices appear as separate PCI devices per socket with the 
> > exception
> > + * of OCX TLK which appears as one PCI device per socket and contains 
> > several
> > + * units with counters that are merged.
> 
> As a general note, as I commented on the QC L2 PMU driver [1,2], we need
> to figure out if we should be aggregating physical PMUs or not.
> 
> Judging by subsequent patches, each unit has individual counters and
> controls, and thus we cannot atomically read/write counters or controls
> across them. As such, I do not think we should aggregate them, and
> should expose them separately to userspace.

I thought each unit was registered as a separate PMU to perf? Or are you
specifically commenting on the OCX TLK? The comment there suggests that
the units cannot be individually enabled/disabled and, without docs, I
trust that's the case.

Will


Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-10 Thread Mark Rutland
Hi Jan,

Apologies for the delay in getting to this.

On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:
> diff --git a/drivers/perf/uncore/uncore_cavium.c 
> b/drivers/perf/uncore/uncore_cavium.c
> new file mode 100644
> index 000..a7b4277
> --- /dev/null
> +++ b/drivers/perf/uncore/uncore_cavium.c
> @@ -0,0 +1,351 @@
> +/*
> + * Cavium Thunder uncore PMU support.
> + *
> + * Copyright (C) 2015,2016 Cavium Inc.
> + * Author: Jan Glauber 
> + */
> +
> +#include 
> +#include 
> +#include 

I believe the following includes are necessary for APIs and/or data
explicitly referenced by the driver code:

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 

... please add those here.

[...]

> +/*
> + * Some notes about the various counters supported by this "uncore" PMU
> + * and the design:
> + *
> + * All counters are 64 bit long.
> + * There are no overflow interrupts.
> + * Counters are summarized per node/socket.
> + * Most devices appear as separate PCI devices per socket with the exception
> + * of OCX TLK which appears as one PCI device per socket and contains several
> + * units with counters that are merged.

As a general note, as I commented on the QC L2 PMU driver [1,2], we need
to figure out if we should be aggregating physical PMUs or not.

Judging by subsequent patches, each unit has individual counters and
controls, and thus we cannot atomically read/write counters or controls
across them. As such, I do not think we should aggregate them, and
should expose them separately to userspace.

That will simplify a number of things (e.g. the CPU migration code no
longer has to iterate over a list of units).

> + * Some counters are selected via a control register (L2C TAD) and read by
> + * a number of counter registers, others (L2C CBC, LMC & OCX TLK) have
> + * one dedicated counter per event.
> + * Some counters are not stoppable (L2C CBC & LMC).
> + * Some counters are read-only (LMC).
> + * All counters belong to PCI devices, the devices may have additional
> + * drivers but we assume we are the only user of the counter registers.
> + * We map the whole PCI BAR so we must be careful to forbid access to
> + * addresses that contain neither counters nor counter control registers.
> + */
> +
> +void thunder_uncore_read(struct perf_event *event)
> +{
> + struct thunder_uncore *uncore = to_uncore(event->pmu);
> + struct hw_perf_event *hwc = >hw;
> + struct thunder_uncore_node *node;
> + struct thunder_uncore_unit *unit;
> + u64 prev, delta, new = 0;
> +
> + node = get_node(hwc->config, uncore);
> +
> + /* read counter values from all units on the node */
> + list_for_each_entry(unit, >unit_list, entry)
> + new += readq(hwc->event_base + unit->map);
> +
> + prev = local64_read(>prev_count);
> + local64_set(>prev_count, new);
> + delta = new - prev;
> + local64_add(delta, >count);
> +}
> +
> +int thunder_uncore_add(struct perf_event *event, int flags, u64 config_base,
> +u64 event_base)
> +{
> + struct thunder_uncore *uncore = to_uncore(event->pmu);
> + struct hw_perf_event *hwc = >hw;
> + struct thunder_uncore_node *node;
> + int id;
> +
> + node = get_node(hwc->config, uncore);
> + id = get_id(hwc->config);
> +
> + if (!cmpxchg(>events[id], NULL, event))
> + hwc->idx = id;

Judging by thunder_uncore_event_init() and get_id(), the specific
counter to use is chosen by the user, rather than allocated as
necessary. Yet the block comment before thunder_uncore_read() said only
some events have a dedicated counter.

This does not seem right. Why are we not choosing a relevant counter
dynamically?

As Will commented, we shouldn't need a full-barrier cmpxchg() here; the
pmu::{add,del} are serialised by the core perf code as ctx->lock has to
be held (and we have no interrupt to worry about). If we want to use
cmpxchg() for convenience, it can be a cmpxchg_relaxed().

> + if (hwc->idx == -1)
> + return -EBUSY;
> +
> + hwc->config_base = config_base;
> + hwc->event_base = event_base;
> + hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
> +
> + if (flags & PERF_EF_START)
> + uncore->pmu.start(event, PERF_EF_RELOAD);
> +
> + return 0;
> +}
> +
> +void thunder_uncore_del(struct perf_event *event, int flags)
> +{
> + struct thunder_uncore *uncore = to_uncore(event->pmu);
> + struct hw_perf_event *hwc = >hw;
> + struct thunder_uncore_node *node;
> + int i;
> +
> + event->pmu->stop(event, PERF_EF_UPDATE);
> +
> + /*
> +  * For programmable counters we need to check where we installed it.
> +  * To keep this function generic always test the more complicated
> +  * case (free running counters won't need the loop).
> +  */
> + node = get_node(hwc->config, uncore);
> + for (i = 

Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-10 Thread Mark Rutland
Hi Jan,

Apologies for the delay in getting to this.

On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:
> diff --git a/drivers/perf/uncore/uncore_cavium.c 
> b/drivers/perf/uncore/uncore_cavium.c
> new file mode 100644
> index 000..a7b4277
> --- /dev/null
> +++ b/drivers/perf/uncore/uncore_cavium.c
> @@ -0,0 +1,351 @@
> +/*
> + * Cavium Thunder uncore PMU support.
> + *
> + * Copyright (C) 2015,2016 Cavium Inc.
> + * Author: Jan Glauber 
> + */
> +
> +#include 
> +#include 
> +#include 

I believe the following includes are necessary for APIs and/or data
explicitly referenced by the driver code:

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 

... please add those here.

[...]

> +/*
> + * Some notes about the various counters supported by this "uncore" PMU
> + * and the design:
> + *
> + * All counters are 64 bit long.
> + * There are no overflow interrupts.
> + * Counters are summarized per node/socket.
> + * Most devices appear as separate PCI devices per socket with the exception
> + * of OCX TLK which appears as one PCI device per socket and contains several
> + * units with counters that are merged.

As a general note, as I commented on the QC L2 PMU driver [1,2], we need
to figure out if we should be aggregating physical PMUs or not.

Judging by subsequent patches, each unit has individual counters and
controls, and thus we cannot atomically read/write counters or controls
across them. As such, I do not think we should aggregate them, and
should expose them separately to userspace.

That will simplify a number of things (e.g. the CPU migration code no
longer has to iterate over a list of units).

> + * Some counters are selected via a control register (L2C TAD) and read by
> + * a number of counter registers, others (L2C CBC, LMC & OCX TLK) have
> + * one dedicated counter per event.
> + * Some counters are not stoppable (L2C CBC & LMC).
> + * Some counters are read-only (LMC).
> + * All counters belong to PCI devices, the devices may have additional
> + * drivers but we assume we are the only user of the counter registers.
> + * We map the whole PCI BAR so we must be careful to forbid access to
> + * addresses that contain neither counters nor counter control registers.
> + */
> +
> +void thunder_uncore_read(struct perf_event *event)
> +{
> + struct thunder_uncore *uncore = to_uncore(event->pmu);
> + struct hw_perf_event *hwc = >hw;
> + struct thunder_uncore_node *node;
> + struct thunder_uncore_unit *unit;
> + u64 prev, delta, new = 0;
> +
> + node = get_node(hwc->config, uncore);
> +
> + /* read counter values from all units on the node */
> + list_for_each_entry(unit, >unit_list, entry)
> + new += readq(hwc->event_base + unit->map);
> +
> + prev = local64_read(>prev_count);
> + local64_set(>prev_count, new);
> + delta = new - prev;
> + local64_add(delta, >count);
> +}
> +
> +int thunder_uncore_add(struct perf_event *event, int flags, u64 config_base,
> +u64 event_base)
> +{
> + struct thunder_uncore *uncore = to_uncore(event->pmu);
> + struct hw_perf_event *hwc = >hw;
> + struct thunder_uncore_node *node;
> + int id;
> +
> + node = get_node(hwc->config, uncore);
> + id = get_id(hwc->config);
> +
> + if (!cmpxchg(>events[id], NULL, event))
> + hwc->idx = id;

Judging by thunder_uncore_event_init() and get_id(), the specific
counter to use is chosen by the user, rather than allocated as
necessary. Yet the block comment before thunder_uncore_read() said only
some events have a dedicated counter.

This does not seem right. Why are we not choosing a relevant counter
dynamically?

As Will commented, we shouldn't need a full-barrier cmpxchg() here; the
pmu::{add,del} are serialised by the core perf code as ctx->lock has to
be held (and we have no interrupt to worry about). If we want to use
cmpxchg() for convenience, it can be a cmpxchg_relaxed().

> + if (hwc->idx == -1)
> + return -EBUSY;
> +
> + hwc->config_base = config_base;
> + hwc->event_base = event_base;
> + hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
> +
> + if (flags & PERF_EF_START)
> + uncore->pmu.start(event, PERF_EF_RELOAD);
> +
> + return 0;
> +}
> +
> +void thunder_uncore_del(struct perf_event *event, int flags)
> +{
> + struct thunder_uncore *uncore = to_uncore(event->pmu);
> + struct hw_perf_event *hwc = >hw;
> + struct thunder_uncore_node *node;
> + int i;
> +
> + event->pmu->stop(event, PERF_EF_UPDATE);
> +
> + /*
> +  * For programmable counters we need to check where we installed it.
> +  * To keep this function generic always test the more complicated
> +  * case (free running counters won't need the loop).
> +  */
> + node = get_node(hwc->config, uncore);
> + for (i = 0; i < 

Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-08 Thread Will Deacon
Hi Jan,

Thanks for posting an updated series. I have a few minor comments, which
we can hopefully address in time for 4.10.

Also, have you run the perf fuzzer with this series applied?

https://github.com/deater/perf_event_tests

(build the tests and look under the fuzzer/ directory for the tool)

On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:
> Provide "uncore" facilities for different non-CPU performance
> counter units.
> 
> The uncore PMUs can be found under /sys/bus/event_source/devices.
> All counters are exported via sysfs in the corresponding events
> files under the PMU directory so the perf tool can list the event names.
> 
> There are some points that are special in this implementation:
> 
> 1) The PMU detection relies on PCI device detection. If a
>matching PCI device is found the PMU is created. The code can deal
>with multiple units of the same type, e.g. more than one memory
>controller.
> 
> 2) Counters are summarized across different units of the same type
>on one NUMA node but not across NUMA nodes.
>For instance L2C TAD 0..7 are presented as a single counter
>(adding the values from TAD 0 to 7). Although losing the ability
>to read a single value the merged values are easier to use.
> 
> 3) The counters are not CPU related. A random CPU is picked regardless
>of the NUMA node. There is a small performance penalty for accessing
>counters on a remote note but reading a performance counter is a
>slow operation anyway.
> 
> Signed-off-by: Jan Glauber 
> ---
>  drivers/perf/Kconfig|  13 ++
>  drivers/perf/Makefile   |   1 +
>  drivers/perf/uncore/Makefile|   1 +
>  drivers/perf/uncore/uncore_cavium.c | 351 
> 
>  drivers/perf/uncore/uncore_cavium.h |  71 

We already have "uncore" PMUs under drivers/perf, so I'd prefer that we
renamed this a bit to reflect better what's going on. How about:

  drivers/perf/cavium/

and then

  drivers/perf/cavium/uncore_thunder.[ch]

?

>  include/linux/cpuhotplug.h  |   1 +
>  6 files changed, 438 insertions(+)
>  create mode 100644 drivers/perf/uncore/Makefile
>  create mode 100644 drivers/perf/uncore/uncore_cavium.c
>  create mode 100644 drivers/perf/uncore/uncore_cavium.h
> 
> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
> index 4d5c5f9..3266c87 100644
> --- a/drivers/perf/Kconfig
> +++ b/drivers/perf/Kconfig
> @@ -19,4 +19,17 @@ config XGENE_PMU
>  help
>Say y if you want to use APM X-Gene SoC performance monitors.
>  
> +config UNCORE_PMU
> + bool

This isn't needed.

> +
> +config UNCORE_PMU_CAVIUM
> + depends on PERF_EVENTS && NUMA && ARM64
> + bool "Cavium uncore PMU support"

Please mentioned Thunder somewhere, since that's the SoC being supported.

> + select UNCORE_PMU
> + default y
> + help
> +   Say y if you want to access performance counters of subsystems
> +   on a Cavium SOC like cache controller, memory controller or
> +   processor interconnect.
> +
>  endmenu
> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
> index b116e98..d6c02c9 100644
> --- a/drivers/perf/Makefile
> +++ b/drivers/perf/Makefile
> @@ -1,2 +1,3 @@
>  obj-$(CONFIG_ARM_PMU) += arm_pmu.o
>  obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
> +obj-y += uncore/
> diff --git a/drivers/perf/uncore/Makefile b/drivers/perf/uncore/Makefile
> new file mode 100644
> index 000..6130e18
> --- /dev/null
> +++ b/drivers/perf/uncore/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_UNCORE_PMU_CAVIUM) += uncore_cavium.o
> diff --git a/drivers/perf/uncore/uncore_cavium.c 
> b/drivers/perf/uncore/uncore_cavium.c
> new file mode 100644
> index 000..a7b4277
> --- /dev/null
> +++ b/drivers/perf/uncore/uncore_cavium.c
> @@ -0,0 +1,351 @@
> +/*
> + * Cavium Thunder uncore PMU support.
> + *
> + * Copyright (C) 2015,2016 Cavium Inc.
> + * Author: Jan Glauber 
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +#include "uncore_cavium.h"
> +
> +/*
> + * Some notes about the various counters supported by this "uncore" PMU
> + * and the design:
> + *
> + * All counters are 64 bit long.
> + * There are no overflow interrupts.
> + * Counters are summarized per node/socket.
> + * Most devices appear as separate PCI devices per socket with the exception
> + * of OCX TLK which appears as one PCI device per socket and contains several
> + * units with counters that are merged.
> + * Some counters are selected via a control register (L2C TAD) and read by
> + * a number of counter registers, others (L2C CBC, LMC & OCX TLK) have
> + * one dedicated counter per event.
> + * Some counters are not stoppable (L2C CBC & LMC).
> + * Some counters are read-only (LMC).
> + * All counters belong to PCI devices, the devices may have additional
> + * drivers but we assume we are the only user of the counter registers.
> + * We map the whole PCI BAR so we must be 

Re: [PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-11-08 Thread Will Deacon
Hi Jan,

Thanks for posting an updated series. I have a few minor comments, which
we can hopefully address in time for 4.10.

Also, have you run the perf fuzzer with this series applied?

https://github.com/deater/perf_event_tests

(build the tests and look under the fuzzer/ directory for the tool)

On Sat, Oct 29, 2016 at 01:55:29PM +0200, Jan Glauber wrote:
> Provide "uncore" facilities for different non-CPU performance
> counter units.
> 
> The uncore PMUs can be found under /sys/bus/event_source/devices.
> All counters are exported via sysfs in the corresponding events
> files under the PMU directory so the perf tool can list the event names.
> 
> There are some points that are special in this implementation:
> 
> 1) The PMU detection relies on PCI device detection. If a
>matching PCI device is found the PMU is created. The code can deal
>with multiple units of the same type, e.g. more than one memory
>controller.
> 
> 2) Counters are summarized across different units of the same type
>on one NUMA node but not across NUMA nodes.
>For instance L2C TAD 0..7 are presented as a single counter
>(adding the values from TAD 0 to 7). Although losing the ability
>to read a single value the merged values are easier to use.
> 
> 3) The counters are not CPU related. A random CPU is picked regardless
>of the NUMA node. There is a small performance penalty for accessing
>counters on a remote note but reading a performance counter is a
>slow operation anyway.
> 
> Signed-off-by: Jan Glauber 
> ---
>  drivers/perf/Kconfig|  13 ++
>  drivers/perf/Makefile   |   1 +
>  drivers/perf/uncore/Makefile|   1 +
>  drivers/perf/uncore/uncore_cavium.c | 351 
> 
>  drivers/perf/uncore/uncore_cavium.h |  71 

We already have "uncore" PMUs under drivers/perf, so I'd prefer that we
renamed this a bit to reflect better what's going on. How about:

  drivers/perf/cavium/

and then

  drivers/perf/cavium/uncore_thunder.[ch]

?

>  include/linux/cpuhotplug.h  |   1 +
>  6 files changed, 438 insertions(+)
>  create mode 100644 drivers/perf/uncore/Makefile
>  create mode 100644 drivers/perf/uncore/uncore_cavium.c
>  create mode 100644 drivers/perf/uncore/uncore_cavium.h
> 
> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
> index 4d5c5f9..3266c87 100644
> --- a/drivers/perf/Kconfig
> +++ b/drivers/perf/Kconfig
> @@ -19,4 +19,17 @@ config XGENE_PMU
>  help
>Say y if you want to use APM X-Gene SoC performance monitors.
>  
> +config UNCORE_PMU
> + bool

This isn't needed.

> +
> +config UNCORE_PMU_CAVIUM
> + depends on PERF_EVENTS && NUMA && ARM64
> + bool "Cavium uncore PMU support"

Please mentioned Thunder somewhere, since that's the SoC being supported.

> + select UNCORE_PMU
> + default y
> + help
> +   Say y if you want to access performance counters of subsystems
> +   on a Cavium SOC like cache controller, memory controller or
> +   processor interconnect.
> +
>  endmenu
> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
> index b116e98..d6c02c9 100644
> --- a/drivers/perf/Makefile
> +++ b/drivers/perf/Makefile
> @@ -1,2 +1,3 @@
>  obj-$(CONFIG_ARM_PMU) += arm_pmu.o
>  obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
> +obj-y += uncore/
> diff --git a/drivers/perf/uncore/Makefile b/drivers/perf/uncore/Makefile
> new file mode 100644
> index 000..6130e18
> --- /dev/null
> +++ b/drivers/perf/uncore/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_UNCORE_PMU_CAVIUM) += uncore_cavium.o
> diff --git a/drivers/perf/uncore/uncore_cavium.c 
> b/drivers/perf/uncore/uncore_cavium.c
> new file mode 100644
> index 000..a7b4277
> --- /dev/null
> +++ b/drivers/perf/uncore/uncore_cavium.c
> @@ -0,0 +1,351 @@
> +/*
> + * Cavium Thunder uncore PMU support.
> + *
> + * Copyright (C) 2015,2016 Cavium Inc.
> + * Author: Jan Glauber 
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +#include "uncore_cavium.h"
> +
> +/*
> + * Some notes about the various counters supported by this "uncore" PMU
> + * and the design:
> + *
> + * All counters are 64 bit long.
> + * There are no overflow interrupts.
> + * Counters are summarized per node/socket.
> + * Most devices appear as separate PCI devices per socket with the exception
> + * of OCX TLK which appears as one PCI device per socket and contains several
> + * units with counters that are merged.
> + * Some counters are selected via a control register (L2C TAD) and read by
> + * a number of counter registers, others (L2C CBC, LMC & OCX TLK) have
> + * one dedicated counter per event.
> + * Some counters are not stoppable (L2C CBC & LMC).
> + * Some counters are read-only (LMC).
> + * All counters belong to PCI devices, the devices may have additional
> + * drivers but we assume we are the only user of the counter registers.
> + * We map the whole PCI BAR so we must be careful to forbid access to
> + * addresses 

[PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-10-29 Thread Jan Glauber
Provide "uncore" facilities for different non-CPU performance
counter units.

The uncore PMUs can be found under /sys/bus/event_source/devices.
All counters are exported via sysfs in the corresponding events
files under the PMU directory so the perf tool can list the event names.

There are some points that are special in this implementation:

1) The PMU detection relies on PCI device detection. If a
   matching PCI device is found the PMU is created. The code can deal
   with multiple units of the same type, e.g. more than one memory
   controller.

2) Counters are summarized across different units of the same type
   on one NUMA node but not across NUMA nodes.
   For instance L2C TAD 0..7 are presented as a single counter
   (adding the values from TAD 0 to 7). Although losing the ability
   to read a single value the merged values are easier to use.

3) The counters are not CPU related. A random CPU is picked regardless
   of the NUMA node. There is a small performance penalty for accessing
   counters on a remote note but reading a performance counter is a
   slow operation anyway.

Signed-off-by: Jan Glauber 
---
 drivers/perf/Kconfig|  13 ++
 drivers/perf/Makefile   |   1 +
 drivers/perf/uncore/Makefile|   1 +
 drivers/perf/uncore/uncore_cavium.c | 351 
 drivers/perf/uncore/uncore_cavium.h |  71 
 include/linux/cpuhotplug.h  |   1 +
 6 files changed, 438 insertions(+)
 create mode 100644 drivers/perf/uncore/Makefile
 create mode 100644 drivers/perf/uncore/uncore_cavium.c
 create mode 100644 drivers/perf/uncore/uncore_cavium.h

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 4d5c5f9..3266c87 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -19,4 +19,17 @@ config XGENE_PMU
 help
   Say y if you want to use APM X-Gene SoC performance monitors.
 
+config UNCORE_PMU
+   bool
+
+config UNCORE_PMU_CAVIUM
+   depends on PERF_EVENTS && NUMA && ARM64
+   bool "Cavium uncore PMU support"
+   select UNCORE_PMU
+   default y
+   help
+ Say y if you want to access performance counters of subsystems
+ on a Cavium SOC like cache controller, memory controller or
+ processor interconnect.
+
 endmenu
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index b116e98..d6c02c9 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -1,2 +1,3 @@
 obj-$(CONFIG_ARM_PMU) += arm_pmu.o
 obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
+obj-y += uncore/
diff --git a/drivers/perf/uncore/Makefile b/drivers/perf/uncore/Makefile
new file mode 100644
index 000..6130e18
--- /dev/null
+++ b/drivers/perf/uncore/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_UNCORE_PMU_CAVIUM) += uncore_cavium.o
diff --git a/drivers/perf/uncore/uncore_cavium.c 
b/drivers/perf/uncore/uncore_cavium.c
new file mode 100644
index 000..a7b4277
--- /dev/null
+++ b/drivers/perf/uncore/uncore_cavium.c
@@ -0,0 +1,351 @@
+/*
+ * Cavium Thunder uncore PMU support.
+ *
+ * Copyright (C) 2015,2016 Cavium Inc.
+ * Author: Jan Glauber 
+ */
+
+#include 
+#include 
+#include 
+
+#include "uncore_cavium.h"
+
+/*
+ * Some notes about the various counters supported by this "uncore" PMU
+ * and the design:
+ *
+ * All counters are 64 bit long.
+ * There are no overflow interrupts.
+ * Counters are summarized per node/socket.
+ * Most devices appear as separate PCI devices per socket with the exception
+ * of OCX TLK which appears as one PCI device per socket and contains several
+ * units with counters that are merged.
+ * Some counters are selected via a control register (L2C TAD) and read by
+ * a number of counter registers, others (L2C CBC, LMC & OCX TLK) have
+ * one dedicated counter per event.
+ * Some counters are not stoppable (L2C CBC & LMC).
+ * Some counters are read-only (LMC).
+ * All counters belong to PCI devices, the devices may have additional
+ * drivers but we assume we are the only user of the counter registers.
+ * We map the whole PCI BAR so we must be careful to forbid access to
+ * addresses that contain neither counters nor counter control registers.
+ */
+
+void thunder_uncore_read(struct perf_event *event)
+{
+   struct thunder_uncore *uncore = to_uncore(event->pmu);
+   struct hw_perf_event *hwc = >hw;
+   struct thunder_uncore_node *node;
+   struct thunder_uncore_unit *unit;
+   u64 prev, delta, new = 0;
+
+   node = get_node(hwc->config, uncore);
+
+   /* read counter values from all units on the node */
+   list_for_each_entry(unit, >unit_list, entry)
+   new += readq(hwc->event_base + unit->map);
+
+   prev = local64_read(>prev_count);
+   local64_set(>prev_count, new);
+   delta = new - prev;
+   local64_add(delta, >count);
+}
+
+int thunder_uncore_add(struct perf_event *event, int flags, u64 config_base,
+  u64 event_base)
+{
+  

[PATCH v4 1/5] arm64: perf: Basic uncore counter support for Cavium ThunderX SOC

2016-10-29 Thread Jan Glauber
Provide "uncore" facilities for different non-CPU performance
counter units.

The uncore PMUs can be found under /sys/bus/event_source/devices.
All counters are exported via sysfs in the corresponding events
files under the PMU directory so the perf tool can list the event names.

There are some points that are special in this implementation:

1) The PMU detection relies on PCI device detection. If a
   matching PCI device is found the PMU is created. The code can deal
   with multiple units of the same type, e.g. more than one memory
   controller.

2) Counters are summarized across different units of the same type
   on one NUMA node but not across NUMA nodes.
   For instance L2C TAD 0..7 are presented as a single counter
   (adding the values from TAD 0 to 7). Although losing the ability
   to read a single value the merged values are easier to use.

3) The counters are not CPU related. A random CPU is picked regardless
   of the NUMA node. There is a small performance penalty for accessing
   counters on a remote note but reading a performance counter is a
   slow operation anyway.

Signed-off-by: Jan Glauber 
---
 drivers/perf/Kconfig|  13 ++
 drivers/perf/Makefile   |   1 +
 drivers/perf/uncore/Makefile|   1 +
 drivers/perf/uncore/uncore_cavium.c | 351 
 drivers/perf/uncore/uncore_cavium.h |  71 
 include/linux/cpuhotplug.h  |   1 +
 6 files changed, 438 insertions(+)
 create mode 100644 drivers/perf/uncore/Makefile
 create mode 100644 drivers/perf/uncore/uncore_cavium.c
 create mode 100644 drivers/perf/uncore/uncore_cavium.h

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 4d5c5f9..3266c87 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -19,4 +19,17 @@ config XGENE_PMU
 help
   Say y if you want to use APM X-Gene SoC performance monitors.
 
+config UNCORE_PMU
+   bool
+
+config UNCORE_PMU_CAVIUM
+   depends on PERF_EVENTS && NUMA && ARM64
+   bool "Cavium uncore PMU support"
+   select UNCORE_PMU
+   default y
+   help
+ Say y if you want to access performance counters of subsystems
+ on a Cavium SOC like cache controller, memory controller or
+ processor interconnect.
+
 endmenu
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index b116e98..d6c02c9 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -1,2 +1,3 @@
 obj-$(CONFIG_ARM_PMU) += arm_pmu.o
 obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
+obj-y += uncore/
diff --git a/drivers/perf/uncore/Makefile b/drivers/perf/uncore/Makefile
new file mode 100644
index 000..6130e18
--- /dev/null
+++ b/drivers/perf/uncore/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_UNCORE_PMU_CAVIUM) += uncore_cavium.o
diff --git a/drivers/perf/uncore/uncore_cavium.c 
b/drivers/perf/uncore/uncore_cavium.c
new file mode 100644
index 000..a7b4277
--- /dev/null
+++ b/drivers/perf/uncore/uncore_cavium.c
@@ -0,0 +1,351 @@
+/*
+ * Cavium Thunder uncore PMU support.
+ *
+ * Copyright (C) 2015,2016 Cavium Inc.
+ * Author: Jan Glauber 
+ */
+
+#include 
+#include 
+#include 
+
+#include "uncore_cavium.h"
+
+/*
+ * Some notes about the various counters supported by this "uncore" PMU
+ * and the design:
+ *
+ * All counters are 64 bit long.
+ * There are no overflow interrupts.
+ * Counters are summarized per node/socket.
+ * Most devices appear as separate PCI devices per socket with the exception
+ * of OCX TLK which appears as one PCI device per socket and contains several
+ * units with counters that are merged.
+ * Some counters are selected via a control register (L2C TAD) and read by
+ * a number of counter registers, others (L2C CBC, LMC & OCX TLK) have
+ * one dedicated counter per event.
+ * Some counters are not stoppable (L2C CBC & LMC).
+ * Some counters are read-only (LMC).
+ * All counters belong to PCI devices, the devices may have additional
+ * drivers but we assume we are the only user of the counter registers.
+ * We map the whole PCI BAR so we must be careful to forbid access to
+ * addresses that contain neither counters nor counter control registers.
+ */
+
+void thunder_uncore_read(struct perf_event *event)
+{
+   struct thunder_uncore *uncore = to_uncore(event->pmu);
+   struct hw_perf_event *hwc = >hw;
+   struct thunder_uncore_node *node;
+   struct thunder_uncore_unit *unit;
+   u64 prev, delta, new = 0;
+
+   node = get_node(hwc->config, uncore);
+
+   /* read counter values from all units on the node */
+   list_for_each_entry(unit, >unit_list, entry)
+   new += readq(hwc->event_base + unit->map);
+
+   prev = local64_read(>prev_count);
+   local64_set(>prev_count, new);
+   delta = new - prev;
+   local64_add(delta, >count);
+}
+
+int thunder_uncore_add(struct perf_event *event, int flags, u64 config_base,
+  u64 event_base)
+{
+   struct thunder_uncore *uncore =