Re: [PATCH v6] arm: perf: Directly handle SMP platforms with one SPI

2015-04-02 Thread Mark Rutland
> >>   struct arm_pmu {
> >> @@ -117,6 +125,10 @@ struct arm_pmu {
> >>  struct platform_device  *plat_device;
> >>  struct pmu_hw_events__percpu *hw_events;
> >>  struct notifier_block   hotplug_nb;
> >> +#ifdef CONFIG_SMP
> >> +   int muxed_spi_workaround_irq;
> >
> > There's nothing workaround specific about this IRQ; it's just the only
> > IRQ.
> >
> > I think we should just pre-parse all the IRQs into a list at probe time,
> > regardless of SMP or the workaround. Then we can just grab the first
> > (and only) interrupt from the list in the workaround paths, and
> > otherwise just iterate over the list.
> 
> What other uses do you anticipate for such a list? I don't really
> understand why we would create a list when we only ever consume the
> first entry.

It would go parallel to the affinity property Will recently added, and
would allow us to separate the DT parsing from the request/free paths,
making those far easier to understand (and likely a little faster too).

> If there is a use for such information then why keep it as a list.
> Wouldn't the code be simpler if we add it as a field in the percpu data
> structure?
> 
> For PPIs all values would be the same, for SPIs all different, for
> broken SoCs all after [0] would be 0.

That's a nice idea. That would make the affinity info implicit, and
could be far neater.

> >> +   atomic_tremaining_irq_work;
> >
> > Perhaps remaining_work_irqs? That would make it clear that this is a
> > counter rather than a boolean or enumeration. We could s/work/fake/ or
> > something to that effect.
> 
> I guess the thing we are actually counting is CPUs so
> remaining_cpus_with_irq_work would therefore be extremely descriptive.

It would be nice if we could come up with something that was a little
more concise. Perhaps cpus_with_work would be good enough?

> >> +}
> >> +
> >> +/*
> >> + * Workaround for systems where all PMU interrupts are targeting a
> >> + * single SPI.
> >> + *
> >> + * The workaround will disable the interrupt and distribute irqwork to all
> >> + * the other processors in the system. Hopefully one of them will clear 
> >> the
> >> + * interrupt...
> >> + *
> >> + * The workaround is only deployed when all PMU interrupts are aimed
> >> + * at a single core. As a result the workaround is never re-entered
> >> + * making it safe for us to use static data to maintain state.
> >> + */
> >> +static void cpu_pmu_deploy_muxed_spi_workaround(struct arm_pmu *cpu_pmu)
> >
> > Perhaps distribute_muxed_irq?
> 
> Could do. Personally I prefer to be clear that this is deploying a
> workaround (i.e. implying it does something odd on a small number of
> SoCs) rather than this code being part of the normal case.

I'd prefer the more succinct name. We already have the comment stating
that this is a workaround.

[...]

> >> +/*
> >> + * Called when the main interrupt handler cannot determine the source
> >> + * of interrupt. It will deploy a workaround if we are running on an SMP
> >> + * platform with only a single muxed SPI.
> >> + */
> >> +static irqreturn_t cpu_pmu_handle_irq_none(int irq_num, struct arm_pmu 
> >> *cpu_pmu)
> >> +{
> >> +   if (irq_num != cpu_pmu->muxed_spi_workaround_irq)
> >> +   return IRQ_NONE;
> >
> > This is somewhat opaque.
> >
> > I'd rather just have a flag to determine when we need to do any special
> > handling for the muxed case (or better, swizzle the irq handler to a
> > wrapper that pings the other CPUs and calls the usual handler).
> 
> Using a different handler for the workaround makes sense.
> 
> IIRC I avoided that previously because the way the code is currently
> split between perf_event.c and perf_event_cpu.c didn't make that very
> natural.
> 
> I'll look at this again.

The split between the two is an unfortunate legacy from when it looked
like system PMUs could reuse the code. We recently decoupled the CCI PMU
from arm_pmu so that we could fuse the two layers and get rid of a load
of indirection.

I have some patches doing that. I'll try to rework them next week given
it sounds like they could be a useful base for this.


> > [...]
> >
> >> diff --git a/arch/arm/kernel/perf_event_v7.c 
> >> b/arch/arm/kernel/perf_event_v7.c
> >> index 8993770c47de..0dd914c10803 100644
> >> --- a/arch/arm/kernel/perf_event_v7.c
> >> +++ b/arch/arm/kernel/perf_event_v7.c
> >> @@ -792,7 +792,7 @@ static irqreturn_t armv7pmu_handle_irq(int irq_num, 
> >> void *dev)
> >>   * Did an overflow occur?
> >>   */
> >>  if (!armv7_pmnc_has_overflowed(pmnc))
> >> -   return IRQ_NONE;
> >> +   return cpu_pmu_handle_irq_none(irq_num, cpu_pmu);
> >
> > Won't this leave samples skewed towards the CPU the interrupt is affine
> > to? If you're counting something like cycles with a short enough period
> > (and therefore effectively always have something to handle on the local
> > CPU), we might never ping the other CPUs.

Re: [PATCH v6] arm: perf: Directly handle SMP platforms with one SPI

2015-04-02 Thread Daniel Thompson

On 31/03/15 17:20, Will Deacon wrote:

Hi Daniel,

On Wed, Mar 04, 2015 at 01:25:45PM +, Daniel Thompson wrote:

Some ARM platforms mux the PMU interrupt of every core into a single
SPI. On such platforms if the PMU of any core except 0 raises an interrupt
then it cannot be serviced and eventually, if you are lucky, the spurious
irq detection might forcefully disable the interrupt.

On these SoCs it is not possible to determine which core raised the
interrupt so workaround this issue by queuing irqwork on the other
cores whenever the primary interrupt handler is unable to service the
interrupt.

The u8500 platform has an alternative workaround that dynamically alters
the affinity of the PMU interrupt. This workaround logic is no longer
required so the original code is removed as is the hook it relied upon.

Tested on imx6q (which has fours cores/PMUs all muxed to a single SPI)
using a simple soak, combined perf and CPU hotplug soak and using
perf fuzzer's fast_repro.sh.


[...]


diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
index b1596bd59129..dfef7904b790 100644
--- a/arch/arm/include/asm/pmu.h
+++ b/arch/arm/include/asm/pmu.h
@@ -87,6 +87,14 @@ struct pmu_hw_events {
  * already have to allocate this struct per cpu.
  */
 struct arm_pmu  *percpu_pmu;
+
+#ifdef CONFIG_SMP
+   /*
+* This is used to schedule workaround logic on platforms where all
+* the PMUs are attached to a single SPI.
+*/
+   struct irq_work work;
+#endif
  };

  struct arm_pmu {
@@ -117,6 +125,10 @@ struct arm_pmu {
 struct platform_device  *plat_device;
 struct pmu_hw_events__percpu *hw_events;
 struct notifier_block   hotplug_nb;
+#ifdef CONFIG_SMP
+   int muxed_spi_workaround_irq;
+   atomic_tremaining_irq_work;
+#endif
  };

  #define to_arm_pmu(p) (container_of(p, struct arm_pmu, pmu))
diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 557e128e4df0..e3fc1a0ce969 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -305,8 +305,6 @@ validate_group(struct perf_event *event)
  static irqreturn_t armpmu_dispatch_irq(int irq, void *dev)
  {
 struct arm_pmu *armpmu;
-   struct platform_device *plat_device;
-   struct arm_pmu_platdata *plat;
 int ret;
 u64 start_clock, finish_clock;

@@ -317,14 +315,9 @@ static irqreturn_t armpmu_dispatch_irq(int irq, void *dev)
  * dereference.
  */
 armpmu = *(void **)dev;
-   plat_device = armpmu->plat_device;
-   plat = dev_get_platdata(_device->dev);

 start_clock = sched_clock();
-   if (plat && plat->handle_irq)
-   ret = plat->handle_irq(irq, armpmu, armpmu->handle_irq);
-   else
-   ret = armpmu->handle_irq(irq, armpmu);
+   ret = armpmu->handle_irq(irq, armpmu);
 finish_clock = sched_clock();

 perf_sample_event_took(finish_clock - start_clock);
diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
index 61b53c46edfa..d5bbd79abd4c 100644
--- a/arch/arm/kernel/perf_event_cpu.c
+++ b/arch/arm/kernel/perf_event_cpu.c
@@ -59,6 +59,116 @@ int perf_num_counters(void)
  }
  EXPORT_SYMBOL_GPL(perf_num_counters);

+#ifdef CONFIG_SMP
+
+static cpumask_t down_prepare_cpu_mask;
+static DEFINE_SPINLOCK(down_prepare_cpu_lock);
+
+/*
+ * Workaround logic that is distributed to all cores if the PMU has only
+ * a single IRQ and the CPU receiving that IRQ cannot handle it. Its
+ * job is to try to service the interrupt on the current CPU. It will
+ * also enable the IRQ again if all the other CPUs have already tried to
+ * service it.
+ */
+static void cpu_pmu_do_percpu_work(struct irq_work *w)
+{
+   struct pmu_hw_events *hw_events =
+   container_of(w, struct pmu_hw_events, work);
+   struct arm_pmu *cpu_pmu = hw_events->percpu_pmu;
+
+   /* Ignore the return code, we can do nothing useful with it */
+   (void) cpu_pmu->handle_irq(0, cpu_pmu);
+
+   if (atomic_dec_and_test(_pmu->remaining_irq_work))
+   enable_irq(cpu_pmu->muxed_spi_workaround_irq);
+}
+
+/*
+ * Workaround for systems where all PMU interrupts are targeting a
+ * single SPI.
+ *
+ * The workaround will disable the interrupt and distribute irqwork to all
+ * the other processors in the system. Hopefully one of them will clear the
+ * interrupt...
+ *
+ * The workaround is only deployed when all PMU interrupts are aimed
+ * at a single core. As a result the workaround is never re-entered
+ * making it safe for us to use static data to maintain state.
+ */
+static void cpu_pmu_deploy_muxed_spi_workaround(struct arm_pmu *cpu_pmu)
+{
+   static cpumask_t irqwork_mask;
+   int cpu;
+
+   disable_irq_nosync(cpu_pmu->muxed_spi_workaround_irq);
+   spin_lock(_prepare_cpu_lock);
+
+   /*
+* Combining cpu_online_mask and 

Re: [PATCH v6] arm: perf: Directly handle SMP platforms with one SPI

2015-04-02 Thread Daniel Thompson

On 31/03/15 18:08, Mark Rutland wrote:

Hi Daniel,

I'd very much like to see us converge on a solution for this soon. The
existing hack is getting in the way of other rework of the arm/arm64
perf code.


I'd quite like to see this patch sorted out too (mostly because one o my 
"go to" devices still crashes horribly whenever I try to do any 
profiling on a mainline kernel).




I think the approach this patch takes should work, but there are some
parts that can be cleaned up (hopefully mostly cosmetic). Unfortunately
I don't seem to have a relevant platform for testing on.

[...]


diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
index b1596bd59129..dfef7904b790 100644
--- a/arch/arm/include/asm/pmu.h
+++ b/arch/arm/include/asm/pmu.h
@@ -87,6 +87,14 @@ struct pmu_hw_events {
  * already have to allocate this struct per cpu.
  */
 struct arm_pmu  *percpu_pmu;
+
+#ifdef CONFIG_SMP
+   /*
+* This is used to schedule workaround logic on platforms where all
+* the PMUs are attached to a single SPI.
+*/
+   struct irq_work work;
+#endif
  };

  struct arm_pmu {
@@ -117,6 +125,10 @@ struct arm_pmu {
 struct platform_device  *plat_device;
 struct pmu_hw_events__percpu *hw_events;
 struct notifier_block   hotplug_nb;
+#ifdef CONFIG_SMP
+   int muxed_spi_workaround_irq;


There's nothing workaround specific about this IRQ; it's just the only
IRQ.

I think we should just pre-parse all the IRQs into a list at probe time,
regardless of SMP or the workaround. Then we can just grab the first
(and only) interrupt from the list in the workaround paths, and
otherwise just iterate over the list.


What other uses do you anticipate for such a list? I don't really 
understand why we would create a list when we only ever consume the 
first entry.


If there is a use for such information then why keep it as a list. 
Wouldn't the code be simpler if we add it as a field in the percpu data 
structure?


For PPIs all values would be the same, for SPIs all different, for 
broken SoCs all after [0] would be 0.




+   atomic_tremaining_irq_work;


Perhaps remaining_work_irqs? That would make it clear that this is a
counter rather than a boolean or enumeration. We could s/work/fake/ or
something to that effect.


I guess the thing we are actually counting is CPUs so 
remaining_cpus_with_irq_work would therefore be extremely descriptive.




@@ -317,14 +315,9 @@ static irqreturn_t armpmu_dispatch_irq(int irq, void *dev)
  * dereference.
  */
 armpmu = *(void **)dev;
-   plat_device = armpmu->plat_device;
-   plat = dev_get_platdata(_device->dev);

 start_clock = sched_clock();
-   if (plat && plat->handle_irq)
-   ret = plat->handle_irq(irq, armpmu, armpmu->handle_irq);
-   else
-   ret = armpmu->handle_irq(irq, armpmu);
+   ret = armpmu->handle_irq(irq, armpmu);
 finish_clock = sched_clock();

 perf_sample_event_took(finish_clock - start_clock);


It's nice to see the plat stuff disappearing!


diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
index 61b53c46edfa..d5bbd79abd4c 100644
--- a/arch/arm/kernel/perf_event_cpu.c
+++ b/arch/arm/kernel/perf_event_cpu.c
@@ -59,6 +59,116 @@ int perf_num_counters(void)
  }
  EXPORT_SYMBOL_GPL(perf_num_counters);

+#ifdef CONFIG_SMP
+
+static cpumask_t down_prepare_cpu_mask;
+static DEFINE_SPINLOCK(down_prepare_cpu_lock);


I think the names here are a little misleading, because we care about
the whole window from CPU_DOWN_PREPARE to CPU_DEAD (or DOWN_FAILED). I
think these would be clearer with s/down_prepare_cpu/dying_cpu/ (though
admittedly that could also be confused with CPU_DYING).


Fine. I'll change this.



+
+/*
+ * Workaround logic that is distributed to all cores if the PMU has only
+ * a single IRQ and the CPU receiving that IRQ cannot handle it. Its
+ * job is to try to service the interrupt on the current CPU. It will
+ * also enable the IRQ again if all the other CPUs have already tried to
+ * service it.
+ */
+static void cpu_pmu_do_percpu_work(struct irq_work *w)


Perhaps do_muxed_irq_work?


Ok.



+{
+   struct pmu_hw_events *hw_events =
+   container_of(w, struct pmu_hw_events, work);
+   struct arm_pmu *cpu_pmu = hw_events->percpu_pmu;
+
+   /* Ignore the return code, we can do nothing useful with it */
+   (void) cpu_pmu->handle_irq(0, cpu_pmu);


Nit: no space after a cast please.

Do we need the void cast here? Does your toolchain complain?


I've always considered casting to void to be an executable comment... In 
this case I added it out of habit rather than because the toolchain 
complained.




+
+   if (atomic_dec_and_test(_pmu->remaining_irq_work))
+   enable_irq(cpu_pmu->muxed_spi_workaround_irq);


I'm a little uneasy about calling enable_irq here, 

Re: [PATCH v6] arm: perf: Directly handle SMP platforms with one SPI

2015-04-02 Thread Daniel Thompson

On 31/03/15 18:08, Mark Rutland wrote:

Hi Daniel,

I'd very much like to see us converge on a solution for this soon. The
existing hack is getting in the way of other rework of the arm/arm64
perf code.


I'd quite like to see this patch sorted out too (mostly because one o my 
go to devices still crashes horribly whenever I try to do any 
profiling on a mainline kernel).




I think the approach this patch takes should work, but there are some
parts that can be cleaned up (hopefully mostly cosmetic). Unfortunately
I don't seem to have a relevant platform for testing on.

[...]


diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
index b1596bd59129..dfef7904b790 100644
--- a/arch/arm/include/asm/pmu.h
+++ b/arch/arm/include/asm/pmu.h
@@ -87,6 +87,14 @@ struct pmu_hw_events {
  * already have to allocate this struct per cpu.
  */
 struct arm_pmu  *percpu_pmu;
+
+#ifdef CONFIG_SMP
+   /*
+* This is used to schedule workaround logic on platforms where all
+* the PMUs are attached to a single SPI.
+*/
+   struct irq_work work;
+#endif
  };

  struct arm_pmu {
@@ -117,6 +125,10 @@ struct arm_pmu {
 struct platform_device  *plat_device;
 struct pmu_hw_events__percpu *hw_events;
 struct notifier_block   hotplug_nb;
+#ifdef CONFIG_SMP
+   int muxed_spi_workaround_irq;


There's nothing workaround specific about this IRQ; it's just the only
IRQ.

I think we should just pre-parse all the IRQs into a list at probe time,
regardless of SMP or the workaround. Then we can just grab the first
(and only) interrupt from the list in the workaround paths, and
otherwise just iterate over the list.


What other uses do you anticipate for such a list? I don't really 
understand why we would create a list when we only ever consume the 
first entry.


If there is a use for such information then why keep it as a list. 
Wouldn't the code be simpler if we add it as a field in the percpu data 
structure?


For PPIs all values would be the same, for SPIs all different, for 
broken SoCs all after [0] would be 0.




+   atomic_tremaining_irq_work;


Perhaps remaining_work_irqs? That would make it clear that this is a
counter rather than a boolean or enumeration. We could s/work/fake/ or
something to that effect.


I guess the thing we are actually counting is CPUs so 
remaining_cpus_with_irq_work would therefore be extremely descriptive.




@@ -317,14 +315,9 @@ static irqreturn_t armpmu_dispatch_irq(int irq, void *dev)
  * dereference.
  */
 armpmu = *(void **)dev;
-   plat_device = armpmu-plat_device;
-   plat = dev_get_platdata(plat_device-dev);

 start_clock = sched_clock();
-   if (plat  plat-handle_irq)
-   ret = plat-handle_irq(irq, armpmu, armpmu-handle_irq);
-   else
-   ret = armpmu-handle_irq(irq, armpmu);
+   ret = armpmu-handle_irq(irq, armpmu);
 finish_clock = sched_clock();

 perf_sample_event_took(finish_clock - start_clock);


It's nice to see the plat stuff disappearing!


diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
index 61b53c46edfa..d5bbd79abd4c 100644
--- a/arch/arm/kernel/perf_event_cpu.c
+++ b/arch/arm/kernel/perf_event_cpu.c
@@ -59,6 +59,116 @@ int perf_num_counters(void)
  }
  EXPORT_SYMBOL_GPL(perf_num_counters);

+#ifdef CONFIG_SMP
+
+static cpumask_t down_prepare_cpu_mask;
+static DEFINE_SPINLOCK(down_prepare_cpu_lock);


I think the names here are a little misleading, because we care about
the whole window from CPU_DOWN_PREPARE to CPU_DEAD (or DOWN_FAILED). I
think these would be clearer with s/down_prepare_cpu/dying_cpu/ (though
admittedly that could also be confused with CPU_DYING).


Fine. I'll change this.



+
+/*
+ * Workaround logic that is distributed to all cores if the PMU has only
+ * a single IRQ and the CPU receiving that IRQ cannot handle it. Its
+ * job is to try to service the interrupt on the current CPU. It will
+ * also enable the IRQ again if all the other CPUs have already tried to
+ * service it.
+ */
+static void cpu_pmu_do_percpu_work(struct irq_work *w)


Perhaps do_muxed_irq_work?


Ok.



+{
+   struct pmu_hw_events *hw_events =
+   container_of(w, struct pmu_hw_events, work);
+   struct arm_pmu *cpu_pmu = hw_events-percpu_pmu;
+
+   /* Ignore the return code, we can do nothing useful with it */
+   (void) cpu_pmu-handle_irq(0, cpu_pmu);


Nit: no space after a cast please.

Do we need the void cast here? Does your toolchain complain?


I've always considered casting to void to be an executable comment... In 
this case I added it out of habit rather than because the toolchain 
complained.




+
+   if (atomic_dec_and_test(cpu_pmu-remaining_irq_work))
+   enable_irq(cpu_pmu-muxed_spi_workaround_irq);


I'm a little uneasy about calling enable_irq here, given 

Re: [PATCH v6] arm: perf: Directly handle SMP platforms with one SPI

2015-04-02 Thread Daniel Thompson

On 31/03/15 17:20, Will Deacon wrote:

Hi Daniel,

On Wed, Mar 04, 2015 at 01:25:45PM +, Daniel Thompson wrote:

Some ARM platforms mux the PMU interrupt of every core into a single
SPI. On such platforms if the PMU of any core except 0 raises an interrupt
then it cannot be serviced and eventually, if you are lucky, the spurious
irq detection might forcefully disable the interrupt.

On these SoCs it is not possible to determine which core raised the
interrupt so workaround this issue by queuing irqwork on the other
cores whenever the primary interrupt handler is unable to service the
interrupt.

The u8500 platform has an alternative workaround that dynamically alters
the affinity of the PMU interrupt. This workaround logic is no longer
required so the original code is removed as is the hook it relied upon.

Tested on imx6q (which has fours cores/PMUs all muxed to a single SPI)
using a simple soak, combined perf and CPU hotplug soak and using
perf fuzzer's fast_repro.sh.


[...]


diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
index b1596bd59129..dfef7904b790 100644
--- a/arch/arm/include/asm/pmu.h
+++ b/arch/arm/include/asm/pmu.h
@@ -87,6 +87,14 @@ struct pmu_hw_events {
  * already have to allocate this struct per cpu.
  */
 struct arm_pmu  *percpu_pmu;
+
+#ifdef CONFIG_SMP
+   /*
+* This is used to schedule workaround logic on platforms where all
+* the PMUs are attached to a single SPI.
+*/
+   struct irq_work work;
+#endif
  };

  struct arm_pmu {
@@ -117,6 +125,10 @@ struct arm_pmu {
 struct platform_device  *plat_device;
 struct pmu_hw_events__percpu *hw_events;
 struct notifier_block   hotplug_nb;
+#ifdef CONFIG_SMP
+   int muxed_spi_workaround_irq;
+   atomic_tremaining_irq_work;
+#endif
  };

  #define to_arm_pmu(p) (container_of(p, struct arm_pmu, pmu))
diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 557e128e4df0..e3fc1a0ce969 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -305,8 +305,6 @@ validate_group(struct perf_event *event)
  static irqreturn_t armpmu_dispatch_irq(int irq, void *dev)
  {
 struct arm_pmu *armpmu;
-   struct platform_device *plat_device;
-   struct arm_pmu_platdata *plat;
 int ret;
 u64 start_clock, finish_clock;

@@ -317,14 +315,9 @@ static irqreturn_t armpmu_dispatch_irq(int irq, void *dev)
  * dereference.
  */
 armpmu = *(void **)dev;
-   plat_device = armpmu-plat_device;
-   plat = dev_get_platdata(plat_device-dev);

 start_clock = sched_clock();
-   if (plat  plat-handle_irq)
-   ret = plat-handle_irq(irq, armpmu, armpmu-handle_irq);
-   else
-   ret = armpmu-handle_irq(irq, armpmu);
+   ret = armpmu-handle_irq(irq, armpmu);
 finish_clock = sched_clock();

 perf_sample_event_took(finish_clock - start_clock);
diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
index 61b53c46edfa..d5bbd79abd4c 100644
--- a/arch/arm/kernel/perf_event_cpu.c
+++ b/arch/arm/kernel/perf_event_cpu.c
@@ -59,6 +59,116 @@ int perf_num_counters(void)
  }
  EXPORT_SYMBOL_GPL(perf_num_counters);

+#ifdef CONFIG_SMP
+
+static cpumask_t down_prepare_cpu_mask;
+static DEFINE_SPINLOCK(down_prepare_cpu_lock);
+
+/*
+ * Workaround logic that is distributed to all cores if the PMU has only
+ * a single IRQ and the CPU receiving that IRQ cannot handle it. Its
+ * job is to try to service the interrupt on the current CPU. It will
+ * also enable the IRQ again if all the other CPUs have already tried to
+ * service it.
+ */
+static void cpu_pmu_do_percpu_work(struct irq_work *w)
+{
+   struct pmu_hw_events *hw_events =
+   container_of(w, struct pmu_hw_events, work);
+   struct arm_pmu *cpu_pmu = hw_events-percpu_pmu;
+
+   /* Ignore the return code, we can do nothing useful with it */
+   (void) cpu_pmu-handle_irq(0, cpu_pmu);
+
+   if (atomic_dec_and_test(cpu_pmu-remaining_irq_work))
+   enable_irq(cpu_pmu-muxed_spi_workaround_irq);
+}
+
+/*
+ * Workaround for systems where all PMU interrupts are targeting a
+ * single SPI.
+ *
+ * The workaround will disable the interrupt and distribute irqwork to all
+ * the other processors in the system. Hopefully one of them will clear the
+ * interrupt...
+ *
+ * The workaround is only deployed when all PMU interrupts are aimed
+ * at a single core. As a result the workaround is never re-entered
+ * making it safe for us to use static data to maintain state.
+ */
+static void cpu_pmu_deploy_muxed_spi_workaround(struct arm_pmu *cpu_pmu)
+{
+   static cpumask_t irqwork_mask;
+   int cpu;
+
+   disable_irq_nosync(cpu_pmu-muxed_spi_workaround_irq);
+   spin_lock(down_prepare_cpu_lock);
+
+   /*
+* Combining cpu_online_mask and 

Re: [PATCH v6] arm: perf: Directly handle SMP platforms with one SPI

2015-04-02 Thread Mark Rutland
struct arm_pmu {
  @@ -117,6 +125,10 @@ struct arm_pmu {
   struct platform_device  *plat_device;
   struct pmu_hw_events__percpu *hw_events;
   struct notifier_block   hotplug_nb;
  +#ifdef CONFIG_SMP
  +   int muxed_spi_workaround_irq;
 
  There's nothing workaround specific about this IRQ; it's just the only
  IRQ.
 
  I think we should just pre-parse all the IRQs into a list at probe time,
  regardless of SMP or the workaround. Then we can just grab the first
  (and only) interrupt from the list in the workaround paths, and
  otherwise just iterate over the list.
 
 What other uses do you anticipate for such a list? I don't really
 understand why we would create a list when we only ever consume the
 first entry.

It would go parallel to the affinity property Will recently added, and
would allow us to separate the DT parsing from the request/free paths,
making those far easier to understand (and likely a little faster too).

 If there is a use for such information then why keep it as a list.
 Wouldn't the code be simpler if we add it as a field in the percpu data
 structure?
 
 For PPIs all values would be the same, for SPIs all different, for
 broken SoCs all after [0] would be 0.

That's a nice idea. That would make the affinity info implicit, and
could be far neater.

  +   atomic_tremaining_irq_work;
 
  Perhaps remaining_work_irqs? That would make it clear that this is a
  counter rather than a boolean or enumeration. We could s/work/fake/ or
  something to that effect.
 
 I guess the thing we are actually counting is CPUs so
 remaining_cpus_with_irq_work would therefore be extremely descriptive.

It would be nice if we could come up with something that was a little
more concise. Perhaps cpus_with_work would be good enough?

  +}
  +
  +/*
  + * Workaround for systems where all PMU interrupts are targeting a
  + * single SPI.
  + *
  + * The workaround will disable the interrupt and distribute irqwork to all
  + * the other processors in the system. Hopefully one of them will clear 
  the
  + * interrupt...
  + *
  + * The workaround is only deployed when all PMU interrupts are aimed
  + * at a single core. As a result the workaround is never re-entered
  + * making it safe for us to use static data to maintain state.
  + */
  +static void cpu_pmu_deploy_muxed_spi_workaround(struct arm_pmu *cpu_pmu)
 
  Perhaps distribute_muxed_irq?
 
 Could do. Personally I prefer to be clear that this is deploying a
 workaround (i.e. implying it does something odd on a small number of
 SoCs) rather than this code being part of the normal case.

I'd prefer the more succinct name. We already have the comment stating
that this is a workaround.

[...]

  +/*
  + * Called when the main interrupt handler cannot determine the source
  + * of interrupt. It will deploy a workaround if we are running on an SMP
  + * platform with only a single muxed SPI.
  + */
  +static irqreturn_t cpu_pmu_handle_irq_none(int irq_num, struct arm_pmu 
  *cpu_pmu)
  +{
  +   if (irq_num != cpu_pmu-muxed_spi_workaround_irq)
  +   return IRQ_NONE;
 
  This is somewhat opaque.
 
  I'd rather just have a flag to determine when we need to do any special
  handling for the muxed case (or better, swizzle the irq handler to a
  wrapper that pings the other CPUs and calls the usual handler).
 
 Using a different handler for the workaround makes sense.
 
 IIRC I avoided that previously because the way the code is currently
 split between perf_event.c and perf_event_cpu.c didn't make that very
 natural.
 
 I'll look at this again.

The split between the two is an unfortunate legacy from when it looked
like system PMUs could reuse the code. We recently decoupled the CCI PMU
from arm_pmu so that we could fuse the two layers and get rid of a load
of indirection.

I have some patches doing that. I'll try to rework them next week given
it sounds like they could be a useful base for this.


  [...]
 
  diff --git a/arch/arm/kernel/perf_event_v7.c 
  b/arch/arm/kernel/perf_event_v7.c
  index 8993770c47de..0dd914c10803 100644
  --- a/arch/arm/kernel/perf_event_v7.c
  +++ b/arch/arm/kernel/perf_event_v7.c
  @@ -792,7 +792,7 @@ static irqreturn_t armv7pmu_handle_irq(int irq_num, 
  void *dev)
* Did an overflow occur?
*/
   if (!armv7_pmnc_has_overflowed(pmnc))
  -   return IRQ_NONE;
  +   return cpu_pmu_handle_irq_none(irq_num, cpu_pmu);
 
  Won't this leave samples skewed towards the CPU the interrupt is affine
  to? If you're counting something like cycles with a short enough period
  (and therefore effectively always have something to handle on the local
  CPU), we might never ping the other CPUs.
 
 Do we really care about fidelity of results when servicing a perf
 interrupt causes another perf interrupt to happen ;-) ? At this point
 perf has already stopped profiling anything useful.

Fair point. I'll have to 

Re: [PATCH v6] arm: perf: Directly handle SMP platforms with one SPI

2015-03-31 Thread Mark Rutland
Hi Daniel,

I'd very much like to see us converge on a solution for this soon. The
existing hack is getting in the way of other rework of the arm/arm64
perf code.

I think the approach this patch takes should work, but there are some
parts that can be cleaned up (hopefully mostly cosmetic). Unfortunately
I don't seem to have a relevant platform for testing on.

[...]

> diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
> index b1596bd59129..dfef7904b790 100644
> --- a/arch/arm/include/asm/pmu.h
> +++ b/arch/arm/include/asm/pmu.h
> @@ -87,6 +87,14 @@ struct pmu_hw_events {
>  * already have to allocate this struct per cpu.
>  */
> struct arm_pmu  *percpu_pmu;
> +
> +#ifdef CONFIG_SMP
> +   /*
> +* This is used to schedule workaround logic on platforms where all
> +* the PMUs are attached to a single SPI.
> +*/
> +   struct irq_work work;
> +#endif
>  };
> 
>  struct arm_pmu {
> @@ -117,6 +125,10 @@ struct arm_pmu {
> struct platform_device  *plat_device;
> struct pmu_hw_events__percpu *hw_events;
> struct notifier_block   hotplug_nb;
> +#ifdef CONFIG_SMP
> +   int muxed_spi_workaround_irq;

There's nothing workaround specific about this IRQ; it's just the only
IRQ.

I think we should just pre-parse all the IRQs into a list at probe time,
regardless of SMP or the workaround. Then we can just grab the first
(and only) interrupt from the list in the workaround paths, and
otherwise just iterate over the list.

> +   atomic_tremaining_irq_work;

Perhaps remaining_work_irqs? That would make it clear that this is a
counter rather than a boolean or enumeration. We could s/work/fake/ or
something to that effect.

> @@ -317,14 +315,9 @@ static irqreturn_t armpmu_dispatch_irq(int irq, void 
> *dev)
>  * dereference.
>  */
> armpmu = *(void **)dev;
> -   plat_device = armpmu->plat_device;
> -   plat = dev_get_platdata(_device->dev);
> 
> start_clock = sched_clock();
> -   if (plat && plat->handle_irq)
> -   ret = plat->handle_irq(irq, armpmu, armpmu->handle_irq);
> -   else
> -   ret = armpmu->handle_irq(irq, armpmu);
> +   ret = armpmu->handle_irq(irq, armpmu);
> finish_clock = sched_clock();
> 
> perf_sample_event_took(finish_clock - start_clock);

It's nice to see the plat stuff disappearing!

> diff --git a/arch/arm/kernel/perf_event_cpu.c 
> b/arch/arm/kernel/perf_event_cpu.c
> index 61b53c46edfa..d5bbd79abd4c 100644
> --- a/arch/arm/kernel/perf_event_cpu.c
> +++ b/arch/arm/kernel/perf_event_cpu.c
> @@ -59,6 +59,116 @@ int perf_num_counters(void)
>  }
>  EXPORT_SYMBOL_GPL(perf_num_counters);
> 
> +#ifdef CONFIG_SMP
> +
> +static cpumask_t down_prepare_cpu_mask;
> +static DEFINE_SPINLOCK(down_prepare_cpu_lock);

I think the names here are a little misleading, because we care about
the whole window from CPU_DOWN_PREPARE to CPU_DEAD (or DOWN_FAILED). I
think these would be clearer with s/down_prepare_cpu/dying_cpu/ (though
admittedly that could also be confused with CPU_DYING).

> +
> +/*
> + * Workaround logic that is distributed to all cores if the PMU has only
> + * a single IRQ and the CPU receiving that IRQ cannot handle it. Its
> + * job is to try to service the interrupt on the current CPU. It will
> + * also enable the IRQ again if all the other CPUs have already tried to
> + * service it.
> + */
> +static void cpu_pmu_do_percpu_work(struct irq_work *w)

Perhaps do_muxed_irq_work?

> +{
> +   struct pmu_hw_events *hw_events =
> +   container_of(w, struct pmu_hw_events, work);
> +   struct arm_pmu *cpu_pmu = hw_events->percpu_pmu;
> +
> +   /* Ignore the return code, we can do nothing useful with it */
> +   (void) cpu_pmu->handle_irq(0, cpu_pmu);

Nit: no space after a cast please.

Do we need the void cast here? Does your toolchain complain?

> +
> +   if (atomic_dec_and_test(_pmu->remaining_irq_work))
> +   enable_irq(cpu_pmu->muxed_spi_workaround_irq);

I'm a little uneasy about calling enable_irq here, given we're in IRQ
context. While it doesn't look we'd be wired up through an irqchip with
irq_bus_lock or irq_bus_sync_unlock, I'm not sure how safe it is to rely
on that being the only thing that matters.

It would be nice to hear from someone familiar with the IRQ code on that
respect.

> +}
> +
> +/*
> + * Workaround for systems where all PMU interrupts are targeting a
> + * single SPI.
> + *
> + * The workaround will disable the interrupt and distribute irqwork to all
> + * the other processors in the system. Hopefully one of them will clear the
> + * interrupt...
> + *
> + * The workaround is only deployed when all PMU interrupts are aimed
> + * at a single core. As a result the workaround is never re-entered
> + * making it safe for us to use static data to maintain state.
> + */
> +static void 

Re: [PATCH v6] arm: perf: Directly handle SMP platforms with one SPI

2015-03-31 Thread Will Deacon
Hi Daniel,

On Wed, Mar 04, 2015 at 01:25:45PM +, Daniel Thompson wrote:
> Some ARM platforms mux the PMU interrupt of every core into a single
> SPI. On such platforms if the PMU of any core except 0 raises an interrupt
> then it cannot be serviced and eventually, if you are lucky, the spurious
> irq detection might forcefully disable the interrupt.
> 
> On these SoCs it is not possible to determine which core raised the
> interrupt so workaround this issue by queuing irqwork on the other
> cores whenever the primary interrupt handler is unable to service the
> interrupt.
> 
> The u8500 platform has an alternative workaround that dynamically alters
> the affinity of the PMU interrupt. This workaround logic is no longer
> required so the original code is removed as is the hook it relied upon.
> 
> Tested on imx6q (which has fours cores/PMUs all muxed to a single SPI)
> using a simple soak, combined perf and CPU hotplug soak and using
> perf fuzzer's fast_repro.sh.

[...]

> diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
> index b1596bd59129..dfef7904b790 100644
> --- a/arch/arm/include/asm/pmu.h
> +++ b/arch/arm/include/asm/pmu.h
> @@ -87,6 +87,14 @@ struct pmu_hw_events {
>  * already have to allocate this struct per cpu.
>  */
> struct arm_pmu  *percpu_pmu;
> +
> +#ifdef CONFIG_SMP
> +   /*
> +* This is used to schedule workaround logic on platforms where all
> +* the PMUs are attached to a single SPI.
> +*/
> +   struct irq_work work;
> +#endif
>  };
> 
>  struct arm_pmu {
> @@ -117,6 +125,10 @@ struct arm_pmu {
> struct platform_device  *plat_device;
> struct pmu_hw_events__percpu *hw_events;
> struct notifier_block   hotplug_nb;
> +#ifdef CONFIG_SMP
> +   int muxed_spi_workaround_irq;
> +   atomic_tremaining_irq_work;
> +#endif
>  };
> 
>  #define to_arm_pmu(p) (container_of(p, struct arm_pmu, pmu))
> diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
> index 557e128e4df0..e3fc1a0ce969 100644
> --- a/arch/arm/kernel/perf_event.c
> +++ b/arch/arm/kernel/perf_event.c
> @@ -305,8 +305,6 @@ validate_group(struct perf_event *event)
>  static irqreturn_t armpmu_dispatch_irq(int irq, void *dev)
>  {
> struct arm_pmu *armpmu;
> -   struct platform_device *plat_device;
> -   struct arm_pmu_platdata *plat;
> int ret;
> u64 start_clock, finish_clock;
> 
> @@ -317,14 +315,9 @@ static irqreturn_t armpmu_dispatch_irq(int irq, void 
> *dev)
>  * dereference.
>  */
> armpmu = *(void **)dev;
> -   plat_device = armpmu->plat_device;
> -   plat = dev_get_platdata(_device->dev);
> 
> start_clock = sched_clock();
> -   if (plat && plat->handle_irq)
> -   ret = plat->handle_irq(irq, armpmu, armpmu->handle_irq);
> -   else
> -   ret = armpmu->handle_irq(irq, armpmu);
> +   ret = armpmu->handle_irq(irq, armpmu);
> finish_clock = sched_clock();
> 
> perf_sample_event_took(finish_clock - start_clock);
> diff --git a/arch/arm/kernel/perf_event_cpu.c 
> b/arch/arm/kernel/perf_event_cpu.c
> index 61b53c46edfa..d5bbd79abd4c 100644
> --- a/arch/arm/kernel/perf_event_cpu.c
> +++ b/arch/arm/kernel/perf_event_cpu.c
> @@ -59,6 +59,116 @@ int perf_num_counters(void)
>  }
>  EXPORT_SYMBOL_GPL(perf_num_counters);
> 
> +#ifdef CONFIG_SMP
> +
> +static cpumask_t down_prepare_cpu_mask;
> +static DEFINE_SPINLOCK(down_prepare_cpu_lock);
> +
> +/*
> + * Workaround logic that is distributed to all cores if the PMU has only
> + * a single IRQ and the CPU receiving that IRQ cannot handle it. Its
> + * job is to try to service the interrupt on the current CPU. It will
> + * also enable the IRQ again if all the other CPUs have already tried to
> + * service it.
> + */
> +static void cpu_pmu_do_percpu_work(struct irq_work *w)
> +{
> +   struct pmu_hw_events *hw_events =
> +   container_of(w, struct pmu_hw_events, work);
> +   struct arm_pmu *cpu_pmu = hw_events->percpu_pmu;
> +
> +   /* Ignore the return code, we can do nothing useful with it */
> +   (void) cpu_pmu->handle_irq(0, cpu_pmu);
> +
> +   if (atomic_dec_and_test(_pmu->remaining_irq_work))
> +   enable_irq(cpu_pmu->muxed_spi_workaround_irq);
> +}
> +
> +/*
> + * Workaround for systems where all PMU interrupts are targeting a
> + * single SPI.
> + *
> + * The workaround will disable the interrupt and distribute irqwork to all
> + * the other processors in the system. Hopefully one of them will clear the
> + * interrupt...
> + *
> + * The workaround is only deployed when all PMU interrupts are aimed
> + * at a single core. As a result the workaround is never re-entered
> + * making it safe for us to use static data to maintain state.
> + */
> +static void cpu_pmu_deploy_muxed_spi_workaround(struct arm_pmu *cpu_pmu)
> +{
> +   static cpumask_t 

Re: [PATCH v6] arm: perf: Directly handle SMP platforms with one SPI

2015-03-31 Thread Mark Rutland
Hi Daniel,

I'd very much like to see us converge on a solution for this soon. The
existing hack is getting in the way of other rework of the arm/arm64
perf code.

I think the approach this patch takes should work, but there are some
parts that can be cleaned up (hopefully mostly cosmetic). Unfortunately
I don't seem to have a relevant platform for testing on.

[...]

 diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
 index b1596bd59129..dfef7904b790 100644
 --- a/arch/arm/include/asm/pmu.h
 +++ b/arch/arm/include/asm/pmu.h
 @@ -87,6 +87,14 @@ struct pmu_hw_events {
  * already have to allocate this struct per cpu.
  */
 struct arm_pmu  *percpu_pmu;
 +
 +#ifdef CONFIG_SMP
 +   /*
 +* This is used to schedule workaround logic on platforms where all
 +* the PMUs are attached to a single SPI.
 +*/
 +   struct irq_work work;
 +#endif
  };
 
  struct arm_pmu {
 @@ -117,6 +125,10 @@ struct arm_pmu {
 struct platform_device  *plat_device;
 struct pmu_hw_events__percpu *hw_events;
 struct notifier_block   hotplug_nb;
 +#ifdef CONFIG_SMP
 +   int muxed_spi_workaround_irq;

There's nothing workaround specific about this IRQ; it's just the only
IRQ.

I think we should just pre-parse all the IRQs into a list at probe time,
regardless of SMP or the workaround. Then we can just grab the first
(and only) interrupt from the list in the workaround paths, and
otherwise just iterate over the list.

 +   atomic_tremaining_irq_work;

Perhaps remaining_work_irqs? That would make it clear that this is a
counter rather than a boolean or enumeration. We could s/work/fake/ or
something to that effect.

 @@ -317,14 +315,9 @@ static irqreturn_t armpmu_dispatch_irq(int irq, void 
 *dev)
  * dereference.
  */
 armpmu = *(void **)dev;
 -   plat_device = armpmu-plat_device;
 -   plat = dev_get_platdata(plat_device-dev);
 
 start_clock = sched_clock();
 -   if (plat  plat-handle_irq)
 -   ret = plat-handle_irq(irq, armpmu, armpmu-handle_irq);
 -   else
 -   ret = armpmu-handle_irq(irq, armpmu);
 +   ret = armpmu-handle_irq(irq, armpmu);
 finish_clock = sched_clock();
 
 perf_sample_event_took(finish_clock - start_clock);

It's nice to see the plat stuff disappearing!

 diff --git a/arch/arm/kernel/perf_event_cpu.c 
 b/arch/arm/kernel/perf_event_cpu.c
 index 61b53c46edfa..d5bbd79abd4c 100644
 --- a/arch/arm/kernel/perf_event_cpu.c
 +++ b/arch/arm/kernel/perf_event_cpu.c
 @@ -59,6 +59,116 @@ int perf_num_counters(void)
  }
  EXPORT_SYMBOL_GPL(perf_num_counters);
 
 +#ifdef CONFIG_SMP
 +
 +static cpumask_t down_prepare_cpu_mask;
 +static DEFINE_SPINLOCK(down_prepare_cpu_lock);

I think the names here are a little misleading, because we care about
the whole window from CPU_DOWN_PREPARE to CPU_DEAD (or DOWN_FAILED). I
think these would be clearer with s/down_prepare_cpu/dying_cpu/ (though
admittedly that could also be confused with CPU_DYING).

 +
 +/*
 + * Workaround logic that is distributed to all cores if the PMU has only
 + * a single IRQ and the CPU receiving that IRQ cannot handle it. Its
 + * job is to try to service the interrupt on the current CPU. It will
 + * also enable the IRQ again if all the other CPUs have already tried to
 + * service it.
 + */
 +static void cpu_pmu_do_percpu_work(struct irq_work *w)

Perhaps do_muxed_irq_work?

 +{
 +   struct pmu_hw_events *hw_events =
 +   container_of(w, struct pmu_hw_events, work);
 +   struct arm_pmu *cpu_pmu = hw_events-percpu_pmu;
 +
 +   /* Ignore the return code, we can do nothing useful with it */
 +   (void) cpu_pmu-handle_irq(0, cpu_pmu);

Nit: no space after a cast please.

Do we need the void cast here? Does your toolchain complain?

 +
 +   if (atomic_dec_and_test(cpu_pmu-remaining_irq_work))
 +   enable_irq(cpu_pmu-muxed_spi_workaround_irq);

I'm a little uneasy about calling enable_irq here, given we're in IRQ
context. While it doesn't look we'd be wired up through an irqchip with
irq_bus_lock or irq_bus_sync_unlock, I'm not sure how safe it is to rely
on that being the only thing that matters.

It would be nice to hear from someone familiar with the IRQ code on that
respect.

 +}
 +
 +/*
 + * Workaround for systems where all PMU interrupts are targeting a
 + * single SPI.
 + *
 + * The workaround will disable the interrupt and distribute irqwork to all
 + * the other processors in the system. Hopefully one of them will clear the
 + * interrupt...
 + *
 + * The workaround is only deployed when all PMU interrupts are aimed
 + * at a single core. As a result the workaround is never re-entered
 + * making it safe for us to use static data to maintain state.
 + */
 +static void cpu_pmu_deploy_muxed_spi_workaround(struct arm_pmu *cpu_pmu)

Perhaps distribute_muxed_irq?

 +{
 +   static 

Re: [PATCH v6] arm: perf: Directly handle SMP platforms with one SPI

2015-03-31 Thread Will Deacon
Hi Daniel,

On Wed, Mar 04, 2015 at 01:25:45PM +, Daniel Thompson wrote:
 Some ARM platforms mux the PMU interrupt of every core into a single
 SPI. On such platforms if the PMU of any core except 0 raises an interrupt
 then it cannot be serviced and eventually, if you are lucky, the spurious
 irq detection might forcefully disable the interrupt.
 
 On these SoCs it is not possible to determine which core raised the
 interrupt so workaround this issue by queuing irqwork on the other
 cores whenever the primary interrupt handler is unable to service the
 interrupt.
 
 The u8500 platform has an alternative workaround that dynamically alters
 the affinity of the PMU interrupt. This workaround logic is no longer
 required so the original code is removed as is the hook it relied upon.
 
 Tested on imx6q (which has fours cores/PMUs all muxed to a single SPI)
 using a simple soak, combined perf and CPU hotplug soak and using
 perf fuzzer's fast_repro.sh.

[...]

 diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
 index b1596bd59129..dfef7904b790 100644
 --- a/arch/arm/include/asm/pmu.h
 +++ b/arch/arm/include/asm/pmu.h
 @@ -87,6 +87,14 @@ struct pmu_hw_events {
  * already have to allocate this struct per cpu.
  */
 struct arm_pmu  *percpu_pmu;
 +
 +#ifdef CONFIG_SMP
 +   /*
 +* This is used to schedule workaround logic on platforms where all
 +* the PMUs are attached to a single SPI.
 +*/
 +   struct irq_work work;
 +#endif
  };
 
  struct arm_pmu {
 @@ -117,6 +125,10 @@ struct arm_pmu {
 struct platform_device  *plat_device;
 struct pmu_hw_events__percpu *hw_events;
 struct notifier_block   hotplug_nb;
 +#ifdef CONFIG_SMP
 +   int muxed_spi_workaround_irq;
 +   atomic_tremaining_irq_work;
 +#endif
  };
 
  #define to_arm_pmu(p) (container_of(p, struct arm_pmu, pmu))
 diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
 index 557e128e4df0..e3fc1a0ce969 100644
 --- a/arch/arm/kernel/perf_event.c
 +++ b/arch/arm/kernel/perf_event.c
 @@ -305,8 +305,6 @@ validate_group(struct perf_event *event)
  static irqreturn_t armpmu_dispatch_irq(int irq, void *dev)
  {
 struct arm_pmu *armpmu;
 -   struct platform_device *plat_device;
 -   struct arm_pmu_platdata *plat;
 int ret;
 u64 start_clock, finish_clock;
 
 @@ -317,14 +315,9 @@ static irqreturn_t armpmu_dispatch_irq(int irq, void 
 *dev)
  * dereference.
  */
 armpmu = *(void **)dev;
 -   plat_device = armpmu-plat_device;
 -   plat = dev_get_platdata(plat_device-dev);
 
 start_clock = sched_clock();
 -   if (plat  plat-handle_irq)
 -   ret = plat-handle_irq(irq, armpmu, armpmu-handle_irq);
 -   else
 -   ret = armpmu-handle_irq(irq, armpmu);
 +   ret = armpmu-handle_irq(irq, armpmu);
 finish_clock = sched_clock();
 
 perf_sample_event_took(finish_clock - start_clock);
 diff --git a/arch/arm/kernel/perf_event_cpu.c 
 b/arch/arm/kernel/perf_event_cpu.c
 index 61b53c46edfa..d5bbd79abd4c 100644
 --- a/arch/arm/kernel/perf_event_cpu.c
 +++ b/arch/arm/kernel/perf_event_cpu.c
 @@ -59,6 +59,116 @@ int perf_num_counters(void)
  }
  EXPORT_SYMBOL_GPL(perf_num_counters);
 
 +#ifdef CONFIG_SMP
 +
 +static cpumask_t down_prepare_cpu_mask;
 +static DEFINE_SPINLOCK(down_prepare_cpu_lock);
 +
 +/*
 + * Workaround logic that is distributed to all cores if the PMU has only
 + * a single IRQ and the CPU receiving that IRQ cannot handle it. Its
 + * job is to try to service the interrupt on the current CPU. It will
 + * also enable the IRQ again if all the other CPUs have already tried to
 + * service it.
 + */
 +static void cpu_pmu_do_percpu_work(struct irq_work *w)
 +{
 +   struct pmu_hw_events *hw_events =
 +   container_of(w, struct pmu_hw_events, work);
 +   struct arm_pmu *cpu_pmu = hw_events-percpu_pmu;
 +
 +   /* Ignore the return code, we can do nothing useful with it */
 +   (void) cpu_pmu-handle_irq(0, cpu_pmu);
 +
 +   if (atomic_dec_and_test(cpu_pmu-remaining_irq_work))
 +   enable_irq(cpu_pmu-muxed_spi_workaround_irq);
 +}
 +
 +/*
 + * Workaround for systems where all PMU interrupts are targeting a
 + * single SPI.
 + *
 + * The workaround will disable the interrupt and distribute irqwork to all
 + * the other processors in the system. Hopefully one of them will clear the
 + * interrupt...
 + *
 + * The workaround is only deployed when all PMU interrupts are aimed
 + * at a single core. As a result the workaround is never re-entered
 + * making it safe for us to use static data to maintain state.
 + */
 +static void cpu_pmu_deploy_muxed_spi_workaround(struct arm_pmu *cpu_pmu)
 +{
 +   static cpumask_t irqwork_mask;
 +   int cpu;
 +
 +   disable_irq_nosync(cpu_pmu-muxed_spi_workaround_irq);
 +   

[PATCH v6] arm: perf: Directly handle SMP platforms with one SPI

2015-03-04 Thread Daniel Thompson
Some ARM platforms mux the PMU interrupt of every core into a single
SPI. On such platforms if the PMU of any core except 0 raises an interrupt
then it cannot be serviced and eventually, if you are lucky, the spurious
irq detection might forcefully disable the interrupt.

On these SoCs it is not possible to determine which core raised the
interrupt so workaround this issue by queuing irqwork on the other
cores whenever the primary interrupt handler is unable to service the
interrupt.

The u8500 platform has an alternative workaround that dynamically alters
the affinity of the PMU interrupt. This workaround logic is no longer
required so the original code is removed as is the hook it relied upon.

Tested on imx6q (which has fours cores/PMUs all muxed to a single SPI)
using a simple soak, combined perf and CPU hotplug soak and using
perf fuzzer's fast_repro.sh.

Signed-off-by: Daniel Thompson 
---

Notes:
v2 was tested on u8500 (thanks to Linus Walleij). The latest patch
doesn't change the nature of the workaround itself but there has been
substantial churn in the logic to decide when it can safely be deployed.
Thus the changes were sufficient for me not to
preserve the Tested-By:.

v6:
 * Redesigned the code that decides if it is safe to deploy the
   workaround (acting on the review by Mark Rutland). Code should no
   longer race during hot unplug; previous patches sought to make the
   hot unplug race benign and the old approach had flaws and even if
   it could be made correct was tortuously hard to review.

v5:
 * Removed the work queue nonsense; being completely race-free requires
   us to take a mutex or avoid dispatch from interrupt (Will Deacon).
   Replacement code can potentially race with a CPU hot unplug however
   it is careful to minimise exposure, to mitigate harmful effects and
   has fairly prominent comments.

v4:
 * Ripped out the logic that tried to preserve the operation of the
   spurious interrupt detector. It was complex and not really needed
   (Will Deacon).
 * Removed a redundant memory barrier and added a comment explaining
   why it is not needed (Will Deacon).
 * Made fully safe w.r.t. hotplug by falling back to a work queue
   if there is a hotplug operation in flight when the PMU interrupt
   comes in (Will Deacon). The work queue code paths have been tested
   synthetically (by changing the if condition).
 * Posted the correct, as in compilable and tested, version of the code
   (Will Deacon).

v3:
 * Removed function pointer indirection when deploying workaround code
   and reorganise the code accordingly (Mark Rutland).
 * Move the workaround state tracking into the existing percpu data
   structure (Mark Rutland).
 * Renamed cret to percpu_ret and rewrote the comment describing the
   purpose of this variable (Mark Rutland).
 * Copy the cpu_online_mask and use that to act on a consistent set of
   cpus throughout the workaround (Mark Rutland).
 * Changed "single_irq" to "muxed_spi" to more explicitly describe
   the problem.

v2:
 * Fixed build problems on systems without SMP.

v1:
 * Thanks to Lucas Stach, Russell King and Thomas Gleixner for
   critiquing an older, completely different way to tackle the
   same problem.


 arch/arm/include/asm/pmu.h   |  12 +++
 arch/arm/kernel/perf_event.c |   9 +-
 arch/arm/kernel/perf_event_cpu.c | 179 +++
 arch/arm/kernel/perf_event_v7.c  |   2 +-
 arch/arm/mach-ux500/cpu-db8500.c |  29 ---
 5 files changed, 178 insertions(+), 53 deletions(-)

diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
index b1596bd59129..dfef7904b790 100644
--- a/arch/arm/include/asm/pmu.h
+++ b/arch/arm/include/asm/pmu.h
@@ -87,6 +87,14 @@ struct pmu_hw_events {
 * already have to allocate this struct per cpu.
 */
struct arm_pmu  *percpu_pmu;
+
+#ifdef CONFIG_SMP
+   /*
+* This is used to schedule workaround logic on platforms where all
+* the PMUs are attached to a single SPI.
+*/
+   struct irq_work work;
+#endif
 };

 struct arm_pmu {
@@ -117,6 +125,10 @@ struct arm_pmu {
struct platform_device  *plat_device;
struct pmu_hw_events__percpu *hw_events;
struct notifier_block   hotplug_nb;
+#ifdef CONFIG_SMP
+   int muxed_spi_workaround_irq;
+   atomic_tremaining_irq_work;
+#endif
 };

 #define to_arm_pmu(p) (container_of(p, struct arm_pmu, pmu))
diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 557e128e4df0..e3fc1a0ce969 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -305,8 +305,6 @@ validate_group(struct perf_event *event)
 static irqreturn_t armpmu_dispatch_irq(int irq, void *dev)
 {

[PATCH v6] arm: perf: Directly handle SMP platforms with one SPI

2015-03-04 Thread Daniel Thompson
Some ARM platforms mux the PMU interrupt of every core into a single
SPI. On such platforms if the PMU of any core except 0 raises an interrupt
then it cannot be serviced and eventually, if you are lucky, the spurious
irq detection might forcefully disable the interrupt.

On these SoCs it is not possible to determine which core raised the
interrupt so workaround this issue by queuing irqwork on the other
cores whenever the primary interrupt handler is unable to service the
interrupt.

The u8500 platform has an alternative workaround that dynamically alters
the affinity of the PMU interrupt. This workaround logic is no longer
required so the original code is removed as is the hook it relied upon.

Tested on imx6q (which has fours cores/PMUs all muxed to a single SPI)
using a simple soak, combined perf and CPU hotplug soak and using
perf fuzzer's fast_repro.sh.

Signed-off-by: Daniel Thompson daniel.thomp...@linaro.org
---

Notes:
v2 was tested on u8500 (thanks to Linus Walleij). The latest patch
doesn't change the nature of the workaround itself but there has been
substantial churn in the logic to decide when it can safely be deployed.
Thus the changes were sufficient for me not to
preserve the Tested-By:.

v6:
 * Redesigned the code that decides if it is safe to deploy the
   workaround (acting on the review by Mark Rutland). Code should no
   longer race during hot unplug; previous patches sought to make the
   hot unplug race benign and the old approach had flaws and even if
   it could be made correct was tortuously hard to review.

v5:
 * Removed the work queue nonsense; being completely race-free requires
   us to take a mutex or avoid dispatch from interrupt (Will Deacon).
   Replacement code can potentially race with a CPU hot unplug however
   it is careful to minimise exposure, to mitigate harmful effects and
   has fairly prominent comments.

v4:
 * Ripped out the logic that tried to preserve the operation of the
   spurious interrupt detector. It was complex and not really needed
   (Will Deacon).
 * Removed a redundant memory barrier and added a comment explaining
   why it is not needed (Will Deacon).
 * Made fully safe w.r.t. hotplug by falling back to a work queue
   if there is a hotplug operation in flight when the PMU interrupt
   comes in (Will Deacon). The work queue code paths have been tested
   synthetically (by changing the if condition).
 * Posted the correct, as in compilable and tested, version of the code
   (Will Deacon).

v3:
 * Removed function pointer indirection when deploying workaround code
   and reorganise the code accordingly (Mark Rutland).
 * Move the workaround state tracking into the existing percpu data
   structure (Mark Rutland).
 * Renamed cret to percpu_ret and rewrote the comment describing the
   purpose of this variable (Mark Rutland).
 * Copy the cpu_online_mask and use that to act on a consistent set of
   cpus throughout the workaround (Mark Rutland).
 * Changed single_irq to muxed_spi to more explicitly describe
   the problem.

v2:
 * Fixed build problems on systems without SMP.

v1:
 * Thanks to Lucas Stach, Russell King and Thomas Gleixner for
   critiquing an older, completely different way to tackle the
   same problem.


 arch/arm/include/asm/pmu.h   |  12 +++
 arch/arm/kernel/perf_event.c |   9 +-
 arch/arm/kernel/perf_event_cpu.c | 179 +++
 arch/arm/kernel/perf_event_v7.c  |   2 +-
 arch/arm/mach-ux500/cpu-db8500.c |  29 ---
 5 files changed, 178 insertions(+), 53 deletions(-)

diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
index b1596bd59129..dfef7904b790 100644
--- a/arch/arm/include/asm/pmu.h
+++ b/arch/arm/include/asm/pmu.h
@@ -87,6 +87,14 @@ struct pmu_hw_events {
 * already have to allocate this struct per cpu.
 */
struct arm_pmu  *percpu_pmu;
+
+#ifdef CONFIG_SMP
+   /*
+* This is used to schedule workaround logic on platforms where all
+* the PMUs are attached to a single SPI.
+*/
+   struct irq_work work;
+#endif
 };

 struct arm_pmu {
@@ -117,6 +125,10 @@ struct arm_pmu {
struct platform_device  *plat_device;
struct pmu_hw_events__percpu *hw_events;
struct notifier_block   hotplug_nb;
+#ifdef CONFIG_SMP
+   int muxed_spi_workaround_irq;
+   atomic_tremaining_irq_work;
+#endif
 };

 #define to_arm_pmu(p) (container_of(p, struct arm_pmu, pmu))
diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 557e128e4df0..e3fc1a0ce969 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -305,8 +305,6 @@ validate_group(struct perf_event *event)
 static irqreturn_t armpmu_dispatch_irq(int