from:"Frederic Weisbecker"

Re: [PATCH v2 RESEND 0/5] sched/vtime: vtime.h headers cleanup

2024-04-17 Thread Frederic Weisbecker

Le Wed, Apr 10, 2024 at 05:09:43PM +0200, Alexander Gordeev a écrit :
> Hi All,
> 
> There are no changes since the last post, just a re-send.
> 
> v2:
> - patch 4: commit message reworded (Heiko)
> - patch 5: vtime.h is removed from Kbuild scripts (PowerPC only) (Heiko)
> 
> v1:
> Please find a small cleanup to vtime_task_switch() wiring.
> I split it into smaller patches to allow separate PowerPC
> vs s390 reviews. Otherwise patches 2+3 and 4+5 could have
> been merged.
> 
> I tested it on s390 and compile-tested it on 32- and 64-bit
> PowerPC and few other major architectures only, but it is
> only of concern for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE-capable
> ones (AFAICT).
> 
> Thanks!

It probably makes sense to apply the whole series to the scheduler tree.
Does any powerpc or s390 maintainer oppose to that?

Thanks.

> 
> 
> Alexander Gordeev (5):
>   sched/vtime: remove confusing arch_vtime_task_switch() declaration
>   sched/vtime: get rid of generic vtime_task_switch() implementation
>   s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
>   s390/irq,nmi: include  header directly
>   sched/vtime: do not include  header
> 
>  arch/powerpc/include/asm/Kbuild|  1 -
>  arch/powerpc/include/asm/cputime.h | 13 -
>  arch/powerpc/kernel/time.c | 22 ++
>  arch/s390/include/asm/vtime.h  |  2 --
>  arch/s390/kernel/irq.c |  1 +
>  arch/s390/kernel/nmi.c |  1 +
>  include/asm-generic/vtime.h|  1 -
>  include/linux/vtime.h  |  5 -
>  kernel/sched/cputime.c | 13 -
>  9 files changed, 24 insertions(+), 35 deletions(-)
>  delete mode 100644 include/asm-generic/vtime.h
> 
> -- 
> 2.40.1
>

Re: [PATCH 5/5] sched/vtime: do not include header

2024-02-07 Thread Frederic Weisbecker

Le Wed, Feb 07, 2024 at 03:12:57PM +0100, Alexander Gordeev a écrit :
> On Wed, Feb 07, 2024 at 12:30:08AM +0100, Frederic Weisbecker wrote:
> > Reviewed-by: Frederic Weisbecker 
> 
> Thank you for the review, Frederic!
> 
> The Heiko comment is valid and I would add this chunk in v2:
> 
> --- a/arch/powerpc/include/asm/Kbuild
> +++ b/arch/powerpc/include/asm/Kbuild
> @@ -6,5 +6,4 @@ generic-y += agp.h
>  generic-y += kvm_types.h
>  generic-y += mcs_spinlock.h
>  generic-y += qrwlock.h
> -generic-y += vtime.h
>  generic-y += early_ioremap.h
> 
> Would you keep your Reviewed-by?

Sure!

Re: [PATCH 5/5] sched/vtime: do not include header

2024-02-06 Thread Frederic Weisbecker

Le Sun, Jan 28, 2024 at 08:58:54PM +0100, Alexander Gordeev a écrit :
> There is no architecture-specific code or data left
> that generic  needs to know about.
> Thus, avoid the inclusion of  header.
> 
> Signed-off-by: Alexander Gordeev 

Reviewed-by: Frederic Weisbecker

Re: [PATCH 3/5] s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover

2024-02-06 Thread Frederic Weisbecker

Le Sun, Jan 28, 2024 at 08:58:52PM +0100, Alexander Gordeev a écrit :
> __ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore.
> 
> Signed-off-by: Alexander Gordeev 

Reviewed-by: Frederic Weisbecker 


> ---
>  arch/s390/include/asm/vtime.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/s390/include/asm/vtime.h b/arch/s390/include/asm/vtime.h
> index fe17e448c0c5..561c91c1a87c 100644
> --- a/arch/s390/include/asm/vtime.h
> +++ b/arch/s390/include/asm/vtime.h
> @@ -2,8 +2,6 @@
>  #ifndef _S390_VTIME_H
>  #define _S390_VTIME_H
>  
> -#define __ARCH_HAS_VTIME_TASK_SWITCH
> -
>  static inline void update_timer_sys(void)
>  {
>   S390_lowcore.system_timer += S390_lowcore.last_update_timer - 
> S390_lowcore.exit_timer;
> -- 
> 2.40.1
>

Re: [PATCH 2/5] sched/vtime: get rid of generic vtime_task_switch() implementation

2024-02-06 Thread Frederic Weisbecker

Le Sun, Jan 28, 2024 at 08:58:51PM +0100, Alexander Gordeev a écrit :
> The generic vtime_task_switch() implementation gets built only
> if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an
> architecture to implement arch_vtime_task_switch() callback at
> the same time, which is confusing.
> 
> Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC
> architecture only and vtime_task_switch() generic variant is rather
> superfluous.
> 
> Simplify the whole vtime_task_switch() wiring by moving the existing
> generic implementation to PowerPC.
> 
> Signed-off-by: Alexander Gordeev 

Reviewed-by: Frederic Weisbecker

Re: [PATCH 1/5] sched/vtime: remove confusing arch_vtime_task_switch() declaration

2024-02-06 Thread Frederic Weisbecker

Le Sun, Jan 28, 2024 at 08:58:50PM +0100, Alexander Gordeev a écrit :
> Callback arch_vtime_task_switch() is only defined when
> CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is selected. Yet, the
> function prototype forward declaration is present for
> CONFIG_VIRT_CPU_ACCOUNTING_GEN variant. Remove it.
> 
> Signed-off-by: Alexander Gordeev 

Reviewed-by: Frederic Weisbecker

Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-05 Thread Frederic Weisbecker

On Wed, Apr 05, 2023 at 02:05:13PM +0200, Frederic Weisbecker wrote:
> On Wed, Apr 05, 2023 at 01:41:48PM +0200, Peter Zijlstra wrote:
> 1) It has the advantage to check context tracking _after_ the llist_add(), so
>it really can't be misused ordering-wise.
> 
> 2) The IPI callback is always enqueued and then executed upon return
>from userland. The ordering makes sure it will either IPI or execute
>upon return to userspace.

*from userspace

Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-05 Thread Frederic Weisbecker

On Wed, Apr 05, 2023 at 01:41:48PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 05, 2023 at 01:10:07PM +0200, Frederic Weisbecker wrote:
> > On Wed, Apr 05, 2023 at 12:44:04PM +0200, Frederic Weisbecker wrote:
> > > On Tue, Apr 04, 2023 at 04:42:24PM +0300, Yair Podemsky wrote:
> > > > +   int state = atomic_read(>state);
> > > > +   /* will return true only for cpus in kernel space */
> > > > +   return state & CT_STATE_MASK == CONTEXT_KERNEL;
> > > > +}
> > > 
> > > Also note that this doesn't stricly prevent userspace from being 
> > > interrupted.
> > > You may well observe the CPU in kernel but it may receive the IPI later 
> > > after
> > > switching to userspace.
> > > 
> > > We could arrange for avoiding that with marking ct->state with a pending 
> > > work bit
> > > to flush upon user entry/exit but that's a bit more overhead so I first 
> > > need to
> > > know about your expectations here, ie: can you tolerate such an occasional
> > > interruption or not?
> > 
> > Bah, actually what can we do to prevent from that racy IPI? Not much I 
> > fear...
> 
> Yeah, so I don't think that's actually a problem. The premise is that
> *IFF* NOHZ_FULL stays in userspace, then it will never observe the IPI.
> 
> If it violates this by doing syscalls or other kernel entries; it gets
> to keep the pieces.

Ok so how about the following (only build tested)?

Two things:

1) It has the advantage to check context tracking _after_ the llist_add(), so
   it really can't be misused ordering-wise.

2) The IPI callback is always enqueued and then executed upon return
   from userland. The ordering makes sure it will either IPI or execute
   upon return to userspace.

diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 4a4d56f77180..dc4b56da1747 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -137,10 +137,23 @@ static __always_inline int ct_state(void)
return ret;
 }
 
+static __always_inline int ct_state_cpu(int cpu)
+{
+   struct context_tracking *ct;
+
+   if (!context_tracking_enabled())
+   return CONTEXT_DISABLED;
+
+   ct = per_cpu_ptr(_tracking, cpu);
+
+   return atomic_read(>state) & CT_STATE_MASK;
+}
+
 #else
 static __always_inline bool context_tracking_enabled(void) { return false; }
 static __always_inline bool context_tracking_enabled_cpu(int cpu) { return 
false; }
 static __always_inline bool context_tracking_enabled_this_cpu(void) { return 
false; }
+static inline int ct_state_cpu(int cpu) { return CONTEXT_DISABLED; }
 #endif /* CONFIG_CONTEXT_TRACKING_USER */
 
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 846add8394c4..cdc7e8a59acc 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 
+#include "../kernel/sched/smp.h"
 #include "common.h"
 
 #define CREATE_TRACE_POINTS
@@ -27,6 +28,10 @@ static __always_inline void __enter_from_user_mode(struct 
pt_regs *regs)
instrumentation_begin();
kmsan_unpoison_entry_regs(regs);
trace_hardirqs_off_finish();
+
+   /* Flush delayed IPI queue on nohz_full */
+   if (context_tracking_enabled_this_cpu())
+   flush_smp_call_function_queue();
instrumentation_end();
 }
 
diff --git a/kernel/smp.c b/kernel/smp.c
index 06a413987a14..14b25d25ef3a 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -878,6 +878,8 @@ EXPORT_SYMBOL_GPL(smp_call_function_any);
  */
 #define SCF_WAIT   (1U << 0)
 #define SCF_RUN_LOCAL  (1U << 1)
+#define SCF_NO_USER(1U << 2)
+
 
 static void smp_call_function_many_cond(const struct cpumask *mask,
smp_call_func_t func, void *info,
@@ -946,10 +948,13 @@ static void smp_call_function_many_cond(const struct 
cpumask *mask,
 #endif
cfd_seq_store(pcpu->seq_queue, this_cpu, cpu, 
CFD_SEQ_QUEUE);
if (llist_add(>node.llist, 
_cpu(call_single_queue, cpu))) {
-   __cpumask_set_cpu(cpu, cfd->cpumask_ipi);
-   nr_cpus++;
-   last_cpu = cpu;
-
+   if (!(scf_flags & SCF_NO_USER) ||
+   !IS_ENABLED(CONFIG_GENERIC_ENTRY) ||
+ct_state_cpu(cpu) != CONTEXT_USER) {
+   __cpumask_set_cpu(cpu, 
cfd->cpumask_ipi);
+   nr_cpus++;
+   last_cpu = cpu;
+   }
cfd_seq_store(pcpu->seq_

Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-05 Thread Frederic Weisbecker

On Wed, Apr 05, 2023 at 12:44:04PM +0200, Frederic Weisbecker wrote:
> On Tue, Apr 04, 2023 at 04:42:24PM +0300, Yair Podemsky wrote:
> > +   int state = atomic_read(>state);
> > +   /* will return true only for cpus in kernel space */
> > +   return state & CT_STATE_MASK == CONTEXT_KERNEL;
> > +}
> 
> Also note that this doesn't stricly prevent userspace from being interrupted.
> You may well observe the CPU in kernel but it may receive the IPI later after
> switching to userspace.
> 
> We could arrange for avoiding that with marking ct->state with a pending work 
> bit
> to flush upon user entry/exit but that's a bit more overhead so I first need 
> to
> know about your expectations here, ie: can you tolerate such an occasional
> interruption or not?

Bah, actually what can we do to prevent from that racy IPI? Not much I fear...

Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-05 Thread Frederic Weisbecker

On Tue, Apr 04, 2023 at 04:42:24PM +0300, Yair Podemsky wrote:
> @@ -191,6 +192,20 @@ static void tlb_remove_table_smp_sync(void *arg)
>   /* Simply deliver the interrupt */
>  }
>  
> +
> +#ifdef CONFIG_CONTEXT_TRACKING
> +static bool cpu_in_kernel(int cpu, void *info)
> +{
> + struct context_tracking *ct = per_cpu_ptr(_tracking, cpu);

Like Peter said, an smp_mb() is required here before the read (unless there is
already one between the page table modification and that ct->state read?).

So that you have this pairing:

   WRITE page_table  WRITE ct->state
   smp_mb()  smp_mb() // implied by 
atomic_fetch_or
   READ ct->stateREAD page_table

> + int state = atomic_read(>state);
> + /* will return true only for cpus in kernel space */
> + return state & CT_STATE_MASK == CONTEXT_KERNEL;
> +}

Also note that this doesn't stricly prevent userspace from being interrupted.
You may well observe the CPU in kernel but it may receive the IPI later after
switching to userspace.

We could arrange for avoiding that with marking ct->state with a pending work 
bit
to flush upon user entry/exit but that's a bit more overhead so I first need to
know about your expectations here, ie: can you tolerate such an occasional
interruption or not?

Thanks.

Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu

2022-11-23 Thread Frederic Weisbecker

On Mon, Nov 21, 2022 at 11:51:40AM +0800, Zhouyi Zhou wrote:
> During CPU-hotplug torture (CONFIG_NO_HZ_FULL=y), if we try to
> offline tick_do_timer_cpu, the operation will fail because in
> function tick_nohz_cpu_down:
> ```
> if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
>   return -EBUSY;
> ```
> Above bug was first discovered in torture tests performed in PPC VM
> of Open Source Lab of Oregon State University, and reproducable in RISC-V
> and X86-64 (with additional kernel commandline cpu0_hotplug).
> 
> In this patch, we avoid offline tick_do_timer_cpu by distribute
> the offlining cpu among remaining cpus.
> 
> Signed-off-by: Zhouyi Zhou 
> ---
>  include/linux/tick.h|  1 +
>  kernel/time/tick-common.c   |  1 +
>  kernel/time/tick-internal.h |  1 -
>  kernel/torture.c| 10 ++
>  4 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index bfd571f18cfd..23cc0b205853 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -14,6 +14,7 @@
>  #include 
>  
>  #ifdef CONFIG_GENERIC_CLOCKEVENTS
> +extern int tick_do_timer_cpu __read_mostly;
>  extern void __init tick_init(void);
>  /* Should be core only, but ARM BL switcher requires it */
>  extern void tick_suspend_local(void);
> diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
> index 46789356f856..87b9b9afa320 100644
> --- a/kernel/time/tick-common.c
> +++ b/kernel/time/tick-common.c
> @@ -48,6 +48,7 @@ ktime_t tick_next_period;
>   *procedure also covers cpu hotplug.
>   */
>  int tick_do_timer_cpu __read_mostly = TICK_DO_TIMER_BOOT;
> +EXPORT_SYMBOL_GPL(tick_do_timer_cpu);

Please rather make a function for this. This is an internal value
that we don't want to expose to modules.

This can be:

 int tick_nohz_full_timekeeper(void)
 {
 if (tick_nohz_full_enabled() && tick_do_timer_cpu >= 0)
 return tick_do_timer_cpu;
 else
 return nr_cpu_ids;
 }

And then just check if the value is below nr_cpu_ids.

Thanks.

Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu

2022-11-23 Thread Frederic Weisbecker

On Mon, Nov 21, 2022 at 05:37:54PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 21, 2022 at 11:51:40AM +0800, Zhouyi Zhou wrote:
> > @@ -358,7 +359,16 @@ torture_onoff(void *arg)
> > schedule_timeout_interruptible(HZ / 10);
> > continue;
> > }
> > +#ifdef CONFIG_NO_HZ_FULL
> > +   /* do not offline tick do timer cpu */
> > +   if (tick_nohz_full_running) {
> > +   cpu = (torture_random() >> 4) % maxcpu;
> > +   if (cpu >= tick_do_timer_cpu)
> 
> Why is this ">=" instead of "=="?
> 
> > +   cpu = (cpu + 1) % (maxcpu + 1);
> > +   } else
> > +#else
> > cpu = (torture_random() >> 4) % (maxcpu + 1);
> > +#endif
> 
> What happens if the value of tick_do_timer_cpu changes between the time of
> the check above and the call to torture_offline() below?  Alternatively,
> how is such a change in value prevented?

It can't, currently tick_do_timer_cpu is fixed when nohz_full is running.
It can however have special values at early boot such as TICK_DO_TIMER_NONE.
But if rcutorture is initialized after smp, it should be ok.

Thanks.

> 
>   Thanx, Paul
> 
> > if (!torture_offline(cpu,
> >  _offline_attempts, _offline_successes,
> >  _offline, _offline, _offline))
> > -- 
> > 2.34.1
> >

Re: [PATCH v2 00/44] cpuidle,rcu: Clean up the mess

2022-09-20 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 11:59:39AM +0200, Peter Zijlstra wrote:
> Hi All!
> 
> At long last, a respin of the cpuidle vs rcu cleanup patches.
> 
> v1: https://lkml.kernel.org/r/20220608142723.103523...@infradead.org
> 
> These here patches clean up the mess that is cpuidle vs rcuidle.
> 
> At the end of the ride there's only on RCU_NONIDLE user left:
> 
>   arch/arm64/kernel/suspend.c:RCU_NONIDLE(__cpu_suspend_exit());
> 
> and 'one' trace_*_rcuidle() user:
> 
>   kernel/trace/trace_preemptirq.c:
> trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
>   kernel/trace/trace_preemptirq.c:
> trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
>   kernel/trace/trace_preemptirq.c:
> trace_irq_enable_rcuidle(CALLER_ADDR0, caller_addr);
>   kernel/trace/trace_preemptirq.c:
> trace_irq_disable_rcuidle(CALLER_ADDR0, caller_addr);
>   kernel/trace/trace_preemptirq.c:
> trace_preempt_enable_rcuidle(a0, a1);
>   kernel/trace/trace_preemptirq.c:
> trace_preempt_disable_rcuidle(a0, a1);
> 
> However this last is all in deprecated code that should be unused for 
> GENERIC_ENTRY.
> 
> I've touched a lot of code that I can't test and I might've broken something 
> by
> accident. In particular the whole ARM cpuidle stuff was quite involved.
> 
> Please all; have a look where you haven't already.
> 
> 
> New since v1:
> 
>  - rebase on top of Frederic's rcu-context-tracking rename fest
>  - more omap goodness as per the last discusion (thanks Tony!)
>  - removed one more RCU_NONIDLE() from arm64/risc-v perf code
>  - ubsan/kasan fixes
>  - intel_idle module-param for testing
>  - a bunch of extra __always_inline, because compilers are silly.

Except for those I have already tagged as Reviewed:

Acked-by: Frederic Weisbecker 

Thanks for the hard work!

Re: [PATCH v2 11/44] cpuidle,omap4: Push RCU-idle into driver

2022-09-20 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 11:59:50AM +0200, Peter Zijlstra wrote:
> Doing RCU-idle outside the driver, only to then temporarily enable it
> again, some *four* times, before going idle is daft.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Reviewed-by: Tony Lindgren 
> Tested-by: Tony Lindgren 

Reviewed-by: Frederic Weisbecker

Re: [PATCH v2 03/44] cpuidle/poll: Ensure IRQ state is invariant

2022-09-20 Thread Frederic Weisbecker

On Tue, Sep 20, 2022 at 10:57:00AM +0200, Peter Zijlstra wrote:
> On Mon, Sep 19, 2022 at 03:19:27PM +0200, Frederic Weisbecker wrote:
> > On Mon, Sep 19, 2022 at 11:59:42AM +0200, Peter Zijlstra wrote:
> > > cpuidle_state::enter() methods should be IRQ invariant
> > 
> > Got a bit confused with the invariant thing since the first chunck I
> > see in this patch is a conversion to an non-traceable local_irq_enable().
> > 
> > Maybe just add a short mention about that and why?
> 
> Changelog now reads:
> 
> ---
> Subject: cpuidle/poll: Ensure IRQ state is invariant
> From: Peter Zijlstra 
> Date: Tue May 31 15:43:32 CEST 2022
> 
> cpuidle_state::enter() methods should be IRQ invariant.
> 
> Additionally make sure to use raw_local_irq_*() methods since this
> cpuidle callback will be called with RCU already disabled.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Reviewed-by: Rafael J. Wysocki 

Reviewed-by: Frederic Weisbecker 

Thanks!

Re: [PATCH v2 08/44] cpuidle,imx6: Push RCU-idle into driver

2022-09-20 Thread Frederic Weisbecker

On Tue, Sep 20, 2022 at 10:58:59AM +0200, Peter Zijlstra wrote:
> On Mon, Sep 19, 2022 at 04:21:23PM +0200, Frederic Weisbecker wrote:
> > On Mon, Sep 19, 2022 at 11:59:47AM +0200, Peter Zijlstra wrote:
> > > Doing RCU-idle outside the driver, only to then temporarily enable it
> > > again, at least twice, before going idle is daft.
> > 
> > Hmm, what ends up calling RCU_IDLE() here? Also what about
> > cpu_do_idle()?
> 
> I've ammended patches 5-12 with a comment like:
> 
> Notably both cpu_pm_enter() and cpu_cluster_pm_enter() implicity
> re-enable RCU.
> 
> (each noting the specific sites for the relevant patch).

Thanks!

Re: [PATCH v2 09/44] cpuidle,omap3: Push RCU-idle into driver

2022-09-20 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 05:19:05PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 19, 2022 at 04:31:42PM +0200, Frederic Weisbecker wrote:
> > On Mon, Sep 19, 2022 at 11:59:48AM +0200, Peter Zijlstra wrote:
> > > Doing RCU-idle outside the driver, only to then teporarily enable it
> > > again before going idle is daft.
> > 
> > That doesn't tell where those calls are.
> 
> cpu_pm_enter/exit and the power domain stuff, possibly also the clock
> domain stuff. It's all over :/
> 
> I suppose I can add a blub and copy/paste it around the various patches
> if you want.

Yes please, sorry I don't want to bother but, just for the sake of
git blame to report something useful in 5 years.

Thanks.

Re: [PATCH v2 08/44] cpuidle,imx6: Push RCU-idle into driver

2022-09-19 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 05:03:04PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 19, 2022 at 04:49:41PM +0200, Frederic Weisbecker wrote:
> > On Mon, Sep 19, 2022 at 11:59:47AM +0200, Peter Zijlstra wrote:
> > > Doing RCU-idle outside the driver, only to then temporarily enable it
> > > again, at least twice, before going idle is daft.
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > ---
> > >  arch/arm/mach-imx/cpuidle-imx6sx.c |5 -
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > --- a/arch/arm/mach-imx/cpuidle-imx6sx.c
> > > +++ b/arch/arm/mach-imx/cpuidle-imx6sx.c
> > > @@ -47,7 +47,9 @@ static int imx6sx_enter_wait(struct cpui
> > >   cpu_pm_enter();
> > >   cpu_cluster_pm_enter();
> > >  
> > > + ct_idle_enter();
> > >   cpu_suspend(0, imx6sx_idle_finish);
> > > + ct_idle_exit();
> > >  
> > >   cpu_cluster_pm_exit();
> > >   cpu_pm_exit();
> > > @@ -87,7 +89,8 @@ static struct cpuidle_driver imx6sx_cpui
> > >*/
> > >   .exit_latency = 300,
> > >   .target_residency = 500,
> > > - .flags = CPUIDLE_FLAG_TIMER_STOP,
> > > + .flags = CPUIDLE_FLAG_TIMER_STOP |
> > > +  CPUIDLE_FLAG_RCU_IDLE,
> > >   .enter = imx6sx_enter_wait,
> > 
> > There is a second one below that also uses imx6sx_enter_wait.
> 
> Oh, above you mean; but only @index==2 gets us into the whole PM crud.
> @index==1 is fine afaict.

Ah ok, got it, hence why you didn't touch cpu_do_idle()...
May need to comment that somewhere...

Reviewed-by: Frederic Weisbecker 

Thanks!

Re: [PATCH v2 08/44] cpuidle,imx6: Push RCU-idle into driver

2022-09-19 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 11:59:47AM +0200, Peter Zijlstra wrote:
> Doing RCU-idle outside the driver, only to then temporarily enable it
> again, at least twice, before going idle is daft.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/arm/mach-imx/cpuidle-imx6sx.c |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> --- a/arch/arm/mach-imx/cpuidle-imx6sx.c
> +++ b/arch/arm/mach-imx/cpuidle-imx6sx.c
> @@ -47,7 +47,9 @@ static int imx6sx_enter_wait(struct cpui
>   cpu_pm_enter();
>   cpu_cluster_pm_enter();
>  
> + ct_idle_enter();
>   cpu_suspend(0, imx6sx_idle_finish);
> + ct_idle_exit();
>  
>   cpu_cluster_pm_exit();
>   cpu_pm_exit();
> @@ -87,7 +89,8 @@ static struct cpuidle_driver imx6sx_cpui
>*/
>   .exit_latency = 300,
>   .target_residency = 500,
> - .flags = CPUIDLE_FLAG_TIMER_STOP,
> + .flags = CPUIDLE_FLAG_TIMER_STOP |
> +  CPUIDLE_FLAG_RCU_IDLE,
>   .enter = imx6sx_enter_wait,

There is a second one below that also uses imx6sx_enter_wait.

Thanks.

>   .name = "LOW-POWER-IDLE",
>   .desc = "ARM power off",
> 
>

Re: [PATCH v2 09/44] cpuidle,omap3: Push RCU-idle into driver

2022-09-19 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 11:59:48AM +0200, Peter Zijlstra wrote:
> Doing RCU-idle outside the driver, only to then teporarily enable it
> again before going idle is daft.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Reviewed-by: Tony Lindgren 
> Tested-by: Tony Lindgren 

Ok now with the cpu_pm_*() informations that makes sense:

Reviewed-by: Frederic Weisbecker

Re: [PATCH v2 10/44] cpuidle,armada: Push RCU-idle into driver

2022-09-19 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 11:59:49AM +0200, Peter Zijlstra wrote:
> Doing RCU-idle outside the driver, only to then temporarily enable it
> again before going idle is daft.

Ah wait, now I see, that's cpu_pm_enter()/cpu_pm_exit() -> cpu_pm_notify*() the 
culprits.
Might be worth adding a short note about that on your changelogs.

> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  drivers/cpuidle/cpuidle-mvebu-v7.c |7 +++
>  1 file changed, 7 insertions(+)
> 
> --- a/drivers/cpuidle/cpuidle-mvebu-v7.c
> +++ b/drivers/cpuidle/cpuidle-mvebu-v7.c
> @@ -36,7 +36,10 @@ static int mvebu_v7_enter_idle(struct cp
>   if (drv->states[index].flags & MVEBU_V7_FLAG_DEEP_IDLE)
>   deepidle = true;
>  
> + ct_idle_enter();
>   ret = mvebu_v7_cpu_suspend(deepidle);
> + ct_idle_exit();

And then yes of course:

Reviewed-by: Frederic Weisbecker

Re: [PATCH v2 09/44] cpuidle,omap3: Push RCU-idle into driver

2022-09-19 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 11:59:48AM +0200, Peter Zijlstra wrote:
> Doing RCU-idle outside the driver, only to then teporarily enable it
> again before going idle is daft.

That doesn't tell where those calls are.

Thanks.

Re: [PATCH v2 08/44] cpuidle,imx6: Push RCU-idle into driver

2022-09-19 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 11:59:47AM +0200, Peter Zijlstra wrote:
> Doing RCU-idle outside the driver, only to then temporarily enable it
> again, at least twice, before going idle is daft.

Hmm, what ends up calling RCU_IDLE() here? Also what about
cpu_do_idle()?

Thanks.

> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/arm/mach-imx/cpuidle-imx6sx.c |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> --- a/arch/arm/mach-imx/cpuidle-imx6sx.c
> +++ b/arch/arm/mach-imx/cpuidle-imx6sx.c
> @@ -47,7 +47,9 @@ static int imx6sx_enter_wait(struct cpui
>   cpu_pm_enter();
>   cpu_cluster_pm_enter();
>  
> + ct_idle_enter();
>   cpu_suspend(0, imx6sx_idle_finish);
> + ct_idle_exit();
>  
>   cpu_cluster_pm_exit();
>   cpu_pm_exit();
> @@ -87,7 +89,8 @@ static struct cpuidle_driver imx6sx_cpui
>*/
>   .exit_latency = 300,
>   .target_residency = 500,
> - .flags = CPUIDLE_FLAG_TIMER_STOP,
> + .flags = CPUIDLE_FLAG_TIMER_STOP |
> +  CPUIDLE_FLAG_RCU_IDLE,
>   .enter = imx6sx_enter_wait,
>   .name = "LOW-POWER-IDLE",
>   .desc = "ARM power off",
> 
>

Re: [PATCH v2 07/44] cpuidle,psci: Push RCU-idle into driver

2022-09-19 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 11:59:46AM +0200, Peter Zijlstra wrote:
> Doing RCU-idle outside the driver, only to then temporarily enable it
> again, at least twice, before going idle is daft.
> 
> Signed-off-by: Peter Zijlstra (Intel) 

Reviewed-by: Frederic Weisbecker

Re: [PATCH v2 06/44] cpuidle,tegra: Push RCU-idle into driver

2022-09-19 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 11:59:45AM +0200, Peter Zijlstra wrote:
> Doing RCU-idle outside the driver, only to then temporarily enable it
> again, at least twice, before going idle is daft.
> 
> Signed-off-by: Peter Zijlstra (Intel) 

Reviewed-by: Frederic Weisbecker

Re: [PATCH v2 05/44] cpuidle,riscv: Push RCU-idle into driver

2022-09-19 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 11:59:44AM +0200, Peter Zijlstra wrote:
> Doing RCU-idle outside the driver, only to then temporarily enable it
> again, at least twice, before going idle is daft.
> 
> Signed-off-by: Peter Zijlstra (Intel) 

Reviewed-by: Frederic Weisbecker

Re: [PATCH v2 03/44] cpuidle/poll: Ensure IRQ state is invariant

2022-09-19 Thread Frederic Weisbecker

On Mon, Sep 19, 2022 at 11:59:42AM +0200, Peter Zijlstra wrote:
> cpuidle_state::enter() methods should be IRQ invariant

Got a bit confused with the invariant thing since the first chunck I
see in this patch is a conversion to an non-traceable local_irq_enable().

Maybe just add a short mention about that and why?

Thanks.

> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Reviewed-by: Rafael J. Wysocki 
> ---
>  drivers/cpuidle/poll_state.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> --- a/drivers/cpuidle/poll_state.c
> +++ b/drivers/cpuidle/poll_state.c
> @@ -17,7 +17,7 @@ static int __cpuidle poll_idle(struct cp
>  
>   dev->poll_time_limit = false;
>  
> - local_irq_enable();
> + raw_local_irq_enable();
>   if (!current_set_polling_and_test()) {
>   unsigned int loop_count = 0;
>   u64 limit;
> @@ -36,6 +36,8 @@ static int __cpuidle poll_idle(struct cp
>   }
>   }
>   }
> + raw_local_irq_disable();
> +
>   current_clr_polling();
>  
>   return index;
> 
>

Re: ppc64le: `NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!` when turning off SMT

2022-02-08 Thread Frederic Weisbecker

On Tue, Feb 08, 2022 at 08:32:37AM +0100, Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> On the POWER8 server IBM S822LC running Ubuntu 21.10, Linux 5.17-rc1+ built
> with
> 
> $ grep HZ /boot/config-5.17.0-rc1+
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> CONFIG_NO_HZ_IDLE=y
> # CONFIG_NO_HZ_FULL is not set
> CONFIG_NO_HZ=y
> # CONFIG_HZ_100 is not set
> CONFIG_HZ_250=y
> # CONFIG_HZ_300 is not set
> # CONFIG_HZ_1000 is not set
> CONFIG_HZ=250
> 
> once warned about a NOHZ tick-stop error, when I executed `sudo
> /usr/sbin/ppc64_cpu --smt=off` (so that KVM would work).

I see, so I assume this sets some CPUs offline, right?

> 
> ```
> $ dmesg
> [0.00] Linux version 5.17.0-rc1+
> (pmen...@flughafenberlinbrandenburgwillybrandt.molgen.mpg.de) (Ubuntu clang
> version 13.0.0-2, LLD 13.0.0) #1 SMP Fri Jan 28 17:13:04 CET 2022
> […]
> [271272.030262] NOHZ tick-stop error: Non-RCU local softirq work is pending,
> handler #20!!!
> [271272.305726] NOHZ tick-stop error: Non-RCU local softirq work is pending,
> handler #20!!!
> [271272.549790] NOHZ tick-stop error: Non-RCU local softirq work is pending,
> handler #20!!!
> [271274.885167] NOHZ tick-stop error: Non-RCU local softirq work is pending,
> handler #20!!!
> [271275.113896] NOHZ tick-stop error: Non-RCU local softirq work is pending,
> handler #20!!!
> [271275.412902] NOHZ tick-stop error: Non-RCU local softirq work is pending,
> handler #20!!!
> [271275.625245] NOHZ tick-stop error: Non-RCU local softirq work is pending,
> handler #20!!!
> [271275.833107] NOHZ tick-stop error: Non-RCU local softirq work is pending,
> handler #20!!!
> [271276.041391] NOHZ tick-stop error: Non-RCU local softirq work is pending,
> handler #20!!!
> [271277.244880] NOHZ tick-stop error: Non-RCU local softirq work is pending,
> handler #20!!!
> ```

That's IRQ_POLL_SOFTIRQ. The problem here is probably that some of these
softirqs are pending even though ksoftirqd has been parked.

I see there is irq_poll_cpu_dead() that migrates the pending queue once
the CPU is finally dead, so this is well handled.

I'm preparing a patch to fix the warning.

Thanks.

Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-25 Thread Frederic Weisbecker

On Mon, May 25, 2020 at 03:21:05PM +0200, Peter Zijlstra wrote:
> @@ -2320,7 +2304,7 @@ static void ttwu_queue_remote(struct task_struct *p, 
> int cpu, int wake_flags)
>  
>   if (llist_add(>wake_entry, >wake_list)) {
>   if (!set_nr_if_polling(rq->idle))
> - smp_call_function_single_async(cpu, >wake_csd);
> + smp_call_function_single_async(cpu, >wake_csd);
>   else
>   trace_sched_wake_idle_without_ipi(cpu);

Ok that's of course very unlikely but could it be possible to have the
following:

CPU 0 CPU 1 CPU 2
-   

//Wake up A
ttwu_queue(TASK A, CPU 1) idle_loop {
  ttwu_queue_pending {
  
  raw_spin_unlock_irqrestore(rq)
  # VMEXIT (with IPI still pending)
//task 
A migrates here

wait_event()
//sleep

//Wake up A
ttwu_queue(TASK A, CPU 2) {
//IPI on CPU 2 ignored
// due to csd->flags == CSD_LOCK

Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-21 Thread Frederic Weisbecker

On Thu, May 21, 2020 at 01:00:27PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 12:49:37PM +0200, Peter Zijlstra wrote:
> > On Thu, May 21, 2020 at 11:39:39AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> > 
> > > This:
> > > 
> > > > smp_call_function_single_async() { 
> > > > smp_call_function_single_async() {
> > > > // verified csd->flags != CSD_LOCK // verified 
> > > > csd->flags != CSD_LOCK
> > > > csd->flags = CSD_LOCK  csd->flags = 
> > > > CSD_LOCK
> > > 
> > > concurrent smp_call_function_single_async() using the same csd is what
> > > I'm looking at as well.
> > 
> > So something like this ought to cure the fundamental problem and make
> > smp_call_function_single_async() more user friendly, but also more
> > expensive.
> > 
> > The problem is that while the ILB case is easy to fix, I can't seem to
> > find an equally nice solution for the ttwu_remote_queue() case; that
> > would basically require sticking the wake_csd in task_struct, I'll also
> > post that.
> > 
> > So it's either this:
> 
> Or this:
> 
> ---
>  include/linux/sched.h | 4 
>  kernel/sched/core.c   | 7 ---
>  kernel/sched/fair.c   | 2 +-
>  kernel/sched/sched.h  | 1 -
>  4 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f38d62c4632c..136ee400b568 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>   struct uclamp_seuclamp[UCLAMP_CNT];
>  #endif
>  
> +#ifdef CONFIG_SMP
> + call_single_data_t  wake_csd;
> +#endif
> +
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>   /* List of struct preempt_notifier: */
>   struct hlist_head   preempt_notifiers;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5b286469e26e..a7129652e89b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2320,7 +2320,7 @@ static void ttwu_queue_remote(struct task_struct *p, 
> int cpu, int wake_flags)
>  
>   if (llist_add(>wake_entry, >wake_list)) {
>   if (!set_nr_if_polling(rq->idle))
> - smp_call_function_single_async(cpu, >wake_csd);
> + smp_call_function_single_async(cpu, >wake_csd);
>   else
>   trace_sched_wake_idle_without_ipi(cpu);
>   }
> @@ -2921,6 +2921,9 @@ int sched_fork(unsigned long clone_flags, struct 
> task_struct *p)
>  #endif
>  #if defined(CONFIG_SMP)
>   p->on_cpu = 0;
> + p->wake_csd = (struct __call_single_data) {
> + .func = wake_csd_func,
> + };
>  #endif
>   init_task_preempt_count(p);
>  #ifdef CONFIG_SMP
> @@ -6723,8 +6726,6 @@ void __init sched_init(void)
>   rq->avg_idle = 2*sysctl_sched_migration_cost;
>   rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>  
> - rq_csd_init(rq, >wake_csd, wake_csd_func);
> -
>   INIT_LIST_HEAD(>cfs_tasks);
>  
>   rq_attach_root(rq, _root_domain);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 01f94cf52783..b6d8a7b991f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
>* is idle. And the softirq performing nohz idle load balance
>* will be run before returning from the IPI.
>*/
> - smp_call_function_single_async(ilb_cpu, _rq(ilb_cpu)->nohz_csd);
> + smp_call_function_single_async(ilb_cpu, _rq()->nohz_csd);

My fear here is that if a previous call from the the same CPU but to another
target is still pending, the new one will be spuriously ignored.

Namely this could happen:

CPU 0  CPU 1
-  -
   local_irq_disable() or VMEXIT
kick_ilb() {
smp_call_function_single_async(CPU 1,  
 _rq()->nohz_csd);
}

kick_ilb() {
smp_call_function_single_async(CPU 2,  
 _rq()->nohz_csd) {
// IPI to CPU 2 ignored
if (csd->flags == CSD_LOCK)
return -EBUSY;
}
}

local_irq_enable();




But I believe we can still keep the remote csd if nohz_flags() are
strictly only set before the IPI and strictly only cleared from it.

And I still don't understand why trigger_load_balance() raise the
softirq without setting the current CPU as ilb. run_rebalance_domains()
thus ignores it most of the time in the end or it spuriously clear the
nohz_flags set by an IPI sender. Or there is something I misunderstood
there.

(Haven't checked the wake up case yet).

Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-20 Thread Frederic Weisbecker

On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > Just a head up. Repeatedly compiling kernels for a while would trigger
> > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > .config are in,
> 
> Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> not seen anything like that myself. Let me go have a look.
> 
> 
> In as far as the logs are readable (they're a wrapped mess, please don't
> do that!), they contain very little useful, as is typical with IPIs :/
> 
> > [ 1167.993773][C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > flush_smp_call_function_queue+0x1fa/0x2e0

So I've tried to think of a race that could produce that and here is
the only thing I could come up with. It's a bit complicated unfortunately:

CPU 0  CPU 1
-  -

tick {
trigger_load_balance() {
raise_softirq(SCHED_SOFTIRQ);
//but nohz_flags(0) = 0
}
   kick_ilb() {
   atomic_fetch_or(, 
nohz_flags(0))
softirq() {#VMEXIT or anything that 
could stop a CPU for a while
run_rebalance_domain() {
nohz_idle_balance() {
atomic_andnot(NOHZ_KICK_MASK, nohz_flag(0))
}
 }
 }
}

// schedule
nohz_newidle_balance() {
kick_ilb() { // pick current CPU
atomic_fetch_or(, nohz_flags(0))   #VMENTER
smp_call_function_single_async() { 
smp_call_function_single_async() {
// verified csd->flags != CSD_LOCK // verified 
csd->flags != CSD_LOCK
csd->flags = CSD_LOCK  csd->flags = CSD_LOCK
//execute in place //queue and send IPI
csd->flags = 0
nohz_csd_func()
}
}
}


IPI�{
flush_smp_call_function_queue() {
csd_unlock() {
WARN_ON(csd->flags != CSD_LOCK) <-!



The root cause here would be that trigger_load_balance() unconditionally raise
the softirq. And I have to confess I'm not clear why since the softirq is
essentially a no-op when nohz_flags() is 0.

Thanks.

Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

2019-04-05 Thread Frederic Weisbecker

On Mon, Apr 01, 2019 at 10:38:27AM +0200, Peter Zijlstra wrote:
> 
> + fweisbec, who did the remote bits
> 
> On Sat, Mar 30, 2019 at 01:10:28PM +1000, Nicholas Piggin wrote:
> > diff --git a/kernel/irq_work.c b/kernel/irq_work.c
> > index 6b7cdf17ccf8..f0e539d0f879 100644
> > --- a/kernel/irq_work.c
> > +++ b/kernel/irq_work.c
> > -/* Enqueue the irq work @work on the current CPU */
> > -bool irq_work_queue(struct irq_work *work)
> > +/*
> > + * Enqueue the irq_work @work on @cpu unless it's already pending
> > + * somewhere.
> > + *
> > + * Can be re-enqueued while the callback is still in progress.
> > + */
> > +bool irq_work_queue_on(struct irq_work *work, int cpu)
> >  {
> > +#ifndef CONFIG_SMP
> > +   return irq_work_queue(work);
> > +

I'd suggest to use "if (!IS_ENABLED(CONFIG_SMP))" here to avoid the large
ifdeffery.

> > +#else /* #ifndef CONFIG_SMP */
> > +   /* All work should have been flushed before going offline */
> > +   WARN_ON_ONCE(cpu_is_offline(cpu));
> > +
> > /* Only queue if not already pending */
> > if (!irq_work_claim(work))
> > return false;
> >  
> > -   /* Queue the entry and raise the IPI if needed. */
> > preempt_disable();
> > -
> > -   /* If the work is "lazy", handle it from next tick if any */
> > -   if (work->flags & IRQ_WORK_LAZY) {
> > -   if (llist_add(>llnode, this_cpu_ptr(_list)) &&
> > -   tick_nohz_tick_stopped())
> > -   arch_irq_work_raise();
> > -   } else {
> > -   if (llist_add(>llnode, this_cpu_ptr(_list)))
> > -   arch_irq_work_raise();
> > -   }
> > -
> > +   if (cpu != smp_processor_id()) {
> > +   /* Arch remote IPI send/receive backend aren't NMI safe */
> > +   WARN_ON_ONCE(in_nmi());
> > +   if (llist_add(>llnode, _cpu(raised_list, cpu)))
> > +   arch_send_call_function_single_ipi(cpu);
> > +   } else
> > +   __irq_work_queue(work);

Also perhaps rename __irq_work_queue() to irq_work_queue_local() to make it
instantly clearer to reviewers.

Other than those cosmetic changes,

  Reviewed-by: Frederic Weisbecker 

Thanks.

Re: [BUG][next-20170606][bisected 411fe24e6b] WARNING: CPU: 10 PID: 0 at kernel/time/tick-sched.c:791

2017-06-09 Thread Frederic Weisbecker

On Wed, Jun 07, 2017 at 12:56:53PM +0530, Abdul Haleem wrote:
> Hi,
> 
> Test: Trinity (https://github.com/kernelslacker/trinity)
> Machine : Power 8 PowerVM LPAR
> Kernel : 4.12.0-rc4-next-20170606
> gcc : version 5.2.1
> config : attached
> 
> With commit (411fe24e6b : nohz: Fix collision between tick and other
> hrtimers), a WARNING is seen while running trinity syscall fuzzer
> 
> In file kernel/time/tick-sched.c at line 791, a WARN_ON_ONCE is being
> triggered from tick_nohz_stop_sched_tick function.
> 
> /* Skip reprogram of event if its not changed */
> if (ts->tick_stopped && (expires == ts->next_tick)) {
> /* Sanity check: make sure clockevent is actually programmed */
> if (likely(dev->next_event <= ts->next_tick))
> goto out;
> 
> WARN_ON_ONCE(1);
> printk_once("basemono: %llu ts->next_tick: %llu dev->next_event: %llu 
> timer->active: %d timer->expires: %llu\n",
> basemono, ts->next_tick, dev->next_event,
> hrtimer_active(>sched_timer), 
> hrtimer_get_expires(>sched_timer));
> }
> 
> Trace logs:
> [22934.302780] [ cut here ]
> [22934.302787] WARNING: CPU: 10 PID: 0 at kernel/time/tick-sched.c:791
> __tick_nohz_idle_enter+0x2e8/0x570

Hi Abdul,

Thanks for reporting. I've cooked a fix, any chance you could test it?

--
>From f80041b5209aaf9d02ac25a29a248d0f214ba19f Mon Sep 17 00:00:00 2001
From: Frederic Weisbecker <fweis...@gmail.com>
Date: Thu, 8 Jun 2017 16:32:58 +0200
Subject: [PATCH] nohz: Fix spurious warning when hrtimer and clocksource get
 out of sync

The sanity check ensuring that the tick expiry cache (ts->next_tick)
is actually in sync with the hardware clock (dev->next_event) makes the
wrong assumption that the clock can't be programmed later than the
hrtimer deadline.

In fact the clock hardware can be programmed later on some conditions
such as:

* The hrtimer deadline is already in the past.
* The hrtimer deadline is earlier than the minimum delay supported
  by the hardware.

Such conditions can be met when we program the tick, for example if the
last jiffies update hasn't been seen by the current CPU yet, we may
program the hrtimer to a deadline that is earlier than ktime_get()
because last_jiffies_update is our timestamp base to compute the next
tick.

As a result, we can randomly observe such warning:

WARNING: CPU: 5 PID: 0 at kernel/time/tick-sched.c:794 
tick_nohz_stop_sched_tick kernel/time/tick-sched.c:791 [inline]
Call Trace:
 tick_nohz_irq_exit
 tick_irq_exit
 irq_exit
 exiting_irq
 smp_call_function_interrupt
 smp_call_function_single_interrupt
 call_function_single_interrupt

Therefore, let's rather make sure that the tick expiry cache is sync'ed
with the tick hrtimer deadline, against which it is not supposed to
drift away. The clock hardware instead has its own will and can't be
used as a reliable comparison point.

Reported-by: Sasha Levin <alexander.le...@verizon.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@kernel.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Rik van Riel <r...@redhat.com>
Cc: James Hartsock <hart...@redhat.com>
Cc: Tim Wright <t...@binbash.co.uk>
Signed-off-by: Frederic Weisbecker <fweis...@gmail.com>
---
 kernel/time/tick-sched.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9d31f1e..83c788e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -768,7 +768,8 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched 
*ts,
/* Skip reprogram of event if its not changed */
if (ts->tick_stopped && (expires == ts->next_tick)) {
/* Sanity check: make sure clockevent is actually programmed */
-   if (likely(dev->next_event <= ts->next_tick))
+   if (tick == KTIME_MAX ||
+   ts->next_tick == hrtimer_get_expires(>sched_timer))
goto out;
 
WARN_ON_ONCE(1);
@@ -806,8 +807,10 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched 
*ts,
goto out;
}
 
+   hrtimer_set_expires(>sched_timer, tick);
+
if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
-   hrtimer_start(>sched_timer, tick, HRTIMER_MODE_ABS_PINNED);
+   hrtimer_start_expires(>sched_timer, 
HRTIMER_MODE_ABS_PINNED);
else
tick_program_event(tick, 1);
 out:
-- 
2.7.4

Re: [PATCH] powerpc: Remove leftover cputime_to_nsecs call causing build error

2017-02-22 Thread Frederic Weisbecker

On Thu, Feb 23, 2017 at 08:34:15AM +1100, Michael Ellerman wrote:
> Frederic Weisbecker <fweis...@gmail.com> writes:
> 
> > This type conversion is a leftover that got ignored during the kcpustat
> > conversion to nanosecs, resulting in build breakage with config having
> > CONFIG_NO_HZ_FULL=y.
> >
> > arch/powerpc/kernel/time.c: In function 'running_clock':
> > arch/powerpc/kernel/time.c:712:2: error: implicit declaration of 
> > function 'cputime_to_nsecs' [-Werror=implicit-function-declaration]
> >   return local_clock() - 
> > cputime_to_nsecs(kcpustat_this_cpu->cpustat[CPUTIME_STEAL]);
> >
> > All we need is to remove it.
> >
> > Reported-by: Abdul Haleem <abdha...@linux.vnet.ibm.com>
> > Cc: Michael Ellerman <m...@ellerman.id.au>
> > Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> > Cc: Paul Mackerras <pau...@samba.org>
> > Cc: Oliver O'Halloran <ooh...@gmail.com>
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Signed-off-by: Frederic Weisbecker <fweis...@gmail.com>
> 
> LGTM.
> 
> Acked-by: Michael Ellerman <m...@ellerman.id.au>
> 
> 
> Are you going to send this to Linus, or should I?

I would prefer you send it to Linus if you don't mind.

The fate of a powerpc patch sent by me directly to Linus is
anything but predictable :o)

Thanks.

Re: [PowerPC] 4.10.0 fails to build on BE config

2017-02-21 Thread Frederic Weisbecker

On Tue, Feb 21, 2017 at 12:55:35PM +0530, abdul wrote:
> Hi,
> 
> Today's mainline build, breaks on Power6 and Power7 (all BE config) with
> these build errors
> 
> arch/powerpc/kernel/time.c: In function ‘running_clock’:
> arch/powerpc/kernel/time.c:712:2: error: implicit declaration of function
> ‘cputime_to_nsecs’ [-Werror=implicit-function-declaration]
> return local_clock() -
> cputime_to_nsecs(kcpustat_this_cpu->cpustat[CPUTIME_STEAL]);
> ^
> cc1: some warnings being treated as errors
> make[1]: *** [arch/powerpc/kernel/time.o] Error 1

Thanks for reporting this!
I just sent a fix.

[PATCH] powerpc: Remove leftover cputime_to_nsecs call causing build error

2017-02-21 Thread Frederic Weisbecker

This type conversion is a leftover that got ignored during the kcpustat
conversion to nanosecs, resulting in build breakage with config having
CONFIG_NO_HZ_FULL=y.

arch/powerpc/kernel/time.c: In function 'running_clock':
arch/powerpc/kernel/time.c:712:2: error: implicit declaration of 
function 'cputime_to_nsecs' [-Werror=implicit-function-declaration]
  return local_clock() - 
cputime_to_nsecs(kcpustat_this_cpu->cpustat[CPUTIME_STEAL]);

All we need is to remove it.

Reported-by: Abdul Haleem <abdha...@linux.vnet.ibm.com>
Cc: Michael Ellerman <m...@ellerman.id.au>
Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Oliver O'Halloran <ooh...@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Frederic Weisbecker <fweis...@gmail.com>
---
 arch/powerpc/kernel/time.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 14e4855..bc84a8d 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -709,7 +709,7 @@ unsigned long long running_clock(void)
 * time and on a host which doesn't do any virtualisation TB *should* 
equal
 * VTB so it makes no difference anyway.
 */
-   return local_clock() - 
cputime_to_nsecs(kcpustat_this_cpu->cpustat[CPUTIME_STEAL]);
+   return local_clock() - kcpustat_this_cpu->cpustat[CPUTIME_STEAL];
 }
 #endif
 
-- 
2.7.4

Re: [PATCH 0/4] cputime: some optimizations and cleanups

2016-11-09 Thread Frederic Weisbecker

2016-10-31 12:36 GMT+00:00 Stanislaw Gruszka <sgrus...@redhat.com>:
> Patches remove accounting of utimescaled/stimescaled on architectures
> that do not provide those values (scaled cputimes are equal to normal
> cputimes) what is every architecture except powerpc and s390.
>
> Patches do not change user visible behaviour.
>
> There is very little documentation how scaled cputimes should be
> interpreted and what useful information they provide (at least
> I could not find that). Moreover they are exported to user-space
> only via not commonly used taskstats and delayacct interfaces (i.e.
> interfaces that are not used by standard linux tools AFICT). However
> I assume they can not be removed by whole (i.e. replaced by normal
> utimes/stimes on powerpc and s390).
>
> Stanislaw Gruszka (4):
>   cputime/powerpc: remove cputime_last_delta global variable
>   cputime/powerpc: remove cputime_to_scaled()
>   cputime/powerpc/s390: make scaled cputime arch specific
>   cputime: simplify task_cputime()
>
>  arch/ia64/kernel/time.c   |4 +-
>  arch/powerpc/Kconfig  |1 +
>  arch/powerpc/include/asm/cputime.h|   14 
>  arch/powerpc/kernel/time.c|8 +-
>  arch/s390/Kconfig |1 +
>  arch/s390/kernel/vtime.c  |9 ++-
>  arch/x86/kernel/apm_32.c  |4 +-
>  include/asm-generic/cputime_jiffies.h |1 -
>  include/asm-generic/cputime_nsecs.h   |1 -
>  include/linux/kernel_stat.h   |4 +-
>  include/linux/sched.h |   29 +---
>  kernel/fork.c |2 +
>  kernel/sched/cputime.c|  124 
> +
>  kernel/time/posix-cpu-timers.c|4 +-
>  14 files changed, 73 insertions(+), 133 deletions(-)
>

Excellent patchset!
Thanks a lot Stanislaw!

Acked-by: Frederic Weisbecker <fweis...@gmail.com>

Re: [PATCH] powerpc: re-enable dynticks

2015-02-22 Thread Frederic Weisbecker

Hi Ben,

2015-02-16 5:06 GMT+01:00 Benjamin Herrenschmidt b...@kernel.crashing.org:
 On Mon, 2015-02-16 at 11:08 +1100, Michael Ellerman wrote:
 On Fri, 2015-02-13 at 13:38 -0600, Paul Clarke wrote:
  implement arch_irq_work_has_interrupt() for powerpc
 
  Commit 9b01f5bf3 introduced a dependency on IRQ work self-IPIs for
  full dynamic ticks to be enabled, by expecting architectures to
  implement a suitable arch_irq_work_has_interrupt() routine.
 
  Several arches have implemented this routine, including x86 (3010279f)
  and arm (09f6edd4), but powerpc was omitted.
 
  This patch implements this routine for powerpc.
 
  .../...

 It makes the message change, but is that correct? ie. do we actually 
 implement
 IRQ work self-IPIs?

 I think so... Fred, do you think what we do will work ? We hijack our
 decrementer (local timer) by making it shoot almost immediately (1 tick
 away) and run the irq work at the beginning of __timer_interrupt().

 At that point we are on our irq stack and have done irq_enter but that's
 about it.

Yes that should work. After all self-IPI is an oxymoron. One would
expect an IPI to be triggered by an irq controller but if such
operation isn't supported with the current CPU being both source and
destination, anything triggering the desired callback in an interrupt
context in a reasonable amount of time ahead does the job here.

I thought well that's what powerpc was doing for irq work but I wasn't
sure I understood the code correctly. I should have pinged people
about that, sorry.

Thanks.


 Cheers,
 Ben.

  diff --git a/arch/powerpc/include/asm/irq_work.h
  b/arch/powerpc/include/asm/irq_work.h
  new file mode 100644
  index 000..18365ec
  --- /dev/null
  +++ b/arch/powerpc/include/asm/irq_work.h
  @@ -0,0 +1,11 @@
  +#ifndef _ASM_IRQ_WORK_H
  +#define _ASM_IRQ_WORK_H
  +
  +#include asm/processor.h
  +
  +static inline bool arch_irq_work_has_interrupt(void)
  +{
  +   return 1;

 Should be true;

  +}

 cheers


 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: perf events ring buffer memory barrier on powerpc

2013-10-28 Thread Frederic Weisbecker

2013/10/25 Peter Zijlstra pet...@infradead.org:
 On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote:
 I would argue for:

   READ -data_tail  READ -data_head
   smp_rmb() (A) smp_rmb()   (C)
   WRITE $data   READ $data
   smp_wmb() (B) smp_mb()(D)
   STORE -data_head WRITE -data_tail

 Where A pairs with D, and B pairs with C.

 I don't think A needs to be a full barrier because we won't in fact
 write data until we see the store from userspace. So we simply don't
 issue the data WRITE until we observe it.

 OTOH, D needs to be a full barrier since it separates the data READ from
 the tail WRITE.

 For B a WMB is sufficient since it separates two WRITEs, and for C an
 RMB is sufficient since it separates two READs.

Hmm, I need to defer on you for that, I'm not yet comfortable with
picking specific barrier flavours when both write and read are
involved in a same side :)
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: perf events ring buffer memory barrier on powerpc

2013-10-23 Thread Frederic Weisbecker

On Wed, Oct 23, 2013 at 10:54:54AM +1100, Michael Neuling wrote:
 Frederic,
 
 In the perf ring buffer code we have this in perf_output_get_handle():
 
   if (!local_dec_and_test(rb-nest))
   goto out;
 
   /*
* Publish the known good head. Rely on the full barrier implied
* by atomic_dec_and_test() order the rb-head read and this
* write.
*/
   rb-user_page-data_head = head;
 
 The comment says atomic_dec_and_test() but the code is
 local_dec_and_test().
 
 On powerpc, local_dec_and_test() doesn't have a memory barrier but
 atomic_dec_and_test() does.  Is the comment wrong, or is
 local_dec_and_test() suppose to imply a memory barrier too and we have
 it wrongly implemented in powerpc?
 
 My guess is that local_dec_and_test() is correct but we to add an
 explicit memory barrier like below:
 
 (Kudos to Victor Kaplansky for finding this)
 
 Mikey
 
 diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
 index cd55144..95768c6 100644
 --- a/kernel/events/ring_buffer.c
 +++ b/kernel/events/ring_buffer.c
 @@ -87,10 +87,10 @@ again:
   goto out;
  
   /*
 -  * Publish the known good head. Rely on the full barrier implied
 -  * by atomic_dec_and_test() order the rb-head read and this
 -  * write.
 +  * Publish the known good head. We need a memory barrier to order the
 +  * order the rb-head read and this write.
*/
 + smp_mb ();
   rb-user_page-data_head = head;
  
   /*


I'm adding Peter in Cc since he wrote that code.
I agree that local_dec_and_test() doesn't need to imply an smp barrier.
All it has to provide as a guarantee is the atomicity against local concurrent
operations (interrupts, preemption, ...).

Now I'm a bit confused about this barrier.

I think we want this ordering:

Kernel User

   READ rb-user_page-data_tail   READ rb-user_page-data_head
   smp_mb()smp_mb()
   WRITE rb data   READ rb  data
   smp_mb()smp_mb()
   rb-user_page-data_headWRITE rb-user_page-data_tail

So yeah we want a berrier between the data published and the user data_head.
But this ordering concerns wider layout than just rb-head and 
rb-user_page-data_head

And BTW I can see an smp_rmb() after we read rb-user_page-data_tail. This is 
probably the
first kernel barrier in my above example. (not sure if rmb() alone is enough 
though).
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: perf events ring buffer memory barrier on powerpc

2013-10-23 Thread Frederic Weisbecker

2013/10/23 Frederic Weisbecker fweis...@gmail.com:
 On Wed, Oct 23, 2013 at 10:54:54AM +1100, Michael Neuling wrote:
 Frederic,

 In the perf ring buffer code we have this in perf_output_get_handle():

   if (!local_dec_and_test(rb-nest))
   goto out;

   /*
* Publish the known good head. Rely on the full barrier implied
* by atomic_dec_and_test() order the rb-head read and this
* write.
*/
   rb-user_page-data_head = head;

 The comment says atomic_dec_and_test() but the code is
 local_dec_and_test().

 On powerpc, local_dec_and_test() doesn't have a memory barrier but
 atomic_dec_and_test() does.  Is the comment wrong, or is
 local_dec_and_test() suppose to imply a memory barrier too and we have
 it wrongly implemented in powerpc?

 My guess is that local_dec_and_test() is correct but we to add an
 explicit memory barrier like below:

 (Kudos to Victor Kaplansky for finding this)

 Mikey

 diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
 index cd55144..95768c6 100644
 --- a/kernel/events/ring_buffer.c
 +++ b/kernel/events/ring_buffer.c
 @@ -87,10 +87,10 @@ again:
   goto out;

   /*
 -  * Publish the known good head. Rely on the full barrier implied
 -  * by atomic_dec_and_test() order the rb-head read and this
 -  * write.
 +  * Publish the known good head. We need a memory barrier to order the
 +  * order the rb-head read and this write.
*/
 + smp_mb ();
   rb-user_page-data_head = head;

   /*


 I'm adding Peter in Cc since he wrote that code.
 I agree that local_dec_and_test() doesn't need to imply an smp barrier.
 All it has to provide as a guarantee is the atomicity against local concurrent
 operations (interrupts, preemption, ...).

 Now I'm a bit confused about this barrier.

 I think we want this ordering:

 Kernel User

READ rb-user_page-data_tail   READ rb-user_page-data_head
smp_mb()smp_mb()
WRITE rb data   READ rb  data
smp_mb()smp_mb()
rb-user_page-data_headWRITE rb-user_page-data_tail
  ^^ I meant a write above for data_head.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: linux-next: build failure after merge of the akpm tree

2013-10-02 Thread Frederic Weisbecker

On Wed, Sep 25, 2013 at 02:43:28PM -0700, Andrew Morton wrote:
 On Wed, 25 Sep 2013 14:32:14 -0700 (PDT) Hugh Dickins hu...@google.com 
 wrote:
 
  On Wed, 25 Sep 2013, Andrew Morton wrote:
   On Wed, 25 Sep 2013 11:06:43 +1000 Stephen Rothwell 
   s...@canb.auug.org.au wrote:
Hi Andrew,

After merging the akpm tree, linux-next builds (powerpc allmodconfig)
fail like this:
   
   I can't get powerpc to build at all at present:
   
 CHK include/config/kernel.release
 CHK include/generated/uapi/linux/version.h
 CHK include/generated/utsrelease.h
 CC  arch/powerpc/kernel/asm-offsets.s
   In file included from include/linux/vtime.h:6,
from include/linux/hardirq.h:7,
from include/linux/memcontrol.h:24,
from include/linux/swap.h:8,
from include/linux/suspend.h:4,
from arch/powerpc/kernel/asm-offsets.c:24:
   arch/powerpc/include/generated/asm/vtime.h:1:31: error: 
   asm-generic/vtime.h: No such file or directory
  
  That caught me too: include/asm-generic/vtime.h is a patch-unfriendly
  0-length file in the git tree;
 
 hm, this?
 
 
 From: Andrew Morton a...@linux-foundation.org
 Subject: include/asm-generic/vtime.h: avoid zero-length file
 
 patch(1) can't handle zero-length files - it appears to simply not create
 the file, so my powerpc build fails.
 
 Put something in here to make life easier.
 
 Cc: Hugh Dickins hu...@google.com
 Signed-off-by: Andrew Morton a...@linux-foundation.org
 ---
 
  include/asm-generic/vtime.h |1 +
  1 file changed, 1 insertion(+)
 
 diff -puN /dev/null include/asm-generic/vtime.h
 --- /dev/null
 +++ a/include/asm-generic/vtime.h
 @@ -0,0 +1 @@
 +/* no content, but patch(1) dislikes empty files */
 _
 
 
 
  I wonder what use it's supposed to have.
 
 Frederic, can you please confirm that include/asm-generic/vtime.h is
 supposed to be empty?

Yep. I use asm/vtime.h to let archs override some CPP symbols. And if they
don't override these, they simply return the generic vtime.h file that is empty
and as such doesn't override anything.

May be that's an ugly way to handle this kind of override scenario but I
couldn't find a better mechanism.

Actually, a Kconfig symbol would do the trick. It just seemed to me like
an overkill at that time. But it may be better.

Thanks.

 
  (And I'm not very keen on the growing trend for symlinks in the git tree.)
 
 ooh, that explains why I lost my arch/microblaze/boot/dts/system.dts.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-25 Thread Frederic Weisbecker

On Thu, Jul 25, 2013 at 02:33:02PM +0530, Preeti U Murthy wrote:
 In the current design of timer offload framework, the broadcast cpu should
 *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep 
 idle states.
 
 Since we prevent the CPUs entering deep idle states from programming the 
 lapic of the
 broadcast cpu for their respective next local events for reasons mentioned in
 PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up 
 during
 each of its timer interrupt programmed to its local events.
 
 With tickless idle, the broadcast CPU might not get a timer interrupt till 
 after
 many ticks which can result in missed wakeups on CPUs in deep idle states. By
 disabling tickless idle, worst case, the tick_sched hrtimer will trigger a
 timer interrupt every period to check for broadcast.
 
 However the current setup of tickless idle does not let us make the choice
 of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless 
 idle,
 is a system wide setting. Hence resort to an arch specific call to check if a 
 cpu
 can go into tickless idle.

Hi Preeti,

I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle 
mode.
I read in the previous patch that's because in dynticks idle mode the broadcast
CPU deactivates its lapic so it doesn't receive the IPI. But may be I 
misunderstood.
Anyway that's not good for powersaving.

Also when an arch wants to prevent a CPU from entering dynticks idle mode, it 
typically
use arch_needs_cpu(). May be that could fit for you as well?

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v3 0/5] powerpc: Support context tracking for Power pSeries

2013-05-29 Thread Frederic Weisbecker

On Mon, May 13, 2013 at 04:03:13PM +0800, Li Zhong wrote:
 On Mon, 2013-05-13 at 15:51 +1000, Benjamin Herrenschmidt wrote:
  On Mon, 2013-05-13 at 13:21 +0800, Li Zhong wrote:
   These patches try to support context tracking for Power arch, beginning 
   with
   64-bit pSeries. The codes are ported from that of the x86_64, and in each
   patch, I listed the corresponding patch for x86.
  
  So that's yet another pile of bloat on all syscall entry/exit and
  exception entry/exit. What is it used for ? (I haven't followed on
  x86_64 side).
 
 To my understanding, it is used to enable RCU user extended quiescent
 state, so RCU on that cpu doesn't need scheduler ticks. And together
 with some other code(already in 3.10), we are able to remove the ticks
 in some cases (e.g. only 1 task running on the cpu, with some other
 limitations).
 
 Maybe Paul, or Frederic could give some better descriptions. 

That's pretty much it. It helps RCU and cputime accounting infrastructure
to know when we enter/exit userspace. This way we can:

* Consider the CPU as idle from an RCU point of view when we run in userspace,
so RCU won't need the tick to stay alive.

* Account cputime (utime/stime/...) without using the tick. Ok powerpc already
has CONFIG_VIRT_CPU_ACCOUNTING_NATIVE but making it working with full dynticks
would require some tweaks.

All in one this is to support full dynticks.

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v3 0/5] powerpc: Support context tracking for Power pSeries

2013-05-29 Thread Frederic Weisbecker

On Mon, May 13, 2013 at 06:59:23PM +1000, Benjamin Herrenschmidt wrote:
 On Mon, 2013-05-13 at 16:03 +0800, Li Zhong wrote:
  
  To my understanding, it is used to enable RCU user extended quiescent
  state, so RCU on that cpu doesn't need scheduler ticks. And together
  with some other code(already in 3.10), we are able to remove the ticks
  in some cases (e.g. only 1 task running on the cpu, with some other
  limitations).
 
 Ok, sounds interesting. Once you fix the little cosmetic issue, I don't
 see any reason not to merge them as it's basically wiring up an existing
 feature (in that regard the patches are pretty straightforward) and I
 assume the overhead is only there when you enable it.

Yeah when CONFIG_RCU_USER_QS=n and CONFIG_VIRT_CPU_ACCOUNTING_GEN=n this
should be zero overhead. And those configs are only needed for full dynticks.

Thanks.

 
 Cheers,
 Ben.
 
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/5] powerpc: Syscall hooks for context tracking subsystem

2013-02-10 Thread Frederic Weisbecker

2013/2/1 Li Zhong zh...@linux.vnet.ibm.com:
 This is the syscall slow path hooks for context tracking subsystem,
 corresponding to
 [PATCH] x86: Syscall hooks for userspace RCU extended QS
   commit bf5a3c13b939813d28ce26c01425054c740d6731

 TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there
 is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is
 better for it to be in the same 16 bits with others in the group, so in the
 asm code, andi. with this group could work.

 Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com

Looks good. Thanks.

Acked-by: Frederic Weisbecker fweis...@gmail.com

 ---
  arch/powerpc/include/asm/thread_info.h |7 +--
  arch/powerpc/kernel/ptrace.c   |5 +
  2 files changed, 10 insertions(+), 2 deletions(-)

 diff --git a/arch/powerpc/include/asm/thread_info.h 
 b/arch/powerpc/include/asm/thread_info.h
 index 406b7b9..414a261 100644
 --- a/arch/powerpc/include/asm/thread_info.h
 +++ b/arch/powerpc/include/asm/thread_info.h
 @@ -97,7 +97,7 @@ static inline struct thread_info *current_thread_info(void)
  #define TIF_PERFMON_CTXSW  6   /* perfmon needs ctxsw calls */
  #define TIF_SYSCALL_AUDIT  7   /* syscall auditing active */
  #define TIF_SINGLESTEP 8   /* singlestepping active */
 -#define TIF_MEMDIE 9   /* is terminating due to OOM killer */
 +#define TIF_NOHZ   9   /* in adaptive nohz mode */
  #define TIF_SECCOMP10  /* secure computing */
  #define TIF_RESTOREALL 11  /* Restore all regs (implies NOERROR) 
 */
  #define TIF_NOERROR12  /* Force successful syscall return */
 @@ -106,6 +106,7 @@ static inline struct thread_info 
 *current_thread_info(void)
  #define TIF_SYSCALL_TRACEPOINT 15  /* syscall tracepoint instrumentation 
 */
  #define TIF_EMULATE_STACK_STORE16  /* Is an instruction emulation
 for stack store? */
 +#define TIF_MEMDIE 17  /* is terminating due to OOM killer */

  /* as above, but as bit values */
  #define _TIF_SYSCALL_TRACE (1TIF_SYSCALL_TRACE)
 @@ -124,8 +125,10 @@ static inline struct thread_info 
 *current_thread_info(void)
  #define _TIF_UPROBE(1TIF_UPROBE)
  #define _TIF_SYSCALL_TRACEPOINT(1TIF_SYSCALL_TRACEPOINT)
  #define _TIF_EMULATE_STACK_STORE   (1TIF_EMULATE_STACK_STORE)
 +#define _TIF_NOHZ  (1TIF_NOHZ)
  #define _TIF_SYSCALL_T_OR_A(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 -_TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
 +_TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \
 +_TIF_NOHZ)

  #define _TIF_USER_WORK_MASK(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
  _TIF_NOTIFY_RESUME | _TIF_UPROBE)
 diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
 index c497000..62238dd 100644
 --- a/arch/powerpc/kernel/ptrace.c
 +++ b/arch/powerpc/kernel/ptrace.c
 @@ -32,6 +32,7 @@
  #include trace/syscall.h
  #include linux/hw_breakpoint.h
  #include linux/perf_event.h
 +#include linux/context_tracking.h

  #include asm/uaccess.h
  #include asm/page.h
 @@ -1745,6 +1746,8 @@ long do_syscall_trace_enter(struct pt_regs *regs)
  {
 long ret = 0;

 +   user_exit();
 +
 secure_computing_strict(regs-gpr[0]);

 if (test_thread_flag(TIF_SYSCALL_TRACE) 
 @@ -1789,4 +1792,6 @@ void do_syscall_trace_leave(struct pt_regs *regs)
 step = test_thread_flag(TIF_SINGLESTEP);
 if (step || test_thread_flag(TIF_SYSCALL_TRACE))
 tracehook_report_syscall_exit(regs, step);
 +
 +   user_enter();
  }
 --
 1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/5] powerpc: Exception hooks for context tracking subsystem

2013-02-10 Thread Frederic Weisbecker

2013/2/1 Li Zhong zh...@linux.vnet.ibm.com:
 This is the exception hooks for context tracking subsystem, including
 data access, program check, single step, instruction breakpoint, machine 
 check,
 alignment, fp unavailable, altivec assist, unknown exception, whose handlers
 might use RCU.

 This patch corresponds to
 [PATCH] x86: Exception hooks for userspace RCU extended QS
   commit 6ba3c97a38803883c2eee489505796cb0a727122

 Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com

Looks good!

I guess we should move exception_enter/exit definition to the generic
code. They should be the same for all archs after all. Also we are
relying on user_mode(regs) but this may be buggy with some corner
cases. For example if an exception happen after a call to user_exit()
(on syscall exit) but before we actually resume in userspace, the
exception will exit in kernel mode from the context tracking POV.

So instead on relying on the regs, which are not sync with the context
tracking state, we should use something like:

prev_state = exception_enter();
...
exception_exit(prev_state);

Also preempt_schedule_irq() is concerned as well by this problem. So I
should convert it to that scheme as well. I'm going to prepare some
patches.

Feel free to merge this patch in the powerpc tree, I'll do the
conversion along the way.

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/8] cputime: Generic on-demand virtual cputime accounting

2013-02-08 Thread Frederic Weisbecker

2013/2/8 Stephen Rothwell s...@canb.auug.org.au:
 Hi Frederic,

 On Fri, 8 Feb 2013 14:07:49 +1100 Stephen Rothwell s...@canb.auug.org.au 
 wrote:

 This patch has the side effect of changing the default configurations:
 (This is PowerPC pseries_defconfig before/after this patch)

 @@ -119,8 +120,8 @@
  #
  # CPU/Task time and stats accounting
  #
 -# CONFIG_TICK_CPU_ACCOUNTING is not set
 -CONFIG_VIRT_CPU_ACCOUNTING=y
 +CONFIG_TICK_CPU_ACCOUNTING=y
 +# CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
  # CONFIG_BSD_PROCESS_ACCT is not set
  CONFIG_TASKSTATS=y
  CONFIG_TASK_DELAY_ACCT=y

 I don't know if that was deliberate, but it was suprising.  I noticed
 when this patch entered next-20130207.

 I suspect that this is caused by the changes to init/Kconfig:

 diff --git a/init/Kconfig b/init/Kconfig
 index be8b7f5..a05f843 100644
 --- a/init/Kconfig
 +++ b/init/Kconfig
 @@ -326,6 +326,9 @@ source kernel/time/Kconfig

  menu CPU/Task time and stats accounting

 +config VIRT_CPU_ACCOUNTING
 +   bool
 +
  choice
 prompt Cputime accounting
 default TICK_CPU_ACCOUNTING if !PPC64

 The next line of context is:

 default VIRT_CPU_ACCOUNTING if PPC64

 Which may have needed changing as well?  Indeed, changing that to
 VIRT_CPU_ACCOUNTING_NATIVE restores the old defaults.

Indeed, thanks for your fix!
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/5] powerpc: Syscall hooks for context tracking subsystem

2013-02-07 Thread Frederic Weisbecker

2013/2/7 Li Zhong zh...@linux.vnet.ibm.com:
 On Thu, 2013-02-07 at 01:29 +0100, Frederic Weisbecker wrote:
 In x86-64, schedule_user() and do_notify_resume() can be called before
 syscall_trace_leave(). As a result we may be entering
 syscall_trace_leave() in user mode (from a context tracking POV). To
 fix this I added a call to user_exit() on the very beginning of that
 function.

 You can find the details in 2c5594df344cd1ff0cc9bf007dea3235582b3acf
 (rcu: Fix unrecovered RCU user mode in syscall_trace_leave()).

 Hi Frederic,

 Thank you very much for the reminding.


 Could this problem happen in ppc as well?

 I checked the code(64 bit) today, it seems to me that it won't happen.
 But fortunately, we are in the ppc mailing list, please correct me if my
 understanding is wrong.

Ah indeed. Looking at syscall_exit_work label in entry_64.S,
do_syscall_trace_leave is called before ret_from_except which is where
we handle user preemption and do_notify_resume. So that looks fine.


 By the way, I enabled CONTEXT_TRACKING_FORCE and PROVE_RCU, so if it
 could happen, I think there should be some illegal RCU usage complaints
 reported.

Ok.

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/5] powerpc: Syscall hooks for context tracking subsystem

2013-02-06 Thread Frederic Weisbecker

2013/2/1 Li Zhong zh...@linux.vnet.ibm.com:
 This is the syscall slow path hooks for context tracking subsystem,
 corresponding to
 [PATCH] x86: Syscall hooks for userspace RCU extended QS
   commit bf5a3c13b939813d28ce26c01425054c740d6731

 TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there
 is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is
 better for it to be in the same 16 bits with others in the group, so in the
 asm code, andi. with this group could work.

 Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com
 ---
  arch/powerpc/include/asm/thread_info.h |7 +--
  arch/powerpc/kernel/ptrace.c   |5 +
  2 files changed, 10 insertions(+), 2 deletions(-)

 diff --git a/arch/powerpc/include/asm/thread_info.h 
 b/arch/powerpc/include/asm/thread_info.h
 index 406b7b9..414a261 100644
 --- a/arch/powerpc/include/asm/thread_info.h
 +++ b/arch/powerpc/include/asm/thread_info.h
 @@ -97,7 +97,7 @@ static inline struct thread_info *current_thread_info(void)
  #define TIF_PERFMON_CTXSW  6   /* perfmon needs ctxsw calls */
  #define TIF_SYSCALL_AUDIT  7   /* syscall auditing active */
  #define TIF_SINGLESTEP 8   /* singlestepping active */
 -#define TIF_MEMDIE 9   /* is terminating due to OOM killer */
 +#define TIF_NOHZ   9   /* in adaptive nohz mode */
  #define TIF_SECCOMP10  /* secure computing */
  #define TIF_RESTOREALL 11  /* Restore all regs (implies NOERROR) 
 */
  #define TIF_NOERROR12  /* Force successful syscall return */
 @@ -106,6 +106,7 @@ static inline struct thread_info 
 *current_thread_info(void)
  #define TIF_SYSCALL_TRACEPOINT 15  /* syscall tracepoint instrumentation 
 */
  #define TIF_EMULATE_STACK_STORE16  /* Is an instruction emulation
 for stack store? */
 +#define TIF_MEMDIE 17  /* is terminating due to OOM killer */

  /* as above, but as bit values */
  #define _TIF_SYSCALL_TRACE (1TIF_SYSCALL_TRACE)
 @@ -124,8 +125,10 @@ static inline struct thread_info 
 *current_thread_info(void)
  #define _TIF_UPROBE(1TIF_UPROBE)
  #define _TIF_SYSCALL_TRACEPOINT(1TIF_SYSCALL_TRACEPOINT)
  #define _TIF_EMULATE_STACK_STORE   (1TIF_EMULATE_STACK_STORE)
 +#define _TIF_NOHZ  (1TIF_NOHZ)
  #define _TIF_SYSCALL_T_OR_A(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 -_TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
 +_TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \
 +_TIF_NOHZ)

  #define _TIF_USER_WORK_MASK(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
  _TIF_NOTIFY_RESUME | _TIF_UPROBE)
 diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
 index c497000..62238dd 100644
 --- a/arch/powerpc/kernel/ptrace.c
 +++ b/arch/powerpc/kernel/ptrace.c
 @@ -32,6 +32,7 @@
  #include trace/syscall.h
  #include linux/hw_breakpoint.h
  #include linux/perf_event.h
 +#include linux/context_tracking.h

  #include asm/uaccess.h
  #include asm/page.h
 @@ -1745,6 +1746,8 @@ long do_syscall_trace_enter(struct pt_regs *regs)
  {
 long ret = 0;

 +   user_exit();
 +
 secure_computing_strict(regs-gpr[0]);

 if (test_thread_flag(TIF_SYSCALL_TRACE) 
 @@ -1789,4 +1792,6 @@ void do_syscall_trace_leave(struct pt_regs *regs)
 step = test_thread_flag(TIF_SINGLESTEP);
 if (step || test_thread_flag(TIF_SYSCALL_TRACE))
 tracehook_report_syscall_exit(regs, step);
 +
 +   user_enter();

In x86-64, schedule_user() and do_notify_resume() can be called before
syscall_trace_leave(). As a result we may be entering
syscall_trace_leave() in user mode (from a context tracking POV). To
fix this I added a call to user_exit() on the very beginning of that
function.

You can find the details in 2c5594df344cd1ff0cc9bf007dea3235582b3acf
(rcu: Fix unrecovered RCU user mode in syscall_trace_leave()).

Could this problem happen in ppc as well?

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc/perf: hw breakpoints return ENOSPC

2012-09-27 Thread Frederic Weisbecker

2012/9/25 Michael Neuling mi...@neuling.org:
 Michael Neuling mi...@neuling.org wrote:

 Frederic Weisbecker fweis...@gmail.com wrote:

  On Thu, Aug 16, 2012 at 02:23:54PM +1000, Michael Neuling wrote:
   Hi,
  
   I've been trying to get hardware breakpoints with perf to work on POWER7
   but I'm getting the following:
  
 % perf record -e mem:0x1000 true
  
   Error: sys_perf_event_open() syscall returned with 28 (No space left 
   on device).  /bin/dmesg may provide additional information.
  
   Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?
  
 true: Terminated
  
   (FWIW adding -a and it works fine)
  
   Debugging it seems that __reserve_bp_slot() is returning ENOSPC because
   it thinks there are no free breakpoint slots on this CPU.
  
   I have a 2 CPUs, so perf userspace is doing two perf_event_open syscalls
   to add a counter to each CPU [1].  The first syscall succeeds but the
   second is failing.
  
   On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1,
   despite there being no breakpoint on this CPU.  This is because the call
   the task_bp_pinned, checks all CPUs, rather than just the current CPU.
   POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we
   return ENOSPC.
  
   The following patch fixes this by checking the associated CPU for each
   breakpoint in task_bp_pinned.  I'm not familiar with this code, so it's
   provided as a reference to the above issue.
  
   Mikey
  
   1. not sure why it doesn't just do one syscall and specify all CPUs, but
   that's another issue.  Using two syscalls should work.
 
  This patch seems to make sense. I'll try it and run some tests.
  Can I have your Signed-off-by ?

 Frederic,

 Did you ever get to testing or integrating this patch?

 Mikey

Sorry, I forgot this in my mailbox. I'm going to look at this in the
next few days.
Feel free to harass me by email or IRC if I don't give news on this soon.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc/perf: hw breakpoints return ENOSPC

2012-08-17 Thread Frederic Weisbecker

On Thu, Aug 16, 2012 at 02:23:54PM +1000, Michael Neuling wrote:
 Hi,
 
 I've been trying to get hardware breakpoints with perf to work on POWER7
 but I'm getting the following:
 
   % perf record -e mem:0x1000 true
 
 Error: sys_perf_event_open() syscall returned with 28 (No space left on 
 device).  /bin/dmesg may provide additional information.
 
 Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?
 
   true: Terminated
 
 (FWIW adding -a and it works fine)
 
 Debugging it seems that __reserve_bp_slot() is returning ENOSPC because
 it thinks there are no free breakpoint slots on this CPU.
 
 I have a 2 CPUs, so perf userspace is doing two perf_event_open syscalls
 to add a counter to each CPU [1].  The first syscall succeeds but the
 second is failing.
 
 On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1,
 despite there being no breakpoint on this CPU.  This is because the call
 the task_bp_pinned, checks all CPUs, rather than just the current CPU.
 POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we
 return ENOSPC.
 
 The following patch fixes this by checking the associated CPU for each
 breakpoint in task_bp_pinned.  I'm not familiar with this code, so it's
 provided as a reference to the above issue.
 
 Mikey
 
 1. not sure why it doesn't just do one syscall and specify all CPUs, but
 that's another issue.  Using two syscalls should work.

This patch seems to make sense. I'll try it and run some tests.
Can I have your Signed-off-by ?

Thanks.

 
 diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
 index bb38c4d..e092daa 100644
 --- a/kernel/events/hw_breakpoint.c
 +++ b/kernel/events/hw_breakpoint.c
 @@ -111,14 +111,16 @@ static unsigned int max_task_bp_pinned(int cpu, enum 
 bp_type_idx type)
   * Count the number of breakpoints of the same type and same task.
   * The given event must be not on the list.
   */
 -static int task_bp_pinned(struct perf_event *bp, enum bp_type_idx type)
 +static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx 
 type)
  {
   struct task_struct *tsk = bp-hw.bp_target;
   struct perf_event *iter;
   int count = 0;
  
   list_for_each_entry(iter, bp_task_head, hw.bp_list) {
 - if (iter-hw.bp_target == tsk  find_slot_idx(iter) == type)
 + if (iter-hw.bp_target == tsk 
 + find_slot_idx(iter) == type 
 + cpu == iter-cpu)
   count += hw_breakpoint_weight(iter);
   }
  
 @@ -141,7 +143,7 @@ fetch_bp_busy_slots(struct bp_busy_slots *slots, struct 
 perf_event *bp,
   if (!tsk)
   slots-pinned += max_task_bp_pinned(cpu, type);
   else
 - slots-pinned += task_bp_pinned(bp, type);
 + slots-pinned += task_bp_pinned(cpu, bp, type);
   slots-flexible = per_cpu(nr_bp_flexible[type], cpu);
  
   return;
 @@ -154,7 +156,7 @@ fetch_bp_busy_slots(struct bp_busy_slots *slots, struct 
 perf_event *bp,
   if (!tsk)
   nr += max_task_bp_pinned(cpu, type);
   else
 - nr += task_bp_pinned(bp, type);
 + nr += task_bp_pinned(cpu, bp, type);
  
   if (nr  slots-pinned)
   slots-pinned = nr;
 @@ -188,7 +190,7 @@ static void toggle_bp_task_slot(struct perf_event *bp, 
 int cpu, bool enable,
   int old_idx = 0;
   int idx = 0;
  
 - old_count = task_bp_pinned(bp, type);
 + old_count = task_bp_pinned(cpu, bp, type);
   old_idx = old_count - 1;
   idx = old_idx + weight;
  
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/6] hw_breakpoints: Migrate breakpoint conditional build under new config

2011-07-05 Thread Frederic Weisbecker

On Mon, Jul 04, 2011 at 11:14:16PM +0530, K.Prasad wrote:
 On Mon, Jul 04, 2011 at 03:29:14PM +0200, Frederic Weisbecker wrote:
  On Mon, Jul 04, 2011 at 06:57:46PM +0530, K.Prasad wrote:
   On Tue, May 24, 2011 at 11:52:23PM +0200, Frederic Weisbecker wrote:
Migrate conditional hw_breakpoint code compilation under
the new config to prepare for letting the user chose whether
or not to build this feature

   
   Making the hardware breakpoint patches modular has always been a goal.
   I've looked at the PowerPC parts of the code and they look harmless.
   
   Acked-by: K.Prasad pra...@linux.vnet.ibm.com
  
  Great!
  
  I'll push that soon, thanks guys for your acks!
 
 Meanwhile, I was testing hardware breakpoints through perf and found
 that monitoring a given address fails when using 'perf record' (returns
 -ENOSPC) while 'perf stat' and watchpoint through gdb works fine (see
 logs below).
 
 Has this behaviour been reported for other perf counters?

Nope I haven't anything like that. What I reported privately to you a
few ago was actually due to a mistake of mine. Otherwise I haven't seen
other problems.

-ENOSPC is likely related to the breakpoint slot reservation, in 
kernel/events/hw_breakpoint.c
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/6] hw_breakpoints: Migrate breakpoint conditional build under new config

2011-07-04 Thread Frederic Weisbecker

On Mon, Jul 04, 2011 at 06:57:46PM +0530, K.Prasad wrote:
 On Tue, May 24, 2011 at 11:52:23PM +0200, Frederic Weisbecker wrote:
  Migrate conditional hw_breakpoint code compilation under
  the new config to prepare for letting the user chose whether
  or not to build this feature
  
  Signed-off-by: Frederic Weisbecker fweis...@gmail.com
  Acked-by: Will Deacon will.dea...@arm.com
  Cc: Ingo Molnar mi...@elte.hu
  Cc: Peter Zijlstra a.p.zijls...@chello.nl
  Cc: Prasad pra...@linux.vnet.ibm.com
  Cc: Paul Mundt let...@linux-sh.org
  ---
   arch/arm/include/asm/hw_breakpoint.h |4 ++--
   arch/arm/include/asm/processor.h |2 +-
   arch/arm/kernel/Makefile |2 +-
   arch/arm/kernel/entry-header.S   |2 +-
   arch/arm/kernel/ptrace.c |4 ++--
   arch/powerpc/include/asm/cputable.h  |4 ++--
   arch/powerpc/include/asm/hw_breakpoint.h |6 +++---
   arch/powerpc/include/asm/processor.h |4 ++--
   arch/powerpc/kernel/Makefile |2 +-
   arch/powerpc/kernel/process.c|   18 +-
   arch/powerpc/kernel/ptrace.c |   13 +++--
   arch/powerpc/lib/Makefile|2 +-
   arch/sh/kernel/Makefile  |2 +-
   arch/sh/kernel/cpu/sh4a/Makefile |2 +-
   include/linux/hw_breakpoint.h|6 +++---
   include/linux/perf_event.h   |4 ++--
   include/linux/ptrace.h   |6 +++---
   include/linux/sched.h|2 +-
   kernel/events/Makefile   |2 +-
   kernel/events/core.c |4 ++--
   kernel/ptrace.c  |4 ++--
   samples/Kconfig  |2 +-
   22 files changed, 49 insertions(+), 48 deletions(-)
  
 
 Making the hardware breakpoint patches modular has always been a goal.
 I've looked at the PowerPC parts of the code and they look harmless.
 
 Acked-by: K.Prasad pra...@linux.vnet.ibm.com

Great!

I'll push that soon, thanks guys for your acks!
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/6] hw_breakpoints: Breakpoints arch ability don't need perf events

2011-07-04 Thread Frederic Weisbecker

On Mon, Jul 04, 2011 at 07:02:23PM +0530, K.Prasad wrote:
 On Tue, May 24, 2011 at 11:52:25PM +0200, Frederic Weisbecker wrote:
  The breakpoint support ability in an arch is not related
  to the fact perf events is built or not. HAVE_HW_BREAKPOINT
  only shows an ability so this dependency makes no sense
  anymore. Archs that select HAVE_HW_BREAKPOINT already
  ensure that perf event is built.
  
  Remove that dependency.
  
  Signed-off-by: Frederic Weisbecker fweis...@gmail.com
  Cc: Ingo Molnar mi...@elte.hu
  Cc: Peter Zijlstra a.p.zijls...@chello.nl
  Cc: Will Deacon will.dea...@arm.com
  Cc: Prasad pra...@linux.vnet.ibm.com
  Cc: Paul Mundt let...@linux-sh.org
  ---
   arch/Kconfig |1 -
   1 files changed, 0 insertions(+), 1 deletions(-)
  
  diff --git a/arch/Kconfig b/arch/Kconfig
  index f78c2be..ce4be89 100644
  --- a/arch/Kconfig
  +++ b/arch/Kconfig
  @@ -149,7 +149,6 @@ config HAVE_DEFAULT_NO_SPIN_MUTEXES
  
   config HAVE_HW_BREAKPOINT
  bool
  -   depends on PERF_EVENTS
  
   config HAVE_MIXED_BREAKPOINTS_REGS
  bool
  -- 
 
 Just a thought you might want to consider...
 
 The need to keep the ability (HAVE_HW_BREAKPOINT) and the user-choice to
 enable hardware breakpoints (through HW_BREAKPOINT) in separate config
 options isn't very clear to me (and is a bit confusing with very similar
 names).
 
 Why not make HAVE_HW_BREAKPOINT selectable by the user (which in turn
 would turn on PERF_EVENTS) for a given architecture?

But then how kconfig knows if the allows that? You need to know if the
arch has the ability to support breakpoints.

This is a commin pattern in Linux Kconfig. Things are often seperated between
ability (some constant value provided by the arch) and the user choice that
depends on that ability.
The advantage of doing this is that you can centralize the generic dependencies,
help menu, etc... into a single place.
 
 Thanks,
 K.Prasad
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] hw_breakpoint: Let the user choose not to build it (and perf too)

2011-05-24 Thread Frederic Weisbecker

Mostly just a rebase against latest upstream
updates and acks from Will Deacon added In this second version.

Please tell me if you are ok with this set.

Thanks.

---

Frederic Weisbecker (6):
  hw_breakpoints: Split hardware breakpoints config
  hw_breakpoints: Migrate breakpoint conditional build under new config
  x86: Allow the user not to build hw_breakpoints
  hw_breakpoints: Breakpoints arch ability don't need perf events
  hw_breakpoints: Only force perf events if breakpoints are selected
  hw_breakpoints: Drop remaining misplaced dependency on perf


 arch/Kconfig |1 -
 arch/arm/Kconfig |2 +-
 arch/arm/include/asm/hw_breakpoint.h |4 +-
 arch/arm/include/asm/processor.h |2 +-
 arch/arm/kernel/Makefile |2 +-
 arch/arm/kernel/entry-header.S   |2 +-
 arch/arm/kernel/ptrace.c |4 +-
 arch/powerpc/Kconfig |2 +-
 arch/powerpc/include/asm/cputable.h  |4 +-
 arch/powerpc/include/asm/hw_breakpoint.h |6 ++--
 arch/powerpc/include/asm/processor.h |4 +-
 arch/powerpc/kernel/Makefile |2 +-
 arch/powerpc/kernel/process.c|   18 
 arch/powerpc/kernel/ptrace.c |   13 ++-
 arch/powerpc/lib/Makefile|2 +-
 arch/sh/Kconfig  |2 +-
 arch/sh/kernel/Makefile  |2 +-
 arch/sh/kernel/cpu/sh4a/Makefile |2 +-
 arch/x86/Kconfig |3 +-
 arch/x86/include/asm/debugreg.h  |   33 -
 arch/x86/kernel/Makefile |3 +-
 arch/x86/kernel/process.c|1 +
 arch/x86/kernel/ptrace.c |   17 +++
 include/linux/hw_breakpoint.h|6 ++--
 include/linux/perf_event.h   |4 +-
 include/linux/ptrace.h   |6 ++--
 include/linux/sched.h|2 +-
 init/Kconfig |   11 ++
 kernel/events/Makefile   |2 +-
 kernel/events/core.c |4 +-
 kernel/ptrace.c  |4 +-
 samples/Kconfig  |2 +-
 32 files changed, 115 insertions(+), 57 deletions(-)
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/6] hw_breakpoints: Split hardware breakpoints config

2011-05-24 Thread Frederic Weisbecker

The hardware breakpoint config is only made of an ability. An
arch that support this feature selects HAVE_HW_BREAKPOINT. If so,
the feature is definetly built-in, the user can't decide to turn
it off. As hw_breakpoints depend on perf, it also makes perf
a mandatory feature. The whole is quite a piece of code and
may not be desired on some embedded systems.

In order to prepare to make this optable by the user, split the
config into the more traditional couple (ability, user choice) by
providing a new HW_BREAKPOINT config. It is default on and depends
on CONFIG_EXPERT because breakpoint ptrace requests are part of the
usual user ABI. The user must know what he's doing before turning
that off.

For now, only the archs that already implemented a conditional
HAVE_HW_BREAKPOINT can turn off HW_BREAKPOINT. x86 and sh have it
always selected because they need more background work to support
this new modularity.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Will Deacon will.dea...@arm.com
Cc: Prasad pra...@linux.vnet.ibm.com
Cc: Paul Mundt let...@linux-sh.org
---
 arch/sh/Kconfig  |1 +
 arch/x86/Kconfig |1 +
 init/Kconfig |   10 ++
 3 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 4b89da2..0d4d124 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -44,6 +44,7 @@ config SUPERH32
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_ARCH_KGDB
select HAVE_HW_BREAKPOINT
+   select HW_BREAKPOINT
select HAVE_MIXED_BREAKPOINTS_REGS
select PERF_EVENTS
select ARCH_HIBERNATION_POSSIBLE if MMU
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..8b49bff 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
select HAVE_KERNEL_XZ
select HAVE_KERNEL_LZO
select HAVE_HW_BREAKPOINT
+   select HW_BREAKPOINT
select HAVE_MIXED_BREAKPOINTS_REGS
select PERF_EVENTS
select HAVE_PERF_EVENTS_NMI
diff --git a/init/Kconfig b/init/Kconfig
index d886b1e..76ae53e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -924,6 +924,16 @@ menuconfig EXPERT
   environments which can tolerate a non-standard kernel.
   Only use this if you really know what you are doing.
 
+config HW_BREAKPOINT
+   bool Hardware breakpoints if EXPERT
+   depends on HAVE_HW_BREAKPOINT
+   default y
+   help
+ Hardware breakpoints are a feature implemented by most CPUs
+ to trigger an event when an instruction or data fetch
+ matches a given pattern. This is typically used by ptrace
+ and perf events.
+
 config UID16
bool Enable 16-bit UID system calls if EXPERT
depends on ARM || BLACKFIN || CRIS || FRV || H8300 || X86_32 || M68K || 
(S390  !64BIT) || SUPERH || SPARC32 || (SPARC64  COMPAT) || UML || (X86_64 
 IA32_EMULATION)
-- 
1.7.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 4/6] hw_breakpoints: Breakpoints arch ability don't need perf events

2011-05-24 Thread Frederic Weisbecker

The breakpoint support ability in an arch is not related
to the fact perf events is built or not. HAVE_HW_BREAKPOINT
only shows an ability so this dependency makes no sense
anymore. Archs that select HAVE_HW_BREAKPOINT already
ensure that perf event is built.

Remove that dependency.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Will Deacon will.dea...@arm.com
Cc: Prasad pra...@linux.vnet.ibm.com
Cc: Paul Mundt let...@linux-sh.org
---
 arch/Kconfig |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index f78c2be..ce4be89 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -149,7 +149,6 @@ config HAVE_DEFAULT_NO_SPIN_MUTEXES
 
 config HAVE_HW_BREAKPOINT
bool
-   depends on PERF_EVENTS
 
 config HAVE_MIXED_BREAKPOINTS_REGS
bool
-- 
1.7.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/6] hw_breakpoints: Migrate breakpoint conditional build under new config

2011-05-24 Thread Frederic Weisbecker

Migrate conditional hw_breakpoint code compilation under
the new config to prepare for letting the user chose whether
or not to build this feature

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Will Deacon will.dea...@arm.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Prasad pra...@linux.vnet.ibm.com
Cc: Paul Mundt let...@linux-sh.org
---
 arch/arm/include/asm/hw_breakpoint.h |4 ++--
 arch/arm/include/asm/processor.h |2 +-
 arch/arm/kernel/Makefile |2 +-
 arch/arm/kernel/entry-header.S   |2 +-
 arch/arm/kernel/ptrace.c |4 ++--
 arch/powerpc/include/asm/cputable.h  |4 ++--
 arch/powerpc/include/asm/hw_breakpoint.h |6 +++---
 arch/powerpc/include/asm/processor.h |4 ++--
 arch/powerpc/kernel/Makefile |2 +-
 arch/powerpc/kernel/process.c|   18 +-
 arch/powerpc/kernel/ptrace.c |   13 +++--
 arch/powerpc/lib/Makefile|2 +-
 arch/sh/kernel/Makefile  |2 +-
 arch/sh/kernel/cpu/sh4a/Makefile |2 +-
 include/linux/hw_breakpoint.h|6 +++---
 include/linux/perf_event.h   |4 ++--
 include/linux/ptrace.h   |6 +++---
 include/linux/sched.h|2 +-
 kernel/events/Makefile   |2 +-
 kernel/events/core.c |4 ++--
 kernel/ptrace.c  |4 ++--
 samples/Kconfig  |2 +-
 22 files changed, 49 insertions(+), 48 deletions(-)

diff --git a/arch/arm/include/asm/hw_breakpoint.h 
b/arch/arm/include/asm/hw_breakpoint.h
index f389b27..fc6ba18 100644
--- a/arch/arm/include/asm/hw_breakpoint.h
+++ b/arch/arm/include/asm/hw_breakpoint.h
@@ -5,7 +5,7 @@
 
 struct task_struct;
 
-#ifdef CONFIG_HAVE_HW_BREAKPOINT
+#ifdef CONFIG_HW_BREAKPOINT
 
 struct arch_hw_breakpoint_ctrl {
u32 __reserved  : 9,
@@ -128,6 +128,6 @@ int hw_breakpoint_slots(int type);
 #else
 static inline void clear_ptrace_hw_breakpoint(struct task_struct *tsk) {}
 
-#endif /* CONFIG_HAVE_HW_BREAKPOINT */
+#endif /* CONFIG_HW_BREAKPOINT */
 #endif /* __KERNEL__ */
 #endif /* _ARM_HW_BREAKPOINT_H */
diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
index b2d9df5..b86d135 100644
--- a/arch/arm/include/asm/processor.h
+++ b/arch/arm/include/asm/processor.h
@@ -30,7 +30,7 @@
 #endif
 
 struct debug_info {
-#ifdef CONFIG_HAVE_HW_BREAKPOINT
+#ifdef CONFIG_HW_BREAKPOINT
struct perf_event   *hbp[ARM_MAX_HBP_SLOTS];
 #endif
 };
diff --git a/arch/arm/kernel/Makefile b/arch/arm/kernel/Makefile
index 8d95446..e6c4b04 100644
--- a/arch/arm/kernel/Makefile
+++ b/arch/arm/kernel/Makefile
@@ -47,7 +47,7 @@ obj-$(CONFIG_HAVE_TCM)+= tcm.o
 obj-$(CONFIG_CRASH_DUMP)   += crash_dump.o
 obj-$(CONFIG_SWP_EMULATE)  += swp_emulate.o
 CFLAGS_swp_emulate.o   := -Wa,-march=armv7-a
-obj-$(CONFIG_HAVE_HW_BREAKPOINT)   += hw_breakpoint.o
+obj-$(CONFIG_HW_BREAKPOINT)+= hw_breakpoint.o
 
 obj-$(CONFIG_CRUNCH)   += crunch.o crunch-bits.o
 AFLAGS_crunch-bits.o   := -Wa,-mcpu=ep9312
diff --git a/arch/arm/kernel/entry-header.S b/arch/arm/kernel/entry-header.S
index 051166c..fbc7cc9 100644
--- a/arch/arm/kernel/entry-header.S
+++ b/arch/arm/kernel/entry-header.S
@@ -171,7 +171,7 @@
@ we can access the debug registers safely.
@
.macro  debug_entry, fsr
-#if defined(CONFIG_HAVE_HW_BREAKPOINT)  defined(CONFIG_PREEMPT)
+#if defined(CONFIG_HW_BREAKPOINT)  defined(CONFIG_PREEMPT)
ldr r4, =0x40f  @ mask out fsr.fs
and r5, r4, \fsr
cmp r5, #2  @ debug exception
diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index 8182f45..07be604 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -468,7 +468,7 @@ static int ptrace_setvfpregs(struct task_struct *tsk, void 
__user *data)
 }
 #endif
 
-#ifdef CONFIG_HAVE_HW_BREAKPOINT
+#ifdef CONFIG_HW_BREAKPOINT
 /*
  * Convert a virtual register number into an index for a thread_info
  * breakpoint array. Breakpoints are identified using positive numbers
@@ -765,7 +765,7 @@ long arch_ptrace(struct task_struct *child, long request,
break;
 #endif
 
-#ifdef CONFIG_HAVE_HW_BREAKPOINT
+#ifdef CONFIG_HW_BREAKPOINT
case PTRACE_GETHBPREGS:
if (ptrace_get_breakpoints(child)  0)
return -ESRCH;
diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 1833d1a..36b5568 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -538,9 +538,9 @@ static inline int cpu_has_feature(unsigned long feature)
 feature);
 }
 
-#ifdef CONFIG_HAVE_HW_BREAKPOINT
+#ifdef

[PATCH 5/6] hw_breakpoints: Only force perf events if breakpoints are selected

2011-05-24 Thread Frederic Weisbecker

Previously, arch were forced to always build perf events if they
supported hw_breakpoints.

Now that the user can choose not to build hw_breakpoints, let
only force perf events if hw_breakpoints are selected.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Will Deacon will.dea...@arm.com
Cc: Prasad pra...@linux.vnet.ibm.com
Cc: Paul Mundt let...@linux-sh.org
---
 arch/sh/Kconfig  |1 -
 arch/x86/Kconfig |1 -
 init/Kconfig |1 +
 3 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 0d4d124..d59e6c2 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -46,7 +46,6 @@ config SUPERH32
select HAVE_HW_BREAKPOINT
select HW_BREAKPOINT
select HAVE_MIXED_BREAKPOINTS_REGS
-   select PERF_EVENTS
select ARCH_HIBERNATION_POSSIBLE if MMU
select SPARSE_IRQ
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fb28dd9..5317f42 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -55,7 +55,6 @@ config X86
select HAVE_KERNEL_LZO
select HAVE_HW_BREAKPOINT
select HAVE_MIXED_BREAKPOINTS_REGS
-   select PERF_EVENTS
select HAVE_PERF_EVENTS_NMI
select ANON_INODES
select HAVE_ARCH_KMEMCHECK
diff --git a/init/Kconfig b/init/Kconfig
index 76ae53e..9ae3555 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -927,6 +927,7 @@ menuconfig EXPERT
 config HW_BREAKPOINT
bool Hardware breakpoints if EXPERT
depends on HAVE_HW_BREAKPOINT
+   select PERF_EVENTS
default y
help
  Hardware breakpoints are a feature implemented by most CPUs
-- 
1.7.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 3/6] x86: Allow the user not to build hw_breakpoints

2011-05-24 Thread Frederic Weisbecker

So that hw_breakpoints and perf can be not built on
specific embedded systems.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Jason Wessel jason.wes...@windriver.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Thomas Gleixner t...@linutronix.de
---
 arch/x86/Kconfig|3 +--
 arch/x86/include/asm/debugreg.h |   33 +++--
 arch/x86/kernel/Makefile|3 ++-
 arch/x86/kernel/process.c   |1 +
 arch/x86/kernel/ptrace.c|   17 +
 5 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8b49bff..fb28dd9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -41,7 +41,7 @@ config X86
select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_KVM
-   select HAVE_ARCH_KGDB
+   select HAVE_ARCH_KGDB if HW_BREAKPOINT
select HAVE_ARCH_TRACEHOOK
select HAVE_GENERIC_DMA_COHERENT if X86_32
select HAVE_EFFICIENT_UNALIGNED_ACCESS
@@ -54,7 +54,6 @@ config X86
select HAVE_KERNEL_XZ
select HAVE_KERNEL_LZO
select HAVE_HW_BREAKPOINT
-   select HW_BREAKPOINT
select HAVE_MIXED_BREAKPOINTS_REGS
select PERF_EVENTS
select HAVE_PERF_EVENTS_NMI
diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index 078ad0c..c40d6d4 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -78,8 +78,6 @@
  */
 #ifdef __KERNEL__
 
-DECLARE_PER_CPU(unsigned long, cpu_dr7);
-
 static inline void hw_breakpoint_disable(void)
 {
/* Zero the control register for HW Breakpoint */
@@ -92,6 +90,10 @@ static inline void hw_breakpoint_disable(void)
set_debugreg(0UL, 3);
 }
 
+#ifdef CONFIG_HW_BREAKPOINT
+
+DECLARE_PER_CPU(unsigned long, cpu_dr7);
+
 static inline int hw_breakpoint_active(void)
 {
return __this_cpu_read(cpu_dr7)  DR_GLOBAL_ENABLE_MASK;
@@ -100,6 +102,33 @@ static inline int hw_breakpoint_active(void)
 extern void aout_dump_debugregs(struct user *dump);
 
 extern void hw_breakpoint_restore(void);
+#else
+static inline int hw_breakpoint_active(void)
+{
+   return 0;
+}
+
+static inline void hw_breakpoint_restore(void)
+{
+   set_debugreg(0UL, 0);
+   set_debugreg(0UL, 1);
+   set_debugreg(0UL, 2);
+   set_debugreg(0UL, 3);
+   set_debugreg(current-thread.debugreg6, 6);
+   set_debugreg(0UL, 7);
+}
+
+static inline void aout_dump_debugregs(struct user *dump)
+{
+   int i;
+
+   for (i = 0; i  6; i++)
+   dump-u_debugreg[i] = 0;
+
+   dump-u_debugreg[6] = current-thread.debugreg6;
+   dump-u_debugreg[7] = 0;
+}
+#endif /* CONFIG_HW_BREAKPOINT */
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 7338ef2..270e439 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -42,7 +42,7 @@ obj-$(CONFIG_X86_64)  += sys_x86_64.o x8664_ksyms_64.o
 obj-$(CONFIG_X86_64)   += syscall_64.o vsyscall_64.o
 obj-y  += bootflag.o e820.o
 obj-y  += pci-dma.o quirks.o topology.o kdebugfs.o
-obj-y  += alternative.o i8253.o pci-nommu.o hw_breakpoint.o
+obj-y  += alternative.o i8253.o pci-nommu.o
 obj-y  += tsc.o io_delay.o rtc.o
 obj-y  += pci-iommu_table.o
 obj-y  += resource.o
@@ -51,6 +51,7 @@ obj-y += trampoline.o 
trampoline_$(BITS).o
 obj-y  += process.o
 obj-y  += i387.o xsave.o
 obj-y  += ptrace.o
+obj-$(CONFIG_HW_BREAKPOINT)+= hw_breakpoint.o
 obj-$(CONFIG_X86_32)   += tls.o
 obj-$(CONFIG_IA32_EMULATION)   += tls.o
 obj-y  += step.o
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index d46cbe4..d1adbd1 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -22,6 +22,7 @@
 #include asm/uaccess.h
 #include asm/i387.h
 #include asm/debugreg.h
+#include asm/kdebug.h
 
 struct kmem_cache *task_xstate_cachep;
 EXPORT_SYMBOL_GPL(task_xstate_cachep);
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index f65e5b5..091f110c 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -528,6 +528,7 @@ static int genregs_set(struct task_struct *target,
return ret;
 }
 
+#ifdef CONFIG_HW_BREAKPOINT
 static void ptrace_triggered(struct perf_event *bp, int nmi,
 struct perf_sample_data *data,
 struct pt_regs *regs)
@@ -777,6 +778,22 @@ ret_path:
return rc;
 }
 
+#else /* !CONFIG_HW_BREAKPOINT */
+
+static inline unsigned long
+ptrace_get_debugreg(struct task_struct *tsk, int n)
+{
+   return -ENOSYS;
+}
+
+static inline
+int ptrace_set_debugreg(struct

[PATCH 6/6] hw_breakpoints: Drop remaining misplaced dependency on perf

2011-05-24 Thread Frederic Weisbecker

Powerpc and Arm select breakpoint support ability only if
Perf is built. This is not necessary anymore now that we
enable perf once breakpoints support is selected.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Will Deacon will.dea...@arm.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Prasad pra...@linux.vnet.ibm.com
Cc: Paul Mundt let...@linux-sh.org
---
 arch/arm/Kconfig |2 +-
 arch/powerpc/Kconfig |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 377a7a5..1d3a0b4 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -24,7 +24,7 @@ config ARM
select HAVE_PERF_EVENTS
select PERF_USE_VMALLOC
select HAVE_REGS_AND_STACK_ACCESS_API
-   select HAVE_HW_BREAKPOINT if (PERF_EVENTS  (CPU_V6 || CPU_V6K || 
CPU_V7))
+   select HAVE_HW_BREAKPOINT if (CPU_V6 || CPU_V6K || CPU_V7)
select HAVE_C_RECORDMCOUNT
select HAVE_GENERIC_HARDIRQS
select HAVE_SPARSE_IRQ
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8f4d50b..e944eb8 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -134,7 +134,7 @@ config PPC
select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select HAVE_REGS_AND_STACK_ACCESS_API
-   select HAVE_HW_BREAKPOINT if PERF_EVENTS  PPC_BOOK3S_64
+   select HAVE_HW_BREAKPOINT if PPC_BOOK3S_64
select HAVE_GENERIC_HARDIRQS
select HAVE_SPARSE_IRQ
select IRQ_PER_CPU
-- 
1.7.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-12 Thread Frederic Weisbecker

On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote:
 To restrict execution to system calls.
 
 Two observations:
 
 1) We already have a specific ABI for this: you can set filters for events 
 via 
an event fd.
 
Why not extend that mechanism instead and improve *both* your sandboxing
bits and the events code? This new seccomp code has a lot more
to do with trace event filters than the minimal old seccomp code ...
 
kernel/trace/trace_event_filter.c is 2000 lines of tricky code that
interprets the ASCII filter expressions. kernel/seccomp.c is 86 lines of
mostly trivial code.
 
 2) Why should this concept not be made available wider, to allow the 
restriction of not just system calls but other security relevant 
 components 
of the kernel as well?
 
This too, if you approach the problem via the events code, will be a 
 natural 
end result, while if you approach it from the seccomp prctl angle it will 
 be
a limited hack only.
 
 Note, the end result will be the same - just using a different ABI.
 
 So i really think the ABI itself should be closer related to the event code. 
 What this seccomp code does is that it uses specific syscall events to 
 restrict execution of certain event generating codepaths, such as system 
 calls.
 
 Thanks,
 
   Ingo

What's positive with that approach is that the code is all there already.
Create a perf event for a given trace event, attach a filter to it.

What needs to be added is an override of the effect of the filter. By default
it's dropping the event, but there may be different flavours, including sending
a signal. All in one, extending the current code to allow that looks trivial.

The negative points are that

* trace events are supposed to stay passive and not act on the system, except
doing some endpoint things like writing to a buffer. We can't call do_exit()
from a tracepoint for example, preemption is disabled there.

* Also, is it actually relevant to extend that seccomp filtering to other events
than syscalls? Exposing kernel events to filtering sounds actually to me 
bringing
a new potential security issue. But with fine restrictions this can probably
be dealt with. Especially if by default only syscalls can be filtered

* I think Peter did not want to give such active role to perf in the system.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[tip:perf/urgent] hw_breakpoints, powerpc: Fix CONFIG_HAVE_HW_BREAKPOINT off-case in ptrace_set_debugreg()

2011-05-06 Thread tip-bot for Frederic Weisbecker

Commit-ID:  925f83c085e1bb08435556c5b4844a60de002e31
Gitweb: http://git.kernel.org/tip/925f83c085e1bb08435556c5b4844a60de002e31
Author: Frederic Weisbecker fweis...@gmail.com
AuthorDate: Fri, 6 May 2011 01:53:18 +0200
Committer:  Ingo Molnar mi...@elte.hu
CommitDate: Fri, 6 May 2011 11:24:46 +0200

hw_breakpoints, powerpc: Fix CONFIG_HAVE_HW_BREAKPOINT off-case in 
ptrace_set_debugreg()

We make use of ptrace_get_breakpoints() / ptrace_put_breakpoints() to
protect ptrace_set_debugreg() even if CONFIG_HAVE_HW_BREAKPOINT if off.
However in this case, these APIs are not implemented.

To fix this, push the protection down inside the relevant ifdef.
Best would be to export the code inside
CONFIG_HAVE_HW_BREAKPOINT into a standalone function to cleanup
the ifdefury there and call the breakpoint ref API inside. But
as it is more invasive, this should be rather made in an -rc1.

Fixes this build error:

  arch/powerpc/kernel/ptrace.c:1594: error: implicit declaration of function 
'ptrace_get_breakpoints' make[2]: ***

Reported-by: Ingo Molnar mi...@elte.hu
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: LPPC linuxppc-dev@lists.ozlabs.org
Cc: Prasad pra...@linux.vnet.ibm.com
Cc: v2.6.33.. sta...@kernel.org
Link: 
http://lkml.kernel.org/r/1304639598-4707-1-git-send-email-fweis...@gmail.com
Signed-off-by: Ingo Molnar mi...@elte.hu
---
 arch/powerpc/kernel/ptrace.c |   15 +++
 1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 4edeeb3..a6ae1cf 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -933,12 +933,16 @@ int ptrace_set_debugreg(struct task_struct *task, 
unsigned long addr,
if (data  !(data  DABR_TRANSLATION))
return -EIO;
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
+   if (ptrace_get_breakpoints(task)  0)
+   return -ESRCH;
+
bp = thread-ptrace_bps[0];
if ((!data) || !(data  (DABR_DATA_WRITE | DABR_DATA_READ))) {
if (bp) {
unregister_hw_breakpoint(bp);
thread-ptrace_bps[0] = NULL;
}
+   ptrace_put_breakpoints(task);
return 0;
}
if (bp) {
@@ -948,9 +952,12 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned 
long addr,
(DABR_DATA_WRITE | DABR_DATA_READ),
attr.bp_type);
ret =  modify_user_hw_breakpoint(bp, attr);
-   if (ret)
+   if (ret) {
+   ptrace_put_breakpoints(task);
return ret;
+   }
thread-ptrace_bps[0] = bp;
+   ptrace_put_breakpoints(task);
thread-dabr = data;
return 0;
}
@@ -965,9 +972,12 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned 
long addr,
ptrace_triggered, task);
if (IS_ERR(bp)) {
thread-ptrace_bps[0] = NULL;
+   ptrace_put_breakpoints(task);
return PTR_ERR(bp);
}
 
+   ptrace_put_breakpoints(task);
+
 #endif /* CONFIG_HAVE_HW_BREAKPOINT */
 
/* Move contents to the DABR register */
@@ -1591,10 +1601,7 @@ long arch_ptrace(struct task_struct *child, long request,
}
 
case PTRACE_SET_DEBUGREG:
-   if (ptrace_get_breakpoints(child)  0)
-   return -ESRCH;
ret = ptrace_set_debugreg(child, addr, data);
-   ptrace_put_breakpoints(child);
break;
 
 #ifdef CONFIG_PPC64
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc, hw_breakpoints: Fix CONFIG_HAVE_HW_BREAKPOINT off-case in ptrace_set_debugreg

2011-05-05 Thread Frederic Weisbecker

We make use of ptrace_get_breakpoints() / ptrace_put_breakpoints()
to protect ptrace_set_debugreg() even if CONFIG_HAVE_HW_BREAKPOINT
if off. However in this case, these APIs are not implemented.

To fix this, push the protection down inside the relevant ifdef.
Best would be to export the code inside CONFIG_HAVE_HW_BREAKPOINT
into a standalone function to cleanup the ifdefury there and call
the breakpoint ref API inside. But as it is more invasive, this
should be rather made in an -rc1.

Fixes:

arch/powerpc/kernel/ptrace.c:1594: error: implicit declaration of function 
'ptrace_get_breakpoints'
make[2]: *** [arch/powerpc/kernel/ptrace.o] Error 1
make[1]: *** [arch/powerpc/kernel] Error 2
make: *** [sub-make] Error 2

Reported-by: Ingo Molnar mi...@elte.hu
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Prasad pra...@linux.vnet.ibm.com
Cc: v2.6.33.. sta...@kernel.org
---
 arch/powerpc/kernel/ptrace.c |   15 +++
 1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 4edeeb3..a6ae1cf 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -933,12 +933,16 @@ int ptrace_set_debugreg(struct task_struct *task, 
unsigned long addr,
if (data  !(data  DABR_TRANSLATION))
return -EIO;
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
+   if (ptrace_get_breakpoints(task)  0)
+   return -ESRCH;
+
bp = thread-ptrace_bps[0];
if ((!data) || !(data  (DABR_DATA_WRITE | DABR_DATA_READ))) {
if (bp) {
unregister_hw_breakpoint(bp);
thread-ptrace_bps[0] = NULL;
}
+   ptrace_put_breakpoints(task);
return 0;
}
if (bp) {
@@ -948,9 +952,12 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned 
long addr,
(DABR_DATA_WRITE | DABR_DATA_READ),
attr.bp_type);
ret =  modify_user_hw_breakpoint(bp, attr);
-   if (ret)
+   if (ret) {
+   ptrace_put_breakpoints(task);
return ret;
+   }
thread-ptrace_bps[0] = bp;
+   ptrace_put_breakpoints(task);
thread-dabr = data;
return 0;
}
@@ -965,9 +972,12 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned 
long addr,
ptrace_triggered, task);
if (IS_ERR(bp)) {
thread-ptrace_bps[0] = NULL;
+   ptrace_put_breakpoints(task);
return PTR_ERR(bp);
}
 
+   ptrace_put_breakpoints(task);
+
 #endif /* CONFIG_HAVE_HW_BREAKPOINT */
 
/* Move contents to the DABR register */
@@ -1591,10 +1601,7 @@ long arch_ptrace(struct task_struct *child, long request,
}
 
case PTRACE_SET_DEBUGREG:
-   if (ptrace_get_breakpoints(child)  0)
-   return -ESRCH;
ret = ptrace_set_debugreg(child, addr, data);
-   ptrace_put_breakpoints(child);
break;
 
 #ifdef CONFIG_PPC64
-- 
1.7.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/6] hw_breakpoints: Migrate breakpoint conditional build under new config

2011-04-27 Thread Frederic Weisbecker

Migrate conditional hw_breakpoint code compilation under
the new config to prepare for letting the user chose whether
or not to build this feature

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Will Deacon will.dea...@arm.com
Cc: Prasad pra...@linux.vnet.ibm.com
Cc: Paul Mundt let...@linux-sh.org
---
 arch/arm/include/asm/hw_breakpoint.h |4 ++--
 arch/arm/include/asm/processor.h |2 +-
 arch/arm/kernel/Makefile |2 +-
 arch/arm/kernel/entry-header.S   |2 +-
 arch/arm/kernel/ptrace.c |4 ++--
 arch/powerpc/include/asm/cputable.h  |4 ++--
 arch/powerpc/include/asm/hw_breakpoint.h |6 +++---
 arch/powerpc/include/asm/processor.h |4 ++--
 arch/powerpc/kernel/Makefile |2 +-
 arch/powerpc/kernel/process.c|   18 +-
 arch/powerpc/kernel/ptrace.c |   12 ++--
 arch/powerpc/lib/Makefile|2 +-
 arch/sh/kernel/Makefile  |2 +-
 arch/sh/kernel/cpu/sh4a/Makefile |2 +-
 include/linux/hw_breakpoint.h|6 +++---
 include/linux/perf_event.h   |4 ++--
 kernel/Makefile  |2 +-
 kernel/perf_event.c  |4 ++--
 samples/Kconfig  |2 +-
 19 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/arch/arm/include/asm/hw_breakpoint.h 
b/arch/arm/include/asm/hw_breakpoint.h
index f389b27..fc6ba18 100644
--- a/arch/arm/include/asm/hw_breakpoint.h
+++ b/arch/arm/include/asm/hw_breakpoint.h
@@ -5,7 +5,7 @@
 
 struct task_struct;
 
-#ifdef CONFIG_HAVE_HW_BREAKPOINT
+#ifdef CONFIG_HW_BREAKPOINT
 
 struct arch_hw_breakpoint_ctrl {
u32 __reserved  : 9,
@@ -128,6 +128,6 @@ int hw_breakpoint_slots(int type);
 #else
 static inline void clear_ptrace_hw_breakpoint(struct task_struct *tsk) {}
 
-#endif /* CONFIG_HAVE_HW_BREAKPOINT */
+#endif /* CONFIG_HW_BREAKPOINT */
 #endif /* __KERNEL__ */
 #endif /* _ARM_HW_BREAKPOINT_H */
diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
index b2d9df5..b86d135 100644
--- a/arch/arm/include/asm/processor.h
+++ b/arch/arm/include/asm/processor.h
@@ -30,7 +30,7 @@
 #endif
 
 struct debug_info {
-#ifdef CONFIG_HAVE_HW_BREAKPOINT
+#ifdef CONFIG_HW_BREAKPOINT
struct perf_event   *hbp[ARM_MAX_HBP_SLOTS];
 #endif
 };
diff --git a/arch/arm/kernel/Makefile b/arch/arm/kernel/Makefile
index 8d95446..e6c4b04 100644
--- a/arch/arm/kernel/Makefile
+++ b/arch/arm/kernel/Makefile
@@ -47,7 +47,7 @@ obj-$(CONFIG_HAVE_TCM)+= tcm.o
 obj-$(CONFIG_CRASH_DUMP)   += crash_dump.o
 obj-$(CONFIG_SWP_EMULATE)  += swp_emulate.o
 CFLAGS_swp_emulate.o   := -Wa,-march=armv7-a
-obj-$(CONFIG_HAVE_HW_BREAKPOINT)   += hw_breakpoint.o
+obj-$(CONFIG_HW_BREAKPOINT)+= hw_breakpoint.o
 
 obj-$(CONFIG_CRUNCH)   += crunch.o crunch-bits.o
 AFLAGS_crunch-bits.o   := -Wa,-mcpu=ep9312
diff --git a/arch/arm/kernel/entry-header.S b/arch/arm/kernel/entry-header.S
index 051166c..fbc7cc9 100644
--- a/arch/arm/kernel/entry-header.S
+++ b/arch/arm/kernel/entry-header.S
@@ -171,7 +171,7 @@
@ we can access the debug registers safely.
@
.macro  debug_entry, fsr
-#if defined(CONFIG_HAVE_HW_BREAKPOINT)  defined(CONFIG_PREEMPT)
+#if defined(CONFIG_HW_BREAKPOINT)  defined(CONFIG_PREEMPT)
ldr r4, =0x40f  @ mask out fsr.fs
and r5, r4, \fsr
cmp r5, #2  @ debug exception
diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index 2bf27f3..a3f8a0b 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -468,7 +468,7 @@ static int ptrace_setvfpregs(struct task_struct *tsk, void 
__user *data)
 }
 #endif
 
-#ifdef CONFIG_HAVE_HW_BREAKPOINT
+#ifdef CONFIG_HW_BREAKPOINT
 /*
  * Convert a virtual register number into an index for a thread_info
  * breakpoint array. Breakpoints are identified using positive numbers
@@ -765,7 +765,7 @@ long arch_ptrace(struct task_struct *child, long request,
break;
 #endif
 
-#ifdef CONFIG_HAVE_HW_BREAKPOINT
+#ifdef CONFIG_HW_BREAKPOINT
case PTRACE_GETHBPREGS:
ret = ptrace_gethbpregs(child, addr,
(unsigned long __user *)data);
diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 1833d1a..36b5568 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -538,9 +538,9 @@ static inline int cpu_has_feature(unsigned long feature)
 feature);
 }
 
-#ifdef CONFIG_HAVE_HW_BREAKPOINT
+#ifdef CONFIG_HW_BREAKPOINT
 #define HBP_NUM 1
-#endif /* CONFIG_HAVE_HW_BREAKPOINT */
+#endif /* CONFIG_HW_BREAKPOINT */
 
 #endif /* !__ASSEMBLY__

[PATCH 6/6] hw_breakpoints: Drop remaining misplaced dependency on perf

2011-04-27 Thread Frederic Weisbecker

Powerpc and Arm select breakpoint support ability only if
Perf is built. This is not necessary anymore now that we
enable perf once breakpoints support is selected.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Will Deacon will.dea...@arm.com
Cc: Prasad pra...@linux.vnet.ibm.com
Cc: Paul Mundt let...@linux-sh.org
---
 arch/arm/Kconfig |2 +-
 arch/powerpc/Kconfig |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 377a7a5..1d3a0b4 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -24,7 +24,7 @@ config ARM
select HAVE_PERF_EVENTS
select PERF_USE_VMALLOC
select HAVE_REGS_AND_STACK_ACCESS_API
-   select HAVE_HW_BREAKPOINT if (PERF_EVENTS  (CPU_V6 || CPU_V6K || 
CPU_V7))
+   select HAVE_HW_BREAKPOINT if (CPU_V6 || CPU_V6K || CPU_V7)
select HAVE_C_RECORDMCOUNT
select HAVE_GENERIC_HARDIRQS
select HAVE_SPARSE_IRQ
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8f4d50b..e944eb8 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -134,7 +134,7 @@ config PPC
select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select HAVE_REGS_AND_STACK_ACCESS_API
-   select HAVE_HW_BREAKPOINT if PERF_EVENTS  PPC_BOOK3S_64
+   select HAVE_HW_BREAKPOINT if PPC_BOOK3S_64
select HAVE_GENERIC_HARDIRQS
select HAVE_SPARSE_IRQ
select IRQ_PER_CPU
-- 
1.7.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 3/5] powerpc, hw_breakpoints: Fix racy access to ptrace breakpoints

2011-04-22 Thread Frederic Weisbecker

(resend with ppc list in cc)

While the tracer accesses ptrace breakpoints, the child task may
concurrently exit due to a SIGKILL and thus release its breakpoints
at the same time. We can then dereference some freed pointers.

To fix this, hold a reference on the child breakpoints before
manipulating them.

Reported-by: Oleg Nesterov o...@redhat.com
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Will Deacon will.dea...@arm.com
Cc: Prasad pra...@linux.vnet.ibm.com
Cc: Paul Mundt let...@linux-sh.org
Cc: v2.6.33.. sta...@kernel.org
---
 arch/powerpc/kernel/ptrace.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 55613e3..4edeeb3 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -1591,7 +1591,10 @@ long arch_ptrace(struct task_struct *child, long request,
}
 
case PTRACE_SET_DEBUGREG:
+   if (ptrace_get_breakpoints(child)  0)
+   return -ESRCH;
ret = ptrace_set_debugreg(child, addr, data);
+   ptrace_put_breakpoints(child);
break;
 
 #ifdef CONFIG_PPC64
-- 
1.7.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 12/40] x86, compat: convert ia32 layer to use

2010-06-23 Thread Frederic Weisbecker

On Wed, Jun 23, 2010 at 12:14:02PM +0200, Christoph Hellwig wrote:
 Any reason we need to differenciate between COMPAT_SYSCALL_DEFINE and
 ARCH_COMPAT_SYSCALL_DEFINE?  We don't need this for native system calls,
 so I can't see the reason to do it for compat system calls.
 


I think we wanted that to keep the sys32_ prefixed based naming, to avoid
collisions with generic compat handler names.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 31/40] trace syscalls: Convert various generic compat syscalls

2010-06-23 Thread Frederic Weisbecker

On Wed, Jun 23, 2010 at 12:19:38PM +0200, Andi Kleen wrote:
 , Ian Munsie wrote:
 From: Ian Munsieimun...@au1.ibm.com

 This patch converts numerous trivial compat syscalls through the generic
 kernel code to use the COMPAT_SYSCALL_DEFINE family of macros.

 Why? This just makes the code look uglier and the functions harder
 to grep for.


Because it makes them usable with syscall tracing.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 12/40] x86, compat: convert ia32 layer to use

2010-06-23 Thread Frederic Weisbecker

On Wed, Jun 23, 2010 at 12:46:04PM +0200, Christoph Hellwig wrote:
 On Wed, Jun 23, 2010 at 12:36:21PM +0200, Frederic Weisbecker wrote:
  I think we wanted that to keep the sys32_ prefixed based naming, to avoid
  collisions with generic compat handler names.
 
 For native syscalls we do this by adding a arch prefix inside the
 syscall name, e.g.:
 
 arch/s390/kernel/sys_s390.c:SYSCALL_DEFINE(s390_fallocate)(int fd, int mode, 
 loff_t offset,
 arch/sparc/kernel/sys_sparc_64.c:SYSCALL_DEFINE1(sparc_pipe_real, struct 
 pt_regs *, regs)


In fact we sort of wanted to standardize the name of arch overriden compat
syscalls, so that userspace programs playing with syscalls tracing won't have
to deal with arch naming differences.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 31/40] trace syscalls: Convert various generic compat syscalls

2010-06-23 Thread Frederic Weisbecker

On Wed, Jun 23, 2010 at 12:37:44PM +0200, Andi Kleen wrote:
 , Frederic Weisbecker wrote:
 On Wed, Jun 23, 2010 at 12:19:38PM +0200, Andi Kleen wrote:
 , Ian Munsie wrote:
 From: Ian Munsieimun...@au1.ibm.com

 This patch converts numerous trivial compat syscalls through the generic
 kernel code to use the COMPAT_SYSCALL_DEFINE family of macros.

 Why? This just makes the code look uglier and the functions harder
 to grep for.


 Because it makes them usable with syscall tracing.

 Ok that information is missing in the changelog then.


Agreed, the changelog lacks the purpose of what it does.



 Also I hope the uglification-usefullness factor is really worth it.
 The patch is certainly no slouch on the uglification side.


It's worth because the kernel's syscall tracing is not complete, we lack all
the compat part.

These wrappers let us create TRACE_EVENT() for every syscalls automatically.
If we had to create them manually, the uglification would be way much more 
worse.

Most syscalls use the syscall wrappers already, so the uglification
is there mostly. We just forgot to uglify a bunch of them :)


 It also has maintenance costs, e.g. I doubt ctags and cscope
 will be able to deal with these kinds of macros, so it has a
 high cost for everyone using these tools. For those
 it would be actually better if you used separate annotation
 that does not confuse standard C parsers.


I haven't heard any complains about existing syscalls wrappers.

What kind of annotations could solve that?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 10/40] tracing: add tracing support for compat syscalls

2010-06-23 Thread Frederic Weisbecker

On Wed, Jun 23, 2010 at 11:26:26AM -0400, Steven Rostedt wrote:
 On Wed, 2010-06-23 at 20:02 +1000, Ian Munsie wrote:
  From: Jason Baron jba...@redhat.com
  
  Add core support to event tracing for compat syscalls. The basic idea is 
  that we
  check if we have a compat task via 'is_compat_task()'. If so, we lookup in 
  the
  new compat_syscalls_metadata table, the corresponding struct 
  syscall_metadata, to
  check syscall arguments and whether or not we are enabled.
  
  Signed-off-by: Jason Baron jba...@redhat.com
  Signed-off-by: Ian Munsie imun...@au1.ibm.com
  ---
   include/linux/compat.h|2 +
   include/trace/syscall.h   |4 ++
   kernel/trace/trace.h  |2 +
   kernel/trace/trace_syscalls.c |   86 
  +
   4 files changed, 86 insertions(+), 8 deletions(-)
  
  diff --git a/include/linux/compat.h b/include/linux/compat.h
  index ab638e9..a94f13d 100644
  --- a/include/linux/compat.h
  +++ b/include/linux/compat.h
  @@ -363,6 +363,8 @@ extern ssize_t compat_rw_copy_check_uvector(int type,
   
   #else /* CONFIG_COMPAT */
   
  +#define NR_syscalls_compat 0
  +
   static inline int is_compat_task(void)
   {
  return 0;
  diff --git a/include/trace/syscall.h b/include/trace/syscall.h
  index 75f3dce..67d4e64 100644
  --- a/include/trace/syscall.h
  +++ b/include/trace/syscall.h
  @@ -22,6 +22,7 @@
   struct syscall_metadata {
  const char  *name;
  int syscall_nr;
  +   int compat_syscall_nr;
 
 This is also bloating the kernel. I don't like to add fields to the
 syscall_metadata lightly.
 
 You're adding 4 more bytes to this structure to handle a few items. Find
 a better way to do this.


This is for cases when the compat and normal handlers are the same.
I haven't checked whether such case happen enough to make this a
benefit. I suspect it's not. Moreover I guess this adds more
complexity to the code.

May be this should be simply dropped.



 
  int nb_args;
  const char  **types;
  const char  **args;
  @@ -38,6 +39,9 @@ struct syscall_metadata {
   
   #ifdef CONFIG_FTRACE_SYSCALLS
   extern unsigned long arch_syscall_addr(int nr);
  +#ifdef CONFIG_COMPAT
  +extern unsigned long arch_compat_syscall_addr(int nr);
  +#endif
   extern int init_syscall_trace(struct ftrace_event_call *call);
   
   extern int reg_event_syscall_enter(struct ftrace_event_call *call);
  diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
  index 01ce088..53ace4b 100644
  --- a/kernel/trace/trace.h
  +++ b/kernel/trace/trace.h
  @@ -79,12 +79,14 @@ enum trace_type {
   struct syscall_trace_enter {
  struct trace_entry  ent;
  int nr;
  +   int compat;
 
 You're adding 4 bytes to a trace (taking up precious buffer space) for a
 single flag. If anything, set the 31st bit of nr if it is compat.



That too can be dropped I think. compat syscalls and normal syscalls are
different events, so the difference can be made in the event id or name.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [Patch 1/4] Allow arch-specific cleanup before breakpoint unregistration

2010-05-26 Thread Frederic Weisbecker

On Wed, May 26, 2010 at 10:47:42PM +0530, K.Prasad wrote:
 On Wed, May 26, 2010 at 10:54:41AM +0100, David Howells wrote:
  K.Prasad pra...@linux.vnet.ibm.com wrote:
  
My understanding is weak function definitions must appear in a 
different C
file than their call sites to work on some toolchains.

   
   Atleast, there are quite a few precedents inside the Linux kernel for
   __weak functions being invoked from the file in which they are defined
   (arch_hwblk_init, arch_enable_nonboot_cpus_begin and hw_perf_disable to
   name a few).
   Moreover the online GCC docs haven't any such constraints mentioned.
  
  I've seen problems in this area.  gcc sometimes inlines a weak function 
  that's
  in the same file as the call point.
  
 
 We've seen such behaviour even otherwiseeven with noinline attribute
 in place. I'm not sure if this gcc fix
 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16922) helped correct the
 behaviour, but the lesson has been to not trust a function to be
 inlined/remain non-inline consistently.


If we can't put the call to the function in the same file of its weak
definition, then perf is totally screwed.

And in fact it makes __weak basically useless and unusable. I guess
that happened in old gcc versions that have been fixed now.

Anyway, I'm personally fine with this patch (you can put my hack
if you want).

Thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [Patch 1/4] Allow arch-specific cleanup before breakpoint unregistration

2010-05-26 Thread Frederic Weisbecker

On Wed, May 26, 2010 at 11:01:24PM +0530, K.Prasad wrote:
 On Wed, May 26, 2010 at 07:23:15PM +0200, Frederic Weisbecker wrote:
  On Wed, May 26, 2010 at 10:47:42PM +0530, K.Prasad wrote:
   On Wed, May 26, 2010 at 10:54:41AM +0100, David Howells wrote:
K.Prasad pra...@linux.vnet.ibm.com wrote:

  My understanding is weak function definitions must appear in a 
  different C
  file than their call sites to work on some toolchains.
  
 
 Atleast, there are quite a few precedents inside the Linux kernel for
 __weak functions being invoked from the file in which they are defined
 (arch_hwblk_init, arch_enable_nonboot_cpus_begin and hw_perf_disable 
 to
 name a few).
 Moreover the online GCC docs haven't any such constraints mentioned.

I've seen problems in this area.  gcc sometimes inlines a weak function 
that's
in the same file as the call point.

   
   We've seen such behaviour even otherwiseeven with noinline attribute
   in place. I'm not sure if this gcc fix
   (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16922) helped correct the
   behaviour, but the lesson has been to not trust a function to be
   inlined/remain non-inline consistently.
  
  
  If we can't put the call to the function in the same file of its weak
  definition, then perf is totally screwed.
  
  And in fact it makes __weak basically useless and unusable. I guess
  that happened in old gcc versions that have been fixed now.
  
  Anyway, I'm personally fine with this patch (you can put my hack
  if you want).
 
 
 I guess you meant Acked-by: :-)



Oops, right :)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: ftrace syscalls, PowerPC: Fixes and PowerPC raw syscall tracepoint implementation

2010-05-13 Thread Frederic Weisbecker

On Thu, May 13, 2010 at 12:06:11PM -0400, Steven Rostedt wrote:
 Frederic,
 
 I'm fine with these patches, but since you mainly did the syscall work,
 I'll let you take them.


Sure.


 
 The patches that touch the PowerPC code needs an acked-by from Ben or
 Paul.
 
 -- Steve



Ok,

Thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: linux-next: PowerPC WARN_ON_ONCE() after merge of the final tree (tip related)

2010-04-16 Thread Frederic Weisbecker

On Fri, Apr 16, 2010 at 12:38:43PM +0200, Peter Zijlstra wrote:
 On Thu, 2010-04-15 at 19:15 +0200, Frederic Weisbecker wrote:
   that looks rather ugly. Why not do a raw:
   
 this_cpu_inc(lockdep_stats.redundant_hardirqs_on);
   
   which basically open-codes debug_atomic_inc(), but without the warning?
  
  
  Because that would open a race against interrupts that might
  touch lockdep_stats.redundant_hardirqs_on too.
 
 
 How so, its a pure per-cpu variable right? so either the increment
 happens before the interrupts hits, or after, in either case there
 should not be a race with interrupts.


In x86 yeah, I guess the compiler simply loads the address
and does an inc directly, which is atomic wrt interrupts.

But what about another arch that would need an intermediate
load of the value:

load val, reg
add reg, 1
---interrupt here
store reg, val

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: linux-next: PowerPC WARN_ON_ONCE() after merge of the final tree (tip related)

2010-04-15 Thread Frederic Weisbecker

On Thu, Apr 15, 2010 at 08:49:40AM +0200, Ingo Molnar wrote:
 
 * Stephen Rothwell s...@canb.auug.org.au wrote:
 
  Hi all,
  
  Yesterday's (and today's) linux-next boot (PowerPC) failed like this:
  
  [ cut here ]
  Badness at kernel/lockdep.c:2301
  NIP: c00a35c8 LR: c00084c4 CTR: 
  REGS: c0bf77e0 TRAP: 0700   Not tainted  (2.6.34-rc4-autokern1)
  MSR: 80021032 ME,CE,IR,DR  CR: 2444  XER: 0004
  TASK = c0aa3d30[0] 'swapper' THREAD: c0bf4000 CPU: 0
  GPR00: 0001 c0bf7a60 c0bf32f0 c00084c4 
  GPR04:  0a00  0068 
  GPR08: 0008 c0c4fabe  7265677368657265 
  GPR12: 80009032 c7691000 01c0 c0770bf8 
  GPR16: c076f390  0043 024876f0 
  GPR20: c0887480 02487480 c08876f0 01b5f8d0 
  GPR24: c0770478 0330 c0c1f1c8 c0884610 
  GPR28: c0c1b290 c00084c4 c0b45068 c0aa3d30 
  NIP [c00a35c8] .trace_hardirqs_on_caller+0xb0/0x224
  LR [c00084c4] system_call_common+0xc4/0x114
  Call Trace:
  [c0bf7a60] [c0bf7ba0] init_thread_union+0x3ba0/0x4000 
  (unreliable)
  [c0bf7af0] [c00084c4] system_call_common+0xc4/0x114
  --- Exception: c01 at .kernel_thread+0x28/0x70
  LR = .rest_init+0x34/0xf8
  [c0bf7de0] [c086916c] .proc_sys_init+0x20/0x64 (unreliable)
  [c0bf7e50] [c00099c0] .rest_init+0x20/0xf8
  [c0bf7ee0] [c0848af0] .start_kernel+0x484/0x4a8
  [c0bf7f90] [c00083c0] .start_here_common+0x1c/0x5c
  Instruction dump:
  409e0188 0fe0 48000180 801f08d8 2f80 41be0050 880d01da 2fa0 
  41be0028 e93e8538 8809 6801 0b00 2fa0 41be0010 e93e8538 
  [ cut here ]
  
  Caused by commit bd6d29c25bb1a24a4c160ec5de43e0004e01f72b (lockstat:
  Make lockstat counting per cpu).  This added a WARN_ON_ONCE to
  debug_atomic_inc() which is called from trace_hardirqs_on_caller() with
  irqs enabled.
  
  Line 2301 is:
  
  if (unlikely(curr-hardirqs_enabled)) {
  debug_atomic_inc(redundant_hardirqs_on);   --- 2301
  return;
  }
  
  This is especially bad since on PowerPC, WARN_ON is a TRAP and the return
  path from the TRAP also calls trace_hardirqs_on_caller(), so the TRAP
  recurses ...
 
 Ok, we'll fix the warning.
 
 Btw., WARN_ON trapping on PowerPC is clearly a PowerPC bug - there's a good 
 reason we have WARN_ON versus BUG_ON - it should be fixed.


In this case, I guess the following fix should be sufficient?
I'm going to test it and provide a sane changelog.


diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 78325f8..65d4336 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -2298,7 +2298,11 @@ void trace_hardirqs_on_caller(unsigned long ip)
return;
 
if (unlikely(curr-hardirqs_enabled)) {
+   unsigned long flags;
+
+   raw_local_irq_save(flags);
debug_atomic_inc(redundant_hardirqs_on);
+   raw_local_irq_restore(flags);
return;
}
/* we'll do an OFF - ON transition: */

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: linux-next: PowerPC WARN_ON_ONCE() after merge of the final tree (tip related)

2010-04-15 Thread Frederic Weisbecker

On Thu, Apr 15, 2010 at 04:03:58PM +0200, Ingo Molnar wrote:
  In this case, I guess the following fix should be sufficient?
  I'm going to test it and provide a sane changelog.
  
  
  diff --git a/kernel/lockdep.c b/kernel/lockdep.c
  index 78325f8..65d4336 100644
  --- a/kernel/lockdep.c
  +++ b/kernel/lockdep.c
  @@ -2298,7 +2298,11 @@ void trace_hardirqs_on_caller(unsigned long ip)
  return;
   
  if (unlikely(curr-hardirqs_enabled)) {
  +   unsigned long flags;
  +
  +   raw_local_irq_save(flags);
  debug_atomic_inc(redundant_hardirqs_on);
  +   raw_local_irq_restore(flags);
  return;
  }
  /* we'll do an OFF - ON transition: */
 
 that looks rather ugly. Why not do a raw:
 
   this_cpu_inc(lockdep_stats.redundant_hardirqs_on);
 
 which basically open-codes debug_atomic_inc(), but without the warning?


Because that would open a race against interrupts that might
touch lockdep_stats.redundant_hardirqs_on too.

If you think it's not very important (this race must be pretty rare I guess),
I can use your solution.



 
 Btw., using the this_cpu() methods might result in faster code for all the 
 debug_atomic_inc() macros as well?


Indeed, will change that too.

Thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: linux-next: PowerPC WARN_ON_ONCE() after merge of the final tree (tip related)

2010-04-15 Thread Frederic Weisbecker

On Thu, Apr 15, 2010 at 04:03:58PM +0200, Ingo Molnar wrote:
  
  
  diff --git a/kernel/lockdep.c b/kernel/lockdep.c
  index 78325f8..65d4336 100644
  --- a/kernel/lockdep.c
  +++ b/kernel/lockdep.c
  @@ -2298,7 +2298,11 @@ void trace_hardirqs_on_caller(unsigned long ip)
  return;
   
  if (unlikely(curr-hardirqs_enabled)) {
  +   unsigned long flags;
  +
  +   raw_local_irq_save(flags);
  debug_atomic_inc(redundant_hardirqs_on);
  +   raw_local_irq_restore(flags);
  return;
  }
  /* we'll do an OFF - ON transition: */
 
 that looks rather ugly. Why not do a raw:
 
   this_cpu_inc(lockdep_stats.redundant_hardirqs_on);
 
 which basically open-codes debug_atomic_inc(), but without the warning?


There is also no guarantee we are in a non-preemptable section. We can then
also race against another cpu.

I'm not sure what to do.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] powerpc/perf_events: Implement perf_arch_fetch_caller_regs

2010-03-18 Thread Frederic Weisbecker

On Thu, Mar 18, 2010 at 04:05:13PM +1100, Paul Mackerras wrote:
 This implements a powerpc version of perf_arch_fetch_caller_regs.
 It's implemented in assembly because that way we can be sure there
 isn't a stack frame for perf_arch_fetch_caller_regs.  If it was in
 C, gcc might or might not create a stack frame for it, which would
 affect the number of levels we have to skip.
 
 With this, we see results from perf record -e lock:lock_acquire like
 this:
 
 # Samples: 24878
 #
 # Overhead Command  Shared Object  Symbol
 #   ..  .  ..
 #
 14.99%perf  [kernel.kallsyms]  [k] ._raw_spin_lock
   |
   --- ._raw_spin_lock
  |
  |--25.00%-- .alloc_fd
  |  (nil)
  |  |
  |  |--50.00%-- .anon_inode_getfd
  |  |  .sys_perf_event_open
  |  |  syscall_exit
  |  |  syscall
  |  |  create_counter
  |  |  __cmd_record
  |  |  run_builtin
  |  |  main
  |  |  0xfd2e704
  |  |  0xfd2e8c0
  |  |  (nil)
 
 ... etc.
 
 Signed-off-by: Paul Mackerras pau...@samba.org
 Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org
 ---
  arch/powerpc/include/asm/asm-compat.h |2 ++
  arch/powerpc/kernel/misc.S|   28 
  2 files changed, 30 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/asm-compat.h 
 b/arch/powerpc/include/asm/asm-compat.h
 index c1b475a..a9b91ed 100644
 --- a/arch/powerpc/include/asm/asm-compat.h
 +++ b/arch/powerpc/include/asm/asm-compat.h
 @@ -28,6 +28,7 @@
  #define PPC_LLARX(t, a, b, eh)   PPC_LDARX(t, a, b, eh)
  #define PPC_STLCXstringify_in_c(stdcx.)
  #define PPC_CNTLZL   stringify_in_c(cntlzd)
 +#define PPC_LR_STKOFF16
  
  /* Move to CR, single-entry optimized version. Only available
   * on POWER4 and later.
 @@ -51,6 +52,7 @@
  #define PPC_STLCXstringify_in_c(stwcx.)
  #define PPC_CNTLZL   stringify_in_c(cntlzw)
  #define PPC_MTOCRF   stringify_in_c(mtcrf)
 +#define PPC_LR_STKOFF4
  
  #endif
  
 diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
 index 2d29752..b485a87 100644
 --- a/arch/powerpc/kernel/misc.S
 +++ b/arch/powerpc/kernel/misc.S
 @@ -127,3 +127,31 @@ _GLOBAL(__setup_cpu_power7)
  _GLOBAL(__restore_cpu_power7)
   /* place holder */
   blr
 +
 +#ifdef CONFIG_EVENT_TRACING
 +/*
 + * Get a minimal set of registers for our caller's nth caller.
 + * r3 = regs pointer, r5 = n.
 + *
 + * We only get R1 (stack pointer), NIP (next instruction pointer)
 + * and LR (link register).  These are all we can get in the
 + * general case without doing complicated stack unwinding, but
 + * fortunately they are enough to do a stack backtrace, which
 + * is all we need them for.
 + */
 +_GLOBAL(perf_arch_fetch_caller_regs)
 + mr  r6,r1
 + cmpwi   r5,0
 + mflrr4
 + ble 2f
 + mtctr   r5
 +1:   PPC_LL  r6,0(r6)
 + bdnz1b
 + PPC_LL  r4,PPC_LR_STKOFF(r6)
 +2:   PPC_LL  r7,0(r6)
 + PPC_LL  r7,PPC_LR_STKOFF(r7)
 + PPC_STL r6,GPR1-STACK_FRAME_OVERHEAD(r3)
 + PPC_STL r4,_NIP-STACK_FRAME_OVERHEAD(r3)
 + PPC_STL r7,_LINK-STACK_FRAME_OVERHEAD(r3)
 + blr
 +#endif /* CONFIG_EVENT_TRACING */


Ingo has reported me that context-switches software events
(not the trace event version) have crappy callchains.

So, while looking into it:

- PERF_COUNT_SW_CPU_MIGRATIONS provides no regs.
Heh, and it event doesn't work because of this perf_swevent_add
give up if regs are NULL.
Has PERF_COUNT_SW_CPU_MIGRATIONS ever worked?

- PERF_COUNT_SW_CONTEXT_SWITCHES uses task_pt_regs(). This
seems a very wrong thing. We don't want the regs when a user
task was interrupted or before a syscall.
That notwithstanding task_pt_regs() on kernel threads
has insane effects.

What we want for both is a hot regs snapshot.
I'm going to fix this. But it means the CONFIG_EVENT_TRACING
dependency is not true anymore. So I can't keep the exported
symbol of perf_arch_fetch_caller_regs() in trace_event_perf.c.

The ideal would be to put the EXPORT_SYMBOL in perf_event.c
but doing so in the same file a weak symbol is defined has
unpredictable effects. So I'm going to make it a macro as I
don't have the choice. I'll update ppc at the same time but I
can't guarantee it will even build :)

Thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/perf_events: Implement perf_arch_fetch_caller_regs for powerpc

2010-03-16 Thread Frederic Weisbecker

On Tue, Mar 16, 2010 at 02:22:13PM +1100, Paul Mackerras wrote:
 On Mon, Mar 15, 2010 at 10:04:54PM +0100, Frederic Weisbecker wrote:
  On Mon, Mar 15, 2010 at 04:46:15PM +1100, Paul Mackerras wrote:
 
   14.99%perf  [kernel.kallsyms]  [k] ._raw_spin_lock
 |
 --- ._raw_spin_lock
|  
|--25.00%-- .alloc_fd
|  (nil)
|  |  
|  |--50.00%-- .anon_inode_getfd
|  |  .sys_perf_event_open
|  |  syscall_exit
|  |  syscall
|  |  create_counter
|  |  __cmd_record
|  |  run_builtin
|  |  main
|  |  0xfd2e704
|  |  0xfd2e8c0
|  |  (nil)
   
   ... etc.
   
   Signed-off-by: Paul Mackerras pau...@samba.org
  
  
  Cool!
 
 By the way, I notice that gcc tends to inline the tracing functions,
 which means that by going up 2 stack frames we miss some of the
 functions.  For example, for the lock:lock_acquire event, we have
 _raw_spin_lock() - lock_acquire() - trace_lock_acquire() -
 perf_trace_lock_acquire() - perf_trace_templ_lock_acquire() -
 perf_fetch_caller_regs() - perf_arch_fetch_caller_regs().
 
 But in the ppc64 kernel binary I just built, gcc inlined
 trace_lock_acquire in lock_acquire, and perf_trace_templ_lock_acquire
 in perf_trace_lock_acquire.  Given that perf_fetch_caller_regs is
 explicitly inlined, going up two levels from perf_fetch_caller_regs
 gets us to _raw_spin_lock, whereas I think you intended it to get us
 to trace_lock_acquire.  I'm not sure what to do about that - any
 thoughts?



Yeah I've indeed seen this, and the problem is especially
the fact perf_trace_templ_lock_acquire may or may not be
inlined.

It is used for trace events that use the TRACE_EVENT_CLASS
thing. We define a pattern of event structure that is shared
among several events.

For example event A and event B share perf_trace_templ_foo.
Both will have a different perf_trace_blah but those
perf_trace_blah will both call the same perf_trace_templ_foo(),
in this case, it won't be inlined.

Events that don't share a pattern will have their
perf_trace_templ inlined, because there will be an exclusive 1:1
relationship between both.

The rewind of 2 is well suited for events sharing a pattern, ip
will match the right event source, and not one of its callers.

Unfortunately, the others are more unlucky.
I didn't mind much about this yet because it  had no bad effect
on lock events. Quite the opposite actually. It's not very interesting
to have lock_acquire as the event source unless you have a callchain.

If you have no callchain, you'll see a lot of such in perf report:

sym1lock_aquire
sym2lock_acquire
sym3lock_acquire

What you want here is the function that called lock_acquire.

But if you have a callchain it's fine, because you have the nature
of the event (lock_aquire) and the origin as well.

That said, lock events are an exception where the mistake
has a lucky result. Other inlined events are harmed as we lose
their most important caller. So I'm going to fix that.

I can just fetch the regs from perf_trace_foo() and pass them
to perf_trace_templ_foo() and here we are.

Thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/perf_events: Implement perf_arch_fetch_caller_regs for powerpc

2010-03-15 Thread Frederic Weisbecker

On Mon, Mar 15, 2010 at 04:46:15PM +1100, Paul Mackerras wrote:
 This implements a powerpc version of perf_arch_fetch_caller_regs.
 It's implemented in assembly because that way we can be sure there
 isn't a stack frame for perf_arch_fetch_caller_regs.  If it was in
 C, gcc might or might not create a stack frame for it, which would
 affect the number of levels we have to skip.  It's not ifdef'd
 because it is only 14 instructions long.
 
 With this, we see results from perf record -e lock:lock_acquire like
 this:
 
 # Samples: 24878
 #
 # Overhead Command  Shared Object  Symbol
 #   ..  .  ..
 #
 14.99%perf  [kernel.kallsyms]  [k] ._raw_spin_lock
   |
   --- ._raw_spin_lock
  |  
  |--25.00%-- .alloc_fd
  |  (nil)
  |  |  
  |  |--50.00%-- .anon_inode_getfd
  |  |  .sys_perf_event_open
  |  |  syscall_exit
  |  |  syscall
  |  |  create_counter
  |  |  __cmd_record
  |  |  run_builtin
  |  |  main
  |  |  0xfd2e704
  |  |  0xfd2e8c0
  |  |  (nil)
 
 ... etc.
 
 Signed-off-by: Paul Mackerras pau...@samba.org


Cool!



 ---
  arch/powerpc/include/asm/asm-compat.h |2 ++
  arch/powerpc/kernel/misc.S|   20 
  2 files changed, 22 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/asm-compat.h 
 b/arch/powerpc/include/asm/asm-compat.h
 index c1b475a..a9b91ed 100644
 --- a/arch/powerpc/include/asm/asm-compat.h
 +++ b/arch/powerpc/include/asm/asm-compat.h
 @@ -28,6 +28,7 @@
  #define PPC_LLARX(t, a, b, eh)   PPC_LDARX(t, a, b, eh)
  #define PPC_STLCXstringify_in_c(stdcx.)
  #define PPC_CNTLZL   stringify_in_c(cntlzd)
 +#define PPC_LR_STKOFF16
  
  /* Move to CR, single-entry optimized version. Only available
   * on POWER4 and later.
 @@ -51,6 +52,7 @@
  #define PPC_STLCXstringify_in_c(stwcx.)
  #define PPC_CNTLZL   stringify_in_c(cntlzw)
  #define PPC_MTOCRF   stringify_in_c(mtcrf)
 +#define PPC_LR_STKOFF4
  
  #endif
  
 diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
 index 2d29752..4459500 100644
 --- a/arch/powerpc/kernel/misc.S
 +++ b/arch/powerpc/kernel/misc.S
 @@ -127,3 +127,23 @@ _GLOBAL(__setup_cpu_power7)
  _GLOBAL(__restore_cpu_power7)
   /* place holder */
   blr
 +
 +/*
 + * Get a minimal set of registers for our caller's nth caller.
 + * r3 = regs pointer, r5 = n.
 + */
 +_GLOBAL(perf_arch_fetch_caller_regs)
 + mr  r6,r1
 + cmpwi   r5,0
 + mflrr4
 + ble 2f
 + mtctr   r5
 +1:   PPC_LL  r6,0(r6)
 + bdnz1b
 + PPC_LL  r4,PPC_LR_STKOFF(r6)
 +2:   PPC_LL  r7,0(r6)
 + PPC_LL  r7,PPC_LR_STKOFF(r7)
 + PPC_STL r6,GPR1-STACK_FRAME_OVERHEAD(r3)
 + PPC_STL r4,_NIP-STACK_FRAME_OVERHEAD(r3)
 + PPC_STL r7,_LINK-STACK_FRAME_OVERHEAD(r3)
 + blr

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [Patch 1/1] PPC64-HWBKPT: Implement hw-breakpoints for PPC64

2010-02-26 Thread Frederic Weisbecker

On Tue, Feb 23, 2010 at 04:27:15PM +0530, K.Prasad wrote:
 On Mon, Feb 22, 2010 at 06:47:46PM +0530, K.Prasad wrote:
  On Sun, Feb 21, 2010 at 02:01:37AM +0100, Frederic Weisbecker wrote:
   On Mon, Feb 15, 2010 at 11:29:14AM +0530, K.Prasad wrote:
 [snipped]
   Also, do you think addr/len/type is enough to abstract out
   any ppc breakpoints?
   
   This looks enough to me to express range breakpoints and
   simple breakpoints. But what about value comparison?
   (And still, there may be other trickier implementations
   I don't know in ppc).
   
  
  The above implementation is for PPC64 architecture that supports only
  'simple' breakpoints of fixed length (no range breakpoints, no value
  comparison). More on that below.
 
 
 Looks like I forgot the 'more on that below' part :-)here are some
 thoughts...
 
 Architectures like PPC Book-E have support for a variety of
 sophisticated debug features and our generic framework, in its present
 form, cannot easily port itself to these processors. In order to extend
 the framework for PPC Book-E, I intend the following to begin with:
 
 - Implement support for data breakpoints through DAC registers with all
   the 'bells and whistles'...support for instruction breakpoints through
   IAC can come in later (without precluding its use through ptrace).
 
 - Embed the flags/variables to store DVC, masked address mode, etc. in
   'struct arch_hw_breakpoint', which will be populated by the user of
   register_breakpoint interface.



Agreed.



 
 Apart from the above extensions to the framework, changes in the generic
 code would be required as described in an earlier LKML mail (ref:
 message-id: 20091127190705.gb18...@in.ibm.com)relevant contents
 pasted below:
 
 I think the register_ interfaces can become wrappers around functions
 that do the following:
 
 - arch_validate(): Validate request by invoking an arch-dependant
   routine. Proceed if returned valid.
 - arch-specific debugreg availability: Do something like
   if (arch_hw_breakpoint_availabile())
 bp = perf_event_create_kernel_counter();



This is already what does register_hw_break(), it fails
if a slot is not available:

perf_event_create_kernel_counter - perf_bp_init() - reserve_bp_slot()

Having a:

if (arch_hw_breakpoint_availabile())
 bp = perf_event_create_kernel_counter();

would be racy.



 
   perf_event_create_kernel_counter()---arch_install_hw_breakpoint();
 
 This way, all book-keeping related work (no. of pinned/flexible/per-cpu)
 will be moved to arch-specific files (will be helpful for PPC Book-E
 implementation having two types of debug registers). Every new
 architecture that intends to port to the new hw-breakpoint
 implementation must define their arch_validate(),
 arch_hw_breakpoint_available() and an arch_install_hw_breakpoint(),
 while the hw-breakpoint code will be flexible enough to extend itself to
 each of these archs.
 
 Let me know what you think of the above.



We certainly need the slot reservation in arch (a part of it at least).
But we also need a kind of new interface for arch predefined attributes,
instead of generic attributes.

Probably we need a kind of perf_event_create_kernel_counter() that
can accept either a perf_event_attr (for perf syscall or ftrace)
and an arch structure that can be passed to the breakpoint API,
so that we don't need the generic translation.


 
 Thanks,
 K.Prasad
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [Patch 1/1] PPC64-HWBKPT: Implement hw-breakpoints for PPC64

2010-02-25 Thread Frederic Weisbecker

On Mon, Feb 22, 2010 at 06:47:46PM +0530, K.Prasad wrote:
 The 'name' field here is actually a legacy inherited from x86 code. It
 is part of x86's arch-specific hw-breakpoint structure since:
 - inspired by the kprobe implementation which accepts symbol name as
   input.
 - kallsyms_lookup_name() was 'unexported' and a module could not resolve
   symbol names externally, so the core-infrastructure had to provide
   such facilities.
 - I wasn't sure about the discussions behind 'unexporting' of
   kallsyms_lookup_name(), so did not venture to export them again (which
   you rightfully did :-)
 
 Having said that, I'll be glad to remove this field (along with that in
 x86),



Cool, I'll integrate the x86 name field removal to the .24 series



 provided we know that there's a way for the user to resolve symbol
 names on its own i.e. routines like kallsyms_lookup_name() will remain
 exported.


Yeah, I guess it's fine to keep kallsyms_lookup_name() exported.


 
  Also, do you think addr/len/type is enough to abstract out
  any ppc breakpoints?
  
  This looks enough to me to express range breakpoints and
  simple breakpoints. But what about value comparison?
  (And still, there may be other trickier implementations
  I don't know in ppc).
  
 
 The above implementation is for PPC64 architecture that supports only
 'simple' breakpoints of fixed length (no range breakpoints, no value
 comparison). More on that below.


Ok. I was just a bit confused in the middle of the several PPC breakpoint
implementations :)



   + /*
   +  * As a policy, the callback is invoked in a 'trigger-after-execute'
   +  * fashion
   +  */
   + (bp-overflow_handler)(bp, 0, NULL, regs);
  
  
  Why are you calling this explicitly instead of using the perf_bp_event()
  thing? This looks like it won't work with perf as the event won't
  be recorded by perf.
  
 
 Yes, should have invoked perf_bp_event() for perf to work well (on a
 side note, it makes me wonder at the amount of 'extra' code that each
 breakpoint exception would execute if it were not called through perf
 sys-call...well, the costs of integrating with a generic infrastructure!)


It has the benefit of not adding extra checks in the breakpoint handler,
like checking the callback. Every breakpoint is treated the same way, which
makes the code more simple.


 
   +void ptrace_triggered(struct perf_event *bp, int nmi,
   +   struct perf_sample_data *data, struct pt_regs *regs)
   +{
   + struct perf_event_attr attr;
   +
   + /*
   +  * Disable the breakpoint request here since ptrace has defined a
   +  * one-shot behaviour for breakpoint exceptions in PPC64.
   +  * The SIGTRAP signal is generated automatically for us in do_dabr().
   +  * We don't have to do anything about that here
   +  */
  
  
  Oh, why does ptrace use a one-shot behaviour in ppc? Breakpoints
  only trigger once?
  
 
 Yes, ptrace breakpoints on PPC64 are designed to trigger once and this
 patch retains that behaviour. It is very convenient to use one-shot
 behaviour on archs where exceptions are triggered-before-execute.


Ah, Why?



  This looks fine for basic breakpoints. And this can probably be
  improved to integrate ranges.
  
  But I think we did something wrong with the generic breakpoint
  interface. We are translating the arch values to generic
  attributes. Then this all will be translated back to arch
  values.
  
  Having generic attributes is necessary for any perf event
  use from userspace. But it looks like a waste for ptrace
  that already gives us arch values. And the problem
  is the same for x86.
  
  So I think we should implement a register_ptrace_breakpoint()
  that doesn't take perf_event_attr but specific arch informations,
  so that we don't need to pass through a generic conversion, which:
  
 
 I agree that the layers of conversion from generic to arch-specific
 breakpoint constants is wasteful.
 Can't the arch_bp_generic_fields() function be moved to
 arch/x86/kernel/ptrace.c instead of a new interface?


I'll answer in your subsequent mail :)


 
  - is wasteful
  - won't be able to express 100% of any arch capabilities. We
can certainly express most arch breakpoints features through
the generic interface, but not all of them (given how tricky
the data value comparison features can be)
  
  I will rework that during the next cycle.
  
  Thanks.
 
 
 Thank you for the comments. I will re-send a new version of the patch
 with the perf_bp_event() substitution.


Thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [Patch 1/1] PPC64-HWBKPT: Implement hw-breakpoints for PPC64

2010-02-20 Thread Frederic Weisbecker

On Mon, Feb 15, 2010 at 11:29:14AM +0530, K.Prasad wrote:
 +struct arch_hw_breakpoint {
 + u8  len; /* length of the target symbol */
 + int type;
 + char*name; /* Contains name of the symbol to set bkpt */
 + unsigned long   address;
 +};




I don't think it's a good idea to integrate the name of
the target. This is something that should be done in a higher
level, not in an arch backend.
We don't even need to store it anywhere as we can resolve
back an address easily. Symbol awareness is not something
the hardware breakpoint should care about, neither in the
arch nor the generic level.

Also, do you think addr/len/type is enough to abstract out
any ppc breakpoints?

This looks enough to me to express range breakpoints and
simple breakpoints. But what about value comparison?
(And still, there may be other trickier implementations
I don't know in ppc).


 +
 +#include linux/kdebug.h
 +#include asm/reg.h
 +#include asm/system.h
 +
 +/* Total number of available HW breakpoint registers */
 +#define HBP_NUM 1


Looking at the G2 PowerPc implementation, DABR and DABR2 can either
express two different watchpoints or one range watchpoint.

There are also IABR and IABR2 for instruction breakpoints that
follow the same above scheme. I'm not sure we can abstract that
using a constant max linear number of resources.



 +static inline void hw_breakpoint_disable(void)
 +{
 + set_dabr(0);
 +}



So, this is only about data breakpoints?



 + /*
 +  * As a policy, the callback is invoked in a 'trigger-after-execute'
 +  * fashion
 +  */
 + (bp-overflow_handler)(bp, 0, NULL, regs);



Why are you calling this explicitly instead of using the perf_bp_event()
thing? This looks like it won't work with perf as the event won't
be recorded by perf.


 +void ptrace_triggered(struct perf_event *bp, int nmi,
 +   struct perf_sample_data *data, struct pt_regs *regs)
 +{
 + struct perf_event_attr attr;
 +
 + /*
 +  * Disable the breakpoint request here since ptrace has defined a
 +  * one-shot behaviour for breakpoint exceptions in PPC64.
 +  * The SIGTRAP signal is generated automatically for us in do_dabr().
 +  * We don't have to do anything about that here
 +  */



Oh, why does ptrace use a one-shot behaviour in ppc? Breakpoints
only trigger once?



 + if (bp) {
 + attr = bp-attr;
 + attr.bp_addr = data  ~HW_BREAKPOINT_ALIGN;
 +
 + switch (data  (DABR_DATA_WRITE | DABR_DATA_READ)) {
 + case DABR_DATA_READ:
 + attr.bp_type = HW_BREAKPOINT_R;
 + break;
 + case DABR_DATA_WRITE:
 + attr.bp_type = HW_BREAKPOINT_W;
 + break;
 + case (DABR_DATA_WRITE | DABR_DATA_READ):
 + attr.bp_type = HW_BREAKPOINT_R | HW_BREAKPOINT_W;
 + break;
 + }
 + ret =  modify_user_hw_breakpoint(bp, attr);
 + if (ret)
 + return ret;
 + thread-ptrace_bps[0] = bp;
 + thread-dabr = data;
 + return 0;
 + }
 +
 + /* Create a new breakpoint request if one doesn't exist already */
 + hw_breakpoint_init(attr);
 + attr.bp_addr = data  ~HW_BREAKPOINT_ALIGN;
 + switch (data  (DABR_DATA_WRITE | DABR_DATA_READ)) {
 + case DABR_DATA_READ:
 + attr.bp_type = HW_BREAKPOINT_R;
 + break;
 + case DABR_DATA_WRITE:
 + attr.bp_type = HW_BREAKPOINT_W;
 + break;
 + case (DABR_DATA_WRITE | DABR_DATA_READ):
 + attr.bp_type = HW_BREAKPOINT_R | HW_BREAKPOINT_W;
 + break;
 + }
 + thread-ptrace_bps[0] = bp = register_user_hw_breakpoint(attr,
 + ptrace_triggered, task);
 + if (IS_ERR(bp)) {
 + thread-ptrace_bps[0] = NULL;
 + return PTR_ERR(bp);
 + }
 +
 +#endif /* CONFIG_PPC64 */



This looks fine for basic breakpoints. And this can probably be
improved to integrate ranges.

But I think we did something wrong with the generic breakpoint
interface. We are translating the arch values to generic
attributes. Then this all will be translated back to arch
values.

Having generic attributes is necessary for any perf event
use from userspace. But it looks like a waste for ptrace
that already gives us arch values. And the problem
is the same for x86.

So I think we should implement a register_ptrace_breakpoint()
that doesn't take perf_event_attr but specific arch informations,
so that we don't need to pass through a generic conversion, which:

- is wasteful
- won't be able to express 100% of any arch capabilities. We
  can certainly express most arch breakpoints features through
  the generic interface, but not all of them (given how tricky
  the data value comparison features can be)

I will rework

Re: [RFC:PATCH 00/03] powerpc: Expose BookE debug registers through extended ptrace interface

2010-01-30 Thread Frederic Weisbecker

On Sun, Jan 31, 2010 at 08:33:25AM +1100, Benjamin Herrenschmidt wrote:
 
  We have one field for addr, one for len and one for the memory access
  type.
  
  I think that those three are enough to express breakpoint ranges.
  Basically a breakpoint range is a breakpoint that can have a high
  len.
  
  I've looked at the G2 PowerPc core breakpoint implementation, just to
  face one of such tricky examples.
 
 BookE has a richer semantic. We have watchpoints with data value compare
 for example, we also have instruction value compare for breakpoints, and
 a few other niceties. There's also subtle differences between what
 processor variants support.
 
  .../...



Ah indeed, I missed the data value compare thing. Especially how
it is implemented won't make things easy.

This is basically a comparison against chosen bytes of the data,
with or/and patterns.

Not sure what the or can be useful for.

That won't be easy to implement in the generic interface, looking
at how it is done in the BookE.

There is also the address comparison by mask.

Anyway, I think we can add fields in the interface to provide
such features, but we can't support all of them given, as you
said, the subtle differences between different cpu.

For example I think it can be useful to implement support
for data comparison, by mask for example. But I don't imagine
useful usecases to compare byte 4 and byte1 and trigger an event
if one OR other match.

I think we are going to implement what has obvious usecases
(parts of such data comparisons, parts of address mask
comparison) in the generic interface: the fields in perf_attr
that can be filled by perf in userspace.
And the rest can be implemented from the hw_perf_event structure
which contains the arch structure and can then be filled by ptrace at
will.


 
   I'd rather have this more dedicated and more complete interface merged
   for gdb's sake, and in a second step look at unifying.
  
  
  Perhaps. Supporting ptrace breakpoints should be an easy first
  step as it's basically the responsibility of the user to fill
  the registers, but it's a pretty limited scope work, especially you
  won't have perf support.
 
 But we can add it later. 



Yeah you're right. Having a raw ptrace support is a first
useful step that won't be a barrier to enhance it further
through the generic API.


   I believe that the generic breakpoint infrastructure should not be the
   mid-layer. IE. It cannot be made in any clean shape of form, to express
   all of the subtle features that a given architecture or platform can
   support and as such would always be inferior to a dedicated one.
  
  
  Actually I think the current interface already does, as I explained
  above.
  
  What is broken for any other architectures than x86 is the set
  of constraints, but I can easily move it to the arch, unless
  I find a good generic solution (or a cool combination between
  both).
 
 Again, this is all can do vs. already done and working. May I point
 you to Linus recent rant against magic infrastructures that try to do
 everything and do nothing right ? :-) I much prefer starting with
 something dedicated that does exactly what is expected, have that
 upstream (and in that regard the patches are good for the next merge
 window) and -then- maybe look at how some of it could be re-used for
 perf.


Sure I'm not against a first raw ptrace support. As I said,
this is not a barrier for what comes next.

Now for the rest, I don't think this is the same story than
utrace.

Trying to have a generic layer for a hardware feature implemented
differently across archs can't be called something that tries to do
everything. Otherwise you can oppose these arguments to everything
that is not in the arch/ directories. No generic irq layer, no
generic timer, etc...

We need this generic layer because we want the breakpoints
to be available for wider uses than ptrace. If there was only
ptrace, I would really agree with you that it's not worth the
generic layer.

It's just that breakpoints are a set of possible features, but each
archs implement its own subset among the possible features. x86
is one of the weakest there, and since this generic layer has been
first x86 oriented, it looks too weak to host the most interesting
possibilities.

Let it grow a bit, it's still young.


  Not at all. It's an attempt to make a generic interface that can
  exploit at best _each_ arch specific features. 
 
 That reminds me of the justifications for utrace :-) It might well be
 but I very much doubt that is possible. In any case, it doesn't appear
 to be there yet.


You are too pessimistic ;-)

I don't think we can express every possibilities through the generic
interface. But we can express the most interesting ones for profiling
uses.

The rest (ptrace) can still be expressed through the arch part of the perf
events.


 So let's just get that stuff in so we have our
 interface finally working, and we can look at doing fancy things with
 perf in a

Re: [PATCH] macintosh: Explicitly set llseek to no_llseek in ans-lcd

2009-10-21 Thread Frederic Weisbecker

On Wed, Oct 21, 2009 at 11:07:18PM +0200, John Kacur wrote:
 From 0c2b412cdccf73bdeb19bb866bfe556942eaeca2 Mon Sep 17 00:00:00 2001
 From: John Kacur jka...@redhat.com
 Date: Wed, 21 Oct 2009 23:01:12 +0200
 Subject: [PATCH] macintosh: Explicitly set llseek to no_llseek in ans-lcd
 
 Now that we've removed the BKL here, let's explicitly set lleek to no_llseek
 
 Signed-off-by: John Kacur jka...@redhat.com
 ---
  drivers/macintosh/ans-lcd.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/macintosh/ans-lcd.c b/drivers/macintosh/ans-lcd.c
 index 4ae8ec9..a1a1bde 100644
 --- a/drivers/macintosh/ans-lcd.c
 +++ b/drivers/macintosh/ans-lcd.c
 @@ -137,6 +137,7 @@ const struct file_operations anslcd_fops = {
   .write  = anslcd_write,
   .unlocked_ioctl = anslcd_ioctl,
   .open   = anslcd_open,
 + .llseedk= no_llseek,


llseedk? :)


Should we better pushdown default_llseek to every to every
file operations that don't implement llseek?
I don't know how many of them don't implement llseek() though.

That said we can't continue anymore with this default attribution
of default_llseek() on new fops.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] macintosh: Explicitly set llseek to no_llseek in ans-lcd

2009-10-21 Thread Frederic Weisbecker

On Wed, Oct 21, 2009 at 11:33:17PM +0200, John Kacur wrote:
  Should we better pushdown default_llseek to every to every
  file operations that don't implement llseek?
  I don't know how many of them don't implement llseek() though.
  
  That said we can't continue anymore with this default attribution
  of default_llseek() on new fops.
  
 
 If you don't explicitly set it to no_llseek, you automatically get the
 default_llseek, which uses the BKL. So if your driver doesn't need it, it 
 is best to explicitly set it to no_llseek.


Sure, that's the right thing to do.

 
 There is also a generic_file_llseek_unlocked, somewhat analogous to the 
 unlocked_ioctls that you can use if you don't need to provide a full 
 llseek yourself.


No problem with that. Setting no_llseek or generic_file_llseek_unlocked,
depending on the context is the right thing to do.

What I'm wondering about concerns the future code that will have
no llsek() implemented in their fops.

We can't continue to use default_llseek() for future code unless we
want to continue these post reviews and fixes forever.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] macintosh: Explicitly set llseek to no_llseek in ans-lcd

2009-10-21 Thread Frederic Weisbecker

On Wed, Oct 21, 2009 at 11:53:21PM +0200, John Kacur wrote:
  No problem with that. Setting no_llseek or generic_file_llseek_unlocked,
  depending on the context is the right thing to do.
  
  What I'm wondering about concerns the future code that will have
  no llsek() implemented in their fops.
  
  We can't continue to use default_llseek() for future code unless we
  want to continue these post reviews and fixes forever.
  
 
 I'm thinking that the simplier approach, would be to make the 
 default_llseek the unlocked one. Then you only have to audit the drivers 
 that have the BKL - ie the ones we are auditing anyway, and explicitly set 
 them to the bkl locked llseek.
 
 There might be a hidden interaction though between the non-unlocked 
 variety of ioctls and default llseek though.


I fear that won't work because the bkl in default_llseek() does not
only synchronizes with others uses of the bkl in a driver, it also
synchronizes lseek() itself.

As an example offset change is not atomic. This is a long long, so
updating its value is not atomic in 32 bits archs.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/7][RFC] function graph tracer port to PowerPC

2009-02-12 Thread Frederic Weisbecker

On Thu, Feb 12, 2009 at 11:31:44AM -0500, Steven Rostedt wrote:
 
 On Thu, 12 Feb 2009, Frederic Weisbecker wrote:
 
  On Wed, Feb 11, 2009 at 08:10:51PM -0500, Steven Rostedt wrote:
   
   The following set of patches are RFC and not for inclusion
   (unless everyone is fine with them as is).
   
   This is the port to PowerPC of the function graph tracer that was written
   by Frederic Weisbecker for the x86 architecture.  It is broken up
   into a series of logical steps.
   
   1) get generic code ready for other archs
   2) get PowerPC 64-bit working with just static function tracing
   3) get PowerPC 64-bit working with dynamic function tracing
   4) get PowerPC 32-bit working with just static function tracing
   5) get PowerPC 32-bit working with dynamic function tracing
   
   (with some clean ups in between)
   
  
  
  Thanks a lot Steven!
  I'm sad to not having a Power Pc to test it...
 
 BTW, Can I count that as an Acked-by: for the first patch. Since the first 
 patch does modify your code.
 
 -- Steve
 


Yes of course, I knew most of it was architecture independant but I delayed
this TODO for future ports, and you've done it.

Thanks.

Just a micro detail: the ftrace_push/pop_return_trace are parts of
the core of the entry/return probe, something that could be used
by other users than the function graph tracer itself.

Perhaps it would be better to put them in kernel/trace/ftrace.c
What do you think?

Anyway, Acked-by: Frederic Weisbecker fweis...@gmail.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH 0/7][RFC] function graph tracer port to PowerPC

2009-02-11 Thread Frederic Weisbecker

On Wed, Feb 11, 2009 at 08:10:51PM -0500, Steven Rostedt wrote:
 
 The following set of patches are RFC and not for inclusion
 (unless everyone is fine with them as is).
 
 This is the port to PowerPC of the function graph tracer that was written
 by Frederic Weisbecker for the x86 architecture.  It is broken up
 into a series of logical steps.
 
 1) get generic code ready for other archs
 2) get PowerPC 64-bit working with just static function tracing
 3) get PowerPC 64-bit working with dynamic function tracing
 4) get PowerPC 32-bit working with just static function tracing
 5) get PowerPC 32-bit working with dynamic function tracing
 
 (with some clean ups in between)
 


Thanks a lot Steven!
I'm sad to not having a Power Pc to test it...

 
 The function graph tracer not only traces the start of a function
 (uses the function tracer part for that) but also uses the kprobes
 trick to replace the return address with a hook to trace the exit
 of the function. These hooks are generic in that other tracers
 can also use them. But the function graph tracer itself is very
 powerful. Simply doing the following:
 
 # echo function_graph  /debug/tracing/current_tracer
 # cat /debug/tracing/trace
 
 # tracer: function_graph
 #
 # CPU OVERHEAD/DURATIONFUNCTION CALLS
 # |   ||   |   |   |   |
  --
  0)   less-2228=cat-2229   
  --
 
  0)   |.__do_fault() {
  0)   |  .filemap_fault() {
  0)   |.find_lock_page() {
  0)   |  .find_get_page() {
  0)   3.168 us|.__rcu_read_lock();
  0)   2.704 us|.__rcu_read_unlock();
  0) + 14.640 us   |  }
  0) + 20.112 us   |}
  0) + 26.464 us   |  }
  0)   2.912 us|  ._spin_lock();
  0)   2.656 us|  .page_add_file_rmap();
  0)   |  .update_mmu_cache() {
  0)   |.hash_preload() {
  0)   2.368 us|  .get_slice_psize();
  0)   2.752 us|  .hash_page_do_lazy_icache();
  0)   3.568 us|  .native_hpte_insert();
  0) + 19.680 us   |}
  0) + 24.960 us   |  }
  0)   2.336 us|  ._spin_unlock();
  0)   |  .unlock_page() {
  0)   2.688 us|.page_waitqueue();
  0)   2.608 us|.__wake_up_bit();
  0) + 12.912 us   |  }
  0) ! 351.776 us  |}
  0) ! 357.392 us  |  }
  0)   3.040 us|  .up_read();
  0)   |  .compat_sys_ioctl() {
  0)   3.024 us|.fget_light();
  0)   |.tty_compat_ioctl() {
  0)   2.704 us|  .tty_paranoia_check();
  0)   |  .tty_ldisc_ref_wait() {
  0)   |.tty_ldisc_try() {
  0)   2.880 us|  ._spin_lock_irqsave();
  0)   2.928 us|  ._spin_unlock_irqrestore();
  0) + 13.776 us   |}
  0) + 19.424 us   |  }
 
 [...]
 
 As you can see, it gives a nice call trace of the functions being called
 at run time, as well as a time stamp of how much time the function
 took to execute.
 
 Adding dynamic tracing to the mix, we can trace a single function:
 
 # echo .do_fork  /debug/tracing/set_graph_function
 # echo function_graph  /debug/tracing/current_tracer
 # cat /debug/tracing/trace
 
 # tracer: function_graph
 #
 # CPU OVERHEAD/DURATIONFUNCTION CALLS
 # |   ||   |   |   |   |
  1)   |  .do_fork() {
  1)   |.copy_process() {
  1)   |  .prepare_to_copy() {
  1)   2.944 us|.flush_fp_to_thread();
  1)   2.800 us|.flush_altivec_to_thread();
  1)   2.608 us|.flush_vsx_to_thread();
  1) + 19.184 us   |  }
  1)   |  .kmem_cache_alloc() {
  1)   2.464 us|.slab_should_failslab();
  1)   8.304 us|  }
  1)   |  .alloc_thread_info() {
  1)   |.kmem_cache_alloc() {
  1)   2.512 us|  .slab_should_failslab();
  1)   8.224 us|}
  1) + 13.344 us   |  }
  1)   7.584 us|  .arch_dup_task_struct();
  1)   |  .copy_creds() {
  1)   2.736 us|.__mutex_init();
  1)   |.prepare_creds() {
  1)   |  .kmem_cache_alloc() {
  1)   2.640 us|.slab_should_failslab();
  1)   8.368 us|  }
 
 [...]
 
 
 Note, the '.' in the '.do_fork' is a PowerPC64 thing. PowerPC32 and
 other archs just need to do 'do_fork', without the dot.
 
 
 The following patches are in:
 
   git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
 
 branch: rfc/ppc/ftrace
 
 
 Steven Rostedt (7):
   tracing/function-graph-tracer: make arch generic push pop functions
   powerpc64: port of the function graph tracer
   powerpc64, tracing: add function graph tracer with dynamic tracing

Re: [PATCH 0/7][RFC] function graph tracer port to PowerPC

2009-02-11 Thread Frederic Weisbecker

On Wed, Feb 11, 2009 at 09:16:57PM -0500, Steven Rostedt wrote:
 
 On Thu, 12 Feb 2009, Frederic Weisbecker wrote:
 
  On Wed, Feb 11, 2009 at 08:10:51PM -0500, Steven Rostedt wrote:
   
   The following set of patches are RFC and not for inclusion
   (unless everyone is fine with them as is).
   
   This is the port to PowerPC of the function graph tracer that was written
   by Frederic Weisbecker for the x86 architecture.  It is broken up
   into a series of logical steps.
   
   1) get generic code ready for other archs
   2) get PowerPC 64-bit working with just static function tracing
   3) get PowerPC 64-bit working with dynamic function tracing
   4) get PowerPC 32-bit working with just static function tracing
   5) get PowerPC 32-bit working with dynamic function tracing
   
   (with some clean ups in between)
   
  
  
  Thanks a lot Steven!
  I'm sad to not having a Power Pc to test it...
 
 If you had a PowerPC, I doubt I would have been the one to port it ;-)


Especially since you already implemented ftrace on PowerPc :)
I will acquire an Arm board soon to adapt function graph on Arm...
 
 -- Steve
 

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

93 matches

Mail list logo