Re: [PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL

2015-02-12 Thread Frederic Weisbecker
On Wed, Feb 11, 2015 at 02:43:19PM -0500, Rik van Riel wrote:
 If exception_enter happens when already in IN_KERNEL state, the
 code still calls context_tracking_exit, which ends up in
 rcu_eqs_exit_common, which explodes with a WARN_ON when it is
 called in a situation where dynticks are not enabled.

Fortunately context_tracking_exit() already has a current_state == IN_KERNEL
check so this shouldn't be a problem.

Meanwhile I'll still take the patch, it's better to handle that
from the caller.

Thanks.

 
 This can be avoided by having exception_enter only switch to
 IN_KERNEL state if the current state is not already IN_KERNEL.
 
 Signed-off-by: Rik van Riel r...@redhat.com
 Reported-by: Luiz Capitulino lcapitul...@redhat.com
 ---
 Frederic, you will want this bonus patch, too :)
 
 Thanks to Luiz for finding this one. Whatever I was running did not
 trigger this issue...
 
  include/linux/context_tracking.h | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/include/linux/context_tracking.h 
 b/include/linux/context_tracking.h
 index b65fd1420e53..9da230406e8c 100644
 --- a/include/linux/context_tracking.h
 +++ b/include/linux/context_tracking.h
 @@ -37,7 +37,8 @@ static inline enum ctx_state exception_enter(void)
   return 0;
  
   prev_ctx = this_cpu_read(context_tracking.state);
 - context_tracking_exit(prev_ctx);
 + if (prev_ctx != IN_KERNEL)
 + context_tracking_exit(prev_ctx);
  
   return prev_ctx;
  }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v5 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-12 Thread Frederic Weisbecker
On Tue, Feb 10, 2015 at 03:27:49PM -0500, r...@redhat.com wrote:
 When running a KVM guest on a system with NOHZ_FULL enabled, and the
 KVM guest running with idle=poll mode, we still get wakeups of the
 rcuos/N threads.
 
 This problem has already been solved for user space by telling the
 RCU subsystem that the CPU is in an extended quiescent state while
 running user space code.
 
 This patch series extends that code a little bit to make it usable
 to track KVM guest space, too.
 
 I tested the code by booting a KVM guest with idle=poll, on a system
 with NOHZ_FULL enabled on most CPUs, and a VCPU thread bound to a
 CPU. In a 10 second interval, rcuos/N threads on other CPUs got woken
 up several times, while the rcuos thread on the CPU running the bound
 and alwasy running VCPU thread never got woken up once.
 
 Thanks to Christian Borntraeger, Paul McKenney, Paulo Bonzini,
 Frederic Weisbecker, and Will Deacon for reviewing and improving
 earlier versions of this patch series.
 

So the patchset look good, thanks everyone. I'm applying the series
and will see anything explode.

I'll wait until the end of the merge window to push it to Ingo.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL

2015-02-12 Thread Frederic Weisbecker
On Thu, Feb 12, 2015 at 10:47:10AM -0500, Rik van Riel wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 02/12/2015 10:42 AM, Frederic Weisbecker wrote:
  On Wed, Feb 11, 2015 at 02:43:19PM -0500, Rik van Riel wrote:
  If exception_enter happens when already in IN_KERNEL state, the 
  code still calls context_tracking_exit, which ends up in 
  rcu_eqs_exit_common, which explodes with a WARN_ON when it is 
  called in a situation where dynticks are not enabled.
  
  Fortunately context_tracking_exit() already has a current_state ==
  IN_KERNEL check so this shouldn't be a problem.
 
 No, it had a hard-coded current_state == IN_USER check,
 which is very close, but ...
 
 ... I replaced that with a state argument, and forgot to
 ensure that it never gets called with state == IN_KERNEL.
 This patch fixes that.

Ah that's right! Well I'm going to merge this patch to 1/5 then to
avoid breaking bisection.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] rcu,nohz: add context_tracking_user_enter/exit wrapper functions

2015-02-10 Thread Frederic Weisbecker
On Tue, Feb 10, 2015 at 09:41:45AM -0500, r...@redhat.com wrote:
 From: Rik van Riel r...@redhat.com
 
 These wrapper functions allow architecture code (eg. ARM) to keep
 calling context_tracking_user_enter  context_tracking_user_exit
 the same way it always has, without error prone tricks like duplicate
 defines of argument values in assembly code.
 
 Signed-off-by: Rik van Riel r...@redhat.com

This patch alone doesn't make much sense. The changelog says it's about keeping
context_tracking_user_*() functions as wrappers but fails to explain to what 
they
wrap, why and what are the new context_tracking_enter/exit functions for.

Perhaps patches 1 and 2 should be merged together into something like:

context_tracking: Generalize context tracking APIs to support user and 
guest

Do that because we'll also track guestetc  And keep the old user 
context tracking APIs
for now to avoid painful enum parameter support in ARM assembly 

 ---
  include/linux/context_tracking.h |  2 ++
  kernel/context_tracking.c| 37 +
  2 files changed, 27 insertions(+), 12 deletions(-)
 
 diff --git a/include/linux/context_tracking.h 
 b/include/linux/context_tracking.h
 index 37b81bd51ec0..03b9c733eae7 100644
 --- a/include/linux/context_tracking.h
 +++ b/include/linux/context_tracking.h
 @@ -10,6 +10,8 @@
  #ifdef CONFIG_CONTEXT_TRACKING
  extern void context_tracking_cpu_set(int cpu);
  
 +extern void context_tracking_enter(void);
 +extern void context_tracking_exit(void);
  extern void context_tracking_user_enter(void);
  extern void context_tracking_user_exit(void);
  extern void __context_tracking_task_switch(struct task_struct *prev,
 diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
 index 937ecdfdf258..bbdc423936e6 100644
 --- a/kernel/context_tracking.c
 +++ b/kernel/context_tracking.c
 @@ -39,15 +39,15 @@ void context_tracking_cpu_set(int cpu)
  }
  
  /**
 - * context_tracking_user_enter - Inform the context tracking that the CPU is 
 going to
 - *   enter userspace mode.
 + * context_tracking_enter - Inform the context tracking that the CPU is going
 + *  enter user or guest space mode.
   *
   * This function must be called right before we switch from the kernel
 - * to userspace, when it's guaranteed the remaining kernel instructions
 - * to execute won't use any RCU read side critical section because this
 - * function sets RCU in extended quiescent state.
 + * to user or guest space, when it's guaranteed the remaining kernel
 + * instructions to execute won't use any RCU read side critical section
 + * because this function sets RCU in extended quiescent state.
   */
 -void context_tracking_user_enter(void)
 +void context_tracking_enter(void)
  {
   unsigned long flags;
  
 @@ -105,20 +105,27 @@ void context_tracking_user_enter(void)
   }
   local_irq_restore(flags);
  }
 +NOKPROBE_SYMBOL(context_tracking_enter);
 +
 +void context_tracking_user_enter(void)
 +{
 + context_tracking_enter();
 +}
  NOKPROBE_SYMBOL(context_tracking_user_enter);
  
  /**
 - * context_tracking_user_exit - Inform the context tracking that the CPU is
 - *  exiting userspace mode and entering the 
 kernel.
 + * context_tracking_exit - Inform the context tracking that the CPU is
 + * exiting user or guest mode and entering the 
 kernel.
   *
 - * This function must be called after we entered the kernel from userspace
 - * before any use of RCU read side critical section. This potentially include
 - * any high level kernel code like syscalls, exceptions, signal handling, 
 etc...
 + * This function must be called after we entered the kernel from user or
 + * guest space before any use of RCU read side critical section. This
 + * potentially include any high level kernel code like syscalls, exceptions,
 + * signal handling, etc...
   *
   * This call supports re-entrancy. This way it can be called from any 
 exception
   * handler without needing to know if we came from userspace or not.
   */
 -void context_tracking_user_exit(void)
 +void context_tracking_exit(void)

Comment changes are good, thanks!

  {
   unsigned long flags;
  
 @@ -143,6 +150,12 @@ void context_tracking_user_exit(void)
   }
   local_irq_restore(flags);
  }
 +NOKPROBE_SYMBOL(context_tracking_exit);
 +
 +void context_tracking_user_exit(void)
 +{
 + context_tracking_exit();
 +}
  NOKPROBE_SYMBOL(context_tracking_user_exit);
  
  /**
 -- 
 1.9.3
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] rcu,nohz: add context_tracking_user_enter/exit wrapper functions

2015-02-10 Thread Frederic Weisbecker
On Tue, Feb 10, 2015 at 09:25:26AM -0800, Paul E. McKenney wrote:
 On Tue, Feb 10, 2015 at 11:48:37AM -0500, Rik van Riel wrote:
  On 02/10/2015 10:28 AM, Frederic Weisbecker wrote:
   On Tue, Feb 10, 2015 at 09:41:45AM -0500, r...@redhat.com wrote:
   From: Rik van Riel r...@redhat.com
  
   These wrapper functions allow architecture code (eg. ARM) to keep
   calling context_tracking_user_enter  context_tracking_user_exit
   the same way it always has, without error prone tricks like duplicate
   defines of argument values in assembly code.
  
   Signed-off-by: Rik van Riel r...@redhat.com
   
   This patch alone doesn't make much sense. 
  
  Agreed, my goal was to keep things super easy to review,
  to reduce the chance of introducing bugs.
  
   The changelog says it's about keeping
   context_tracking_user_*() functions as wrappers but fails to explain to 
   what they
   wrap, why and what are the new context_tracking_enter/exit functions for.
   
   Perhaps patches 1 and 2 should be merged together into something like:
   
 context_tracking: Generalize context tracking APIs to support user and 
   guest
   
   Do that because we'll also track guestetc  And keep the old 
   user context tracking APIs
   for now to avoid painful enum parameter support in ARM 
   assembly 
  
  Can do...
  
  Paul, would you like me to resend the whole series, or just
  a merged patch that replaces patches 1  2?
 
 I prefer the whole series, as it reduces my opportunity to introduce
 human error when applying them.  ;-)
 
  That is assuming Paul prefers having the patches merged into
  one :)
 
 I am OK with that.  I will merge the next set with the first two patches
 merged, people have had sufficient opportunity to review.

BTW, I have a few patches to make on the next cycle to fix a few context 
tracking
related things. And since it's too late to push this series for the current 
merge window,
now I wonder it may be easier if I take these patches. Otherwise you might 
experience
unpleasant rebase conflicts. Is that ok for you?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v3 0/6] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-09 Thread Frederic Weisbecker
On Mon, Feb 09, 2015 at 08:22:59PM -0500, Rik van Riel wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 02/09/2015 08:15 PM, Will Deacon wrote:
  Hi Rik,
  
  On Mon, Feb 09, 2015 at 04:04:38PM +, r...@redhat.com wrote:
  Apologies to Catalin and Will for not fixing up ARM. I am not 
  familiar with ARM assembly, and not sure how to pass a constant 
  argument to a function from assembly code on ARM :)
  
  It's a bit of a faff getting enum values into asm -- we actually
  have to duplicate the definitions using #defines to get at the
  constants. Perhaps it would be cleaner to leave
  context_tracking_user_{enter,exit} intact as C wrappers around
  context_tracking_{enter,exit} passing the appropriate constant?
  That way we don't actually need to change the arch code at all.
 
 If Paul and Frederic have no objections, I would be happy to do that.
 
 Paul, Frederic?

Sure, that's fine by me.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-07 Thread Frederic Weisbecker
On Fri, Feb 06, 2015 at 11:14:53PM -0800, Paul E. McKenney wrote:
 On Fri, Feb 06, 2015 at 10:34:21PM -0800, Paul E. McKenney wrote:
  On Fri, Feb 06, 2015 at 10:53:34PM -0500, Rik van Riel wrote:
   -BEGIN PGP SIGNED MESSAGE-
   Hash: SHA1
   
   On 02/06/2015 06:15 PM, Frederic Weisbecker wrote:
   
Just a few things then:

1) In this case rename context_tracking_user_enter/exit() to 
context_tracking_enter() and context_tracking_exit(), since it's
not anymore about user only but about any generic context.

2) We have the WARN_ON_ONCE(!current-mm); condition that is a
debug check specific to userspace transitions because kernel
threads aren't expected to resume to userspace. Can we also expect
that we never switch to/from guest from a kernel thread? AFAICS
this happens from an ioctl (thus user task) in x86 for kvm. But I
only know this case.

3) You might want to update a few comments that assume we only deal
with userspace transitions.

4) trace_user_enter/exit() should stay user-transitions specific.
   
   Paul, would you like me to send follow-up patches with the cleanups
   suggested by Frederic, or would you prefer me to send a new series
   with the cleanups integrated?
  
  I would prefer a new series, in order to prevent possible future
  confusion.
 
 Of course, if Frederic would rather push them himself, I am fine with
 that.  And in that case, you should ask him for his preferences, which
 just might differ from mine.  ;-)

I prefer a new series too. Now whether you or me take the patches, I don't mind
either way :-)

Also I wonder how this feature is going to be enabled. Will it be enabled on
full dynticks or should it be a seperate feature depending on full dynticks?
Or even just CONFIG_RCU_USER_EQS? Because I'm still unclear about how and what
this is used, if it involves full dynticks or only RCU extended quiescent 
states.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-06 Thread Frederic Weisbecker
On Fri, Feb 06, 2015 at 01:51:56PM -0500, Rik van Riel wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 02/06/2015 01:23 PM, Frederic Weisbecker wrote:
  On Fri, Feb 06, 2015 at 01:20:21PM -0500, Rik van Riel wrote: On
  02/06/2015 12:22 PM, Frederic Weisbecker wrote:
  On Thu, Feb 05, 2015 at 03:23:48PM -0500, r...@redhat.com
  wrote:
  From: Rik van Riel r...@redhat.com
  
  Add the expected ctx_state as a parameter to 
  context_tracking_user_enter and
  context_tracking_user_exit, allowing the same functions to
  not just track kernel  user space switching, but also
  kernel  guest transitions.
  
  Signed-off-by: Rik van Riel r...@redhat.com
  
  You should consider using guest_enter() and guest_exit()
  instead. These are context tracking APIs too but specifically
  for guest.
  
  What do you mean instead?  KVM already uses those.
  
  I just wanted to avoid duplicating the code...
  
  I mean you can call rcu_user APIs directly from
  guest_enter/exit. You don't really need to call the
  context_tracking_user functions since guest_enter/guest_exit
  already handle the vtime accounting.
 
 I would still have to modify exception_enter and exception_exit,
 and with them context_tracking_user_enter and
 context_tracking_user_exit.
 
 We have to re-enable RCU when an exception happens.
 
 I suspect exceptions in a guest just trigger VMEXIT, and we
 figure later why the exception happened. However, if we were
 to get an exception during the code where we transition into
 or out of guest mode, we would still need exception_enter
 and exception_exit...

Ah that's a fair point. I didn't think about that. Ok then a real
IN_GUEST mode makes sense. And context_tracking_user_enter/exit() can
be reused as is indeed.

Just a few things then:

1) In this case rename context_tracking_user_enter/exit() to
context_tracking_enter() and context_tracking_exit(), since it's not
anymore about user only but about any generic context.

2) We have the WARN_ON_ONCE(!current-mm); condition that is a debug
check specific to userspace transitions because kernel threads aren't
expected to resume to userspace. Can we also expect that we never switch
to/from guest from a kernel thread? AFAICS this happens from an ioctl (thus
user task) in x86 for kvm. But I only know this case.

3) You might want to update a few comments that assume we only deal with
userspace transitions.

4) trace_user_enter/exit() should stay user-transitions specific.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest

2015-02-06 Thread Frederic Weisbecker
On Thu, Feb 05, 2015 at 03:23:51PM -0500, r...@redhat.com wrote:
 From: Rik van Riel r...@redhat.com
 
 The host kernel is not doing anything while the CPU is executing
 a KVM guest VCPU, so it can be marked as being in an extended
 quiescent state, identical to that used when running user space
 code.
 
 The only exception to that rule is when the host handles an
 interrupt, which is already handled by the irq code, which
 calls rcu_irq_enter and rcu_irq_exit.
 
 The guest_enter and guest_exit functions already switch vtime
 accounting independent of context tracking, so leave those calls
 where they are, instead of moving them into the context tracking
 code.
 
 Signed-off-by: Rik van Riel r...@redhat.com
 ---
  include/linux/context_tracking.h   | 8 +++-
  include/linux/context_tracking_state.h | 1 +
  include/linux/kvm_host.h   | 3 ++-
  3 files changed, 10 insertions(+), 2 deletions(-)
 
 diff --git a/include/linux/context_tracking.h 
 b/include/linux/context_tracking.h
 index bd9f000fc98d..a5d3bb44b897 100644
 --- a/include/linux/context_tracking.h
 +++ b/include/linux/context_tracking.h
 @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
  static inline void exception_exit(enum ctx_state prev_ctx)
  {
   if (context_tracking_is_enabled()) {
 - if (prev_ctx == IN_USER)
 + if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)

That's nitpicking but != IN_KERNEL would be more generic. We are exiting an 
exception
and we know that the exception executes IN_KERNEL, so we want to restore any 
context
(whether IN_USER, IN_GUEST, or anything added in the future) prior the 
exception if that
was anything else than IN_KERNEL.

   context_tracking_user_enter(prev_ctx);
   }
  }
 @@ -78,6 +78,9 @@ static inline void guest_enter(void)
   vtime_guest_enter(current);
   else
   current-flags |= PF_VCPU;
 +
 + if (context_tracking_is_enabled())
 + context_tracking_user_enter(IN_GUEST);
  }
  
  static inline void guest_exit(void)
 @@ -86,6 +89,9 @@ static inline void guest_exit(void)
   vtime_guest_exit(current);
   else
   current-flags = ~PF_VCPU;
 +
 + if (context_tracking_is_enabled())
 + context_tracking_user_exit(IN_GUEST);

I suggest you to restore RCU before anything else. I believe cputime
accounting doesn't use RCU but we never know with all the debug/tracing
code behind, the acct accounting...

Thanks.

  }
  
  #else
 diff --git a/include/linux/context_tracking_state.h 
 b/include/linux/context_tracking_state.h
 index 97a81225d037..f3ef027af749 100644
 --- a/include/linux/context_tracking_state.h
 +++ b/include/linux/context_tracking_state.h
 @@ -15,6 +15,7 @@ struct context_tracking {
   enum ctx_state {
   IN_KERNEL = 0,
   IN_USER,
 + IN_GUEST,
   } state;
  };
  
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 26f106022c88..c7828a6a9614 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -772,7 +772,8 @@ static inline void kvm_guest_enter(void)
* one time slice). Lets treat guest mode as quiescent state, just like
* we do with user-mode execution.
*/
 - rcu_virt_note_context_switch(smp_processor_id());
 + if (!context_tracking_cpu_is_enabled())
 + rcu_virt_note_context_switch(smp_processor_id());
  }
  
  static inline void kvm_guest_exit(void)
 -- 
 1.9.3
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-06 Thread Frederic Weisbecker
On Thu, Feb 05, 2015 at 03:23:47PM -0500, r...@redhat.com wrote:
 When running a KVM guest on a system with NOHZ_FULL enabled

I just need to clarify the motivation first, does the above situation
really happen? Ok some distros enable NOHZ_FULL to let the user stop
the tick in userspace. So most of the time, CONFIG_NOHZ_FULL=y but
nohz full is runtime disabled (we need to pass a nohz_full= boot
parameter to enable it). And when it is runtime disabled, there should
be no rcu nocb CPU.

(Although not setting CPUs in nocb mode when nohz full is runtime disabled
is perhaps a recent change.)

So for the problem to arise, one need to enable nohz_full and run KVM
guest. And I never heard about such workloads. That said it's potentially
interesting to turn off the tick on the host when the guest runs.

, and the
 KVM guest running with idle=poll mode, we still get wakeups of the
 rcuos/N threads.

So we need nohz_full on the host and idle=poll mode on the guest. Is it
likely to happen? (sorry, again I'm just trying to make sure we agree on
why we do this change).

 
 This problem has already been solved for user space by telling the
 RCU subsystem that the CPU is in an extended quiescent state while
 running user space code.
 
 This patch series extends that code a little bit to make it usable
 to track KVM guest space, too.
 
 I tested the code by booting a KVM guest with idle=poll, on a system
 with NOHZ_FULL enabled on most CPUs, and a VCPU thread bound to a
 CPU. In a 10 second interval, rcuos/N threads on other CPUs got woken
 up several times, while the rcuos thread on the CPU running the bound
 and alwasy running VCPU thread never got woken up once.

So what you're describing is to set RCU in extended quiescent state, right?
This doesn't include stopping the tick while running in guest mode? Those
are indeed two different thing, although stopping the tick most often requires
to set RCU in extended quiescent state.

 
 Thanks to Christian Borntraeger and Paul McKenney for reviewing the
 first version of this patch series, and helping optimize patch 4/5.
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-06 Thread Frederic Weisbecker
On Thu, Feb 05, 2015 at 03:23:48PM -0500, r...@redhat.com wrote:
 From: Rik van Riel r...@redhat.com
 
 Add the expected ctx_state as a parameter to context_tracking_user_enter
 and context_tracking_user_exit, allowing the same functions to not just
 track kernel  user space switching, but also kernel  guest transitions.
 
 Signed-off-by: Rik van Riel r...@redhat.com

You should consider using guest_enter() and guest_exit() instead. These are
context tracking APIs too but specifically for guest.

These can be uninlined if needed.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest

2015-02-06 Thread Frederic Weisbecker
On Thu, Feb 05, 2015 at 03:23:51PM -0500, r...@redhat.com wrote:
 From: Rik van Riel r...@redhat.com
 
 The host kernel is not doing anything while the CPU is executing
 a KVM guest VCPU, so it can be marked as being in an extended
 quiescent state, identical to that used when running user space
 code.
 
 The only exception to that rule is when the host handles an
 interrupt, which is already handled by the irq code, which
 calls rcu_irq_enter and rcu_irq_exit.
 
 The guest_enter and guest_exit functions already switch vtime
 accounting independent of context tracking, so leave those calls
 where they are, instead of moving them into the context tracking
 code.
 
 Signed-off-by: Rik van Riel r...@redhat.com
 ---
  include/linux/context_tracking.h   | 8 +++-
  include/linux/context_tracking_state.h | 1 +
  include/linux/kvm_host.h   | 3 ++-
  3 files changed, 10 insertions(+), 2 deletions(-)
 
 diff --git a/include/linux/context_tracking.h 
 b/include/linux/context_tracking.h
 index bd9f000fc98d..a5d3bb44b897 100644
 --- a/include/linux/context_tracking.h
 +++ b/include/linux/context_tracking.h
 @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
  static inline void exception_exit(enum ctx_state prev_ctx)
  {
   if (context_tracking_is_enabled()) {
 - if (prev_ctx == IN_USER)
 + if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
   context_tracking_user_enter(prev_ctx);
   }
  }
 @@ -78,6 +78,9 @@ static inline void guest_enter(void)
   vtime_guest_enter(current);
   else
   current-flags |= PF_VCPU;
 +
 + if (context_tracking_is_enabled())
 + context_tracking_user_enter(IN_GUEST);

So you should probably just call rcu_user_enter() directly from
there. context_tracking_user_enter() is really about userspace
boundaries.

  }
  
  static inline void guest_exit(void)
 @@ -86,6 +89,9 @@ static inline void guest_exit(void)
   vtime_guest_exit(current);
   else
   current-flags = ~PF_VCPU;
 +
 + if (context_tracking_is_enabled())
 + context_tracking_user_exit(IN_GUEST);
  }
  
  #else
 diff --git a/include/linux/context_tracking_state.h 
 b/include/linux/context_tracking_state.h
 index 97a81225d037..f3ef027af749 100644
 --- a/include/linux/context_tracking_state.h
 +++ b/include/linux/context_tracking_state.h
 @@ -15,6 +15,7 @@ struct context_tracking {
   enum ctx_state {
   IN_KERNEL = 0,
   IN_USER,
 + IN_GUEST,
   } state;
  };
  
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 26f106022c88..c7828a6a9614 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -772,7 +772,8 @@ static inline void kvm_guest_enter(void)
* one time slice). Lets treat guest mode as quiescent state, just like
* we do with user-mode execution.
*/
 - rcu_virt_note_context_switch(smp_processor_id());
 + if (!context_tracking_cpu_is_enabled())
 + rcu_virt_note_context_switch(smp_processor_id());

Should we have a specific CONFIG for this feature? Or relying on full dynticks
to be enabled (and thus context tracking enabled) is enough?

Thanks.

  }
  
  static inline void kvm_guest_exit(void)
 -- 
 1.9.3
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-06 Thread Frederic Weisbecker
On Fri, Feb 06, 2015 at 09:56:43AM -0500, Rik van Riel wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 02/06/2015 08:46 AM, Frederic Weisbecker wrote:
  On Thu, Feb 05, 2015 at 03:23:47PM -0500, r...@redhat.com wrote:
  When running a KVM guest on a system with NOHZ_FULL enabled
  
  I just need to clarify the motivation first, does the above
  situation really happen? Ok some distros enable NOHZ_FULL to let
  the user stop the tick in userspace. So most of the time,
  CONFIG_NOHZ_FULL=y but nohz full is runtime disabled (we need to
  pass a nohz_full= boot parameter to enable it). And when it is
  runtime disabled, there should be no rcu nocb CPU.
  
  (Although not setting CPUs in nocb mode when nohz full is runtime
  disabled is perhaps a recent change.)
  
  So for the problem to arise, one need to enable nohz_full and run
  KVM guest. And I never heard about such workloads. That said it's
  potentially interesting to turn off the tick on the host when the
  guest runs.
  
  , and the KVM guest running with idle=poll mode, we still get
  wakeups of the rcuos/N threads.
  
  So we need nohz_full on the host and idle=poll mode on the guest.
  Is it likely to happen? (sorry, again I'm just trying to make sure
  we agree on why we do this change).
 
 We have users interested in doing just that, in order to run
 KVM guests with the least amount of perturbation to the guest.

So what are you interested in exactly? Only RCU extended quiescent state
or also full dynticks? Because your cover letter only points to disturbing
RCU nocb and quiescent states.

Also I'm curious why you run guests in idle=poll. Maybe that avoids host/guest
context switches?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-06 Thread Frederic Weisbecker
On Fri, Feb 06, 2015 at 02:50:44PM +0100, Paolo Bonzini wrote:
 
 
 On 06/02/2015 14:46, Frederic Weisbecker wrote:
   When running a KVM guest on a system with NOHZ_FULL enabled
  
  I just need to clarify the motivation first, does the above situation
  really happen? Ok some distros enable NOHZ_FULL to let the user stop
  the tick in userspace. So most of the time, CONFIG_NOHZ_FULL=y but
  nohz full is runtime disabled (we need to pass a nohz_full= boot
  parameter to enable it). And when it is runtime disabled, there should
  be no rcu nocb CPU.
  
  (Although not setting CPUs in nocb mode when nohz full is runtime disabled
  is perhaps a recent change.)
  
  So for the problem to arise, one need to enable nohz_full and run KVM
  guest. And I never heard about such workloads.
 
 Yeah, it's a new thing but Marcelo, Luiz and Rik have been having a lot
 of fun with them (with PREEMPT_RT too).  They're getting pretty good
 results given the right tuning.

Ok but, I'm still not sure about the details of what you're trying to do.
Whether it's only about RCU or it also involves ticks. What kind of tuning
you're doing and what kind of performance gain?

Thanks.

 
 I'll let Paul queue the patches for 3.21 then!
 
 Paolo
 
  That said it's potentially
  interesting to turn off the tick on the host when the guest runs.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-06 Thread Frederic Weisbecker
On Fri, Feb 06, 2015 at 04:00:27PM +0100, Christian Borntraeger wrote:
 Am 05.02.2015 um 21:23 schrieb r...@redhat.com:
  When running a KVM guest on a system with NOHZ_FULL enabled, and the
  KVM guest running with idle=poll mode, we still get wakeups of the
  rcuos/N threads.
  
  This problem has already been solved for user space by telling the
  RCU subsystem that the CPU is in an extended quiescent state while
  running user space code.
  
  This patch series extends that code a little bit to make it usable
  to track KVM guest space, too.
  
  I tested the code by booting a KVM guest with idle=poll, on a system
  with NOHZ_FULL enabled on most CPUs, and a VCPU thread bound to a
  CPU. In a 10 second interval, rcuos/N threads on other CPUs got woken
  up several times, while the rcuos thread on the CPU running the bound
  and alwasy running VCPU thread never got woken up once.
  
  Thanks to Christian Borntraeger and Paul McKenney for reviewing the
  first version of this patch series, and helping optimize patch 4/5.
 
 I gave it a quick run on s390/kvm and everything still seem to be 
 running fine. A also I like the idea of this patch set.
 
 We have seen several cases were the fact that we are in guest context
 a full tick for cpu bound guests (10ms on s390) caused significant
 latencies for host synchronize-rcu heavy workload - e.g. getting rid
 of macvtap devices on guest shutdown, adding hundreds of irq routes
 for many guest devices
 
 s390 has no context tracking infrastructure yet (no nohz_full), but
 this series looks like that the current case (nohz_idle) still works.
 With this in place, having hohz==full on s390 now even makes more
 sense, as KVM hosts with cpu bound guests should have get much quicker
 rcu response times when most host CPUs are in an extended quiescant
 state.

Sure, if you need any help for context tracking, don't hesitate to ask,
it can be a bit tricky to implement sometimes. Perhaps x86 isn't the
best example because it does quite some weird dances to minimize fast path
overhead. ARM is perhaps clearer.

 
 Christian
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-06 Thread Frederic Weisbecker
On Fri, Feb 06, 2015 at 01:20:21PM -0500, Rik van Riel wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 02/06/2015 12:22 PM, Frederic Weisbecker wrote:
  On Thu, Feb 05, 2015 at 03:23:48PM -0500, r...@redhat.com wrote:
  From: Rik van Riel r...@redhat.com
  
  Add the expected ctx_state as a parameter to
  context_tracking_user_enter and context_tracking_user_exit,
  allowing the same functions to not just track kernel  user
  space switching, but also kernel  guest transitions.
  
  Signed-off-by: Rik van Riel r...@redhat.com
  
  You should consider using guest_enter() and guest_exit() instead.
  These are context tracking APIs too but specifically for guest.
 
 What do you mean instead?  KVM already uses those.
 
 I just wanted to avoid duplicating the code...

I mean you can call rcu_user APIs directly from guest_enter/exit.
You don't really need to call the context_tracking_user functions
since guest_enter/guest_exit already handle the vtime accounting.

 
 - -- 
 All rights reversed
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1
 
 iQEcBAEBAgAGBQJU1QXlAAoJEM553pKExN6D+l4H/1PPmFioxed9XyL+rJZf0XSt
 mATl5JcWGlNybL5c4Tnld/3FX5/vYwBXmgw2Rh5a84F+TJi8B+Hu2Uwetl6C6vUF
 EK2+ExJ1rla4lpiO3frxPDdfdOHJFw2bR0fhEb4GHqcN2ecfSdXtL4hKwFru5h5s
 IJ8dzNIW52vzqzmulkcvI1y+VkgQBwUXYbkiGyy/MPf4F0WvGC9g44eXHZNPRXoT
 V34/nMJCpFHlZ7FVuHqGGstmPjv19VUAYNhUkrlU8DOpZMKxT58Sb1CGLfwsGqvZ
 y0+pRca8eT+gX0vqg9YUBfoEBNy4MnHdQEwQ0EPZwPJkcQ3Leco3/1JLHyDogCg=
 =3AJV
 -END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM

2013-05-17 Thread Frederic Weisbecker
On Fri, May 17, 2013 at 07:09:42AM -0700, Kevin Hilman wrote:
 Frederic Weisbecker fweis...@gmail.com writes:
 
  On Thu, May 16, 2013 at 12:52:03AM +0200, Frederic Weisbecker wrote:
  On Mon, Mar 25, 2013 at 02:14:20PM -0700, Kevin Hilman wrote:
   Gleb Natapov g...@redhat.com writes:
   
On Sun, Mar 24, 2013 at 02:44:26PM +0100, Frederic Weisbecker wrote:
2013/3/21 Gleb Natapov g...@redhat.com:
 Isn't is simpler for kernel/context_tracking.c to define empty
 __guest_enter()/__guest_exit() if !CONFIG_KVM.

That doesn't look right. Off-cases are usually handled from the
headers, right? So that we avoid iffdeffery ugliness in core code.
Lets put it in linux/context_tracking.h header then.
   
   Here's a version to do that.
   
   Kevin
   
   From d9d909394479dd7ff90b7bddb95a564945406719 Mon Sep 17 00:00:00 2001
   From: Kevin Hilman khil...@linaro.org
   Date: Mon, 25 Mar 2013 14:12:41 -0700
   Subject: [PATCH v2] ontext_tracking: fix !CONFIG_KVM compile: add stub 
   guest
enter/exit
  
  Sorry for my very delayed response...
  
   
   When KVM is not enabled, or not available on a platform, the KVM
   headers should not be included.  Instead, just define stub
   __guest_[enter|exit] functions.
  
  May be it would be cleaner to move guest_enter/exit definitions altogether
  in linux/context_tracking.h
  
  After all that's where the implementation mostly belong to.
  
  Let me see if I can get that in shape.
 
  Does the following work for you?
 
 Nope. 
 
 Since it still includs kvm_host.h on non-KVM builds, there is potential
 for problems.  For example, on ARM (v3.10-rc1 + this patch) has this
 build error:
 
   CC  kernel/context_tracking.o
 In file included from 
 /work/kernel/linaro/nohz/arch/arm/include/asm/kvm_host.h:41:0,
  from /work/kernel/linaro/nohz/include/linux/kvm_host.h:34,
  from /work/kernel/linaro/nohz/kernel/context_tracking.c:18:
 /work/kernel/linaro/nohz/arch/arm/include/asm/kvm_vgic.h:38:6: warning: 
 CONFIG_KVM_ARM_MAX_VCPUS is not defined [-Wundef]
 In file included from 
 /work/kernel/linaro/nohz/arch/arm/include/asm/kvm_host.h:41:0,
  from /work/kernel/linaro/nohz/include/linux/kvm_host.h:34,
  from /work/kernel/linaro/nohz/kernel/context_tracking.c:18:
 /work/kernel/linaro/nohz/arch/arm/include/asm/kvm_vgic.h:59:11: error: 
 'CONFIG_KVM_ARM_MAX_VCPUS' undeclared here (not in a function)

Sorry I forgot to remove the include to kvm_host.h in context_tracking.c,
here's the fixed patch:

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 365f4a6..fc09d7b 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include linux/sched.h
 #include linux/percpu.h
+#include linux/vtime.h
 #include asm/ptrace.h
 
 struct context_tracking {
@@ -19,6 +20,26 @@ struct context_tracking {
} state;
 };
 
+static inline void __guest_enter(void)
+{
+   /*
+* This is running in ioctl context so we can avoid
+* the call to vtime_account() with its unnecessary idle check.
+*/
+   vtime_account_system(current);
+   current-flags |= PF_VCPU;
+}
+
+static inline void __guest_exit(void)
+{
+   /*
+* This is running in ioctl context so we can avoid
+* the call to vtime_account() with its unnecessary idle check.
+*/
+   vtime_account_system(current);
+   current-flags = ~PF_VCPU;
+}
+
 #ifdef CONFIG_CONTEXT_TRACKING
 DECLARE_PER_CPU(struct context_tracking, context_tracking);
 
@@ -35,6 +56,9 @@ static inline bool context_tracking_active(void)
 extern void user_enter(void);
 extern void user_exit(void);
 
+extern void guest_enter(void);
+extern void guest_exit(void);
+
 static inline enum ctx_state exception_enter(void)
 {
enum ctx_state prev_ctx;
@@ -57,6 +81,17 @@ extern void context_tracking_task_switch(struct task_struct 
*prev,
 static inline bool context_tracking_in_user(void) { return false; }
 static inline void user_enter(void) { }
 static inline void user_exit(void) { }
+
+static inline void guest_enter(void)
+{
+   __guest_enter();
+}
+
+static inline void guest_exit(void)
+{
+   __guest_exit();
+}
+
 static inline enum ctx_state exception_enter(void) { return 0; }
 static inline void exception_exit(enum ctx_state prev_ctx) { }
 static inline void context_tracking_task_switch(struct task_struct *prev,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f0eea07..8db53cf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -23,6 +23,7 @@
 #include linux/ratelimit.h
 #include linux/err.h
 #include linux/irqflags.h
+#include linux/context_tracking.h
 #include asm/signal.h
 
 #include linux/kvm.h
@@ -760,42 +761,6 @@ static inline int kvm_iommu_unmap_guest(struct kvm *kvm)
 }
 #endif
 
-static inline void __guest_enter(void)
-{
-   /*
-* This is running in ioctl

Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM

2013-05-16 Thread Frederic Weisbecker
On Thu, May 16, 2013 at 12:52:03AM +0200, Frederic Weisbecker wrote:
 On Mon, Mar 25, 2013 at 02:14:20PM -0700, Kevin Hilman wrote:
  Gleb Natapov g...@redhat.com writes:
  
   On Sun, Mar 24, 2013 at 02:44:26PM +0100, Frederic Weisbecker wrote:
   2013/3/21 Gleb Natapov g...@redhat.com:
Isn't is simpler for kernel/context_tracking.c to define empty
__guest_enter()/__guest_exit() if !CONFIG_KVM.
   
   That doesn't look right. Off-cases are usually handled from the
   headers, right? So that we avoid iffdeffery ugliness in core code.
   Lets put it in linux/context_tracking.h header then.
  
  Here's a version to do that.
  
  Kevin
  
  From d9d909394479dd7ff90b7bddb95a564945406719 Mon Sep 17 00:00:00 2001
  From: Kevin Hilman khil...@linaro.org
  Date: Mon, 25 Mar 2013 14:12:41 -0700
  Subject: [PATCH v2] ontext_tracking: fix !CONFIG_KVM compile: add stub guest
   enter/exit
 
 Sorry for my very delayed response...
 
  
  When KVM is not enabled, or not available on a platform, the KVM
  headers should not be included.  Instead, just define stub
  __guest_[enter|exit] functions.
 
 May be it would be cleaner to move guest_enter/exit definitions altogether
 in linux/context_tracking.h
 
 After all that's where the implementation mostly belong to.
 
 Let me see if I can get that in shape.

Does the following work for you?

---
commit b1633c075527653a99df6fd54d2611cf8c8047f5
Author: Frederic Weisbecker fweis...@gmail.com
Date:   Thu May 16 01:21:38 2013 +0200

kvm: Move guest entry/exit APIs to context_tracking

Signed-off-by: Frederic Weisbecker fweis...@gmail.com

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 365f4a6..fc09d7b 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include linux/sched.h
 #include linux/percpu.h
+#include linux/vtime.h
 #include asm/ptrace.h
 
 struct context_tracking {
@@ -19,6 +20,26 @@ struct context_tracking {
} state;
 };
 
+static inline void __guest_enter(void)
+{
+   /*
+* This is running in ioctl context so we can avoid
+* the call to vtime_account() with its unnecessary idle check.
+*/
+   vtime_account_system(current);
+   current-flags |= PF_VCPU;
+}
+
+static inline void __guest_exit(void)
+{
+   /*
+* This is running in ioctl context so we can avoid
+* the call to vtime_account() with its unnecessary idle check.
+*/
+   vtime_account_system(current);
+   current-flags = ~PF_VCPU;
+}
+
 #ifdef CONFIG_CONTEXT_TRACKING
 DECLARE_PER_CPU(struct context_tracking, context_tracking);
 
@@ -35,6 +56,9 @@ static inline bool context_tracking_active(void)
 extern void user_enter(void);
 extern void user_exit(void);
 
+extern void guest_enter(void);
+extern void guest_exit(void);
+
 static inline enum ctx_state exception_enter(void)
 {
enum ctx_state prev_ctx;
@@ -57,6 +81,17 @@ extern void context_tracking_task_switch(struct task_struct 
*prev,
 static inline bool context_tracking_in_user(void) { return false; }
 static inline void user_enter(void) { }
 static inline void user_exit(void) { }
+
+static inline void guest_enter(void)
+{
+   __guest_enter();
+}
+
+static inline void guest_exit(void)
+{
+   __guest_exit();
+}
+
 static inline enum ctx_state exception_enter(void) { return 0; }
 static inline void exception_exit(enum ctx_state prev_ctx) { }
 static inline void context_tracking_task_switch(struct task_struct *prev,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f0eea07..8db53cf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -23,6 +23,7 @@
 #include linux/ratelimit.h
 #include linux/err.h
 #include linux/irqflags.h
+#include linux/context_tracking.h
 #include asm/signal.h
 
 #include linux/kvm.h
@@ -760,42 +761,6 @@ static inline int kvm_iommu_unmap_guest(struct kvm *kvm)
 }
 #endif
 
-static inline void __guest_enter(void)
-{
-   /*
-* This is running in ioctl context so we can avoid
-* the call to vtime_account() with its unnecessary idle check.
-*/
-   vtime_account_system(current);
-   current-flags |= PF_VCPU;
-}
-
-static inline void __guest_exit(void)
-{
-   /*
-* This is running in ioctl context so we can avoid
-* the call to vtime_account() with its unnecessary idle check.
-*/
-   vtime_account_system(current);
-   current-flags = ~PF_VCPU;
-}
-
-#ifdef CONFIG_CONTEXT_TRACKING
-extern void guest_enter(void);
-extern void guest_exit(void);
-
-#else /* !CONFIG_CONTEXT_TRACKING */
-static inline void guest_enter(void)
-{
-   __guest_enter();
-}
-
-static inline void guest_exit(void)
-{
-   __guest_exit();
-}
-#endif /* !CONFIG_CONTEXT_TRACKING */
-
 static inline void kvm_guest_enter(void)
 {
unsigned long flags;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord

Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM

2013-05-15 Thread Frederic Weisbecker
On Mon, Mar 25, 2013 at 02:14:20PM -0700, Kevin Hilman wrote:
 Gleb Natapov g...@redhat.com writes:
 
  On Sun, Mar 24, 2013 at 02:44:26PM +0100, Frederic Weisbecker wrote:
  2013/3/21 Gleb Natapov g...@redhat.com:
   Isn't is simpler for kernel/context_tracking.c to define empty
   __guest_enter()/__guest_exit() if !CONFIG_KVM.
  
  That doesn't look right. Off-cases are usually handled from the
  headers, right? So that we avoid iffdeffery ugliness in core code.
  Lets put it in linux/context_tracking.h header then.
 
 Here's a version to do that.
 
 Kevin
 
 From d9d909394479dd7ff90b7bddb95a564945406719 Mon Sep 17 00:00:00 2001
 From: Kevin Hilman khil...@linaro.org
 Date: Mon, 25 Mar 2013 14:12:41 -0700
 Subject: [PATCH v2] ontext_tracking: fix !CONFIG_KVM compile: add stub guest
  enter/exit

Sorry for my very delayed response...

 
 When KVM is not enabled, or not available on a platform, the KVM
 headers should not be included.  Instead, just define stub
 __guest_[enter|exit] functions.

May be it would be cleaner to move guest_enter/exit definitions altogether
in linux/context_tracking.h

After all that's where the implementation mostly belong to.

Let me see if I can get that in shape.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM

2013-03-24 Thread Frederic Weisbecker
2013/3/21 Gleb Natapov g...@redhat.com:
 Isn't is simpler for kernel/context_tracking.c to define empty
 __guest_enter()/__guest_exit() if !CONFIG_KVM.

That doesn't look right. Off-cases are usually handled from the
headers, right? So that we avoid iffdeffery ugliness in core code.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-06 Thread Frederic Weisbecker
2013/2/19 Marcelo Tosatti mtosa...@redhat.com:
 On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com:
  In the case of where you have a system that is running in a
  capped or overcommitted environment the user may see steal time
  being reported in accounting tools such as top or vmstat.  This can
  cause confusion for the end user.

 Sorry, I'm no expert in this area. But I don't really understand what
 is confusing for the end user here.

 I suppose that what is wanted is to subtract stolen time due to 'known
 reasons' from steal time reporting. 'Known reasons' being, for example,
 hard caps. So a vcpu executing instructions with no halt, but limited to
 80% of available bandwidth, would not have 20% of stolen time reported.

Ok, that's a good explanation to add to make that subtle steal time
issue clearer.


 But yes, a description of the scenario that is being dealt with, with
 details, is important.

Yeah especially for such a significant user ABI change.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-06 Thread Frederic Weisbecker
2013/3/5 Michael Wolf m...@linux.vnet.ibm.com:
 Sorry for the delay in the response.  I did not see the email
 right away.

 On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
 On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
  2013/2/5 Michael Wolf m...@linux.vnet.ibm.com:
   In the case of where you have a system that is running in a
   capped or overcommitted environment the user may see steal time
   being reported in accounting tools such as top or vmstat.  This can
   cause confusion for the end user.
 
  Sorry, I'm no expert in this area. But I don't really understand what
  is confusing for the end user here.

 I suppose that what is wanted is to subtract stolen time due to 'known
 reasons' from steal time reporting. 'Known reasons' being, for example,
 hard caps. So a vcpu executing instructions with no halt, but limited to
 80% of available bandwidth, would not have 20% of stolen time reported.

 Yes exactly and the end user many times did not set up the guest and is
 not aware of the capping.  The end user is only aware of the performance
 level that they were told they would get with the guest.


 But yes, a description of the scenario that is being dealt with, with
 details, is important.

 I will add more detail to the description next time I submit the
 patches.  How about something like,In a cloud environment the user of a
 kvm guest is not aware of the underlying hardware or how many other
 guests are running on it.  The end user is only aware of a level of
 performance that they should see.   or does that just muddy the picture
 more??

That alone is probably not enough. But yeah, make sure you clearly
state the difference between expected (caps, sched bandwidth...) and
unexpected (overcommitting, competing load...) stolen time. Then add a
practical example as you made above that explains why it matters to
make that distinction and why you want to report it.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-02-18 Thread Frederic Weisbecker
2013/2/5 Michael Wolf m...@linux.vnet.ibm.com:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.  This can
 cause confusion for the end user.

Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.

 To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will separate
 the consigned time from the steal time.  Tthe steal time will only be altered
 if hard limits (cfs bandwidth control) is used.  The period and the quota used
 to separate the consigned time (expected steal) from the steal time are taken
 from the cfs bandwidth control settings. Any other steal time accruing during
 that period will show as the traditional steal time.

I'm also a bit confused here. steal time will then only account the
cpu time lost due to quotas from cfs bandwidth control? Also what do
you exactly mean by expected steal time? Is it steal time due to
overcommitting minus scheduler quotas?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 1/1] s390/kvm: Fix BUG in include/linux/kvm_host.h:745

2013-01-08 Thread Frederic Weisbecker
2013/1/8 Christian Borntraeger borntrae...@de.ibm.com:
 commit b080935c8638e08134629d0a9ebdf35669bec14d
 kvm: Directly account vtime to system on guest switch

 also removed the irq_disable/enable around kvm guest switch, which
 is correct in itself. Unfortunately, there is a BUG ON that (correctly)
 checks for preemptible to cover the call to rcu later on.
 (Introduced with commit 8fa2206821953a50a3a02ea33fcfb3ced2fd9997
 KVM: make guest mode entry to be rcu quiescent state)

 This check might trigger depending on the kernel config.
 Lets make sure that no preemption happens during kvm_guest_enter.
 We can enable preemption again after the call to
 rcu_virt_note_context_switch returns.

 Please note that we continue to run s390 guests with interrupts
 enabled.

 CC: Frederic Weisbecker fweis...@gmail.com
 CC: Gleb Natapov g...@redhat.com
 Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
 ---
  arch/s390/kvm/kvm-s390.c | 2 ++
  1 file changed, 2 insertions(+)

 diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
 index c9011bf..f090e81 100644
 --- a/arch/s390/kvm/kvm-s390.c
 +++ b/arch/s390/kvm/kvm-s390.c
 @@ -613,7 +613,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 kvm_s390_deliver_pending_interrupts(vcpu);

 vcpu-arch.sie_block-icptcode = 0;
 +   preempt_disable();
 kvm_guest_enter();
 +   preempt_enable();

Sorry for the issue.

The fix looks good to me, thanks!

Acked-by: Frederic Weisbecker fweis...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] rcu: Allow rcu_user_enter()/exit() to nest

2012-07-09 Thread Frederic Weisbecker
On Sun, Jul 08, 2012 at 06:54:18PM +0300, Avi Kivity wrote:
 On 07/06/2012 03:00 PM, Frederic Weisbecker wrote:
  Allow calls to rcu_user_enter() even if we are already
  in userspace (as seen by RCU) and allow calls to rcu_user_exit()
  even if we are already in the kernel.
  
  This makes the APIs more flexible to be called from architectures.
  Exception entries for example won't need to know if they come from
  userspace before calling rcu_user_exit().
 
 I guess I should switch kvm to rcu_user_enter() and co, so we can
 disable the tick while running in a guest.  But where are those
 functions?  What are the rules for calling them?

I guess we need to have a closer look at the guest case first. We probably need
to take some care about specifics in time and load accounting usually
handled by the tick before we can shut it down. RCU is only part of the
problem.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/9] perf: expose perf capability to other modules.

2011-11-15 Thread Frederic Weisbecker
On Thu, Nov 10, 2011 at 06:12:23AM -0600, Jason Wessel wrote:
 On 11/10/2011 02:58 AM, Frederic Weisbecker wrote:
  On Mon, Nov 07, 2011 at 02:45:17PM +, Will Deacon wrote:
  Hi Frederic,
 
  On Wed, Nov 02, 2011 at 07:42:04AM +, Frederic Weisbecker wrote:
  On Tue, Nov 01, 2011 at 10:20:04AM -0600, David Ahern wrote:
  Right. Originally it could be enabled/disabled. Right now it cannot be,
  but I believe Frederic is working on making it configurable again.
 
  David
  Yep. Will Deacon is working on making the breakpoints able to process
  pure arch informations (ie: without beeing forced to use the perf attr
  as a midlayer to define them).
 
  Once we have that I can seperate the breakpoints implementation from perf
  and make it opt-able.
  How do you foresee kdb fitting into this? I see that currently [on x86] we
  cook up perf_event structures with a specific overflow handler set. If we
  want to move this over to using a completely arch-defined structure, then
  we're going to end up with an overflow handler field in both perf_event
  *and* the arch-specific structure, which doesn't feel right to me.
 
  Of course, if the goal is only to separate ptrace (i.e. user debugging) 
  from
  the perf dependency then we don't need the overflow handler because we'll
  always just send SIGTRAP to the current task.
 
  Any ideas?
  I don't know if we want to convert x86/kgdb to use pure arch breakpoints.
  If kgdb one day wants to extend this use to generic code, it may be a good
  idea to keep the things as is. I don't know, I'm adding Jason in Cc.
 
 I think the important part is to share the allocation code (meaning who owns 
 which break point slots).  This is  why kgdb/kdb allocates the perf 
 structures.  The kgdb code will also directly write data to the slots once it 
 has reserved them it would be good to share this code, but it was not shared 
 because it was not usable early enough in the boot cycle on x86.
 
 Certainly there are others who could consume the same infrastructure such as 
 kprobes.
 
 Jason.

Yeah sure, in any case we want to keep the slot allocation/reservation handled 
in
kernel/event/hw_breakpoint.c
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/9] perf: expose perf capability to other modules.

2011-11-10 Thread Frederic Weisbecker
On Mon, Nov 07, 2011 at 02:45:17PM +, Will Deacon wrote:
 Hi Frederic,
 
 On Wed, Nov 02, 2011 at 07:42:04AM +, Frederic Weisbecker wrote:
  On Tue, Nov 01, 2011 at 10:20:04AM -0600, David Ahern wrote:
   Right. Originally it could be enabled/disabled. Right now it cannot be,
   but I believe Frederic is working on making it configurable again.
   
   David
  
  Yep. Will Deacon is working on making the breakpoints able to process
  pure arch informations (ie: without beeing forced to use the perf attr
  as a midlayer to define them).
  
  Once we have that I can seperate the breakpoints implementation from perf
  and make it opt-able.
 
 How do you foresee kdb fitting into this? I see that currently [on x86] we
 cook up perf_event structures with a specific overflow handler set. If we
 want to move this over to using a completely arch-defined structure, then
 we're going to end up with an overflow handler field in both perf_event
 *and* the arch-specific structure, which doesn't feel right to me.
 
 Of course, if the goal is only to separate ptrace (i.e. user debugging) from
 the perf dependency then we don't need the overflow handler because we'll
 always just send SIGTRAP to the current task.
 
 Any ideas?

I don't know if we want to convert x86/kgdb to use pure arch breakpoints.
If kgdb one day wants to extend this use to generic code, it may be a good
idea to keep the things as is. I don't know, I'm adding Jason in Cc.

In any case I think we have a problem if we want to default to send a
SIGTRAP. Look at this:

bp = per_cpu(bp_per_reg[i], cpu);
/*
 * Reset the 'i'th TRAP bit in dr6 to denote completion of
 * exception handling
 */
(*dr6_p) = ~(DR_TRAP0  i);
/*
 * bp can be NULL due to lazy debug register switching
 * or due to concurrent perf counter removing.
 */
if (!bp) {
rcu_read_unlock();
break;
}

perf_bp_event(bp, args-regs);


I don't have the details about how lazy the debug register switching
can be. And also we want to avoid a locking between the perf event
scheduling (removing) and the breakpoint triggering path.

A solution is to look at the ptrace breakpoints in the thread
struct and see if the one in the index is there. That can reside
in its own callback or as a fallback in hw_breakpoint_handler().
I don't feel that strong with choosing either of those solutions.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/9] perf: expose perf capability to other modules.

2011-11-02 Thread Frederic Weisbecker
On Tue, Nov 01, 2011 at 10:20:04AM -0600, David Ahern wrote:
 
 
 On 11/01/2011 10:13 AM, Gleb Natapov wrote:
  On Tue, Nov 01, 2011 at 09:49:19AM -0600, David Ahern wrote:
  On 10/30/2011 10:53 AM, Gleb Natapov wrote:
  KVM needs to know perf capability to decide which PMU it can expose to a
  guest.
 
  Signed-off-by: Gleb Natapov g...@redhat.com
  ---
   arch/x86/include/asm/perf_event.h  |   11 +++
   arch/x86/kernel/cpu/perf_event.c   |   11 +++
   arch/x86/kernel/cpu/perf_event.h   |2 ++
   arch/x86/kernel/cpu/perf_event_intel.c |3 +++
   4 files changed, 27 insertions(+), 0 deletions(-)
 
  diff --git a/arch/x86/include/asm/perf_event.h 
  b/arch/x86/include/asm/perf_event.h
  index f61c62f..7d7e57f 100644
  --- a/arch/x86/include/asm/perf_event.h
  +++ b/arch/x86/include/asm/perf_event.h
  @@ -201,7 +201,18 @@ struct perf_guest_switch_msr {
u64 host, guest;
   };
   
  +struct x86_pmu_capability {
  + int version;
  + int num_counters_gp;
  + int num_counters_fixed;
  + int bit_width_gp;
  + int bit_width_fixed;
  + unsigned int events_mask;
  + int events_mask_len;
  +};
  +
   extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
  +extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
   #else
   static inline perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
   {
 
  What about version of perf_get_x86_pmu_capability for CONFIG_PERF_EVENTS
  not enabled in host kernel? Next patch for KVM assumes the function is
  defined.
 
  As far as I understand it is not possible to build x86 without
  CONFIG_PERF_EVENTS right now. Actually kvm pmu code depends on
  CONFIG_PERF_EVENTS been enabled. I can easily provide the stub if
  needed though.
 
 Right. Originally it could be enabled/disabled. Right now it cannot be,
 but I believe Frederic is working on making it configurable again.
 
 David

Yep. Will Deacon is working on making the breakpoints able to process
pure arch informations (ie: without beeing forced to use the perf attr
as a midlayer to define them).

Once we have that I can seperate the breakpoints implementation from perf
and make it opt-able.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] perf: add context field to perf_event

2011-07-05 Thread Frederic Weisbecker
On Tue, Jul 05, 2011 at 03:30:26PM +0100, Will Deacon wrote:
 Hi Frederic,
 
 On Mon, Jul 04, 2011 at 02:58:24PM +0100, Frederic Weisbecker wrote:
  On Wed, Jun 29, 2011 at 05:27:25PM +0100, Will Deacon wrote:
   Hi Frederic,
   
   Thanks for including me on CC.
   
   On Wed, Jun 29, 2011 at 05:08:45PM +0100, Frederic Weisbecker wrote:
On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote:
 The perf_event overflow handler does not receive any caller-derived
 argument, so many callers need to resort to looking up the perf_event
 in their local data structure.  This is ugly and doesn't scale if a
 single callback services many perf_events.

 Fix by adding a context parameter to 
 perf_event_create_kernel_counter()
 (and derived hardware breakpoints APIs) and storing it in the 
 perf_event.
 The field can be accessed from the callback as 
 event-overflow_handler_context.
 All callers are updated.

 Signed-off-by: Avi Kivity a...@redhat.com

I believe it can micro-optimize ptrace through 
register_user_hw_breakpoint() because
we could store the index of the breakpoint that way, instead of 
iterating through 4 slots.

Perhaps it can help in arm too, adding Will in Cc.
   
   Yes, we could store the breakpoint index in there and it would save us
   walking over the breakpoints when one fires. Not sure this helps us for
   anything else though. My main gripe with the ptrace interface to
   hw_breakpoints is that we have to convert all the breakpoint information
   from ARM_BREAKPOINT_* to HW_BREAKPOINT_* and then convert it all back 
   again
   in the hw_breakpoint code. Yuck!
  
  Agreed, I don't like that either.
  
  Would you like to improve that? We probably need to be able to pass some 
  arch data
  through the whole call of breakpoint creation, including 
  perf_event_create_kernel_counter().
 
 Sure, I'll make some time to look at this and try and get an RFC out in the
 next few weeks.

Great! Thanks a lot!

 
  There can be a transition step where we can either take generic attr or 
  arch datas, until
  every archs are converted. So that you can handle the arm part and other 
  arch developers
  can relay.
 
 Yup.
 
  
  Another thing I would like to do in the even longer term is to not use perf 
  anymore
  for ptrace breakpoints, because that involves a heavy dependency and few 
  people are
  happy with that. Instead we should just have a generic hook into the 
  sched_switch()
  and handle pure ptrace breakpoints there. The central breakpoint API would 
  still be
  there to reserve/schedule breakpoint resources between ptrace and perf.
  
  But beeing able to create ptrace breakpoints without converting to generic 
  perf attr
  is a necessary first step in order to achieve this.
 
 Agreed, but I'll bear that in mind so I don't make it any more difficult
 than it already is!
 
 Cheers,
 
 Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] perf: add context field to perf_event

2011-07-04 Thread Frederic Weisbecker
On Wed, Jun 29, 2011 at 05:27:25PM +0100, Will Deacon wrote:
 Hi Frederic,
 
 Thanks for including me on CC.
 
 On Wed, Jun 29, 2011 at 05:08:45PM +0100, Frederic Weisbecker wrote:
  On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote:
   The perf_event overflow handler does not receive any caller-derived
   argument, so many callers need to resort to looking up the perf_event
   in their local data structure.  This is ugly and doesn't scale if a
   single callback services many perf_events.
  
   Fix by adding a context parameter to perf_event_create_kernel_counter()
   (and derived hardware breakpoints APIs) and storing it in the perf_event.
   The field can be accessed from the callback as 
   event-overflow_handler_context.
   All callers are updated.
  
   Signed-off-by: Avi Kivity a...@redhat.com
  
  I believe it can micro-optimize ptrace through 
  register_user_hw_breakpoint() because
  we could store the index of the breakpoint that way, instead of iterating 
  through 4 slots.
  
  Perhaps it can help in arm too, adding Will in Cc.
 
 Yes, we could store the breakpoint index in there and it would save us
 walking over the breakpoints when one fires. Not sure this helps us for
 anything else though. My main gripe with the ptrace interface to
 hw_breakpoints is that we have to convert all the breakpoint information
 from ARM_BREAKPOINT_* to HW_BREAKPOINT_* and then convert it all back again
 in the hw_breakpoint code. Yuck!

Agreed, I don't like that either.

Would you like to improve that? We probably need to be able to pass some arch 
data
through the whole call of breakpoint creation, including 
perf_event_create_kernel_counter().

There can be a transition step where we can either take generic attr or arch 
datas, until
every archs are converted. So that you can handle the arm part and other arch 
developers
can relay.


Another thing I would like to do in the even longer term is to not use perf 
anymore
for ptrace breakpoints, because that involves a heavy dependency and few people 
are
happy with that. Instead we should just have a generic hook into the 
sched_switch()
and handle pure ptrace breakpoints there. The central breakpoint API would 
still be
there to reserve/schedule breakpoint resources between ptrace and perf.

But beeing able to create ptrace breakpoints without converting to generic perf 
attr
is a necessary first step in order to achieve this.

Thanks. 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] perf: add context field to perf_event

2011-07-04 Thread Frederic Weisbecker
On Mon, Jul 04, 2011 at 05:10:20PM +0300, Avi Kivity wrote:
 On 07/04/2011 04:58 PM, Frederic Weisbecker wrote:
 Another thing I would like to do in the even longer term is to not use perf 
 anymore
 for ptrace breakpoints, because that involves a heavy dependency and few 
 people are
 happy with that. Instead we should just have a generic hook into the 
 sched_switch()
 and handle pure ptrace breakpoints there. The central breakpoint API would 
 still be
 there to reserve/schedule breakpoint resources between ptrace and perf.
 
 
 'struct preempt_notifier' may be the hook you're looking for.

Yeah looks like a perfect fit as it's per task.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] perf: add context field to perf_event

2011-06-29 Thread Frederic Weisbecker
On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote:
 The perf_event overflow handler does not receive any caller-derived
 argument, so many callers need to resort to looking up the perf_event
 in their local data structure.  This is ugly and doesn't scale if a
 single callback services many perf_events.
 
 Fix by adding a context parameter to perf_event_create_kernel_counter()
 (and derived hardware breakpoints APIs) and storing it in the perf_event.
 The field can be accessed from the callback as 
 event-overflow_handler_context.
 All callers are updated.
 
 Signed-off-by: Avi Kivity a...@redhat.com

I believe it can micro-optimize ptrace through register_user_hw_breakpoint() 
because
we could store the index of the breakpoint that way, instead of iterating 
through 4 slots.

Perhaps it can help in arm too, adding Will in Cc.

But for register_wide_hw_breakpoint, I'm not sure. kgdb is the main user, may 
be Jason
could find some use of it.

 ---
  arch/arm/kernel/ptrace.c|3 ++-
  arch/powerpc/kernel/ptrace.c|2 +-
  arch/sh/kernel/ptrace_32.c  |3 ++-
  arch/x86/kernel/kgdb.c  |2 +-
  arch/x86/kernel/ptrace.c|3 ++-
  drivers/oprofile/oprofile_perf.c|2 +-
  include/linux/hw_breakpoint.h   |   10 --
  include/linux/perf_event.h  |4 +++-
  kernel/events/core.c|   21 +++--
  kernel/events/hw_breakpoint.c   |   10 +++---
  kernel/watchdog.c   |2 +-
  samples/hw_breakpoint/data_breakpoint.c |2 +-
  12 files changed, 44 insertions(+), 20 deletions(-)
 
 diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
 index 9726006..4911c94 100644
 --- a/arch/arm/kernel/ptrace.c
 +++ b/arch/arm/kernel/ptrace.c
 @@ -479,7 +479,8 @@ static struct perf_event *ptrace_hbp_create(struct 
 task_struct *tsk, int type)
   attr.bp_type= type;
   attr.disabled   = 1;
  
 - return register_user_hw_breakpoint(attr, ptrace_hbptriggered, tsk);
 + return register_user_hw_breakpoint(attr, ptrace_hbptriggered, NULL,
 +tsk);
  }
  
  static int ptrace_gethbpregs(struct task_struct *tsk, long num,
 diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
 index cb22024..5249308 100644
 --- a/arch/powerpc/kernel/ptrace.c
 +++ b/arch/powerpc/kernel/ptrace.c
 @@ -973,7 +973,7 @@ int ptrace_set_debugreg(struct task_struct *task, 
 unsigned long addr,
   attr.bp_type);
  
   thread-ptrace_bps[0] = bp = register_user_hw_breakpoint(attr,
 - ptrace_triggered, task);
 +ptrace_triggered, NULL, task);
   if (IS_ERR(bp)) {
   thread-ptrace_bps[0] = NULL;
   ptrace_put_breakpoints(task);
 diff --git a/arch/sh/kernel/ptrace_32.c b/arch/sh/kernel/ptrace_32.c
 index 3d7b209..930312f 100644
 --- a/arch/sh/kernel/ptrace_32.c
 +++ b/arch/sh/kernel/ptrace_32.c
 @@ -91,7 +91,8 @@ static int set_single_step(struct task_struct *tsk, 
 unsigned long addr)
   attr.bp_len = HW_BREAKPOINT_LEN_2;
   attr.bp_type = HW_BREAKPOINT_R;
  
 - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk);
 + bp = register_user_hw_breakpoint(attr, ptrace_triggered,
 +  NULL, tsk);
   if (IS_ERR(bp))
   return PTR_ERR(bp);
  
 diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
 index 5f9ecff..473ab53 100644
 --- a/arch/x86/kernel/kgdb.c
 +++ b/arch/x86/kernel/kgdb.c
 @@ -638,7 +638,7 @@ void kgdb_arch_late(void)
   for (i = 0; i  HBP_NUM; i++) {
   if (breakinfo[i].pev)
   continue;
 - breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL);
 + breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL, 
 NULL);
   if (IS_ERR((void * __force)breakinfo[i].pev)) {
   printk(KERN_ERR kgdb: Could not allocate hw
  breakpoints\nDisabling the kernel debugger\n);
 diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
 index 807c2a2..28092ae 100644
 --- a/arch/x86/kernel/ptrace.c
 +++ b/arch/x86/kernel/ptrace.c
 @@ -715,7 +715,8 @@ static int ptrace_set_breakpoint_addr(struct task_struct 
 *tsk, int nr,
   attr.bp_type = HW_BREAKPOINT_W;
   attr.disabled = 1;
  
 - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk);
 + bp = register_user_hw_breakpoint(attr, ptrace_triggered,
 +  NULL, tsk);
  
   /*
* CHECKME: the previous code returned -EIO if the addr wasn't
 diff --git a/drivers/oprofile/oprofile_perf.c 
 

Re: ftrace/perf_event leak

2010-09-01 Thread Frederic Weisbecker
On Wed, Sep 01, 2010 at 01:02:57PM +0200, Peter Zijlstra wrote:
 On Wed, 2010-09-01 at 13:38 +0300, Avi Kivity wrote:
  On 09/01/2010 12:38 PM, Li Zefan wrote:
  
   Then try this:
  
  Tested-by: Avi Kivity a...@redhat.com
  
 
 Thanks, queued as:
 
 ---
 Subject: perf, trace: Fix module leak
 From: Li Zefan l...@cn.fujitsu.com
 Date: Wed Sep 01 12:58:43 CEST 2010
 
 Commit 1c024eca (perf, trace: Optimize tracepoints by using
 per-tracepoint-per-cpu hlist to track events) caused a module refcount
 leak.
 
 Tested-by: Avi Kivity a...@redhat.com
 Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
 LKML-Reference: 4c7e1f12.8030...@cn.fujitsu.com
 ---
  kernel/trace/trace_event_perf.c |3 +++
  1 file changed, 3 insertions(+)
 
 Index: linux-2.6/kernel/trace/trace_event_perf.c
 ===
 --- linux-2.6.orig/kernel/trace/trace_event_perf.c
 +++ linux-2.6/kernel/trace/trace_event_perf.c
 @@ -91,6 +91,8 @@ int perf_trace_init(struct perf_event *p
   tp_event-class  tp_event-class-reg 
   try_module_get(tp_event-mod)) {
   ret = perf_trace_event_init(tp_event, p_event);
 + if (ret)
 + module_put(tp_event-mod);
   break;
   }
   }
 @@ -147,6 +149,7 @@ void perf_trace_destroy(struct perf_even
   }
   }
  out:
 + module_put(tp_event-mod);
   mutex_unlock(event_mutex);
  }
  
 



Thanks for fixing this.

However, can we split this in two patches to ease the backport?

The lack of a module_put() after perf_trace_init() failure is there for a while
(the backport needs to start in 2.6.32).

But the lack of a module_put in the destroy path needs a .35 backport only.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Unify KVM kernel-space and user-space code into a single project

2010-03-18 Thread Frederic Weisbecker
On Thu, Mar 18, 2010 at 12:32:51PM +0200, Avi Kivity wrote:
 By serious developer I mean

  - someone who is interested in contributing, not in getting their name  
 into the kernel commits list
  - someone who is willing to read the wiki page and find out where the  
 repository and mailing list for a project is
  - someone who will spend enough time on the project so that the time to 
 clone two repositories will not be a factor in their contributions


I'm not going to argue about the Qemu merging here.
But your above assessment is incomplete.

It is not because developers don't want to clone two different
trees that tools/perf is a success. Or may be it's a factor but
I suspect it to be very minimal. I can script git commands if
needed. It is actually because both kernel and user side are
sync in this scheme.



 Let's wait and see then.  If the tools/perf/ experience has really good  
 results, we can reconsider this at a later date.


I think it has already really good results.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-10 Thread Frederic Weisbecker
On Sat, Oct 10, 2009 at 01:18:16PM +0400, Michael Tokarev wrote:
 Michael Tokarev wrote:
 Frederic Weisbecker wrote:
 []
 Was there swapping going on?
 Not as far as I can see, and sar output agrees.

 But I can read this from you guest traces:

 I missed this one yesterday.  Note it's GUEST traces
 indeed.  Higher (read: non-zero) pgp{in,out} and faults
 values happens in *guest*, not on host (original question
 was if we've swapping in HOST, which'd explain the timer
 issues)



Yeah indeed. But still, that's a strange happenstance.



 [cutting extra all-zero columns]
 pgpgin/s pgpgout/s   fault/s  pgfree/s
 11:44:47 0.00 32.32907.07   277.78
 11:44:4827.59 22.99 44.83   150.57
 11:44:49 0.00 33.68 22.11   218.95
 [...]
 21:46:54 0.00 31.68 16.8390.10
 21:46:55 0.00108.00 17.0089.00
 21:46:56 9.76482.93   3890.24   439.02
 21:46:57 0.00760.00   8627.00  1133.00
 21:46:58 0.00 84.85   2612.12   138.38
 21:46:59 0.00 16.00 17.0090.00

 So it looks like there was some swapping in when the hrtimer (spuriously)
 hanged.

 One possible guess.  Since the guest hanged for some time, the
 higher values there might be a result of accumulated values for
 several seconds.



May be yeah. I don't know enough about virtual internals so...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-09 Thread Frederic Weisbecker
On Sat, Oct 10, 2009 at 01:22:16AM +0400, Michael Tokarev wrote:
 Marcelo Tosatti wrote:
 [snip]

 Would be useful to collect sar (sar -B -b -u) output every one second
 in both host/guest. You already mentioned load was low, but this should
 give more details.

 Here we go:  http://www.corpit.ru/mjt/hrtimer-interrupt-too-slow/

 Two incindents - cases when hrtimer: interrupt is too slow were
 reported in the guest (with Marcelo's patch so that min_delta is
 increased to 50% only), happened at 11:44:48 and 21:46:56 (as
 shown in guest-dmesg file).  For both, there's `sar -BWbd' output
 for a 2-minute interval (starting one minute before the delay
 and ending one minute after) from both host and guest.

 Was there swapping going on?

 Not as far as I can see, and sar output agrees.


But I can read this from you guest traces:

pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s 
pgscand/s pgsteal/s%vmeff

11:44:45 0.00 32.32174.75  0.00176.77  0.00  
0.00  0.00  0.00
11:44:46 0.00 16.00789.00  0.00323.00  0.00  
0.00  0.00  0.00
11:44:47 0.00 32.32907.07  0.00277.78  0.00  
0.00  0.00  0.00
11:44:4827.59 22.99 44.83  0.00150.57  0.00  
0.00  0.00  0.00
11:44:49 0.00 33.68 22.11  0.00218.95  0.00  
0.00  0.00  0.00
11:44:50 0.00101.01 17.17  0.00151.52  0.00  
0.00  0.00  0.00
11:44:51 0.00 15.69 16.67  0.00126.47  0.00  
0.00  0.00  0.00
[...]
21:46:52 0.00 40.00 17.00  0.00 82.00  0.00  
0.00  0.00  0.00
21:46:53 0.00 31.68 18.81  0.00 94.06  0.00  
0.00  0.00  0.00
21:46:54 0.00 31.68 16.83  0.00 90.10  0.00  
0.00  0.00  0.00
21:46:55 0.00108.00 17.00  0.00 89.00  0.00  
0.00  0.00  0.00
21:46:56 9.76482.93   3890.24  0.00439.02  0.00  
0.00  0.00  0.00
21:46:57 0.00760.00   8627.00  0.00   1133.00  0.00  
0.00  0.00  0.00
21:46:58 0.00 84.85   2612.12  0.00138.38  0.00  
0.00  0.00  0.00
21:46:59 0.00 16.00 17.00  0.00 90.00  0.00  
0.00  0.00  0.00
21:47:00 0.00 36.36 17.17  0.00 90.91  0.00  
0.00  0.00  0.00



So it looks like there was some swapping in when the hrtimer (spuriously)
hanged.

Not sure how to correlate these results with the incident though...



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-07 Thread Frederic Weisbecker
(Adding Thomas in Cc)


On Sat, Oct 03, 2009 at 08:12:05PM -0300, Marcelo Tosatti wrote:
 Michael,
 
 Can you please give the patch below a try please? (without acpi_pm timer 
 or priority adjustments for the guest).
 
 On Tue, Sep 29, 2009 at 05:12:17PM +0400, Michael Tokarev wrote:
  Hello.
 
  I'm having quite an.. unusable system here.
  It's not really a regresssion with 0.11.0,
  it was something similar before, but with
  0.11.0 and/or 2.6.31 it become much worse.
 
  The thing is that after some uptime, kvm
  guest prints something like this:
 
  hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns
 
  after which system (guest) speeed becomes
  very slow.  The above message is from
  2.6.31 guest running wiht 0.11.0  2.6.31
  host.  Before I tried it with 0.10.6 and
  2.6.30 or 2.6.27, and the delta were a
  bit less than that:
 
  hrtimer: interrupt too slow, forcing clock min delta to 15415 ns
  hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns
 
 It seems the way hrtimer_interrupt_hanging calculates min_delta is
 wrong (especially to virtual machines). The guest vcpu can be scheduled
 out during the execution of the hrtimer callbacks (and the callbacks
 themselves can do operations that translate to blocking operations in
 the hypervisor).
 
 So high min_delta values can be calculated if, for example, a single
 hrtimer_interrupt run takes two host time slices to execute, while some
 other higher priority task runs for N slices in between.
 
 Using the hrtimer_interrupt execution time (which can be the worse
 case at any given time), as the min_delta is problematic.
 
 So simply increase min_delta_ns by 50% once every detected failure,
 which will eventually lead to an acceptable threshold (the algorithm
 should scale back to down lower min_delta, to adjust back to wealthier
 times, too).
 
 diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
 index 49da79a..8997978 100644
 --- a/kernel/hrtimer.c
 +++ b/kernel/hrtimer.c
 @@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer)
  
  #ifdef CONFIG_HIGH_RES_TIMERS
  
 -static int force_clock_reprogram;
 -
  /*
   * After 5 iteration's attempts, we consider that hrtimer_interrupt()
   * is hanging, which could happen with something that slows the interrupt
 - * such as the tracing. Then we force the clock reprogramming for each future
 - * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
 - * threshold that we will overwrite.
 - * The next tick event will be scheduled to 3 times we currently spend on
 - * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
 - * 1/4 of their time to process the hrtimer interrupts. This is enough to
 - * let it running without serious starvation.
 + * such as the tracing, so we increase min_delta_ns.
   */
  
  static inline void
 -hrtimer_interrupt_hanging(struct clock_event_device *dev,
 - ktime_t try_time)
 +hrtimer_interrupt_hanging(struct clock_event_device *dev)
  {
 - force_clock_reprogram = 1;
 - dev-min_delta_ns = (unsigned long)try_time.tv64 * 3;
 - printk(KERN_WARNING hrtimer: interrupt too slow, 
 - forcing clock min delta to %lu ns\n, dev-min_delta_ns);
 + dev-min_delta_ns += dev-min_delta_ns  1;


I haven't thought about the guest that could be scheduled out in
the middle of the timers servicing, making wrong this check based
of the time spent in hrtimer_interrupt().

I guess there is no easy/generic/cheap way to rebase this check
on the _virtual_ time spent in the timers servicing. By virtual,
I mean the time spent in the guest only.

In a non-guest kernel, the old check forces an adaptive rate
sharing:

- we spent n nanosecs to service the batch of timers.
- we are hanging
- we want at least 3/4 of time reserved for non-timer
  servicing in the kernel, this is a minimum prerequisite
  for the system to not starve
- adapt the min_clock_delta against to fit the above constraint

All that does not make sense anymore in a guest. The hang detection
and warnings, the recalibrations of the min_clock_deltas are completely
wrong in this context.
Not only does it spuriously warn, but the minimum timer is increasing
slowly and the guest progressively suffers from higher and higher
latencies.

That's really bad.

Your patch lowers the immediate impact and makes this illness evolving
smoother by scaling down the recalibration to the min_clock_delta.
This appeases the bug but doesn't solve it. I fear it could be even
worse because it makes it more discreet.


May be can we instead increase the minimum threshold of loop in the
hrtimer interrupt before considering it as a hang? Hmm, but a too high
number could make this check useless, depending of the number of pending
timers, which is a finite number.

Well, actually I'm not confident anymore in this check. Or actually we
should change it. May be we can rebase it on the time spent on the hrtimer
interrupt (and check it every 10 

Re: [TOOL] c2kpe: C expression to kprobe event format converter

2009-08-31 Thread Frederic Weisbecker
On Mon, Aug 31, 2009 at 12:14:34AM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
  On Thu, Aug 13, 2009 at 04:59:19PM -0400, Masami Hiramatsu wrote:
  This program converts probe point in C expression to kprobe event
  format for kprobe-based event tracer. This helps to define kprobes
  events by C source line number or function name, and local variable
  name. Currently, this supports only x86(32/64) kernels.
 
 
  Compile
  
  Before compilation, please install libelf and libdwarf development
  packages.
  (e.g. elfutils-libelf-devel and libdwarf-devel on Fedora)
  
  
  This may probably need a specific libdwarf version?
  
  c2kpe.c: In function ‘die_get_entrypc’:
  c2kpe.c:422: erreur: ‘Dwarf_Ranges’ undeclared (first use in this function)
  c2kpe.c:422: erreur: (Each undeclared identifier is reported only once
  c2kpe.c:422: erreur: for each function it appears in.)
  c2kpe.c:422: erreur: ‘ranges’ undeclared (first use in this function)
  c2kpe.c:447: attention : implicit declaration of function ‘dwarf_get_ranges’
  c2kpe.c:451: attention : implicit declaration of function 
  ‘dwarf_ranges_dealloc’
 
 Aah, sure, it should be compiled with libdwarf newer than 20090324.
 You can find it in http://reality.sgiweb.org/davea/dwarf.html


Ah ok.

 
 BTW, libdwarf and libdw (which is the yet another implementation of
 dwarf library) are still under development, e.g. libdwarf doesn't
 support gcc-4.4.1(very new) and only the latest libdw(0.142) can
 support it. So, perhaps I might better port it on libdw, even that is
 less documented...:(



May be let's continue with libdwarf for now and we'll see if support
for libdw is required later.



  TODO
  
   - Fix bugs.
   - Support multiple probepoints from stdin.
   - Better kmodule support.
   - Use elfutils-libdw?
   - Merge into trace-cmd or perf-tools?
  
  
  Yeah definetly, that would be a veeery interesting thing to have.
  I've played with kprobe ftrace to debug something this evening.
  
  It's very cool to be able to put dynamic tracepoints in desired places.
  
  But...
  I firstly needed to put random trace_printk() in some places to
  observe some variables values. And then I thought about the kprobes
  tracer and realized I could do that without the need of rebuilding
  my kernel. Then I've played with it and indeed it works well and
  it's useful, but at the cost of reading objdump based assembly
  code to find the places where I could find my variables values.
  And after two or three probes in such conditions, I've become
  tired of that, then I wanted to try this tool.
  
  
  While I cannot yet because of this build error, I can imagine
  the power of such facility from perf.
  
  We could have a perf probe that creates a kprobe event in debugfs
  (default enable = 0) and which then rely on perf record for the actual
  recording.
  
  Then we could analyse it through perf trace.
  Let's imagine a simple example:
  
  int foo(int arg1, int arg2)
  {
  int var1;
  
  var1 = arg1;
  var1 *= arg2;
  var1 -= arg1;
  
  -- insert a probe here (file bar.c : line 60)
  
  var1 ^= ...
  
  return var1;
  }
  
  ./perf kprobe --file bar.c:60 --action arg1=%d,arg2=%d,var1=%d -- ls 
  -R /
 
 I recommend it should be separated from record, like below:
 
 # set new event
 ./perf kprobe --add kprobe:event1 --file bar.c:60 --action 
 arg1=%d,arg2=%d,var1=%d
 # record new event
 ./perf record -e kprobe:event1 -a -R -- ls -R /


That indeed solves the command line overkill, but that also
breaks a bit the workflow :)

Well, I guess we can start simple in the beginning and follow the above
mockup which is indeed better decoupled. 
And if something more intuitive comes in mind later, then we can still
change it.


 This will allow us to focus on one thing -- convert C to kprobe-tracer.
 And also, it can be listed as like as tracepoint events.


Yeah.


  ./perf trace
  arg1=1 arg2=1 var1=0
  arg1=2 arg2=2 var1=2
  etc..
  
  You may want to sort by field:
  
  ./perf trace -s arg1 --order desc
  arg1=1
|
--- arg2=1 var=1
|
--- arg2=2 var=1
  
  arg1=2
|
--- arg2=1 var=0
|
--- [...]
  
  ./perf trace -s arg1,arg2 --order asc
  arg1=1
|
--- arg2=1
  |
  - var1=0
  |
  - var1=
arg2=...
  |
  
  Ok the latter is a bad example because var1 will always have only one
  value for a given arg1 and arg2. But I guess you see the point.
  
  You won't have to care about the perf trace part, it's already
  implemented and I'll soon handle the sorting part.
  
  All we need is the perf kprobes that translate a C level
  probing expression to a /debug/tracing/kprobe_events compliant
  thing. And then just call

Re: [TOOL] c2kpe: C expression to kprobe event format converter

2009-08-30 Thread Frederic Weisbecker
On Thu, Aug 13, 2009 at 04:59:19PM -0400, Masami Hiramatsu wrote:
 This program converts probe point in C expression to kprobe event
 format for kprobe-based event tracer. This helps to define kprobes
 events by C source line number or function name, and local variable
 name. Currently, this supports only x86(32/64) kernels.


 Compile
 
 Before compilation, please install libelf and libdwarf development
 packages.
 (e.g. elfutils-libelf-devel and libdwarf-devel on Fedora)


This may probably need a specific libdwarf version?

c2kpe.c: In function ‘die_get_entrypc’:
c2kpe.c:422: erreur: ‘Dwarf_Ranges’ undeclared (first use in this function)
c2kpe.c:422: erreur: (Each undeclared identifier is reported only once
c2kpe.c:422: erreur: for each function it appears in.)
c2kpe.c:422: erreur: ‘ranges’ undeclared (first use in this function)
c2kpe.c:447: attention : implicit declaration of function ‘dwarf_get_ranges’
c2kpe.c:451: attention : implicit declaration of function ‘dwarf_ranges_dealloc’


 TODO
 
  - Fix bugs.
  - Support multiple probepoints from stdin.
  - Better kmodule support.
  - Use elfutils-libdw?
  - Merge into trace-cmd or perf-tools?


Yeah definetly, that would be a veeery interesting thing to have.
I've played with kprobe ftrace to debug something this evening.

It's very cool to be able to put dynamic tracepoints in desired places.

But...
I firstly needed to put random trace_printk() in some places to
observe some variables values. And then I thought about the kprobes
tracer and realized I could do that without the need of rebuilding
my kernel. Then I've played with it and indeed it works well and
it's useful, but at the cost of reading objdump based assembly
code to find the places where I could find my variables values.
And after two or three probes in such conditions, I've become
tired of that, then I wanted to try this tool.


While I cannot yet because of this build error, I can imagine
the power of such facility from perf.

We could have a perf probe that creates a kprobe event in debugfs
(default enable = 0) and which then rely on perf record for the actual
recording.

Then we could analyse it through perf trace.
Let's imagine a simple example:

int foo(int arg1, int arg2)
{
int var1;

var1 = arg1;
var1 *= arg2;
var1 -= arg1;

-- insert a probe here (file bar.c : line 60)

var1 ^= ...

return var1;
}

./perf kprobe --file bar.c:60 --action arg1=%d,arg2=%d,var1=%d -- ls -R /
./perf trace
arg1=1 arg2=1 var1=0
arg1=2 arg2=2 var1=2
etc..

You may want to sort by field:

./perf trace -s arg1 --order desc
arg1=1
  |
  --- arg2=1 var=1
  |
  --- arg2=2 var=1

arg1=2
  |
  --- arg2=1 var=0
  |
  --- [...]

./perf trace -s arg1,arg2 --order asc
arg1=1
  |
  --- arg2=1
|
- var1=0
|
- var1=
  arg2=...
|

Ok the latter is a bad example because var1 will always have only one
value for a given arg1 and arg2. But I guess you see the point.

You won't have to care about the perf trace part, it's already
implemented and I'll soon handle the sorting part.

All we need is the perf kprobes that translate a C level
probing expression to a /debug/tracing/kprobe_events compliant
thing. And then just call perf record with the new created
event as an argument.

Frederic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH tracing/kprobes 1/4] x86: Fix x86 instruction decoder selftest to check only .text

2009-08-23 Thread Frederic Weisbecker
On Fri, Aug 21, 2009 at 03:43:07PM -0400, Masami Hiramatsu wrote:
 Fix x86 instruction decoder selftest to check only .text because other
 sections (e.g. .notes) will have random bytes which don't need to be checked.


Applied these 4 patches in

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git \
tracing/kprobes

Thanks!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v14 01/12] x86: instruction decoder API

2009-08-20 Thread Frederic Weisbecker
On Thu, Aug 20, 2009 at 10:42:05AM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
 On Thu, Aug 13, 2009 at 04:34:13PM -0400, Masami Hiramatsu wrote:
 Add x86 instruction decoder to arch-specific libraries. This decoder
 can decode x86 instructions used in kernel into prefix, opcode, modrm,
 sib, displacement and immediates. This can also show the length of
 instructions.

 This version introduces instruction attributes for decoding instructions.
 The instruction attribute tables are generated from the opcode map file
 (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk).

 Currently, the opcode maps are based on opcode maps in Intel(R) 64 and
 IA-32 Architectures Software Developers Manual Vol.2: Appendix.A,
 and consist of below two types of opcode tables.

 1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are
 written as below;

   Table: table-name
   Referrer: escaped-name
   opcode: mnemonic|GrpXXX [operand1[,operand2...]] [(extra1)[,(extra2)...] 
 [| 2nd-mnemonic ...]
(or)
   opcode: escape # escaped-name
   EndTable

 Group opcodes, which has 8 elements, are written as below;

   GrpTable: GrpXXX
   reg:  mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 
 2nd-mnemonic ...]
   EndTable

 These opcode maps include a few SSE and FP opcodes (for setup), because
 those opcodes are used in the kernel.



 I'm getting the following build error on an old K7 box:

 arch/x86/lib/inat.c: In function ‘inat_get_opcode_attribute’:
 arch/x86/lib/inat.c:29: erreur: ‘inat_primary_table’ undeclared (first use 
 in this function)
 arch/x86/lib/inat.c:29: erreur: (Each undeclared identifier is reported only 
 once
 arch/x86/lib/inat.c:29: erreur: for each function it appears in.)

 Thanks for reporting!
 Hmm, it seems that inat-tables.c is not correctly generated.
 Could you tell me which awk you used and send the inat-tables.c?

 Thank you,



Sure:

$ awk -Wv
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

compiled limits:
max NF 32767
sprintf buffer  2040

And I've sent you the content of inat_tables.c in the other answer :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v14 01/12] x86: instruction decoder API

2009-08-20 Thread Frederic Weisbecker
On Thu, Aug 20, 2009 at 11:03:40AM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
 On Thu, Aug 20, 2009 at 01:42:31AM +0200, Frederic Weisbecker wrote:
 On Thu, Aug 13, 2009 at 04:34:13PM -0400, Masami Hiramatsu wrote:
 Add x86 instruction decoder to arch-specific libraries. This decoder
 can decode x86 instructions used in kernel into prefix, opcode, modrm,
 sib, displacement and immediates. This can also show the length of
 instructions.

 This version introduces instruction attributes for decoding instructions.
 The instruction attribute tables are generated from the opcode map file
 (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk).

 Currently, the opcode maps are based on opcode maps in Intel(R) 64 and
 IA-32 Architectures Software Developers Manual Vol.2: Appendix.A,
 and consist of below two types of opcode tables.

 1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are
 written as below;

   Table: table-name
   Referrer: escaped-name
   opcode: mnemonic|GrpXXX [operand1[,operand2...]] [(extra1)[,(extra2)...] 
 [| 2nd-mnemonic ...]
(or)
   opcode: escape # escaped-name
   EndTable

 Group opcodes, which has 8 elements, are written as below;

   GrpTable: GrpXXX
   reg:  mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 
 2nd-mnemonic ...]
   EndTable

 These opcode maps include a few SSE and FP opcodes (for setup), because
 those opcodes are used in the kernel.



 I'm getting the following build error on an old K7 box:

 arch/x86/lib/inat.c: In function ‘inat_get_opcode_attribute’:
 arch/x86/lib/inat.c:29: erreur: ‘inat_primary_table’ undeclared (first use 
 in this function)
 arch/x86/lib/inat.c:29: erreur: (Each undeclared identifier is reported 
 only once
 arch/x86/lib/inat.c:29: erreur: for each function it appears in.)


 I've attached my config. I haven't such problem on a dual x86-64 box.


 Actually I have the same problem in x86-64
 The content of my arch/x86/lib/inat-tables.c:

 /* x86 opcode map generated from x86-opcode-map.txt */
 /* Do not change this code. */
 /* Table: one byte opcode */
 /* Escape opcode map array */
 const insn_attr_t const *inat_escape_tables[INAT_ESC_MAX + 
 1][INAT_LPREFIX_MAX + 1] = {
 };

 /* Group opcode map array */
 const insn_attr_t const *inat_group_tables[INAT_GRP_MAX + 
 1][INAT_LPREFIX_MAX + 1] = {
 };


 I guess there is a problem with the generation of this file.

 Aah, you may use mawk on Ubuntu 9.04, right?
 If so, unfortunately, mawk is still under development.

 http://invisible-island.net/mawk/CHANGES



Aargh...


 20090727
  add check/fix to prevent gsub from recurring to modify on a substring
  of the current line when the regular expression is anchored to the
  beginning of the line; fixes gawk's anchgsub testcase.

  add check for implicit concatenation mistaken for exponent; fixes
  gawk's hex testcase.

  add character-classes to built-in regular expressions.
 ^^
 Look, this means we can't use char-class expressions like
 [:lower:] until this version...

 And I've found another bug in mawk-1.3.3-20090728(the latest one).
 it almost works, but;

 $ mawk 'BEGIN {printf(0x%x\n, 0)}'
 0x1


Ouch, indeed.



 $ gawk 'BEGIN {printf(0x%x\n, 0)}'
 0x0

 This bug skips an array element index 0x0 in inat-tables.c :(

 So, I recommend you to install gawk instead mawk until that
 supports all posix-awk features, since I don't think it is
 good idea to avoid all those bugs which depends on
 implementation (not specification).


 Thank you,



Yeah, indeed. May be add a warning (or build error) in case the user uses
mawk?

Anyway that works fine now with gawk, thanks!
All your patches build well :-)



 -- 
 Masami Hiramatsu

 Software Engineer
 Hitachi Computer Products (America), Inc.
 Software Solutions Division

 e-mail: mhira...@redhat.com


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v14 01/12] x86: instruction decoder API

2009-08-20 Thread Frederic Weisbecker
On Thu, Aug 20, 2009 at 12:16:05PM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
 On Thu, Aug 20, 2009 at 11:03:40AM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
 On Thu, Aug 20, 2009 at 01:42:31AM +0200, Frederic Weisbecker wrote:
 On Thu, Aug 13, 2009 at 04:34:13PM -0400, Masami Hiramatsu wrote:
 Add x86 instruction decoder to arch-specific libraries. This decoder
 can decode x86 instructions used in kernel into prefix, opcode, modrm,
 sib, displacement and immediates. This can also show the length of
 instructions.

 This version introduces instruction attributes for decoding instructions.
 The instruction attribute tables are generated from the opcode map file
 (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk).

 Currently, the opcode maps are based on opcode maps in Intel(R) 64 and
 IA-32 Architectures Software Developers Manual Vol.2: Appendix.A,
 and consist of below two types of opcode tables.

 1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are
 written as below;

Table: table-name
Referrer: escaped-name
opcode: mnemonic|GrpXXX [operand1[,operand2...]] 
 [(extra1)[,(extra2)...] [| 2nd-mnemonic ...]
 (or)
opcode: escape # escaped-name
EndTable

 Group opcodes, which has 8 elements, are written as below;

GrpTable: GrpXXX
reg:  mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 
 2nd-mnemonic ...]
EndTable

 These opcode maps include a few SSE and FP opcodes (for setup), because
 those opcodes are used in the kernel.



 I'm getting the following build error on an old K7 box:

 arch/x86/lib/inat.c: In function ‘inat_get_opcode_attribute’:
 arch/x86/lib/inat.c:29: erreur: ‘inat_primary_table’ undeclared (first 
 use in this function)
 arch/x86/lib/inat.c:29: erreur: (Each undeclared identifier is reported 
 only once
 arch/x86/lib/inat.c:29: erreur: for each function it appears in.)


 I've attached my config. I haven't such problem on a dual x86-64 box.


 Actually I have the same problem in x86-64
 The content of my arch/x86/lib/inat-tables.c:

 /* x86 opcode map generated from x86-opcode-map.txt */
 /* Do not change this code. */
 /* Table: one byte opcode */
 /* Escape opcode map array */
 const insn_attr_t const *inat_escape_tables[INAT_ESC_MAX + 
 1][INAT_LPREFIX_MAX + 1] = {
 };

 /* Group opcode map array */
 const insn_attr_t const *inat_group_tables[INAT_GRP_MAX + 
 1][INAT_LPREFIX_MAX + 1] = {
 };


 I guess there is a problem with the generation of this file.

 Aah, you may use mawk on Ubuntu 9.04, right?
 If so, unfortunately, mawk is still under development.

 http://invisible-island.net/mawk/CHANGES



 Aargh...


 20090727
add check/fix to prevent gsub from recurring to modify on a substring
of the current line when the regular expression is anchored to the
beginning of the line; fixes gawk's anchgsub testcase.

add check for implicit concatenation mistaken for exponent; fixes
gawk's hex testcase.

add character-classes to built-in regular expressions.
  ^^
 Look, this means we can't use char-class expressions like
 [:lower:] until this version...

 And I've found another bug in mawk-1.3.3-20090728(the latest one).
 it almost works, but;

 $ mawk 'BEGIN {printf(0x%x\n, 0)}'
 0x1


 Ouch, indeed.



 $ gawk 'BEGIN {printf(0x%x\n, 0)}'
 0x0

 This bug skips an array element index 0x0 in inat-tables.c :(

 So, I recommend you to install gawk instead mawk until that
 supports all posix-awk features, since I don't think it is
 good idea to avoid all those bugs which depends on
 implementation (not specification).


 Thank you,



 Yeah, indeed. May be add a warning (or build error) in case the user uses
 mawk?

 Hmm, it is possible that mawk will fix those bugs and catch up soon,
 so, I think checking mawk is not a good idea.
 (and since there will be other awk implementations, it's not fair.)

 I think what all I can do now is reporting bugs to
 mawk and ubuntu people.:-)



Yeah, but without your tip I couldn't be able to find the origin
before some time.
And the kernel couldn't build anyway.

At least we should do something with this version of mawk.




 Anyway that works fine now with gawk, thanks!
 All your patches build well :-)

 Thank you for testing!

 -- 
 Masami Hiramatsu

 Software Engineer
 Hitachi Computer Products (America), Inc.
 Software Solutions Division

 e-mail: mhira...@redhat.com


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [TOOL] kprobestest : Kprobe stress test tool

2009-08-20 Thread Frederic Weisbecker
On Thu, Aug 13, 2009 at 04:57:20PM -0400, Masami Hiramatsu wrote:
 This script tests kprobes to probe on all symbols in the kernel and finds
 symbols which must be blacklisted.


 Usage
 -
   kprobestest [-s SYMLIST] [-b BLACKLIST] [-w WHITELIST]
  Run stress test. If SYMLIST file is specified, use it as
  an initial symbol list (This is useful for verifying white list
  after diagnosing all symbols).

   kprobestest cleanup
  Cleanup all lists


 How to Work
 ---
 This tool list up all symbols in the kernel via /proc/kallsyms, and sorts
 it into groups (each of them including 64 symbols in default). And then,
 it tests each group by using kprobe-tracer. If a kernel crash occurred,
 that group is moved into 'failed' dir. If the group passed the test, this
 script moves it into 'passed' dir and saves kprobe_profile into
 'passed/profiles/'.
 After testing all groups, all 'failed' groups are merged and sorted into
 smaller groups (divided by 4, in default). And those are tested again.
 This loop will be repeated until all group has just 1 symbol.

 Finally, the script sorts all 'passed' symbols into 'tested', 'untested',
 and 'missed' based on profiles.


 Note
 
  - This script just gives us some clues to the blacklisted functions.
In some cases, a combination of probe points will cause a problem, but
each of them doesn't cause the problem alone.

 Thank you,



This script makes my x86-64 dual core easily and hardly locking-up
on the 1st batch of symbols to test.
I have one sym list in the failed and unset directories:

int_very_careful
int_signal
int_restore_rest
stub_clone
stub_fork
stub_vfork
stub_sigaltstack
stub_iopl
ptregscall_common
stub_execve
stub_rt_sigreturn
irq_entries_start
common_interrupt
ret_from_intr
exit_intr
retint_with_reschedule
retint_check
retint_swapgs
retint_restore_args
restore_args
irq_return
retint_careful
retint_signal
retint_kernel
irq_move_cleanup_interrupt
reboot_interrupt
apic_timer_interrupt
generic_interrupt
invalidate_interrupt0
invalidate_interrupt1
invalidate_interrupt2
invalidate_interrupt3
invalidate_interrupt4
invalidate_interrupt5
invalidate_interrupt6
invalidate_interrupt7
threshold_interrupt
thermal_interrupt
mce_self_interrupt
call_function_single_interrupt
call_function_interrupt
reschedule_interrupt
error_interrupt
spurious_interrupt
perf_pending_interrupt
divide_error
overflow
bounds
invalid_op
device_not_available
double_fault
coprocessor_segment_overrun
invalid_TSS
segment_not_present
spurious_interrupt_bug
coprocessor_error
alignment_check
simd_coprocessor_error
native_load_gs_index
gs_change
kernel_thread
child_rip
kernel_execve
call_softirq


I don't have a crash log because I was running with X.
But it also happened with other batch of symbols.

The problem is that I don't have any serial line in this
box then I can't catch any crash log.
My K7 testbox also died in my arms this afternoon.

But I still have two other testboxes (one P2 and one P3),
hopefully I could reproduce the problem in these boxes
in which I can connect a serial line.

I've pushed your patches in the following git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git \
tracing/kprobes

So you can send patches on top of this one.

Config in attachment:

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.31-rc5
# Thu Aug 20 19:35:39 2009
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_OUTPUT_FORMAT=elf64-x86-64
CONFIG_ARCH_DEFCONFIG=arch/x86/configs/x86_64_defconfig
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config
CONFIG_CONSTRUCTORS=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y

Re: [PATCH -tip v14 01/12] x86: instruction decoder API

2009-08-20 Thread Frederic Weisbecker
On Thu, Aug 20, 2009 at 03:01:25PM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
 On Thu, Aug 20, 2009 at 12:16:05PM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
 On Thu, Aug 20, 2009 at 11:03:40AM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
 On Thu, Aug 20, 2009 at 01:42:31AM +0200, Frederic Weisbecker wrote:
 On Thu, Aug 13, 2009 at 04:34:13PM -0400, Masami Hiramatsu wrote:
 Add x86 instruction decoder to arch-specific libraries. This decoder
 can decode x86 instructions used in kernel into prefix, opcode, modrm,
 sib, displacement and immediates. This can also show the length of
 instructions.

 This version introduces instruction attributes for decoding 
 instructions.
 The instruction attribute tables are generated from the opcode map file
 (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk).

 Currently, the opcode maps are based on opcode maps in Intel(R) 64 and
 IA-32 Architectures Software Developers Manual Vol.2: Appendix.A,
 and consist of below two types of opcode tables.

 1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are
 written as below;

 Table: table-name
 Referrer: escaped-name
 opcode: mnemonic|GrpXXX [operand1[,operand2...]] 
 [(extra1)[,(extra2)...] [| 2nd-mnemonic ...]
  (or)
 opcode: escape # escaped-name
 EndTable

 Group opcodes, which has 8 elements, are written as below;

 GrpTable: GrpXXX
 reg:  mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 
 2nd-mnemonic ...]
 EndTable

 These opcode maps include a few SSE and FP opcodes (for setup), because
 those opcodes are used in the kernel.



 I'm getting the following build error on an old K7 box:

 arch/x86/lib/inat.c: In function ‘inat_get_opcode_attribute’:
 arch/x86/lib/inat.c:29: erreur: ‘inat_primary_table’ undeclared (first 
 use in this function)
 arch/x86/lib/inat.c:29: erreur: (Each undeclared identifier is reported 
 only once
 arch/x86/lib/inat.c:29: erreur: for each function it appears in.)


 I've attached my config. I haven't such problem on a dual x86-64 box.


 Actually I have the same problem in x86-64
 The content of my arch/x86/lib/inat-tables.c:

 /* x86 opcode map generated from x86-opcode-map.txt */
 /* Do not change this code. */
 /* Table: one byte opcode */
 /* Escape opcode map array */
 const insn_attr_t const *inat_escape_tables[INAT_ESC_MAX + 
 1][INAT_LPREFIX_MAX + 1] = {
 };

 /* Group opcode map array */
 const insn_attr_t const *inat_group_tables[INAT_GRP_MAX + 
 1][INAT_LPREFIX_MAX + 1] = {
 };


 I guess there is a problem with the generation of this file.

 Aah, you may use mawk on Ubuntu 9.04, right?
 If so, unfortunately, mawk is still under development.

 http://invisible-island.net/mawk/CHANGES



 Aargh...


 20090727
  add check/fix to prevent gsub from recurring to modify on a substring
  of the current line when the regular expression is anchored to the
  beginning of the line; fixes gawk's anchgsub testcase.

  add check for implicit concatenation mistaken for exponent; fixes
  gawk's hex testcase.

  add character-classes to built-in regular expressions.
   ^^
 Look, this means we can't use char-class expressions like
 [:lower:] until this version...

 And I've found another bug in mawk-1.3.3-20090728(the latest one).
 it almost works, but;

 $ mawk 'BEGIN {printf(0x%x\n, 0)}'
 0x1


 Ouch, indeed.



 $ gawk 'BEGIN {printf(0x%x\n, 0)}'
 0x0

 This bug skips an array element index 0x0 in inat-tables.c :(

 So, I recommend you to install gawk instead mawk until that
 supports all posix-awk features, since I don't think it is
 good idea to avoid all those bugs which depends on
 implementation (not specification).


 Thank you,



 Yeah, indeed. May be add a warning (or build error) in case the user uses
 mawk?

 Hmm, it is possible that mawk will fix those bugs and catch up soon,
 so, I think checking mawk is not a good idea.
 (and since there will be other awk implementations, it's not fair.)

 I think what all I can do now is reporting bugs to
 mawk and ubuntu people.:-)



 Yeah, but without your tip I couldn't be able to find the origin
 before some time.
 And the kernel couldn't build anyway.

 At least we should do something with this version of mawk.

 Hm, indeed.
 Maybe, we can run additional sanity check script before using
 awk, like this;

 ---
 res=`echo a | $AWK '/[[:lower:]]+/{print OK}'`
 [ $res != OK ]  exit 1

 res=`$AWK 'BEGIN {printf(%x, 0)}'`
 [ $res != 0 ]  exit 1

 exit 0
 ---

 Thanks,



Yeah, that looks good.

Thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v14 07/12] tracing: Introduce TRACE_FIELD_ZERO() macro

2009-08-19 Thread Frederic Weisbecker
On Tue, Aug 18, 2009 at 10:20:11PM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
  On Thu, Aug 13, 2009 at 04:35:01PM -0400, Masami Hiramatsu wrote:
  Use TRACE_FIELD_ZERO(type, item) instead of TRACE_FIELD_ZERO_CHAR(item).
  This also includes a fix of TRACE_ZERO_CHAR() macro.
  
  
  I can't find what the fix is about (see below)
 
 Ah, OK. This patch actually includes two parts.
 
 One is introducing TRACE_FIELD_ZERO which is more generic than
 TRACE_FIELD_ZERO_CHAR, I think.
 
 Another is a typo fix of TRACE_ZERO_CHAR.


Ok.

 
   
  +#undef TRACE_FIELD_ZERO
  +#define TRACE_FIELD_ZERO(type, item)
  +
  
  
  
  Is it about the above moving?
  If so, could you just tell so that I can add something about
  it in the changelog.
 
 No, I assume that TRACE_ZERO_CHAR is just a typo of TRACE_FIELD_ZERO_CHAR.
 (because I couldn't find any other TRACE_ZERO_CHAR)
 
 BTW, this patch may not be needed after applying patch 10/12, since
 it removes ftrace event definitions of TRACE_KPROBE/KRETPROBE.
 
 Perhaps, would I better merge and split those additional patches(and
 remove this change)?
 (It also could make the incremental review hard...)
 
 Thank you,
 


No let's keep it as is. It should be fine.

Thanks.


 -- 
 Masami Hiramatsu
 
 Software Engineer
 Hitachi Computer Products (America), Inc.
 Software Solutions Division
 
 e-mail: mhira...@redhat.com
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v14 01/12] x86: instruction decoder API

2009-08-19 Thread Frederic Weisbecker
On Thu, Aug 13, 2009 at 04:34:13PM -0400, Masami Hiramatsu wrote:
 Add x86 instruction decoder to arch-specific libraries. This decoder
 can decode x86 instructions used in kernel into prefix, opcode, modrm,
 sib, displacement and immediates. This can also show the length of
 instructions.
 
 This version introduces instruction attributes for decoding instructions.
 The instruction attribute tables are generated from the opcode map file
 (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk).
 
 Currently, the opcode maps are based on opcode maps in Intel(R) 64 and
 IA-32 Architectures Software Developers Manual Vol.2: Appendix.A,
 and consist of below two types of opcode tables.
 
 1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are
 written as below;
 
  Table: table-name
  Referrer: escaped-name
  opcode: mnemonic|GrpXXX [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 
 2nd-mnemonic ...]
   (or)
  opcode: escape # escaped-name
  EndTable
 
 Group opcodes, which has 8 elements, are written as below;
 
  GrpTable: GrpXXX
  reg:  mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 
 2nd-mnemonic ...]
  EndTable
 
 These opcode maps include a few SSE and FP opcodes (for setup), because
 those opcodes are used in the kernel.
 


I'm getting the following build error on an old K7 box:

arch/x86/lib/inat.c: In function ‘inat_get_opcode_attribute’:
arch/x86/lib/inat.c:29: erreur: ‘inat_primary_table’ undeclared (first use in 
this function)
arch/x86/lib/inat.c:29: erreur: (Each undeclared identifier is reported only 
once
arch/x86/lib/inat.c:29: erreur: for each function it appears in.)


I've attached my config. I haven't such problem on a dual x86-64 box.

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.31-rc5
# Fri Jun 11 05:59:12 2010
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
CONFIG_OUTPUT_FORMAT=elf32-i386
CONFIG_ARCH_DEFCONFIG=arch/x86/configs/i386_defconfig
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA=y
# CONFIG_HAVE_CPUMASK_OF_CPU_MAP is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_X86_32_LAZY_GS=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config
CONFIG_CONSTRUCTORS=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
# CONFIG_TASKSTATS is not set
CONFIG_AUDIT=y
# CONFIG_AUDITSYSCALL is not set

#
# RCU Subsystem
#
CONFIG_CLASSIC_RCU=y
# CONFIG_TREE_RCU is not set
# CONFIG_PREEMPT_RCU is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
# CONFIG_GROUP_SCHED is not set
# CONFIG_CGROUPS is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
# CONFIG_NET_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y

Re: [PATCH -tip v14 01/12] x86: instruction decoder API

2009-08-19 Thread Frederic Weisbecker
On Thu, Aug 20, 2009 at 01:42:31AM +0200, Frederic Weisbecker wrote:
 On Thu, Aug 13, 2009 at 04:34:13PM -0400, Masami Hiramatsu wrote:
  Add x86 instruction decoder to arch-specific libraries. This decoder
  can decode x86 instructions used in kernel into prefix, opcode, modrm,
  sib, displacement and immediates. This can also show the length of
  instructions.
  
  This version introduces instruction attributes for decoding instructions.
  The instruction attribute tables are generated from the opcode map file
  (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk).
  
  Currently, the opcode maps are based on opcode maps in Intel(R) 64 and
  IA-32 Architectures Software Developers Manual Vol.2: Appendix.A,
  and consist of below two types of opcode tables.
  
  1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are
  written as below;
  
   Table: table-name
   Referrer: escaped-name
   opcode: mnemonic|GrpXXX [operand1[,operand2...]] [(extra1)[,(extra2)...] 
  [| 2nd-mnemonic ...]
(or)
   opcode: escape # escaped-name
   EndTable
  
  Group opcodes, which has 8 elements, are written as below;
  
   GrpTable: GrpXXX
   reg:  mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 
  2nd-mnemonic ...]
   EndTable
  
  These opcode maps include a few SSE and FP opcodes (for setup), because
  those opcodes are used in the kernel.
  
 
 
 I'm getting the following build error on an old K7 box:
 
 arch/x86/lib/inat.c: In function ‘inat_get_opcode_attribute’:
 arch/x86/lib/inat.c:29: erreur: ‘inat_primary_table’ undeclared (first use in 
 this function)
 arch/x86/lib/inat.c:29: erreur: (Each undeclared identifier is reported only 
 once
 arch/x86/lib/inat.c:29: erreur: for each function it appears in.)
 
 
 I've attached my config. I haven't such problem on a dual x86-64 box.


Actually I have the same problem in x86-64
The content of my arch/x86/lib/inat-tables.c:

/* x86 opcode map generated from x86-opcode-map.txt */
/* Do not change this code. */
/* Table: one byte opcode */
/* Escape opcode map array */
const insn_attr_t const *inat_escape_tables[INAT_ESC_MAX + 1][INAT_LPREFIX_MAX 
+ 1] = {
};

/* Group opcode map array */
const insn_attr_t const *inat_group_tables[INAT_GRP_MAX + 1][INAT_LPREFIX_MAX + 
1] = {
};


I guess there is a problem with the generation of this file.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v14 03/12] kprobes: checks probe address is instruction boudary on x86

2009-08-18 Thread Frederic Weisbecker
On Thu, Aug 13, 2009 at 04:34:28PM -0400, Masami Hiramatsu wrote:
 Ensure safeness of inserting kprobes by checking whether the specified
 address is at the first byte of a instruction on x86.
 This is done by decoding probed function from its head to the probe point.
 
 Signed-off-by: Masami Hiramatsu mhira...@redhat.com
 Acked-by: Ananth N Mavinakayanahalli ana...@in.ibm.com
 Cc: Avi Kivity a...@redhat.com
 Cc: Andi Kleen a...@linux.intel.com
 Cc: Christoph Hellwig h...@infradead.org
 Cc: Frank Ch. Eigler f...@redhat.com
 Cc: Frederic Weisbecker fweis...@gmail.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: Ingo Molnar mi...@elte.hu
 Cc: Jason Baron jba...@redhat.com
 Cc: Jim Keniston jkeni...@us.ibm.com
 Cc: K.Prasad pra...@linux.vnet.ibm.com
 Cc: Lai Jiangshan la...@cn.fujitsu.com
 Cc: Li Zefan l...@cn.fujitsu.com
 Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
 Cc: Roland McGrath rol...@redhat.com
 Cc: Sam Ravnborg s...@ravnborg.org
 Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
 Cc: Steven Rostedt rost...@goodmis.org
 Cc: Tom Zanussi tzanu...@gmail.com
 Cc: Vegard Nossum vegard.nos...@gmail.com
 ---
 
  arch/x86/kernel/kprobes.c |   69 
 +
  1 files changed, 69 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
 index b5b1848..80d493f 100644
 --- a/arch/x86/kernel/kprobes.c
 +++ b/arch/x86/kernel/kprobes.c
 @@ -48,6 +48,7 @@
  #include linux/preempt.h
  #include linux/module.h
  #include linux/kdebug.h
 +#include linux/kallsyms.h
  
  #include asm/cacheflush.h
  #include asm/desc.h
 @@ -55,6 +56,7 @@
  #include asm/uaccess.h
  #include asm/alternative.h
  #include asm/debugreg.h
 +#include asm/insn.h
  
  void jprobe_return_end(void);
  
 @@ -245,6 +247,71 @@ retry:
   }
  }
  
 +/* Recover the probed instruction at addr for further analysis. */
 +static int recover_probed_instruction(kprobe_opcode_t *buf, unsigned long 
 addr)
 +{
 + struct kprobe *kp;
 + kp = get_kprobe((void *)addr);
 + if (!kp)
 + return -EINVAL;
 +
 + /*
 +  *  Basically, kp-ainsn.insn has an original instruction.
 +  *  However, RIP-relative instruction can not do single-stepping
 +  *  at different place, fix_riprel() tweaks the displacement of
 +  *  that instruction. In that case, we can't recover the instruction
 +  *  from the kp-ainsn.insn.
 +  *
 +  *  On the other hand, kp-opcode has a copy of the first byte of
 +  *  the probed instruction, which is overwritten by int3. And
 +  *  the instruction at kp-addr is not modified by kprobes except
 +  *  for the first byte, we can recover the original instruction
 +  *  from it and kp-opcode.
 +  */
 + memcpy(buf, kp-addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
 + buf[0] = kp-opcode;
 + return 0;
 +}
 +
 +/* Dummy buffers for kallsyms_lookup */
 +static char __dummy_buf[KSYM_NAME_LEN];
 +
 +/* Check if paddr is at an instruction boundary */
 +static int __kprobes can_probe(unsigned long paddr)
 +{
 + int ret;
 + unsigned long addr, offset = 0;
 + struct insn insn;
 + kprobe_opcode_t buf[MAX_INSN_SIZE];
 +
 + if (!kallsyms_lookup(paddr, NULL, offset, NULL, __dummy_buf))
 + return 0;
 +
 + /* Decode instructions */
 + addr = paddr - offset;
 + while (addr  paddr) {
 + kernel_insn_init(insn, (void *)addr);
 + insn_get_opcode(insn);
 +
 + /* Check if the instruction has been modified. */
 + if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION) {
 + ret = recover_probed_instruction(buf, addr);



I'm confused about the reason of this recovering. Is it to remove
kprobes behind the current setting one in the current function?

If such cleanup is needed for whatever reason, I wonder what happens
to the corresponding kprobe structure, why isn't it using the arch_disarm_
helper to patch back?

(Questions that may prove my solid misunderstanding of the kprobes code ;-)

Frederic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v14 03/12] kprobes: checks probe address is instruction boudary on x86

2009-08-18 Thread Frederic Weisbecker
On Tue, Aug 18, 2009 at 07:17:39PM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
  +  while (addr  paddr) {
  +  kernel_insn_init(insn, (void *)addr);
  +  insn_get_opcode(insn);
  +
  +  /* Check if the instruction has been modified. */
  +  if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION) {
  +  ret = recover_probed_instruction(buf, addr);
  
  
  
  I'm confused about the reason of this recovering. Is it to remove
  kprobes behind the current setting one in the current function?
 
 No, it recovers just an instruction which is probed by a kprobe,
 because we need to know the first byte of this instruction for
 decoding it.
 
 Perhaps we'd better to have more generic interface (text_peek?)
 for it because another subsystem (e.g. kgdb) may want to insert int3...
 
 Thank you,


Aah, I see now, it's to keep a sane check of the instructions
boundaries without int 3 artifacts in the middle.

But in that case, you should re-arm the breakpoint after your
check, right?

Or may be you could do the check without repatching?
May be by doing a copy of insn.opcode.bytes and replacing bytes[0]
with what a random kprobe has stolen?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v14 07/12] tracing: Introduce TRACE_FIELD_ZERO() macro

2009-08-18 Thread Frederic Weisbecker
On Thu, Aug 13, 2009 at 04:35:01PM -0400, Masami Hiramatsu wrote:
 Use TRACE_FIELD_ZERO(type, item) instead of TRACE_FIELD_ZERO_CHAR(item).
 This also includes a fix of TRACE_ZERO_CHAR() macro.


I can't find what the fix is about (see below)



 
 Signed-off-by: Masami Hiramatsu mhira...@redhat.com
 Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
 Cc: Avi Kivity a...@redhat.com
 Cc: Andi Kleen a...@linux.intel.com
 Cc: Christoph Hellwig h...@infradead.org
 Cc: Frank Ch. Eigler f...@redhat.com
 Cc: Frederic Weisbecker fweis...@gmail.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: Ingo Molnar mi...@elte.hu
 Cc: Jason Baron jba...@redhat.com
 Cc: Jim Keniston jkeni...@us.ibm.com
 Cc: K.Prasad pra...@linux.vnet.ibm.com
 Cc: Lai Jiangshan la...@cn.fujitsu.com
 Cc: Li Zefan l...@cn.fujitsu.com
 Cc: Przemysław Pawełczyk przemys...@pawelczyk.it
 Cc: Roland McGrath rol...@redhat.com
 Cc: Sam Ravnborg s...@ravnborg.org
 Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
 Cc: Steven Rostedt rost...@goodmis.org
 Cc: Tom Zanussi tzanu...@gmail.com
 Cc: Vegard Nossum vegard.nos...@gmail.com
 ---
 
  kernel/trace/trace_event_types.h |4 ++--
  kernel/trace/trace_export.c  |   16 
  2 files changed, 10 insertions(+), 10 deletions(-)
 
 diff --git a/kernel/trace/trace_event_types.h 
 b/kernel/trace/trace_event_types.h
 index 6db005e..e74f090 100644
 --- a/kernel/trace/trace_event_types.h
 +++ b/kernel/trace/trace_event_types.h
 @@ -109,7 +109,7 @@ TRACE_EVENT_FORMAT(bprint, TRACE_BPRINT, bprint_entry, 
 ignore,
   TRACE_STRUCT(
   TRACE_FIELD(unsigned long, ip, ip)
   TRACE_FIELD(char *, fmt, fmt)
 - TRACE_FIELD_ZERO_CHAR(buf)
 + TRACE_FIELD_ZERO(char, buf)
   ),
   TP_RAW_FMT(%08lx (%d) fmt:%p %s)
  );
 @@ -117,7 +117,7 @@ TRACE_EVENT_FORMAT(bprint, TRACE_BPRINT, bprint_entry, 
 ignore,
  TRACE_EVENT_FORMAT(print, TRACE_PRINT, print_entry, ignore,
   TRACE_STRUCT(
   TRACE_FIELD(unsigned long, ip, ip)
 - TRACE_FIELD_ZERO_CHAR(buf)
 + TRACE_FIELD_ZERO(char, buf)
   ),
   TP_RAW_FMT(%08lx (%d) fmt:%p %s)
  );
 diff --git a/kernel/trace/trace_export.c b/kernel/trace/trace_export.c
 index 71c8d7f..b0ac92c 100644
 --- a/kernel/trace/trace_export.c
 +++ b/kernel/trace/trace_export.c
 @@ -42,9 +42,9 @@ extern void __bad_type_size(void);
   if (!ret)   \
   return 0;
  
 -#undef TRACE_FIELD_ZERO_CHAR
 -#define TRACE_FIELD_ZERO_CHAR(item)  \
 - ret = trace_seq_printf(s, \tfield:char  #item ;\t   \
 +#undef TRACE_FIELD_ZERO
 +#define TRACE_FIELD_ZERO(type, item) \
 + ret = trace_seq_printf(s, \tfield: #type   #item ;\t  \
  offset:%u;\tsize:0;\n, \
  (unsigned int)offsetof(typeof(field), item)); \
   if (!ret)   \
 @@ -92,9 +92,6 @@ ftrace_format_##call(struct ftrace_event_call *unused,  
 \
  
  #include trace_event_types.h
  
 -#undef TRACE_ZERO_CHAR
 -#define TRACE_ZERO_CHAR(arg)
 -
  #undef TRACE_FIELD
  #define TRACE_FIELD(type, item, assign)\
   entry-item = assign;
 @@ -107,6 +104,9 @@ ftrace_format_##call(struct ftrace_event_call *unused,
 \
  #define TRACE_FIELD_SIGN(type, item, assign, is_signed)  \
   TRACE_FIELD(type, item, assign)
  
 +#undef TRACE_FIELD_ZERO
 +#define TRACE_FIELD_ZERO(type, item)
 +



Is it about the above moving?
If so, could you just tell so that I can add something about
it in the changelog.

Thanks.
Frederic.




  #undef TP_CMD
  #define TP_CMD(cmd...)   cmd
  
 @@ -178,8 +178,8 @@ __attribute__((section(_ftrace_events))) event_##call = 
 {   \
   if (ret)\
   return ret;
  
 -#undef TRACE_FIELD_ZERO_CHAR
 -#define TRACE_FIELD_ZERO_CHAR(item)
 +#undef TRACE_FIELD_ZERO
 +#define TRACE_FIELD_ZERO(type, item)
  
  #undef TRACE_EVENT_FORMAT
  #define TRACE_EVENT_FORMAT(call, proto, args, fmt, tstruct, tpfmt)   \
 
 
 -- 
 Masami Hiramatsu
 
 Software Engineer
 Hitachi Computer Products (America), Inc.
 Software Solutions Division
 
 e-mail: mhira...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip -v10 7/7] tracing: add kprobe-based event tracer

2009-07-07 Thread Frederic Weisbecker
On Tue, Jun 30, 2009 at 09:09:23PM -0400, Masami Hiramatsu wrote:
 Add kprobes-based event tracer on ftrace.
 
 This tracer is similar to the events tracer which is based on Tracepoint
 infrastructure. Instead of Tracepoint, this tracer is based on kprobes(kprobe
 and kretprobe). It probes anywhere where kprobes can probe(this means, all
 functions body except for __kprobes functions).
 
 Similar to the events tracer, this tracer doesn't need to be activated via
 current_tracer, instead of that, just set probe points via
 /sys/kernel/debug/tracing/kprobe_events. And you can set filters on each
 probe events via /sys/kernel/debug/tracing/events/kprobes/EVENT/filter.
 
 This tracer supports following probe arguments for each probe.
 
   %REG  : Fetch register REG
   sN: Fetch Nth entry of stack (N = 0)
   @ADDR : Fetch memory at ADDR (ADDR should be in kernel)
   @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
   aN: Fetch function argument. (N = 0)
   rv: Fetch return value.
   ra: Fetch return address.
   +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.
 
 See Documentation/trace/kprobes.txt for details.
 
 Changes from v9:
  - Select CONFIG_GENERIC_TRACER when CONFIG_KPROBE_TRACER=y.
 
 Signed-off-by: Masami Hiramatsu mhira...@redhat.com
 Acked-by: Ananth N Mavinakayanahalli ana...@in.ibm.com
 Cc: Christoph Hellwig h...@infradead.org
 Cc: Steven Rostedt rost...@goodmis.org
 Cc: Ingo Molnar mi...@elte.hu
 Cc: Frederic Weisbecker fweis...@gmail.com
 Cc: Tom Zanussi tzanu...@gmail.com
 ---
 
  Documentation/trace/kprobes.txt  |  138 
  kernel/trace/Kconfig |   12 
  kernel/trace/Makefile|1 
  kernel/trace/trace.h |   22 +
  kernel/trace/trace_event_types.h |   20 +
  kernel/trace/trace_kprobe.c  | 1183 
 ++
  6 files changed, 1376 insertions(+), 0 deletions(-)
  create mode 100644 Documentation/trace/kprobes.txt
  create mode 100644 kernel/trace/trace_kprobe.c
 
 diff --git a/Documentation/trace/kprobes.txt b/Documentation/trace/kprobes.txt
 new file mode 100644
 index 000..3a90ebb
 --- /dev/null
 +++ b/Documentation/trace/kprobes.txt
 @@ -0,0 +1,138 @@
 + Kprobe-based Event Tracer
 + =
 +
 + Documentation is written by Masami Hiramatsu
 +
 +
 +Overview
 +
 +This tracer is similar to the events tracer which is based on Tracepoint
 +infrastructure. Instead of Tracepoint, this tracer is based on kprobes(kprobe
 +and kretprobe). It probes anywhere where kprobes can probe(this means, all
 +functions body except for __kprobes functions).
 +
 +Unlike the function tracer, this tracer can probe instructions inside of
 +kernel functions. It allows you to check which instruction has been executed.
 +
 +Unlike the Tracepoint based events tracer, this tracer can add and remove
 +probe points on the fly.
 +
 +Similar to the events tracer, this tracer doesn't need to be activated via
 +current_tracer, instead of that, just set probe points via
 +/sys/kernel/debug/tracing/kprobe_events. And you can set filters on each
 +probe events via /sys/kernel/debug/tracing/events/kprobes/EVENT/filter.
 +
 +
 +Synopsis of kprobe_events
 +-
 +  p[:EVENT] SYMBOL[+offs|-offs]|MEMADDR [FETCHARGS]  : set a probe
 +  r[:EVENT] SYMBOL[+0] [FETCHARGS]   : set a return probe
 +
 + EVENT   : Event name
 + SYMBOL[+offs|-offs] : Symbol+offset where the probe is inserted
 + MEMADDR : Address where the probe is inserted
 +
 + FETCHARGS   : Arguments
 +  %REG   : Fetch register REG
 +  sN : Fetch Nth entry of stack (N = 0)
 +  @ADDR  : Fetch memory at ADDR (ADDR should be in kernel)
 +  @SYM[+|-offs]  : Fetch memory at SYM +|- offs (SYM should be a data 
 symbol)
 +  aN : Fetch function argument. (N = 0)(*)
 +  rv : Fetch return value.(**)
 +  ra : Fetch return address.(**)
 +  +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.(***)
 +
 +  (*) aN may not correct on asmlinkaged functions and at the middle of
 +  function body.
 +  (**) only for return probe.
 +  (***) this is useful for fetching a field of data structures.
 +
 +
 +Per-Probe Event Filtering
 +-
 + Per-probe event filtering feature allows you to set different filter on each
 +probe and gives you what arguments will be shown in trace buffer. If an event
 +name is specified right after 'p:' or 'r:' in kprobe_events, the tracer adds
 +an event under tracing/events/kprobes/EVENT, at the directory you can see
 +'id', 'enabled', 'format' and 'filter'.
 +
 +enabled:
 +  You can enable/disable the probe by writing 1 or 0 on it.
 +
 +format:
 +  It shows the format of this probe event. It also shows aliases of arguments
 + which you specified to kprobe_events.
 +
 +filter:
 +  You can write filtering rules of this event

Re: [PATCH -tip -v10 7/7] tracing: add kprobe-based event tracer

2009-07-07 Thread Frederic Weisbecker
On Tue, Jul 07, 2009 at 03:55:28PM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
  diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
  index 206cb7d..65945eb 100644
  --- a/kernel/trace/trace.h
  +++ b/kernel/trace/trace.h
  @@ -45,6 +45,8 @@ enum trace_type {
 TRACE_POWER,
 TRACE_BLK,
 TRACE_KSYM,
  +  TRACE_KPROBE,
  +  TRACE_KRETPROBE,
   
 __TRACE_LAST_TYPE,
   };
  @@ -227,6 +229,22 @@ struct trace_ksym {
 charksym_name[KSYM_NAME_LEN];
 charp_name[TASK_COMM_LEN];
   };
  +#define TRACE_KPROBE_ARGS 6
  +
  +struct kprobe_trace_entry {
  +  struct trace_entry  ent;
  +  unsigned long   ip;
  +  int nargs;
  +  unsigned long   args[TRACE_KPROBE_ARGS];
  
  
  
  I see that you actually make use of arg as a dynamic sizeable
  array.
  For clarity, args[TRACE_KPROBE_ARGS] could be args[0].
  
  It's just a neat and wouldn't affect the code nor the data
  but would be clearer for readers of that code.
 
 Hmm. In that case, I think we'll need a new macro for field
 definition, like TRACE_FIELD_ZERO(type, item).



You mean that for trace_define_field() to describe fields of events?
Actually the fields should be defined dynamically depending on how
is built the kprobe event (which arguments are requested, how many,
etc..).

Frederic.


 
  +};
  +
  +struct kretprobe_trace_entry {
  +  struct trace_entry  ent;
  +  unsigned long   func;
  +  unsigned long   ret_ip;
  +  int nargs;
  +  unsigned long   args[TRACE_KPROBE_ARGS];
  +};
  
  
  ditto
  
  

   /*
* trace_flag_type is an enumeration that holds different
  @@ -344,6 +362,10 @@ extern void __ftrace_bad_type(void);
 IF_ASSIGN(var, ent, struct syscall_trace_exit,  \
   TRACE_SYSCALL_EXIT);  \
 IF_ASSIGN(var, ent, struct trace_ksym, TRACE_KSYM); \
  +  IF_ASSIGN(var, ent, struct kprobe_trace_entry,  \
  +TRACE_KPROBE);\
  +  IF_ASSIGN(var, ent, struct kretprobe_trace_entry,   \
  +TRACE_KRETPROBE); \
 __ftrace_bad_type();\
 } while (0)
   
  diff --git a/kernel/trace/trace_event_types.h 
  b/kernel/trace/trace_event_types.h
  index 6db005e..ec2e6f3 100644
  --- a/kernel/trace/trace_event_types.h
  +++ b/kernel/trace/trace_event_types.h
  @@ -175,4 +175,24 @@ TRACE_EVENT_FORMAT(kmem_free, TRACE_KMEM_FREE, 
  kmemtrace_free_entry, ignore,
 TP_RAW_FMT(type:%u call_site:%lx ptr:%p)
   );
   
  +TRACE_EVENT_FORMAT(kprobe, TRACE_KPROBE, kprobe_trace_entry, ignore,
  +  TRACE_STRUCT(
  +  TRACE_FIELD(unsigned long, ip, ip)
  +  TRACE_FIELD(int, nargs, nargs)
  +  TRACE_FIELD_SPECIAL(unsigned long args[TRACE_KPROBE_ARGS],
  +  args, TRACE_KPROBE_ARGS, args)
  +  ),
  +  TP_RAW_FMT(%08lx: args:0x%lx ...)
  +);
  +
  +TRACE_EVENT_FORMAT(kretprobe, TRACE_KRETPROBE, kretprobe_trace_entry, 
  ignore,
  +  TRACE_STRUCT(
  +  TRACE_FIELD(unsigned long, func, func)
  +  TRACE_FIELD(unsigned long, ret_ip, ret_ip)
  +  TRACE_FIELD(int, nargs, nargs)
  +  TRACE_FIELD_SPECIAL(unsigned long args[TRACE_KPROBE_ARGS],
  +  args, TRACE_KPROBE_ARGS, args)
  +  ),
  +  TP_RAW_FMT(%08lx - %08lx: args:0x%lx ...)
  +);
   #undef TRACE_SYSTEM
  diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
  new file mode 100644
  index 000..0951512
  --- /dev/null
  +++ b/kernel/trace/trace_kprobe.c
  @@ -0,0 +1,1183 @@
  +/*
  + * kprobe based kernel tracer
  + *
  + * Created by Masami Hiramatsu mhira...@redhat.com
  + *
  + * This program is free software; you can redistribute it and/or modify
  + * it under the terms of the GNU General Public License version 2 as
  + * published by the Free Software Foundation.
  + *
  + * This program is distributed in the hope that it will be useful,
  + * but WITHOUT ANY WARRANTY; without even the implied warranty of
  + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  + * GNU General Public License for more details.
  + *
  + * You should have received a copy of the GNU General Public License
  + * along with this program; if not, write to the Free Software
  + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  
  USA
  + */
  +
  +#include linux/module.h
  +#include linux/uaccess.h
  +#include linux/kprobes.h
  +#include linux/seq_file.h
  +#include linux/slab.h
  +#include linux/smp.h
  +#include linux/debugfs.h
  +#include linux/types.h
  +#include linux/string.h
  +#include linux/ctype.h
  +#include linux/ptrace.h
  +
  +#include trace.h
  +#include trace_output.h
  +
  +#define MAX_ARGSTR_LEN 63
  +
  +/* currently, trace_kprobe only

Re: [PATCH -tip -v10 7/7] tracing: add kprobe-based event tracer

2009-07-07 Thread Frederic Weisbecker
On Tue, Jul 07, 2009 at 04:42:32PM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
  On Tue, Jul 07, 2009 at 03:55:28PM -0400, Masami Hiramatsu wrote:
  Frederic Weisbecker wrote:
  diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
  index 206cb7d..65945eb 100644
  --- a/kernel/trace/trace.h
  +++ b/kernel/trace/trace.h
  @@ -45,6 +45,8 @@ enum trace_type {
   TRACE_POWER,
   TRACE_BLK,
   TRACE_KSYM,
  +TRACE_KPROBE,
  +TRACE_KRETPROBE,
   
   __TRACE_LAST_TYPE,
   };
  @@ -227,6 +229,22 @@ struct trace_ksym {
   charksym_name[KSYM_NAME_LEN];
   charp_name[TASK_COMM_LEN];
   };
  +#define TRACE_KPROBE_ARGS 6
  +
  +struct kprobe_trace_entry {
  +struct trace_entry  ent;
  +unsigned long   ip;
  +int nargs;
  +unsigned long   args[TRACE_KPROBE_ARGS];
 
 
  I see that you actually make use of arg as a dynamic sizeable
  array.
  For clarity, args[TRACE_KPROBE_ARGS] could be args[0].
 
  It's just a neat and wouldn't affect the code nor the data
  but would be clearer for readers of that code.
  Hmm. In that case, I think we'll need a new macro for field
  definition, like TRACE_FIELD_ZERO(type, item).
  
  
  
  You mean that for trace_define_field() to describe fields of events?
  Actually the fields should be defined dynamically depending on how
  is built the kprobe event (which arguments are requested, how many,
  etc..).
 
 Yeah, if you specified a probe point with its event name, the tracer
 will make a corresponding event dynamically. There are also anonymous
 probes which don't have corresponding events. For those anonymous
 probes, I need to define two generic event types(kprobe and kretprobe).
 
 Thank you,


Ok. Btw, why do you need to define those two anonymous events?
Actually your event types are always dynamically created.
Those you defined through TRACE_FORMAT_EVENT are only ghost events,
they only stand there as a abstract pattern, right?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip -v10 7/7] tracing: add kprobe-based event tracer

2009-07-07 Thread Frederic Weisbecker
On Tue, Jul 07, 2009 at 05:31:25PM -0400, Masami Hiramatsu wrote:
 Frederic Weisbecker wrote:
  On Tue, Jul 07, 2009 at 04:42:32PM -0400, Masami Hiramatsu wrote:
  Frederic Weisbecker wrote:
  On Tue, Jul 07, 2009 at 03:55:28PM -0400, Masami Hiramatsu wrote:
  Frederic Weisbecker wrote:
  diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
  index 206cb7d..65945eb 100644
  --- a/kernel/trace/trace.h
  +++ b/kernel/trace/trace.h
  @@ -45,6 +45,8 @@ enum trace_type {
 TRACE_POWER,
 TRACE_BLK,
 TRACE_KSYM,
  +  TRACE_KPROBE,
  +  TRACE_KRETPROBE,
   
 __TRACE_LAST_TYPE,
   };
  @@ -227,6 +229,22 @@ struct trace_ksym {
 charksym_name[KSYM_NAME_LEN];
 charp_name[TASK_COMM_LEN];
   };
  +#define TRACE_KPROBE_ARGS 6
  +
  +struct kprobe_trace_entry {
  +  struct trace_entry  ent;
  +  unsigned long   ip;
  +  int nargs;
  +  unsigned long   args[TRACE_KPROBE_ARGS];
 
  I see that you actually make use of arg as a dynamic sizeable
  array.
  For clarity, args[TRACE_KPROBE_ARGS] could be args[0].
 
  It's just a neat and wouldn't affect the code nor the data
  but would be clearer for readers of that code.
  Hmm. In that case, I think we'll need a new macro for field
  definition, like TRACE_FIELD_ZERO(type, item).
 
 
  You mean that for trace_define_field() to describe fields of events?
  Actually the fields should be defined dynamically depending on how
  is built the kprobe event (which arguments are requested, how many,
  etc..).
  Yeah, if you specified a probe point with its event name, the tracer
  will make a corresponding event dynamically. There are also anonymous
  probes which don't have corresponding events. For those anonymous
  probes, I need to define two generic event types(kprobe and kretprobe).
 
  Thank you,
  
  
  Ok. Btw, why do you need to define those two anonymous events?
  Actually your event types are always dynamically created.
  Those you defined through TRACE_FORMAT_EVENT are only ghost events,
  they only stand there as a abstract pattern, right?
  
 
 Not always created.
 
 Below command will create an event event1;
 p probe_point:event1 a1 a2 a3 ...  /debug/tracing/kprobe_events
 
 But next command doesn't create.
 p probe_point a1 a2 a3 ...  /debug/tracing/kprobe_events


Aah, ok.


 This just inserts a kprobe to probe_point. the advantage of this
 simple command is that you never be annoyed by making different
 name for new events :-)


Indeed.
But speaking about that, may be you could dynamically create a name
following this simple model: func+offset
Unless we can set several kprobes on the exact same address?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip -v10 5/7] x86: add pt_regs register and stack access APIs

2009-07-05 Thread Frederic Weisbecker
On Tue, Jun 30, 2009 at 09:09:11PM -0400, Masami Hiramatsu wrote:
 Add following APIs for accessing registers and stack entries from pt_regs.
 
 - regs_query_register_offset(const char *name)
Query the offset of name register.
 
 - regs_query_register_name(unsigned offset)
Query the name of register by its offset.
 
 - regs_get_register(struct pt_regs *regs, unsigned offset)
Get the value of a register by its offset.
 
 - regs_within_kernel_stack(struct pt_regs *regs, unsigned long addr)
Check the address is in the kernel stack.
 
 - regs_get_kernel_stack_nth(struct pt_regs *reg, unsigned nth)
Get Nth entry of the kernel stack. (N = 0)
 
 - regs_get_argument_nth(struct pt_regs *reg, unsigned nth)
Get Nth argument at function call. (N = 0)
 
 Changes from v9:
  -Fix a typo in a comment.
 
 Signed-off-by: Masami Hiramatsu mhira...@redhat.com
 Cc: Christoph Hellwig h...@infradead.org
 Cc: Steven Rostedt rost...@goodmis.org
 Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
 Cc: Ingo Molnar mi...@elte.hu
 Cc: Frederic Weisbecker fweis...@gmail.com
 Cc: Roland McGrath rol...@redhat.com
 Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
 Cc: linux-a...@vger.kernel.org



Looks good!

Reviewed-by: Frederic Weisbecker fweis...@gmail.com

Frederic.



 ---
 
  arch/x86/include/asm/ptrace.h |  122 
 +
  arch/x86/kernel/ptrace.c  |   73 +
  2 files changed, 195 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
 index 0f0d908..d5e3b3b 100644
 --- a/arch/x86/include/asm/ptrace.h
 +++ b/arch/x86/include/asm/ptrace.h
 @@ -7,6 +7,7 @@
  
  #ifdef __KERNEL__
  #include asm/segment.h
 +#include asm/page_types.h
  #endif
  
  #ifndef __ASSEMBLY__
 @@ -216,6 +217,127 @@ static inline unsigned long user_stack_pointer(struct 
 pt_regs *regs)
   return regs-sp;
  }
  
 +/* Query offset/name of register from its name/offset */
 +extern int regs_query_register_offset(const char *name);
 +extern const char *regs_query_register_name(unsigned offset);
 +#define MAX_REG_OFFSET (offsetof(struct pt_regs, ss))
 +
 +/**
 + * regs_get_register() - get register value from its offset
 + * @regs:pt_regs from which register value is gotten.
 + * @offset:  offset number of the register.
 + *
 + * regs_get_register returns the value of a register whose offset from @regs
 + * is @offset. The @offset is the offset of the register in struct pt_regs.
 + * If @offset is bigger than MAX_REG_OFFSET, this returns 0.
 + */
 +static inline unsigned long regs_get_register(struct pt_regs *regs,
 +   unsigned offset)
 +{
 + if (unlikely(offset  MAX_REG_OFFSET))
 + return 0;
 + return *(unsigned long *)((unsigned long)regs + offset);
 +}
 +
 +/**
 + * regs_within_kernel_stack() - check the address in the stack
 + * @regs:pt_regs which contains kernel stack pointer.
 + * @addr:address which is checked.
 + *
 + * regs_within_kenel_stack() checks @addr is within the kernel stack page(s).
 + * If @addr is within the kernel stack, it returns true. If not, returns 
 false.
 + */
 +static inline int regs_within_kernel_stack(struct pt_regs *regs,
 +unsigned long addr)
 +{
 + return ((addr  ~(THREAD_SIZE - 1))  ==
 + (kernel_stack_pointer(regs)  ~(THREAD_SIZE - 1)));
 +}
 +
 +/**
 + * regs_get_kernel_stack_nth() - get Nth entry of the stack
 + * @regs:pt_regs which contains kernel stack pointer.
 + * @n:   stack entry number.
 + *
 + * regs_get_kernel_stack_nth() returns @n th entry of the kernel stack which
 + * is specifined by @regs. If the @n th entry is NOT in the kernel stack,
 + * this returns 0.
 + */
 +static inline unsigned long regs_get_kernel_stack_nth(struct pt_regs *regs,
 +   unsigned n)
 +{
 + unsigned long *addr = (unsigned long *)kernel_stack_pointer(regs);
 + addr += n;
 + if (regs_within_kernel_stack(regs, (unsigned long)addr))
 + return *addr;
 + else
 + return 0;
 +}
 +
 +/**
 + * regs_get_argument_nth() - get Nth argument at function call
 + * @regs:pt_regs which contains registers at function entry.
 + * @n:   argument number.
 + *
 + * regs_get_argument_nth() returns @n th argument of a function call.
 + * Since usually the kernel stack will be changed right after function entry,
 + * you must use this at function entry. If the @n th entry is NOT in the
 + * kernel stack or pt_regs, this returns 0.
 + */
 +#ifdef CONFIG_X86_32
 +#define NR_REGPARMS 3
 +static inline unsigned long regs_get_argument_nth(struct pt_regs *regs,
 +   unsigned n)
 +{
 + if (n  NR_REGPARMS) {
 + switch (n) {
 + case 0:
 + return regs-ax;
 + case 1

Re: [PATCH -tip -v10 6/7] tracing: ftrace dynamic ftrace_event_call support

2009-07-05 Thread Frederic Weisbecker
On Tue, Jun 30, 2009 at 09:09:17PM -0400, Masami Hiramatsu wrote:
 Add dynamic ftrace_event_call support to ftrace. Trace engines can adds new
 ftrace_event_call to ftrace on the fly. Each operator functions of the call
 takes a ftrace_event_call data structure as an argument, because these
 functions may be shared among several ftrace_event_calls.
 
 Signed-off-by: Masami Hiramatsu mhira...@redhat.com
 Cc: Steven Rostedt rost...@goodmis.org
 Cc: Ingo Molnar mi...@elte.hu
 Cc: Tom Zanussi tzanu...@gmail.com
 Cc: Frederic Weisbecker fweis...@gmail.com



Looks good too.

Acked-by: Frederic Weisbecker fweis...@gmail.com



 ---
 
  include/linux/ftrace_event.h |   13 +---
  include/trace/ftrace.h   |   22 +++--
  kernel/trace/trace_events.c  |   70 
 --
  kernel/trace/trace_export.c  |   27 
  4 files changed, 85 insertions(+), 47 deletions(-)
 
 diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
 index 5c093ff..f7733b6 100644
 --- a/include/linux/ftrace_event.h
 +++ b/include/linux/ftrace_event.h
 @@ -108,12 +108,13 @@ struct ftrace_event_call {
   struct dentry   *dir;
   struct trace_event  *event;
   int enabled;
 - int (*regfunc)(void);
 - void(*unregfunc)(void);
 + int (*regfunc)(struct ftrace_event_call *);
 + void(*unregfunc)(struct ftrace_event_call *);
   int id;
 - int (*raw_init)(void);
 - int (*show_format)(struct trace_seq *s);
 - int (*define_fields)(void);
 + int (*raw_init)(struct ftrace_event_call *);
 + int (*show_format)(struct ftrace_event_call *,
 +struct trace_seq *);
 + int (*define_fields)(struct ftrace_event_call *);
   struct list_headfields;
   int filter_active;
   void*filter;
 @@ -138,6 +139,8 @@ extern int filter_current_check_discard(struct 
 ftrace_event_call *call,
  
  extern int trace_define_field(struct ftrace_event_call *call, char *type,
 char *name, int offset, int size, int is_signed);
 +extern int trace_add_event_call(struct ftrace_event_call *call);
 +extern void trace_remove_event_call(struct ftrace_event_call *call);
  
  #define is_signed_type(type) (((type)(-1))  0)
  
 diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
 index 1867553..d696580 100644
 --- a/include/trace/ftrace.h
 +++ b/include/trace/ftrace.h
 @@ -147,7 +147,8 @@
  #undef TRACE_EVENT
  #define TRACE_EVENT(call, proto, args, tstruct, func, print) \
  static int   \
 -ftrace_format_##call(struct trace_seq *s)\
 +ftrace_format_##call(struct ftrace_event_call *event_call,   \
 +  struct trace_seq *s)   \
  {\
   struct ftrace_raw_##call field __attribute__((unused)); \
   int ret = 0;\
 @@ -289,10 +290,9 @@ ftrace_raw_output_##call(struct trace_iterator *iter, 
 int flags) \
  #undef TRACE_EVENT
  #define TRACE_EVENT(call, proto, args, tstruct, func, print) \
  int  \
 -ftrace_define_fields_##call(void)\
 +ftrace_define_fields_##call(struct ftrace_event_call *event_call)\
  {\
   struct ftrace_raw_##call field; \
 - struct ftrace_event_call *event_call = event_##call;   \
   int ret;\
   \
   __common_field(int, type, 1);   \
 @@ -355,7 +355,7 @@ static inline int ftrace_get_offsets_##call(  
 \
   *   event_trace_printk(_RET_IP_, call:  fmt);
   * }
   *
 - * static int ftrace_reg_event_call(void)
 + * static int ftrace_reg_event_call(struct ftrace_event_call *unused)
   * {
   *   int ret;
   *
 @@ -366,7 +366,7 @@ static inline int ftrace_get_offsets_##call(  
 \
   *   return ret;
   * }
   *
 - * static void ftrace_unreg_event_call(void)
 + * static void ftrace_unreg_event_call(struct ftrace_event_call *unused)
   * {
   *   unregister_trace_call(ftrace_event_call);
   * }
 @@ -399,7 +399,7 @@ static inline int ftrace_get_offsets_##call(  
 \
   *   trace_current_buffer_unlock_commit(event, irq_flags, pc

Re: [PATCH -tip v5 4/7] tracing: add kprobe-based event tracer

2009-05-11 Thread Frederic Weisbecker
On Sat, May 09, 2009 at 01:33:53PM -0400, Masami Hiramatsu wrote:
 Frédéric Weisbecker wrote:
  Hi,
  
  2009/5/9 Masami Hiramatsu mhira...@redhat.com:
 [...]
  +
  +/* event recording functions */
  +static void kprobe_trace_record(unsigned long ip, struct trace_probe *tp,
  +   struct pt_regs *regs)
  +{
  +   __trace_bprintk(ip, %s%s%+ld\n,
  +   probe_is_return(tp) ? - : @,
  +   probe_symbol(tp), probe_offset(tp));
  +}
  
  
  
  What happens here if you have:
  
  kprobe_trace_record() {
probe_symbol() {
   probes_open() {
cleanup_all_probes() {
   free_trace_probe();
   return tp-symbol ? ; //crack!
 
  I wonder if you shouldn't use a per_cpu list of probes,
  spinlocked/irqsaved  accessed
  and also a kind of prevention against nmi.
 
 Sure, cleanup_all_probes() invokes unregister_kprobe() via
 unregister_trace_probe(), which waits running probe-handlers by
 using synchronize_sched()(because kprobes disables preemption
 around its handlers), before free_trace_probe().
 
 So you don't need any locks there :-)
 
 Thank you,
 
 


Aah, ok :)
So this patch looks sane.

Thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip 2/6 V4] x86: add arch-dep register and stack access API to ptrace

2009-04-02 Thread Frederic Weisbecker
On Thu, Apr 02, 2009 at 01:24:47PM -0400, Masami Hiramatsu wrote:
 Add following APIs for accessing registers and stack entries from pt_regs.
 - query_register_offset(const char *name)
Query the offset of name register.
 
 - query_register_name(unsigned offset)
Query the name of register by its offset.
 
 - get_register(struct pt_regs *regs, unsigned offset)
Get the value of a register by its offset.
 
 - valid_stack_address(struct pt_regs *regs, unsigned long addr)
Check the address is in the stack.
 
 - get_stack_nth(struct pt_regs *reg, unsigned nth)
Get Nth entry of the stack. (N = 0)
 
 - get_argument_nth(struct pt_regs *reg, unsigned nth)
Get Nth argument at function call. (N = 0)
 
 Signed-off-by: Masami Hiramatsu mhira...@redhat.com
 Cc: Steven Rostedt rost...@goodmis.org
 Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
 Cc: Ingo Molnar mi...@elte.hu
 Cc: Frederic Weisbecker fweis...@gmail.com
 ---
 
  arch/x86/include/asm/ptrace.h |   66 
 +
  arch/x86/kernel/ptrace.c  |   59 +
  2 files changed, 125 insertions(+), 0 deletions(-)
 
 
 diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
 index aed0894..44773b8 100644
 --- a/arch/x86/include/asm/ptrace.h
 +++ b/arch/x86/include/asm/ptrace.h
 @@ -7,6 +7,7 @@
 
  #ifdef __KERNEL__
  #include asm/segment.h
 +#include asm/page_types.h
  #endif
 
  #ifndef __ASSEMBLY__
 @@ -215,6 +216,71 @@ static inline unsigned long user_stack_pointer(struct 
 pt_regs *regs)
   return regs-sp;
  }
 
 +/* Query offset/name of register from its name/offset */
 +extern int query_register_offset(const char *name);
 +extern const char *query_register_name(unsigned offset);
 +#define MAX_REG_OFFSET (offsetof(struct pt_regs, sp))
 +
 +/* Get register value from its offset */
 +static inline unsigned long get_register(struct pt_regs *regs, unsigned 
 offset)
 +{
 + if (unlikely(offset  MAX_REG_OFFSET))
 + return 0;
 + return *(unsigned long *)((unsigned long)regs + offset);
 +}
 +
 +/* Check the address in the stack */
 +static inline int valid_stack_address(struct pt_regs *regs, unsigned long 
 addr)
 +{
 + return ((addr  ~(THREAD_SIZE - 1))  ==
 + (kernel_trap_sp(regs)  ~(THREAD_SIZE - 1)));
 +}
 +
 +/* Get Nth entry of the stack */
 +static inline unsigned long get_stack_nth(struct pt_regs *regs, unsigned n)
 +{
 + unsigned long *addr = (unsigned long *)kernel_trap_sp(regs);
 + addr += n;
 + if (valid_stack_address(regs, (unsigned long)addr))
 + return *addr;
 + else
 + return 0;
 +}
 +
 +/* Get Nth argument at function call */
 +static inline unsigned long get_argument_nth(struct pt_regs *regs, unsigned 
 n)
 +{
 +#ifdef CONFIG_X86_32
 +#define NR_REGPARMS 3
 + if (n  NR_REGPARMS) {
 + switch (n) {
 + case 0: return regs-ax;
 + case 1: return regs-dx;
 + case 2: return regs-cx;
 + }
 + return 0;
 +#else /* CONFIG_X86_64 */
 +#define NR_REGPARMS 6
 + if (n  NR_REGPARMS) {
 + switch (n) {
 + case 0: return regs-di;
 + case 1: return regs-si;
 + case 2: return regs-dx;
 + case 3: return regs-cx;
 + case 4: return regs-r8;
 + case 5: return regs-r9;
 + }
 + return 0;
 +#endif
 + } else {
 + /*
 +  * The typical case: arg n is on the stack.
 +  * (Note: stack[0] = return address, so skip it)
 +  */
 + return get_stack_nth(regs, 1 + n - NR_REGPARMS);
 + }
 +}
 +
  /*
   * These are defined as per linux/ptrace.h, which see.
   */
 diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
 index 5c6e463..8d65dcb 100644
 --- a/arch/x86/kernel/ptrace.c
 +++ b/arch/x86/kernel/ptrace.c
 @@ -46,6 +46,65 @@ enum x86_regset {
   REGSET_IOPERM32,
  };
 
 +struct pt_regs_offset {
 + const char *name;
 + int offset;
 +};
 +
 +#define REG_OFFSET(r) offsetof(struct pt_regs, r)
 +#define REG_OFFSET_NAME(r) {.name = #r, .offset = REG_OFFSET(r)}
 +#define REG_OFFSET_END {.name = NULL, .offset = 0}
 +
 +static struct pt_regs_offset regoffset_table[] = {
 +#ifdef CONFIG_X86_64
 + REG_OFFSET_NAME(r15),
 + REG_OFFSET_NAME(r14),
 + REG_OFFSET_NAME(r13),
 + REG_OFFSET_NAME(r12),
 + REG_OFFSET_NAME(r11),
 + REG_OFFSET_NAME(r10),
 + REG_OFFSET_NAME(r9),
 + REG_OFFSET_NAME(r8),
 +#endif
 + REG_OFFSET_NAME(bx),
 + REG_OFFSET_NAME(cx),
 + REG_OFFSET_NAME(dx),
 + REG_OFFSET_NAME(si),
 + REG_OFFSET_NAME(di),
 + REG_OFFSET_NAME(bp),
 + REG_OFFSET_NAME(ax),
 +#ifdef CONFIG_X86_32
 + REG_OFFSET_NAME(ds),
 + REG_OFFSET_NAME(es),
 + REG_OFFSET_NAME(fs),
 + REG_OFFSET_NAME(gs),
 +#endif
 + REG_OFFSET_NAME(orig_ax),
 + REG_OFFSET_NAME(ip