Re: [PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL
On Wed, Feb 11, 2015 at 02:43:19PM -0500, Rik van Riel wrote: If exception_enter happens when already in IN_KERNEL state, the code still calls context_tracking_exit, which ends up in rcu_eqs_exit_common, which explodes with a WARN_ON when it is called in a situation where dynticks are not enabled. Fortunately context_tracking_exit() already has a current_state == IN_KERNEL check so this shouldn't be a problem. Meanwhile I'll still take the patch, it's better to handle that from the caller. Thanks. This can be avoided by having exception_enter only switch to IN_KERNEL state if the current state is not already IN_KERNEL. Signed-off-by: Rik van Riel r...@redhat.com Reported-by: Luiz Capitulino lcapitul...@redhat.com --- Frederic, you will want this bonus patch, too :) Thanks to Luiz for finding this one. Whatever I was running did not trigger this issue... include/linux/context_tracking.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b65fd1420e53..9da230406e8c 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -37,7 +37,8 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - context_tracking_exit(prev_ctx); + if (prev_ctx != IN_KERNEL) + context_tracking_exit(prev_ctx); return prev_ctx; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v5 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest
On Tue, Feb 10, 2015 at 03:27:49PM -0500, r...@redhat.com wrote: When running a KVM guest on a system with NOHZ_FULL enabled, and the KVM guest running with idle=poll mode, we still get wakeups of the rcuos/N threads. This problem has already been solved for user space by telling the RCU subsystem that the CPU is in an extended quiescent state while running user space code. This patch series extends that code a little bit to make it usable to track KVM guest space, too. I tested the code by booting a KVM guest with idle=poll, on a system with NOHZ_FULL enabled on most CPUs, and a VCPU thread bound to a CPU. In a 10 second interval, rcuos/N threads on other CPUs got woken up several times, while the rcuos thread on the CPU running the bound and alwasy running VCPU thread never got woken up once. Thanks to Christian Borntraeger, Paul McKenney, Paulo Bonzini, Frederic Weisbecker, and Will Deacon for reviewing and improving earlier versions of this patch series. So the patchset look good, thanks everyone. I'm applying the series and will see anything explode. I'll wait until the end of the merge window to push it to Ingo. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL
On Thu, Feb 12, 2015 at 10:47:10AM -0500, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/12/2015 10:42 AM, Frederic Weisbecker wrote: On Wed, Feb 11, 2015 at 02:43:19PM -0500, Rik van Riel wrote: If exception_enter happens when already in IN_KERNEL state, the code still calls context_tracking_exit, which ends up in rcu_eqs_exit_common, which explodes with a WARN_ON when it is called in a situation where dynticks are not enabled. Fortunately context_tracking_exit() already has a current_state == IN_KERNEL check so this shouldn't be a problem. No, it had a hard-coded current_state == IN_USER check, which is very close, but ... ... I replaced that with a state argument, and forgot to ensure that it never gets called with state == IN_KERNEL. This patch fixes that. Ah that's right! Well I'm going to merge this patch to 1/5 then to avoid breaking bisection. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] rcu,nohz: add context_tracking_user_enter/exit wrapper functions
On Tue, Feb 10, 2015 at 09:41:45AM -0500, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com These wrapper functions allow architecture code (eg. ARM) to keep calling context_tracking_user_enter context_tracking_user_exit the same way it always has, without error prone tricks like duplicate defines of argument values in assembly code. Signed-off-by: Rik van Riel r...@redhat.com This patch alone doesn't make much sense. The changelog says it's about keeping context_tracking_user_*() functions as wrappers but fails to explain to what they wrap, why and what are the new context_tracking_enter/exit functions for. Perhaps patches 1 and 2 should be merged together into something like: context_tracking: Generalize context tracking APIs to support user and guest Do that because we'll also track guestetc And keep the old user context tracking APIs for now to avoid painful enum parameter support in ARM assembly --- include/linux/context_tracking.h | 2 ++ kernel/context_tracking.c| 37 + 2 files changed, 27 insertions(+), 12 deletions(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 37b81bd51ec0..03b9c733eae7 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -10,6 +10,8 @@ #ifdef CONFIG_CONTEXT_TRACKING extern void context_tracking_cpu_set(int cpu); +extern void context_tracking_enter(void); +extern void context_tracking_exit(void); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); extern void __context_tracking_task_switch(struct task_struct *prev, diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 937ecdfdf258..bbdc423936e6 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -39,15 +39,15 @@ void context_tracking_cpu_set(int cpu) } /** - * context_tracking_user_enter - Inform the context tracking that the CPU is going to - * enter userspace mode. + * context_tracking_enter - Inform the context tracking that the CPU is going + * enter user or guest space mode. * * This function must be called right before we switch from the kernel - * to userspace, when it's guaranteed the remaining kernel instructions - * to execute won't use any RCU read side critical section because this - * function sets RCU in extended quiescent state. + * to user or guest space, when it's guaranteed the remaining kernel + * instructions to execute won't use any RCU read side critical section + * because this function sets RCU in extended quiescent state. */ -void context_tracking_user_enter(void) +void context_tracking_enter(void) { unsigned long flags; @@ -105,20 +105,27 @@ void context_tracking_user_enter(void) } local_irq_restore(flags); } +NOKPROBE_SYMBOL(context_tracking_enter); + +void context_tracking_user_enter(void) +{ + context_tracking_enter(); +} NOKPROBE_SYMBOL(context_tracking_user_enter); /** - * context_tracking_user_exit - Inform the context tracking that the CPU is - * exiting userspace mode and entering the kernel. + * context_tracking_exit - Inform the context tracking that the CPU is + * exiting user or guest mode and entering the kernel. * - * This function must be called after we entered the kernel from userspace - * before any use of RCU read side critical section. This potentially include - * any high level kernel code like syscalls, exceptions, signal handling, etc... + * This function must be called after we entered the kernel from user or + * guest space before any use of RCU read side critical section. This + * potentially include any high level kernel code like syscalls, exceptions, + * signal handling, etc... * * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_user_exit(void) +void context_tracking_exit(void) Comment changes are good, thanks! { unsigned long flags; @@ -143,6 +150,12 @@ void context_tracking_user_exit(void) } local_irq_restore(flags); } +NOKPROBE_SYMBOL(context_tracking_exit); + +void context_tracking_user_exit(void) +{ + context_tracking_exit(); +} NOKPROBE_SYMBOL(context_tracking_user_exit); /** -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] rcu,nohz: add context_tracking_user_enter/exit wrapper functions
On Tue, Feb 10, 2015 at 09:25:26AM -0800, Paul E. McKenney wrote: On Tue, Feb 10, 2015 at 11:48:37AM -0500, Rik van Riel wrote: On 02/10/2015 10:28 AM, Frederic Weisbecker wrote: On Tue, Feb 10, 2015 at 09:41:45AM -0500, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com These wrapper functions allow architecture code (eg. ARM) to keep calling context_tracking_user_enter context_tracking_user_exit the same way it always has, without error prone tricks like duplicate defines of argument values in assembly code. Signed-off-by: Rik van Riel r...@redhat.com This patch alone doesn't make much sense. Agreed, my goal was to keep things super easy to review, to reduce the chance of introducing bugs. The changelog says it's about keeping context_tracking_user_*() functions as wrappers but fails to explain to what they wrap, why and what are the new context_tracking_enter/exit functions for. Perhaps patches 1 and 2 should be merged together into something like: context_tracking: Generalize context tracking APIs to support user and guest Do that because we'll also track guestetc And keep the old user context tracking APIs for now to avoid painful enum parameter support in ARM assembly Can do... Paul, would you like me to resend the whole series, or just a merged patch that replaces patches 1 2? I prefer the whole series, as it reduces my opportunity to introduce human error when applying them. ;-) That is assuming Paul prefers having the patches merged into one :) I am OK with that. I will merge the next set with the first two patches merged, people have had sufficient opportunity to review. BTW, I have a few patches to make on the next cycle to fix a few context tracking related things. And since it's too late to push this series for the current merge window, now I wonder it may be easier if I take these patches. Otherwise you might experience unpleasant rebase conflicts. Is that ok for you? Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v3 0/6] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest
On Mon, Feb 09, 2015 at 08:22:59PM -0500, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/09/2015 08:15 PM, Will Deacon wrote: Hi Rik, On Mon, Feb 09, 2015 at 04:04:38PM +, r...@redhat.com wrote: Apologies to Catalin and Will for not fixing up ARM. I am not familiar with ARM assembly, and not sure how to pass a constant argument to a function from assembly code on ARM :) It's a bit of a faff getting enum values into asm -- we actually have to duplicate the definitions using #defines to get at the constants. Perhaps it would be cleaner to leave context_tracking_user_{enter,exit} intact as C wrappers around context_tracking_{enter,exit} passing the appropriate constant? That way we don't actually need to change the arch code at all. If Paul and Frederic have no objections, I would be happy to do that. Paul, Frederic? Sure, that's fine by me. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit
On Fri, Feb 06, 2015 at 11:14:53PM -0800, Paul E. McKenney wrote: On Fri, Feb 06, 2015 at 10:34:21PM -0800, Paul E. McKenney wrote: On Fri, Feb 06, 2015 at 10:53:34PM -0500, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/06/2015 06:15 PM, Frederic Weisbecker wrote: Just a few things then: 1) In this case rename context_tracking_user_enter/exit() to context_tracking_enter() and context_tracking_exit(), since it's not anymore about user only but about any generic context. 2) We have the WARN_ON_ONCE(!current-mm); condition that is a debug check specific to userspace transitions because kernel threads aren't expected to resume to userspace. Can we also expect that we never switch to/from guest from a kernel thread? AFAICS this happens from an ioctl (thus user task) in x86 for kvm. But I only know this case. 3) You might want to update a few comments that assume we only deal with userspace transitions. 4) trace_user_enter/exit() should stay user-transitions specific. Paul, would you like me to send follow-up patches with the cleanups suggested by Frederic, or would you prefer me to send a new series with the cleanups integrated? I would prefer a new series, in order to prevent possible future confusion. Of course, if Frederic would rather push them himself, I am fine with that. And in that case, you should ask him for his preferences, which just might differ from mine. ;-) I prefer a new series too. Now whether you or me take the patches, I don't mind either way :-) Also I wonder how this feature is going to be enabled. Will it be enabled on full dynticks or should it be a seperate feature depending on full dynticks? Or even just CONFIG_RCU_USER_EQS? Because I'm still unclear about how and what this is used, if it involves full dynticks or only RCU extended quiescent states. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit
On Fri, Feb 06, 2015 at 01:51:56PM -0500, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/06/2015 01:23 PM, Frederic Weisbecker wrote: On Fri, Feb 06, 2015 at 01:20:21PM -0500, Rik van Riel wrote: On 02/06/2015 12:22 PM, Frederic Weisbecker wrote: On Thu, Feb 05, 2015 at 03:23:48PM -0500, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com Add the expected ctx_state as a parameter to context_tracking_user_enter and context_tracking_user_exit, allowing the same functions to not just track kernel user space switching, but also kernel guest transitions. Signed-off-by: Rik van Riel r...@redhat.com You should consider using guest_enter() and guest_exit() instead. These are context tracking APIs too but specifically for guest. What do you mean instead? KVM already uses those. I just wanted to avoid duplicating the code... I mean you can call rcu_user APIs directly from guest_enter/exit. You don't really need to call the context_tracking_user functions since guest_enter/guest_exit already handle the vtime accounting. I would still have to modify exception_enter and exception_exit, and with them context_tracking_user_enter and context_tracking_user_exit. We have to re-enable RCU when an exception happens. I suspect exceptions in a guest just trigger VMEXIT, and we figure later why the exception happened. However, if we were to get an exception during the code where we transition into or out of guest mode, we would still need exception_enter and exception_exit... Ah that's a fair point. I didn't think about that. Ok then a real IN_GUEST mode makes sense. And context_tracking_user_enter/exit() can be reused as is indeed. Just a few things then: 1) In this case rename context_tracking_user_enter/exit() to context_tracking_enter() and context_tracking_exit(), since it's not anymore about user only but about any generic context. 2) We have the WARN_ON_ONCE(!current-mm); condition that is a debug check specific to userspace transitions because kernel threads aren't expected to resume to userspace. Can we also expect that we never switch to/from guest from a kernel thread? AFAICS this happens from an ioctl (thus user task) in x86 for kvm. But I only know this case. 3) You might want to update a few comments that assume we only deal with userspace transitions. 4) trace_user_enter/exit() should stay user-transitions specific. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest
On Thu, Feb 05, 2015 at 03:23:51PM -0500, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com The host kernel is not doing anything while the CPU is executing a KVM guest VCPU, so it can be marked as being in an extended quiescent state, identical to that used when running user space code. The only exception to that rule is when the host handles an interrupt, which is already handled by the irq code, which calls rcu_irq_enter and rcu_irq_exit. The guest_enter and guest_exit functions already switch vtime accounting independent of context tracking, so leave those calls where they are, instead of moving them into the context tracking code. Signed-off-by: Rik van Riel r...@redhat.com --- include/linux/context_tracking.h | 8 +++- include/linux/context_tracking_state.h | 1 + include/linux/kvm_host.h | 3 ++- 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index bd9f000fc98d..a5d3bb44b897 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void) static inline void exception_exit(enum ctx_state prev_ctx) { if (context_tracking_is_enabled()) { - if (prev_ctx == IN_USER) + if (prev_ctx == IN_USER || prev_ctx == IN_GUEST) That's nitpicking but != IN_KERNEL would be more generic. We are exiting an exception and we know that the exception executes IN_KERNEL, so we want to restore any context (whether IN_USER, IN_GUEST, or anything added in the future) prior the exception if that was anything else than IN_KERNEL. context_tracking_user_enter(prev_ctx); } } @@ -78,6 +78,9 @@ static inline void guest_enter(void) vtime_guest_enter(current); else current-flags |= PF_VCPU; + + if (context_tracking_is_enabled()) + context_tracking_user_enter(IN_GUEST); } static inline void guest_exit(void) @@ -86,6 +89,9 @@ static inline void guest_exit(void) vtime_guest_exit(current); else current-flags = ~PF_VCPU; + + if (context_tracking_is_enabled()) + context_tracking_user_exit(IN_GUEST); I suggest you to restore RCU before anything else. I believe cputime accounting doesn't use RCU but we never know with all the debug/tracing code behind, the acct accounting... Thanks. } #else diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 97a81225d037..f3ef027af749 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -15,6 +15,7 @@ struct context_tracking { enum ctx_state { IN_KERNEL = 0, IN_USER, + IN_GUEST, } state; }; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 26f106022c88..c7828a6a9614 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -772,7 +772,8 @@ static inline void kvm_guest_enter(void) * one time slice). Lets treat guest mode as quiescent state, just like * we do with user-mode execution. */ - rcu_virt_note_context_switch(smp_processor_id()); + if (!context_tracking_cpu_is_enabled()) + rcu_virt_note_context_switch(smp_processor_id()); } static inline void kvm_guest_exit(void) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest
On Thu, Feb 05, 2015 at 03:23:47PM -0500, r...@redhat.com wrote: When running a KVM guest on a system with NOHZ_FULL enabled I just need to clarify the motivation first, does the above situation really happen? Ok some distros enable NOHZ_FULL to let the user stop the tick in userspace. So most of the time, CONFIG_NOHZ_FULL=y but nohz full is runtime disabled (we need to pass a nohz_full= boot parameter to enable it). And when it is runtime disabled, there should be no rcu nocb CPU. (Although not setting CPUs in nocb mode when nohz full is runtime disabled is perhaps a recent change.) So for the problem to arise, one need to enable nohz_full and run KVM guest. And I never heard about such workloads. That said it's potentially interesting to turn off the tick on the host when the guest runs. , and the KVM guest running with idle=poll mode, we still get wakeups of the rcuos/N threads. So we need nohz_full on the host and idle=poll mode on the guest. Is it likely to happen? (sorry, again I'm just trying to make sure we agree on why we do this change). This problem has already been solved for user space by telling the RCU subsystem that the CPU is in an extended quiescent state while running user space code. This patch series extends that code a little bit to make it usable to track KVM guest space, too. I tested the code by booting a KVM guest with idle=poll, on a system with NOHZ_FULL enabled on most CPUs, and a VCPU thread bound to a CPU. In a 10 second interval, rcuos/N threads on other CPUs got woken up several times, while the rcuos thread on the CPU running the bound and alwasy running VCPU thread never got woken up once. So what you're describing is to set RCU in extended quiescent state, right? This doesn't include stopping the tick while running in guest mode? Those are indeed two different thing, although stopping the tick most often requires to set RCU in extended quiescent state. Thanks to Christian Borntraeger and Paul McKenney for reviewing the first version of this patch series, and helping optimize patch 4/5. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit
On Thu, Feb 05, 2015 at 03:23:48PM -0500, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com Add the expected ctx_state as a parameter to context_tracking_user_enter and context_tracking_user_exit, allowing the same functions to not just track kernel user space switching, but also kernel guest transitions. Signed-off-by: Rik van Riel r...@redhat.com You should consider using guest_enter() and guest_exit() instead. These are context tracking APIs too but specifically for guest. These can be uninlined if needed. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest
On Thu, Feb 05, 2015 at 03:23:51PM -0500, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com The host kernel is not doing anything while the CPU is executing a KVM guest VCPU, so it can be marked as being in an extended quiescent state, identical to that used when running user space code. The only exception to that rule is when the host handles an interrupt, which is already handled by the irq code, which calls rcu_irq_enter and rcu_irq_exit. The guest_enter and guest_exit functions already switch vtime accounting independent of context tracking, so leave those calls where they are, instead of moving them into the context tracking code. Signed-off-by: Rik van Riel r...@redhat.com --- include/linux/context_tracking.h | 8 +++- include/linux/context_tracking_state.h | 1 + include/linux/kvm_host.h | 3 ++- 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index bd9f000fc98d..a5d3bb44b897 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void) static inline void exception_exit(enum ctx_state prev_ctx) { if (context_tracking_is_enabled()) { - if (prev_ctx == IN_USER) + if (prev_ctx == IN_USER || prev_ctx == IN_GUEST) context_tracking_user_enter(prev_ctx); } } @@ -78,6 +78,9 @@ static inline void guest_enter(void) vtime_guest_enter(current); else current-flags |= PF_VCPU; + + if (context_tracking_is_enabled()) + context_tracking_user_enter(IN_GUEST); So you should probably just call rcu_user_enter() directly from there. context_tracking_user_enter() is really about userspace boundaries. } static inline void guest_exit(void) @@ -86,6 +89,9 @@ static inline void guest_exit(void) vtime_guest_exit(current); else current-flags = ~PF_VCPU; + + if (context_tracking_is_enabled()) + context_tracking_user_exit(IN_GUEST); } #else diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 97a81225d037..f3ef027af749 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -15,6 +15,7 @@ struct context_tracking { enum ctx_state { IN_KERNEL = 0, IN_USER, + IN_GUEST, } state; }; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 26f106022c88..c7828a6a9614 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -772,7 +772,8 @@ static inline void kvm_guest_enter(void) * one time slice). Lets treat guest mode as quiescent state, just like * we do with user-mode execution. */ - rcu_virt_note_context_switch(smp_processor_id()); + if (!context_tracking_cpu_is_enabled()) + rcu_virt_note_context_switch(smp_processor_id()); Should we have a specific CONFIG for this feature? Or relying on full dynticks to be enabled (and thus context tracking enabled) is enough? Thanks. } static inline void kvm_guest_exit(void) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest
On Fri, Feb 06, 2015 at 09:56:43AM -0500, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/06/2015 08:46 AM, Frederic Weisbecker wrote: On Thu, Feb 05, 2015 at 03:23:47PM -0500, r...@redhat.com wrote: When running a KVM guest on a system with NOHZ_FULL enabled I just need to clarify the motivation first, does the above situation really happen? Ok some distros enable NOHZ_FULL to let the user stop the tick in userspace. So most of the time, CONFIG_NOHZ_FULL=y but nohz full is runtime disabled (we need to pass a nohz_full= boot parameter to enable it). And when it is runtime disabled, there should be no rcu nocb CPU. (Although not setting CPUs in nocb mode when nohz full is runtime disabled is perhaps a recent change.) So for the problem to arise, one need to enable nohz_full and run KVM guest. And I never heard about such workloads. That said it's potentially interesting to turn off the tick on the host when the guest runs. , and the KVM guest running with idle=poll mode, we still get wakeups of the rcuos/N threads. So we need nohz_full on the host and idle=poll mode on the guest. Is it likely to happen? (sorry, again I'm just trying to make sure we agree on why we do this change). We have users interested in doing just that, in order to run KVM guests with the least amount of perturbation to the guest. So what are you interested in exactly? Only RCU extended quiescent state or also full dynticks? Because your cover letter only points to disturbing RCU nocb and quiescent states. Also I'm curious why you run guests in idle=poll. Maybe that avoids host/guest context switches? Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest
On Fri, Feb 06, 2015 at 02:50:44PM +0100, Paolo Bonzini wrote: On 06/02/2015 14:46, Frederic Weisbecker wrote: When running a KVM guest on a system with NOHZ_FULL enabled I just need to clarify the motivation first, does the above situation really happen? Ok some distros enable NOHZ_FULL to let the user stop the tick in userspace. So most of the time, CONFIG_NOHZ_FULL=y but nohz full is runtime disabled (we need to pass a nohz_full= boot parameter to enable it). And when it is runtime disabled, there should be no rcu nocb CPU. (Although not setting CPUs in nocb mode when nohz full is runtime disabled is perhaps a recent change.) So for the problem to arise, one need to enable nohz_full and run KVM guest. And I never heard about such workloads. Yeah, it's a new thing but Marcelo, Luiz and Rik have been having a lot of fun with them (with PREEMPT_RT too). They're getting pretty good results given the right tuning. Ok but, I'm still not sure about the details of what you're trying to do. Whether it's only about RCU or it also involves ticks. What kind of tuning you're doing and what kind of performance gain? Thanks. I'll let Paul queue the patches for 3.21 then! Paolo That said it's potentially interesting to turn off the tick on the host when the guest runs. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest
On Fri, Feb 06, 2015 at 04:00:27PM +0100, Christian Borntraeger wrote: Am 05.02.2015 um 21:23 schrieb r...@redhat.com: When running a KVM guest on a system with NOHZ_FULL enabled, and the KVM guest running with idle=poll mode, we still get wakeups of the rcuos/N threads. This problem has already been solved for user space by telling the RCU subsystem that the CPU is in an extended quiescent state while running user space code. This patch series extends that code a little bit to make it usable to track KVM guest space, too. I tested the code by booting a KVM guest with idle=poll, on a system with NOHZ_FULL enabled on most CPUs, and a VCPU thread bound to a CPU. In a 10 second interval, rcuos/N threads on other CPUs got woken up several times, while the rcuos thread on the CPU running the bound and alwasy running VCPU thread never got woken up once. Thanks to Christian Borntraeger and Paul McKenney for reviewing the first version of this patch series, and helping optimize patch 4/5. I gave it a quick run on s390/kvm and everything still seem to be running fine. A also I like the idea of this patch set. We have seen several cases were the fact that we are in guest context a full tick for cpu bound guests (10ms on s390) caused significant latencies for host synchronize-rcu heavy workload - e.g. getting rid of macvtap devices on guest shutdown, adding hundreds of irq routes for many guest devices s390 has no context tracking infrastructure yet (no nohz_full), but this series looks like that the current case (nohz_idle) still works. With this in place, having hohz==full on s390 now even makes more sense, as KVM hosts with cpu bound guests should have get much quicker rcu response times when most host CPUs are in an extended quiescant state. Sure, if you need any help for context tracking, don't hesitate to ask, it can be a bit tricky to implement sometimes. Perhaps x86 isn't the best example because it does quite some weird dances to minimize fast path overhead. ARM is perhaps clearer. Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit
On Fri, Feb 06, 2015 at 01:20:21PM -0500, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/06/2015 12:22 PM, Frederic Weisbecker wrote: On Thu, Feb 05, 2015 at 03:23:48PM -0500, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com Add the expected ctx_state as a parameter to context_tracking_user_enter and context_tracking_user_exit, allowing the same functions to not just track kernel user space switching, but also kernel guest transitions. Signed-off-by: Rik van Riel r...@redhat.com You should consider using guest_enter() and guest_exit() instead. These are context tracking APIs too but specifically for guest. What do you mean instead? KVM already uses those. I just wanted to avoid duplicating the code... I mean you can call rcu_user APIs directly from guest_enter/exit. You don't really need to call the context_tracking_user functions since guest_enter/guest_exit already handle the vtime accounting. - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU1QXlAAoJEM553pKExN6D+l4H/1PPmFioxed9XyL+rJZf0XSt mATl5JcWGlNybL5c4Tnld/3FX5/vYwBXmgw2Rh5a84F+TJi8B+Hu2Uwetl6C6vUF EK2+ExJ1rla4lpiO3frxPDdfdOHJFw2bR0fhEb4GHqcN2ecfSdXtL4hKwFru5h5s IJ8dzNIW52vzqzmulkcvI1y+VkgQBwUXYbkiGyy/MPf4F0WvGC9g44eXHZNPRXoT V34/nMJCpFHlZ7FVuHqGGstmPjv19VUAYNhUkrlU8DOpZMKxT58Sb1CGLfwsGqvZ y0+pRca8eT+gX0vqg9YUBfoEBNy4MnHdQEwQ0EPZwPJkcQ3Leco3/1JLHyDogCg= =3AJV -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM
On Fri, May 17, 2013 at 07:09:42AM -0700, Kevin Hilman wrote: Frederic Weisbecker fweis...@gmail.com writes: On Thu, May 16, 2013 at 12:52:03AM +0200, Frederic Weisbecker wrote: On Mon, Mar 25, 2013 at 02:14:20PM -0700, Kevin Hilman wrote: Gleb Natapov g...@redhat.com writes: On Sun, Mar 24, 2013 at 02:44:26PM +0100, Frederic Weisbecker wrote: 2013/3/21 Gleb Natapov g...@redhat.com: Isn't is simpler for kernel/context_tracking.c to define empty __guest_enter()/__guest_exit() if !CONFIG_KVM. That doesn't look right. Off-cases are usually handled from the headers, right? So that we avoid iffdeffery ugliness in core code. Lets put it in linux/context_tracking.h header then. Here's a version to do that. Kevin From d9d909394479dd7ff90b7bddb95a564945406719 Mon Sep 17 00:00:00 2001 From: Kevin Hilman khil...@linaro.org Date: Mon, 25 Mar 2013 14:12:41 -0700 Subject: [PATCH v2] ontext_tracking: fix !CONFIG_KVM compile: add stub guest enter/exit Sorry for my very delayed response... When KVM is not enabled, or not available on a platform, the KVM headers should not be included. Instead, just define stub __guest_[enter|exit] functions. May be it would be cleaner to move guest_enter/exit definitions altogether in linux/context_tracking.h After all that's where the implementation mostly belong to. Let me see if I can get that in shape. Does the following work for you? Nope. Since it still includs kvm_host.h on non-KVM builds, there is potential for problems. For example, on ARM (v3.10-rc1 + this patch) has this build error: CC kernel/context_tracking.o In file included from /work/kernel/linaro/nohz/arch/arm/include/asm/kvm_host.h:41:0, from /work/kernel/linaro/nohz/include/linux/kvm_host.h:34, from /work/kernel/linaro/nohz/kernel/context_tracking.c:18: /work/kernel/linaro/nohz/arch/arm/include/asm/kvm_vgic.h:38:6: warning: CONFIG_KVM_ARM_MAX_VCPUS is not defined [-Wundef] In file included from /work/kernel/linaro/nohz/arch/arm/include/asm/kvm_host.h:41:0, from /work/kernel/linaro/nohz/include/linux/kvm_host.h:34, from /work/kernel/linaro/nohz/kernel/context_tracking.c:18: /work/kernel/linaro/nohz/arch/arm/include/asm/kvm_vgic.h:59:11: error: 'CONFIG_KVM_ARM_MAX_VCPUS' undeclared here (not in a function) Sorry I forgot to remove the include to kvm_host.h in context_tracking.c, here's the fixed patch: diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 365f4a6..fc09d7b 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include linux/sched.h #include linux/percpu.h +#include linux/vtime.h #include asm/ptrace.h struct context_tracking { @@ -19,6 +20,26 @@ struct context_tracking { } state; }; +static inline void __guest_enter(void) +{ + /* +* This is running in ioctl context so we can avoid +* the call to vtime_account() with its unnecessary idle check. +*/ + vtime_account_system(current); + current-flags |= PF_VCPU; +} + +static inline void __guest_exit(void) +{ + /* +* This is running in ioctl context so we can avoid +* the call to vtime_account() with its unnecessary idle check. +*/ + vtime_account_system(current); + current-flags = ~PF_VCPU; +} + #ifdef CONFIG_CONTEXT_TRACKING DECLARE_PER_CPU(struct context_tracking, context_tracking); @@ -35,6 +56,9 @@ static inline bool context_tracking_active(void) extern void user_enter(void); extern void user_exit(void); +extern void guest_enter(void); +extern void guest_exit(void); + static inline enum ctx_state exception_enter(void) { enum ctx_state prev_ctx; @@ -57,6 +81,17 @@ extern void context_tracking_task_switch(struct task_struct *prev, static inline bool context_tracking_in_user(void) { return false; } static inline void user_enter(void) { } static inline void user_exit(void) { } + +static inline void guest_enter(void) +{ + __guest_enter(); +} + +static inline void guest_exit(void) +{ + __guest_exit(); +} + static inline enum ctx_state exception_enter(void) { return 0; } static inline void exception_exit(enum ctx_state prev_ctx) { } static inline void context_tracking_task_switch(struct task_struct *prev, diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index f0eea07..8db53cf 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -23,6 +23,7 @@ #include linux/ratelimit.h #include linux/err.h #include linux/irqflags.h +#include linux/context_tracking.h #include asm/signal.h #include linux/kvm.h @@ -760,42 +761,6 @@ static inline int kvm_iommu_unmap_guest(struct kvm *kvm) } #endif -static inline void __guest_enter(void) -{ - /* -* This is running in ioctl
Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM
On Thu, May 16, 2013 at 12:52:03AM +0200, Frederic Weisbecker wrote: On Mon, Mar 25, 2013 at 02:14:20PM -0700, Kevin Hilman wrote: Gleb Natapov g...@redhat.com writes: On Sun, Mar 24, 2013 at 02:44:26PM +0100, Frederic Weisbecker wrote: 2013/3/21 Gleb Natapov g...@redhat.com: Isn't is simpler for kernel/context_tracking.c to define empty __guest_enter()/__guest_exit() if !CONFIG_KVM. That doesn't look right. Off-cases are usually handled from the headers, right? So that we avoid iffdeffery ugliness in core code. Lets put it in linux/context_tracking.h header then. Here's a version to do that. Kevin From d9d909394479dd7ff90b7bddb95a564945406719 Mon Sep 17 00:00:00 2001 From: Kevin Hilman khil...@linaro.org Date: Mon, 25 Mar 2013 14:12:41 -0700 Subject: [PATCH v2] ontext_tracking: fix !CONFIG_KVM compile: add stub guest enter/exit Sorry for my very delayed response... When KVM is not enabled, or not available on a platform, the KVM headers should not be included. Instead, just define stub __guest_[enter|exit] functions. May be it would be cleaner to move guest_enter/exit definitions altogether in linux/context_tracking.h After all that's where the implementation mostly belong to. Let me see if I can get that in shape. Does the following work for you? --- commit b1633c075527653a99df6fd54d2611cf8c8047f5 Author: Frederic Weisbecker fweis...@gmail.com Date: Thu May 16 01:21:38 2013 +0200 kvm: Move guest entry/exit APIs to context_tracking Signed-off-by: Frederic Weisbecker fweis...@gmail.com diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 365f4a6..fc09d7b 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include linux/sched.h #include linux/percpu.h +#include linux/vtime.h #include asm/ptrace.h struct context_tracking { @@ -19,6 +20,26 @@ struct context_tracking { } state; }; +static inline void __guest_enter(void) +{ + /* +* This is running in ioctl context so we can avoid +* the call to vtime_account() with its unnecessary idle check. +*/ + vtime_account_system(current); + current-flags |= PF_VCPU; +} + +static inline void __guest_exit(void) +{ + /* +* This is running in ioctl context so we can avoid +* the call to vtime_account() with its unnecessary idle check. +*/ + vtime_account_system(current); + current-flags = ~PF_VCPU; +} + #ifdef CONFIG_CONTEXT_TRACKING DECLARE_PER_CPU(struct context_tracking, context_tracking); @@ -35,6 +56,9 @@ static inline bool context_tracking_active(void) extern void user_enter(void); extern void user_exit(void); +extern void guest_enter(void); +extern void guest_exit(void); + static inline enum ctx_state exception_enter(void) { enum ctx_state prev_ctx; @@ -57,6 +81,17 @@ extern void context_tracking_task_switch(struct task_struct *prev, static inline bool context_tracking_in_user(void) { return false; } static inline void user_enter(void) { } static inline void user_exit(void) { } + +static inline void guest_enter(void) +{ + __guest_enter(); +} + +static inline void guest_exit(void) +{ + __guest_exit(); +} + static inline enum ctx_state exception_enter(void) { return 0; } static inline void exception_exit(enum ctx_state prev_ctx) { } static inline void context_tracking_task_switch(struct task_struct *prev, diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index f0eea07..8db53cf 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -23,6 +23,7 @@ #include linux/ratelimit.h #include linux/err.h #include linux/irqflags.h +#include linux/context_tracking.h #include asm/signal.h #include linux/kvm.h @@ -760,42 +761,6 @@ static inline int kvm_iommu_unmap_guest(struct kvm *kvm) } #endif -static inline void __guest_enter(void) -{ - /* -* This is running in ioctl context so we can avoid -* the call to vtime_account() with its unnecessary idle check. -*/ - vtime_account_system(current); - current-flags |= PF_VCPU; -} - -static inline void __guest_exit(void) -{ - /* -* This is running in ioctl context so we can avoid -* the call to vtime_account() with its unnecessary idle check. -*/ - vtime_account_system(current); - current-flags = ~PF_VCPU; -} - -#ifdef CONFIG_CONTEXT_TRACKING -extern void guest_enter(void); -extern void guest_exit(void); - -#else /* !CONFIG_CONTEXT_TRACKING */ -static inline void guest_enter(void) -{ - __guest_enter(); -} - -static inline void guest_exit(void) -{ - __guest_exit(); -} -#endif /* !CONFIG_CONTEXT_TRACKING */ - static inline void kvm_guest_enter(void) { unsigned long flags; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord
Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM
On Mon, Mar 25, 2013 at 02:14:20PM -0700, Kevin Hilman wrote: Gleb Natapov g...@redhat.com writes: On Sun, Mar 24, 2013 at 02:44:26PM +0100, Frederic Weisbecker wrote: 2013/3/21 Gleb Natapov g...@redhat.com: Isn't is simpler for kernel/context_tracking.c to define empty __guest_enter()/__guest_exit() if !CONFIG_KVM. That doesn't look right. Off-cases are usually handled from the headers, right? So that we avoid iffdeffery ugliness in core code. Lets put it in linux/context_tracking.h header then. Here's a version to do that. Kevin From d9d909394479dd7ff90b7bddb95a564945406719 Mon Sep 17 00:00:00 2001 From: Kevin Hilman khil...@linaro.org Date: Mon, 25 Mar 2013 14:12:41 -0700 Subject: [PATCH v2] ontext_tracking: fix !CONFIG_KVM compile: add stub guest enter/exit Sorry for my very delayed response... When KVM is not enabled, or not available on a platform, the KVM headers should not be included. Instead, just define stub __guest_[enter|exit] functions. May be it would be cleaner to move guest_enter/exit definitions altogether in linux/context_tracking.h After all that's where the implementation mostly belong to. Let me see if I can get that in shape. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM
2013/3/21 Gleb Natapov g...@redhat.com: Isn't is simpler for kernel/context_tracking.c to define empty __guest_enter()/__guest_exit() if !CONFIG_KVM. That doesn't look right. Off-cases are usually handled from the headers, right? So that we avoid iffdeffery ugliness in core code. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Alter steal-time reporting in the guest
2013/2/19 Marcelo Tosatti mtosa...@redhat.com: On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. Sorry, I'm no expert in this area. But I don't really understand what is confusing for the end user here. I suppose that what is wanted is to subtract stolen time due to 'known reasons' from steal time reporting. 'Known reasons' being, for example, hard caps. So a vcpu executing instructions with no halt, but limited to 80% of available bandwidth, would not have 20% of stolen time reported. Ok, that's a good explanation to add to make that subtle steal time issue clearer. But yes, a description of the scenario that is being dealt with, with details, is important. Yeah especially for such a significant user ABI change. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Alter steal-time reporting in the guest
2013/3/5 Michael Wolf m...@linux.vnet.ibm.com: Sorry for the delay in the response. I did not see the email right away. On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote: On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. Sorry, I'm no expert in this area. But I don't really understand what is confusing for the end user here. I suppose that what is wanted is to subtract stolen time due to 'known reasons' from steal time reporting. 'Known reasons' being, for example, hard caps. So a vcpu executing instructions with no halt, but limited to 80% of available bandwidth, would not have 20% of stolen time reported. Yes exactly and the end user many times did not set up the guest and is not aware of the capping. The end user is only aware of the performance level that they were told they would get with the guest. But yes, a description of the scenario that is being dealt with, with details, is important. I will add more detail to the description next time I submit the patches. How about something like,In a cloud environment the user of a kvm guest is not aware of the underlying hardware or how many other guests are running on it. The end user is only aware of a level of performance that they should see. or does that just muddy the picture more?? That alone is probably not enough. But yeah, make sure you clearly state the difference between expected (caps, sched bandwidth...) and unexpected (overcommitting, competing load...) stolen time. Then add a practical example as you made above that explains why it matters to make that distinction and why you want to report it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Alter steal-time reporting in the guest
2013/2/5 Michael Wolf m...@linux.vnet.ibm.com: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. Sorry, I'm no expert in this area. But I don't really understand what is confusing for the end user here. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. Tthe steal time will only be altered if hard limits (cfs bandwidth control) is used. The period and the quota used to separate the consigned time (expected steal) from the steal time are taken from the cfs bandwidth control settings. Any other steal time accruing during that period will show as the traditional steal time. I'm also a bit confused here. steal time will then only account the cpu time lost due to quotas from cfs bandwidth control? Also what do you exactly mean by expected steal time? Is it steal time due to overcommitting minus scheduler quotas? Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 1/1] s390/kvm: Fix BUG in include/linux/kvm_host.h:745
2013/1/8 Christian Borntraeger borntrae...@de.ibm.com: commit b080935c8638e08134629d0a9ebdf35669bec14d kvm: Directly account vtime to system on guest switch also removed the irq_disable/enable around kvm guest switch, which is correct in itself. Unfortunately, there is a BUG ON that (correctly) checks for preemptible to cover the call to rcu later on. (Introduced with commit 8fa2206821953a50a3a02ea33fcfb3ced2fd9997 KVM: make guest mode entry to be rcu quiescent state) This check might trigger depending on the kernel config. Lets make sure that no preemption happens during kvm_guest_enter. We can enable preemption again after the call to rcu_virt_note_context_switch returns. Please note that we continue to run s390 guests with interrupts enabled. CC: Frederic Weisbecker fweis...@gmail.com CC: Gleb Natapov g...@redhat.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- arch/s390/kvm/kvm-s390.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index c9011bf..f090e81 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -613,7 +613,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu) kvm_s390_deliver_pending_interrupts(vcpu); vcpu-arch.sie_block-icptcode = 0; + preempt_disable(); kvm_guest_enter(); + preempt_enable(); Sorry for the issue. The fix looks good to me, thanks! Acked-by: Frederic Weisbecker fweis...@gmail.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] rcu: Allow rcu_user_enter()/exit() to nest
On Sun, Jul 08, 2012 at 06:54:18PM +0300, Avi Kivity wrote: On 07/06/2012 03:00 PM, Frederic Weisbecker wrote: Allow calls to rcu_user_enter() even if we are already in userspace (as seen by RCU) and allow calls to rcu_user_exit() even if we are already in the kernel. This makes the APIs more flexible to be called from architectures. Exception entries for example won't need to know if they come from userspace before calling rcu_user_exit(). I guess I should switch kvm to rcu_user_enter() and co, so we can disable the tick while running in a guest. But where are those functions? What are the rules for calling them? I guess we need to have a closer look at the guest case first. We probably need to take some care about specifics in time and load accounting usually handled by the tick before we can shut it down. RCU is only part of the problem. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/9] perf: expose perf capability to other modules.
On Thu, Nov 10, 2011 at 06:12:23AM -0600, Jason Wessel wrote: On 11/10/2011 02:58 AM, Frederic Weisbecker wrote: On Mon, Nov 07, 2011 at 02:45:17PM +, Will Deacon wrote: Hi Frederic, On Wed, Nov 02, 2011 at 07:42:04AM +, Frederic Weisbecker wrote: On Tue, Nov 01, 2011 at 10:20:04AM -0600, David Ahern wrote: Right. Originally it could be enabled/disabled. Right now it cannot be, but I believe Frederic is working on making it configurable again. David Yep. Will Deacon is working on making the breakpoints able to process pure arch informations (ie: without beeing forced to use the perf attr as a midlayer to define them). Once we have that I can seperate the breakpoints implementation from perf and make it opt-able. How do you foresee kdb fitting into this? I see that currently [on x86] we cook up perf_event structures with a specific overflow handler set. If we want to move this over to using a completely arch-defined structure, then we're going to end up with an overflow handler field in both perf_event *and* the arch-specific structure, which doesn't feel right to me. Of course, if the goal is only to separate ptrace (i.e. user debugging) from the perf dependency then we don't need the overflow handler because we'll always just send SIGTRAP to the current task. Any ideas? I don't know if we want to convert x86/kgdb to use pure arch breakpoints. If kgdb one day wants to extend this use to generic code, it may be a good idea to keep the things as is. I don't know, I'm adding Jason in Cc. I think the important part is to share the allocation code (meaning who owns which break point slots). This is why kgdb/kdb allocates the perf structures. The kgdb code will also directly write data to the slots once it has reserved them it would be good to share this code, but it was not shared because it was not usable early enough in the boot cycle on x86. Certainly there are others who could consume the same infrastructure such as kprobes. Jason. Yeah sure, in any case we want to keep the slot allocation/reservation handled in kernel/event/hw_breakpoint.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/9] perf: expose perf capability to other modules.
On Mon, Nov 07, 2011 at 02:45:17PM +, Will Deacon wrote: Hi Frederic, On Wed, Nov 02, 2011 at 07:42:04AM +, Frederic Weisbecker wrote: On Tue, Nov 01, 2011 at 10:20:04AM -0600, David Ahern wrote: Right. Originally it could be enabled/disabled. Right now it cannot be, but I believe Frederic is working on making it configurable again. David Yep. Will Deacon is working on making the breakpoints able to process pure arch informations (ie: without beeing forced to use the perf attr as a midlayer to define them). Once we have that I can seperate the breakpoints implementation from perf and make it opt-able. How do you foresee kdb fitting into this? I see that currently [on x86] we cook up perf_event structures with a specific overflow handler set. If we want to move this over to using a completely arch-defined structure, then we're going to end up with an overflow handler field in both perf_event *and* the arch-specific structure, which doesn't feel right to me. Of course, if the goal is only to separate ptrace (i.e. user debugging) from the perf dependency then we don't need the overflow handler because we'll always just send SIGTRAP to the current task. Any ideas? I don't know if we want to convert x86/kgdb to use pure arch breakpoints. If kgdb one day wants to extend this use to generic code, it may be a good idea to keep the things as is. I don't know, I'm adding Jason in Cc. In any case I think we have a problem if we want to default to send a SIGTRAP. Look at this: bp = per_cpu(bp_per_reg[i], cpu); /* * Reset the 'i'th TRAP bit in dr6 to denote completion of * exception handling */ (*dr6_p) = ~(DR_TRAP0 i); /* * bp can be NULL due to lazy debug register switching * or due to concurrent perf counter removing. */ if (!bp) { rcu_read_unlock(); break; } perf_bp_event(bp, args-regs); I don't have the details about how lazy the debug register switching can be. And also we want to avoid a locking between the perf event scheduling (removing) and the breakpoint triggering path. A solution is to look at the ptrace breakpoints in the thread struct and see if the one in the index is there. That can reside in its own callback or as a fallback in hw_breakpoint_handler(). I don't feel that strong with choosing either of those solutions. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/9] perf: expose perf capability to other modules.
On Tue, Nov 01, 2011 at 10:20:04AM -0600, David Ahern wrote: On 11/01/2011 10:13 AM, Gleb Natapov wrote: On Tue, Nov 01, 2011 at 09:49:19AM -0600, David Ahern wrote: On 10/30/2011 10:53 AM, Gleb Natapov wrote: KVM needs to know perf capability to decide which PMU it can expose to a guest. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/perf_event.h | 11 +++ arch/x86/kernel/cpu/perf_event.c | 11 +++ arch/x86/kernel/cpu/perf_event.h |2 ++ arch/x86/kernel/cpu/perf_event_intel.c |3 +++ 4 files changed, 27 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index f61c62f..7d7e57f 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -201,7 +201,18 @@ struct perf_guest_switch_msr { u64 host, guest; }; +struct x86_pmu_capability { + int version; + int num_counters_gp; + int num_counters_fixed; + int bit_width_gp; + int bit_width_fixed; + unsigned int events_mask; + int events_mask_len; +}; + extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr); +extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap); #else static inline perf_guest_switch_msr *perf_guest_get_msrs(int *nr) { What about version of perf_get_x86_pmu_capability for CONFIG_PERF_EVENTS not enabled in host kernel? Next patch for KVM assumes the function is defined. As far as I understand it is not possible to build x86 without CONFIG_PERF_EVENTS right now. Actually kvm pmu code depends on CONFIG_PERF_EVENTS been enabled. I can easily provide the stub if needed though. Right. Originally it could be enabled/disabled. Right now it cannot be, but I believe Frederic is working on making it configurable again. David Yep. Will Deacon is working on making the breakpoints able to process pure arch informations (ie: without beeing forced to use the perf attr as a midlayer to define them). Once we have that I can seperate the breakpoints implementation from perf and make it opt-able. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Tue, Jul 05, 2011 at 03:30:26PM +0100, Will Deacon wrote: Hi Frederic, On Mon, Jul 04, 2011 at 02:58:24PM +0100, Frederic Weisbecker wrote: On Wed, Jun 29, 2011 at 05:27:25PM +0100, Will Deacon wrote: Hi Frederic, Thanks for including me on CC. On Wed, Jun 29, 2011 at 05:08:45PM +0100, Frederic Weisbecker wrote: On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote: The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity a...@redhat.com I believe it can micro-optimize ptrace through register_user_hw_breakpoint() because we could store the index of the breakpoint that way, instead of iterating through 4 slots. Perhaps it can help in arm too, adding Will in Cc. Yes, we could store the breakpoint index in there and it would save us walking over the breakpoints when one fires. Not sure this helps us for anything else though. My main gripe with the ptrace interface to hw_breakpoints is that we have to convert all the breakpoint information from ARM_BREAKPOINT_* to HW_BREAKPOINT_* and then convert it all back again in the hw_breakpoint code. Yuck! Agreed, I don't like that either. Would you like to improve that? We probably need to be able to pass some arch data through the whole call of breakpoint creation, including perf_event_create_kernel_counter(). Sure, I'll make some time to look at this and try and get an RFC out in the next few weeks. Great! Thanks a lot! There can be a transition step where we can either take generic attr or arch datas, until every archs are converted. So that you can handle the arm part and other arch developers can relay. Yup. Another thing I would like to do in the even longer term is to not use perf anymore for ptrace breakpoints, because that involves a heavy dependency and few people are happy with that. Instead we should just have a generic hook into the sched_switch() and handle pure ptrace breakpoints there. The central breakpoint API would still be there to reserve/schedule breakpoint resources between ptrace and perf. But beeing able to create ptrace breakpoints without converting to generic perf attr is a necessary first step in order to achieve this. Agreed, but I'll bear that in mind so I don't make it any more difficult than it already is! Cheers, Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Wed, Jun 29, 2011 at 05:27:25PM +0100, Will Deacon wrote: Hi Frederic, Thanks for including me on CC. On Wed, Jun 29, 2011 at 05:08:45PM +0100, Frederic Weisbecker wrote: On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote: The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity a...@redhat.com I believe it can micro-optimize ptrace through register_user_hw_breakpoint() because we could store the index of the breakpoint that way, instead of iterating through 4 slots. Perhaps it can help in arm too, adding Will in Cc. Yes, we could store the breakpoint index in there and it would save us walking over the breakpoints when one fires. Not sure this helps us for anything else though. My main gripe with the ptrace interface to hw_breakpoints is that we have to convert all the breakpoint information from ARM_BREAKPOINT_* to HW_BREAKPOINT_* and then convert it all back again in the hw_breakpoint code. Yuck! Agreed, I don't like that either. Would you like to improve that? We probably need to be able to pass some arch data through the whole call of breakpoint creation, including perf_event_create_kernel_counter(). There can be a transition step where we can either take generic attr or arch datas, until every archs are converted. So that you can handle the arm part and other arch developers can relay. Another thing I would like to do in the even longer term is to not use perf anymore for ptrace breakpoints, because that involves a heavy dependency and few people are happy with that. Instead we should just have a generic hook into the sched_switch() and handle pure ptrace breakpoints there. The central breakpoint API would still be there to reserve/schedule breakpoint resources between ptrace and perf. But beeing able to create ptrace breakpoints without converting to generic perf attr is a necessary first step in order to achieve this. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Mon, Jul 04, 2011 at 05:10:20PM +0300, Avi Kivity wrote: On 07/04/2011 04:58 PM, Frederic Weisbecker wrote: Another thing I would like to do in the even longer term is to not use perf anymore for ptrace breakpoints, because that involves a heavy dependency and few people are happy with that. Instead we should just have a generic hook into the sched_switch() and handle pure ptrace breakpoints there. The central breakpoint API would still be there to reserve/schedule breakpoint resources between ptrace and perf. 'struct preempt_notifier' may be the hook you're looking for. Yeah looks like a perfect fit as it's per task. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote: The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity a...@redhat.com I believe it can micro-optimize ptrace through register_user_hw_breakpoint() because we could store the index of the breakpoint that way, instead of iterating through 4 slots. Perhaps it can help in arm too, adding Will in Cc. But for register_wide_hw_breakpoint, I'm not sure. kgdb is the main user, may be Jason could find some use of it. --- arch/arm/kernel/ptrace.c|3 ++- arch/powerpc/kernel/ptrace.c|2 +- arch/sh/kernel/ptrace_32.c |3 ++- arch/x86/kernel/kgdb.c |2 +- arch/x86/kernel/ptrace.c|3 ++- drivers/oprofile/oprofile_perf.c|2 +- include/linux/hw_breakpoint.h | 10 -- include/linux/perf_event.h |4 +++- kernel/events/core.c| 21 +++-- kernel/events/hw_breakpoint.c | 10 +++--- kernel/watchdog.c |2 +- samples/hw_breakpoint/data_breakpoint.c |2 +- 12 files changed, 44 insertions(+), 20 deletions(-) diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c index 9726006..4911c94 100644 --- a/arch/arm/kernel/ptrace.c +++ b/arch/arm/kernel/ptrace.c @@ -479,7 +479,8 @@ static struct perf_event *ptrace_hbp_create(struct task_struct *tsk, int type) attr.bp_type= type; attr.disabled = 1; - return register_user_hw_breakpoint(attr, ptrace_hbptriggered, tsk); + return register_user_hw_breakpoint(attr, ptrace_hbptriggered, NULL, +tsk); } static int ptrace_gethbpregs(struct task_struct *tsk, long num, diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index cb22024..5249308 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -973,7 +973,7 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned long addr, attr.bp_type); thread-ptrace_bps[0] = bp = register_user_hw_breakpoint(attr, - ptrace_triggered, task); +ptrace_triggered, NULL, task); if (IS_ERR(bp)) { thread-ptrace_bps[0] = NULL; ptrace_put_breakpoints(task); diff --git a/arch/sh/kernel/ptrace_32.c b/arch/sh/kernel/ptrace_32.c index 3d7b209..930312f 100644 --- a/arch/sh/kernel/ptrace_32.c +++ b/arch/sh/kernel/ptrace_32.c @@ -91,7 +91,8 @@ static int set_single_step(struct task_struct *tsk, unsigned long addr) attr.bp_len = HW_BREAKPOINT_LEN_2; attr.bp_type = HW_BREAKPOINT_R; - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk); + bp = register_user_hw_breakpoint(attr, ptrace_triggered, + NULL, tsk); if (IS_ERR(bp)) return PTR_ERR(bp); diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c index 5f9ecff..473ab53 100644 --- a/arch/x86/kernel/kgdb.c +++ b/arch/x86/kernel/kgdb.c @@ -638,7 +638,7 @@ void kgdb_arch_late(void) for (i = 0; i HBP_NUM; i++) { if (breakinfo[i].pev) continue; - breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL); + breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL, NULL); if (IS_ERR((void * __force)breakinfo[i].pev)) { printk(KERN_ERR kgdb: Could not allocate hw breakpoints\nDisabling the kernel debugger\n); diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 807c2a2..28092ae 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -715,7 +715,8 @@ static int ptrace_set_breakpoint_addr(struct task_struct *tsk, int nr, attr.bp_type = HW_BREAKPOINT_W; attr.disabled = 1; - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk); + bp = register_user_hw_breakpoint(attr, ptrace_triggered, + NULL, tsk); /* * CHECKME: the previous code returned -EIO if the addr wasn't diff --git a/drivers/oprofile/oprofile_perf.c
Re: ftrace/perf_event leak
On Wed, Sep 01, 2010 at 01:02:57PM +0200, Peter Zijlstra wrote: On Wed, 2010-09-01 at 13:38 +0300, Avi Kivity wrote: On 09/01/2010 12:38 PM, Li Zefan wrote: Then try this: Tested-by: Avi Kivity a...@redhat.com Thanks, queued as: --- Subject: perf, trace: Fix module leak From: Li Zefan l...@cn.fujitsu.com Date: Wed Sep 01 12:58:43 CEST 2010 Commit 1c024eca (perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events) caused a module refcount leak. Tested-by: Avi Kivity a...@redhat.com Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl LKML-Reference: 4c7e1f12.8030...@cn.fujitsu.com --- kernel/trace/trace_event_perf.c |3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6/kernel/trace/trace_event_perf.c === --- linux-2.6.orig/kernel/trace/trace_event_perf.c +++ linux-2.6/kernel/trace/trace_event_perf.c @@ -91,6 +91,8 @@ int perf_trace_init(struct perf_event *p tp_event-class tp_event-class-reg try_module_get(tp_event-mod)) { ret = perf_trace_event_init(tp_event, p_event); + if (ret) + module_put(tp_event-mod); break; } } @@ -147,6 +149,7 @@ void perf_trace_destroy(struct perf_even } } out: + module_put(tp_event-mod); mutex_unlock(event_mutex); } Thanks for fixing this. However, can we split this in two patches to ease the backport? The lack of a module_put() after perf_trace_init() failure is there for a while (the backport needs to start in 2.6.32). But the lack of a module_put in the destroy path needs a .35 backport only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Unify KVM kernel-space and user-space code into a single project
On Thu, Mar 18, 2010 at 12:32:51PM +0200, Avi Kivity wrote: By serious developer I mean - someone who is interested in contributing, not in getting their name into the kernel commits list - someone who is willing to read the wiki page and find out where the repository and mailing list for a project is - someone who will spend enough time on the project so that the time to clone two repositories will not be a factor in their contributions I'm not going to argue about the Qemu merging here. But your above assessment is incomplete. It is not because developers don't want to clone two different trees that tools/perf is a success. Or may be it's a factor but I suspect it to be very minimal. I can script git commands if needed. It is actually because both kernel and user side are sync in this scheme. Let's wait and see then. If the tools/perf/ experience has really good results, we can reconsider this at a later date. I think it has already really good results. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm guest: hrtimer: interrupt too slow
On Sat, Oct 10, 2009 at 01:18:16PM +0400, Michael Tokarev wrote: Michael Tokarev wrote: Frederic Weisbecker wrote: [] Was there swapping going on? Not as far as I can see, and sar output agrees. But I can read this from you guest traces: I missed this one yesterday. Note it's GUEST traces indeed. Higher (read: non-zero) pgp{in,out} and faults values happens in *guest*, not on host (original question was if we've swapping in HOST, which'd explain the timer issues) Yeah indeed. But still, that's a strange happenstance. [cutting extra all-zero columns] pgpgin/s pgpgout/s fault/s pgfree/s 11:44:47 0.00 32.32907.07 277.78 11:44:4827.59 22.99 44.83 150.57 11:44:49 0.00 33.68 22.11 218.95 [...] 21:46:54 0.00 31.68 16.8390.10 21:46:55 0.00108.00 17.0089.00 21:46:56 9.76482.93 3890.24 439.02 21:46:57 0.00760.00 8627.00 1133.00 21:46:58 0.00 84.85 2612.12 138.38 21:46:59 0.00 16.00 17.0090.00 So it looks like there was some swapping in when the hrtimer (spuriously) hanged. One possible guess. Since the guest hanged for some time, the higher values there might be a result of accumulated values for several seconds. May be yeah. I don't know enough about virtual internals so... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm guest: hrtimer: interrupt too slow
On Sat, Oct 10, 2009 at 01:22:16AM +0400, Michael Tokarev wrote: Marcelo Tosatti wrote: [snip] Would be useful to collect sar (sar -B -b -u) output every one second in both host/guest. You already mentioned load was low, but this should give more details. Here we go: http://www.corpit.ru/mjt/hrtimer-interrupt-too-slow/ Two incindents - cases when hrtimer: interrupt is too slow were reported in the guest (with Marcelo's patch so that min_delta is increased to 50% only), happened at 11:44:48 and 21:46:56 (as shown in guest-dmesg file). For both, there's `sar -BWbd' output for a 2-minute interval (starting one minute before the delay and ending one minute after) from both host and guest. Was there swapping going on? Not as far as I can see, and sar output agrees. But I can read this from you guest traces: pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s%vmeff 11:44:45 0.00 32.32174.75 0.00176.77 0.00 0.00 0.00 0.00 11:44:46 0.00 16.00789.00 0.00323.00 0.00 0.00 0.00 0.00 11:44:47 0.00 32.32907.07 0.00277.78 0.00 0.00 0.00 0.00 11:44:4827.59 22.99 44.83 0.00150.57 0.00 0.00 0.00 0.00 11:44:49 0.00 33.68 22.11 0.00218.95 0.00 0.00 0.00 0.00 11:44:50 0.00101.01 17.17 0.00151.52 0.00 0.00 0.00 0.00 11:44:51 0.00 15.69 16.67 0.00126.47 0.00 0.00 0.00 0.00 [...] 21:46:52 0.00 40.00 17.00 0.00 82.00 0.00 0.00 0.00 0.00 21:46:53 0.00 31.68 18.81 0.00 94.06 0.00 0.00 0.00 0.00 21:46:54 0.00 31.68 16.83 0.00 90.10 0.00 0.00 0.00 0.00 21:46:55 0.00108.00 17.00 0.00 89.00 0.00 0.00 0.00 0.00 21:46:56 9.76482.93 3890.24 0.00439.02 0.00 0.00 0.00 0.00 21:46:57 0.00760.00 8627.00 0.00 1133.00 0.00 0.00 0.00 0.00 21:46:58 0.00 84.85 2612.12 0.00138.38 0.00 0.00 0.00 0.00 21:46:59 0.00 16.00 17.00 0.00 90.00 0.00 0.00 0.00 0.00 21:47:00 0.00 36.36 17.17 0.00 90.91 0.00 0.00 0.00 0.00 So it looks like there was some swapping in when the hrtimer (spuriously) hanged. Not sure how to correlate these results with the incident though... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm guest: hrtimer: interrupt too slow
(Adding Thomas in Cc) On Sat, Oct 03, 2009 at 08:12:05PM -0300, Marcelo Tosatti wrote: Michael, Can you please give the patch below a try please? (without acpi_pm timer or priority adjustments for the guest). On Tue, Sep 29, 2009 at 05:12:17PM +0400, Michael Tokarev wrote: Hello. I'm having quite an.. unusable system here. It's not really a regresssion with 0.11.0, it was something similar before, but with 0.11.0 and/or 2.6.31 it become much worse. The thing is that after some uptime, kvm guest prints something like this: hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns after which system (guest) speeed becomes very slow. The above message is from 2.6.31 guest running wiht 0.11.0 2.6.31 host. Before I tried it with 0.10.6 and 2.6.30 or 2.6.27, and the delta were a bit less than that: hrtimer: interrupt too slow, forcing clock min delta to 15415 ns hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns It seems the way hrtimer_interrupt_hanging calculates min_delta is wrong (especially to virtual machines). The guest vcpu can be scheduled out during the execution of the hrtimer callbacks (and the callbacks themselves can do operations that translate to blocking operations in the hypervisor). So high min_delta values can be calculated if, for example, a single hrtimer_interrupt run takes two host time slices to execute, while some other higher priority task runs for N slices in between. Using the hrtimer_interrupt execution time (which can be the worse case at any given time), as the min_delta is problematic. So simply increase min_delta_ns by 50% once every detected failure, which will eventually lead to an acceptable threshold (the algorithm should scale back to down lower min_delta, to adjust back to wealthier times, too). diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 49da79a..8997978 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer) #ifdef CONFIG_HIGH_RES_TIMERS -static int force_clock_reprogram; - /* * After 5 iteration's attempts, we consider that hrtimer_interrupt() * is hanging, which could happen with something that slows the interrupt - * such as the tracing. Then we force the clock reprogramming for each future - * hrtimer interrupts to avoid infinite loops and use the min_delta_ns - * threshold that we will overwrite. - * The next tick event will be scheduled to 3 times we currently spend on - * hrtimer_interrupt(). This gives a good compromise, the cpus will spend - * 1/4 of their time to process the hrtimer interrupts. This is enough to - * let it running without serious starvation. + * such as the tracing, so we increase min_delta_ns. */ static inline void -hrtimer_interrupt_hanging(struct clock_event_device *dev, - ktime_t try_time) +hrtimer_interrupt_hanging(struct clock_event_device *dev) { - force_clock_reprogram = 1; - dev-min_delta_ns = (unsigned long)try_time.tv64 * 3; - printk(KERN_WARNING hrtimer: interrupt too slow, - forcing clock min delta to %lu ns\n, dev-min_delta_ns); + dev-min_delta_ns += dev-min_delta_ns 1; I haven't thought about the guest that could be scheduled out in the middle of the timers servicing, making wrong this check based of the time spent in hrtimer_interrupt(). I guess there is no easy/generic/cheap way to rebase this check on the _virtual_ time spent in the timers servicing. By virtual, I mean the time spent in the guest only. In a non-guest kernel, the old check forces an adaptive rate sharing: - we spent n nanosecs to service the batch of timers. - we are hanging - we want at least 3/4 of time reserved for non-timer servicing in the kernel, this is a minimum prerequisite for the system to not starve - adapt the min_clock_delta against to fit the above constraint All that does not make sense anymore in a guest. The hang detection and warnings, the recalibrations of the min_clock_deltas are completely wrong in this context. Not only does it spuriously warn, but the minimum timer is increasing slowly and the guest progressively suffers from higher and higher latencies. That's really bad. Your patch lowers the immediate impact and makes this illness evolving smoother by scaling down the recalibration to the min_clock_delta. This appeases the bug but doesn't solve it. I fear it could be even worse because it makes it more discreet. May be can we instead increase the minimum threshold of loop in the hrtimer interrupt before considering it as a hang? Hmm, but a too high number could make this check useless, depending of the number of pending timers, which is a finite number. Well, actually I'm not confident anymore in this check. Or actually we should change it. May be we can rebase it on the time spent on the hrtimer interrupt (and check it every 10
Re: [TOOL] c2kpe: C expression to kprobe event format converter
On Mon, Aug 31, 2009 at 12:14:34AM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: On Thu, Aug 13, 2009 at 04:59:19PM -0400, Masami Hiramatsu wrote: This program converts probe point in C expression to kprobe event format for kprobe-based event tracer. This helps to define kprobes events by C source line number or function name, and local variable name. Currently, this supports only x86(32/64) kernels. Compile Before compilation, please install libelf and libdwarf development packages. (e.g. elfutils-libelf-devel and libdwarf-devel on Fedora) This may probably need a specific libdwarf version? c2kpe.c: In function ‘die_get_entrypc’: c2kpe.c:422: erreur: ‘Dwarf_Ranges’ undeclared (first use in this function) c2kpe.c:422: erreur: (Each undeclared identifier is reported only once c2kpe.c:422: erreur: for each function it appears in.) c2kpe.c:422: erreur: ‘ranges’ undeclared (first use in this function) c2kpe.c:447: attention : implicit declaration of function ‘dwarf_get_ranges’ c2kpe.c:451: attention : implicit declaration of function ‘dwarf_ranges_dealloc’ Aah, sure, it should be compiled with libdwarf newer than 20090324. You can find it in http://reality.sgiweb.org/davea/dwarf.html Ah ok. BTW, libdwarf and libdw (which is the yet another implementation of dwarf library) are still under development, e.g. libdwarf doesn't support gcc-4.4.1(very new) and only the latest libdw(0.142) can support it. So, perhaps I might better port it on libdw, even that is less documented...:( May be let's continue with libdwarf for now and we'll see if support for libdw is required later. TODO - Fix bugs. - Support multiple probepoints from stdin. - Better kmodule support. - Use elfutils-libdw? - Merge into trace-cmd or perf-tools? Yeah definetly, that would be a veeery interesting thing to have. I've played with kprobe ftrace to debug something this evening. It's very cool to be able to put dynamic tracepoints in desired places. But... I firstly needed to put random trace_printk() in some places to observe some variables values. And then I thought about the kprobes tracer and realized I could do that without the need of rebuilding my kernel. Then I've played with it and indeed it works well and it's useful, but at the cost of reading objdump based assembly code to find the places where I could find my variables values. And after two or three probes in such conditions, I've become tired of that, then I wanted to try this tool. While I cannot yet because of this build error, I can imagine the power of such facility from perf. We could have a perf probe that creates a kprobe event in debugfs (default enable = 0) and which then rely on perf record for the actual recording. Then we could analyse it through perf trace. Let's imagine a simple example: int foo(int arg1, int arg2) { int var1; var1 = arg1; var1 *= arg2; var1 -= arg1; -- insert a probe here (file bar.c : line 60) var1 ^= ... return var1; } ./perf kprobe --file bar.c:60 --action arg1=%d,arg2=%d,var1=%d -- ls -R / I recommend it should be separated from record, like below: # set new event ./perf kprobe --add kprobe:event1 --file bar.c:60 --action arg1=%d,arg2=%d,var1=%d # record new event ./perf record -e kprobe:event1 -a -R -- ls -R / That indeed solves the command line overkill, but that also breaks a bit the workflow :) Well, I guess we can start simple in the beginning and follow the above mockup which is indeed better decoupled. And if something more intuitive comes in mind later, then we can still change it. This will allow us to focus on one thing -- convert C to kprobe-tracer. And also, it can be listed as like as tracepoint events. Yeah. ./perf trace arg1=1 arg2=1 var1=0 arg1=2 arg2=2 var1=2 etc.. You may want to sort by field: ./perf trace -s arg1 --order desc arg1=1 | --- arg2=1 var=1 | --- arg2=2 var=1 arg1=2 | --- arg2=1 var=0 | --- [...] ./perf trace -s arg1,arg2 --order asc arg1=1 | --- arg2=1 | - var1=0 | - var1= arg2=... | Ok the latter is a bad example because var1 will always have only one value for a given arg1 and arg2. But I guess you see the point. You won't have to care about the perf trace part, it's already implemented and I'll soon handle the sorting part. All we need is the perf kprobes that translate a C level probing expression to a /debug/tracing/kprobe_events compliant thing. And then just call
Re: [TOOL] c2kpe: C expression to kprobe event format converter
On Thu, Aug 13, 2009 at 04:59:19PM -0400, Masami Hiramatsu wrote: This program converts probe point in C expression to kprobe event format for kprobe-based event tracer. This helps to define kprobes events by C source line number or function name, and local variable name. Currently, this supports only x86(32/64) kernels. Compile Before compilation, please install libelf and libdwarf development packages. (e.g. elfutils-libelf-devel and libdwarf-devel on Fedora) This may probably need a specific libdwarf version? c2kpe.c: In function ‘die_get_entrypc’: c2kpe.c:422: erreur: ‘Dwarf_Ranges’ undeclared (first use in this function) c2kpe.c:422: erreur: (Each undeclared identifier is reported only once c2kpe.c:422: erreur: for each function it appears in.) c2kpe.c:422: erreur: ‘ranges’ undeclared (first use in this function) c2kpe.c:447: attention : implicit declaration of function ‘dwarf_get_ranges’ c2kpe.c:451: attention : implicit declaration of function ‘dwarf_ranges_dealloc’ TODO - Fix bugs. - Support multiple probepoints from stdin. - Better kmodule support. - Use elfutils-libdw? - Merge into trace-cmd or perf-tools? Yeah definetly, that would be a veeery interesting thing to have. I've played with kprobe ftrace to debug something this evening. It's very cool to be able to put dynamic tracepoints in desired places. But... I firstly needed to put random trace_printk() in some places to observe some variables values. And then I thought about the kprobes tracer and realized I could do that without the need of rebuilding my kernel. Then I've played with it and indeed it works well and it's useful, but at the cost of reading objdump based assembly code to find the places where I could find my variables values. And after two or three probes in such conditions, I've become tired of that, then I wanted to try this tool. While I cannot yet because of this build error, I can imagine the power of such facility from perf. We could have a perf probe that creates a kprobe event in debugfs (default enable = 0) and which then rely on perf record for the actual recording. Then we could analyse it through perf trace. Let's imagine a simple example: int foo(int arg1, int arg2) { int var1; var1 = arg1; var1 *= arg2; var1 -= arg1; -- insert a probe here (file bar.c : line 60) var1 ^= ... return var1; } ./perf kprobe --file bar.c:60 --action arg1=%d,arg2=%d,var1=%d -- ls -R / ./perf trace arg1=1 arg2=1 var1=0 arg1=2 arg2=2 var1=2 etc.. You may want to sort by field: ./perf trace -s arg1 --order desc arg1=1 | --- arg2=1 var=1 | --- arg2=2 var=1 arg1=2 | --- arg2=1 var=0 | --- [...] ./perf trace -s arg1,arg2 --order asc arg1=1 | --- arg2=1 | - var1=0 | - var1= arg2=... | Ok the latter is a bad example because var1 will always have only one value for a given arg1 and arg2. But I guess you see the point. You won't have to care about the perf trace part, it's already implemented and I'll soon handle the sorting part. All we need is the perf kprobes that translate a C level probing expression to a /debug/tracing/kprobe_events compliant thing. And then just call perf record with the new created event as an argument. Frederic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH tracing/kprobes 1/4] x86: Fix x86 instruction decoder selftest to check only .text
On Fri, Aug 21, 2009 at 03:43:07PM -0400, Masami Hiramatsu wrote: Fix x86 instruction decoder selftest to check only .text because other sections (e.g. .notes) will have random bytes which don't need to be checked. Applied these 4 patches in git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git \ tracing/kprobes Thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip v14 01/12] x86: instruction decoder API
On Thu, Aug 20, 2009 at 10:42:05AM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: On Thu, Aug 13, 2009 at 04:34:13PM -0400, Masami Hiramatsu wrote: Add x86 instruction decoder to arch-specific libraries. This decoder can decode x86 instructions used in kernel into prefix, opcode, modrm, sib, displacement and immediates. This can also show the length of instructions. This version introduces instruction attributes for decoding instructions. The instruction attribute tables are generated from the opcode map file (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk). Currently, the opcode maps are based on opcode maps in Intel(R) 64 and IA-32 Architectures Software Developers Manual Vol.2: Appendix.A, and consist of below two types of opcode tables. 1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are written as below; Table: table-name Referrer: escaped-name opcode: mnemonic|GrpXXX [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] (or) opcode: escape # escaped-name EndTable Group opcodes, which has 8 elements, are written as below; GrpTable: GrpXXX reg: mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] EndTable These opcode maps include a few SSE and FP opcodes (for setup), because those opcodes are used in the kernel. I'm getting the following build error on an old K7 box: arch/x86/lib/inat.c: In function ‘inat_get_opcode_attribute’: arch/x86/lib/inat.c:29: erreur: ‘inat_primary_table’ undeclared (first use in this function) arch/x86/lib/inat.c:29: erreur: (Each undeclared identifier is reported only once arch/x86/lib/inat.c:29: erreur: for each function it appears in.) Thanks for reporting! Hmm, it seems that inat-tables.c is not correctly generated. Could you tell me which awk you used and send the inat-tables.c? Thank you, Sure: $ awk -Wv mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan compiled limits: max NF 32767 sprintf buffer 2040 And I've sent you the content of inat_tables.c in the other answer :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip v14 01/12] x86: instruction decoder API
On Thu, Aug 20, 2009 at 11:03:40AM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: On Thu, Aug 20, 2009 at 01:42:31AM +0200, Frederic Weisbecker wrote: On Thu, Aug 13, 2009 at 04:34:13PM -0400, Masami Hiramatsu wrote: Add x86 instruction decoder to arch-specific libraries. This decoder can decode x86 instructions used in kernel into prefix, opcode, modrm, sib, displacement and immediates. This can also show the length of instructions. This version introduces instruction attributes for decoding instructions. The instruction attribute tables are generated from the opcode map file (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk). Currently, the opcode maps are based on opcode maps in Intel(R) 64 and IA-32 Architectures Software Developers Manual Vol.2: Appendix.A, and consist of below two types of opcode tables. 1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are written as below; Table: table-name Referrer: escaped-name opcode: mnemonic|GrpXXX [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] (or) opcode: escape # escaped-name EndTable Group opcodes, which has 8 elements, are written as below; GrpTable: GrpXXX reg: mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] EndTable These opcode maps include a few SSE and FP opcodes (for setup), because those opcodes are used in the kernel. I'm getting the following build error on an old K7 box: arch/x86/lib/inat.c: In function ‘inat_get_opcode_attribute’: arch/x86/lib/inat.c:29: erreur: ‘inat_primary_table’ undeclared (first use in this function) arch/x86/lib/inat.c:29: erreur: (Each undeclared identifier is reported only once arch/x86/lib/inat.c:29: erreur: for each function it appears in.) I've attached my config. I haven't such problem on a dual x86-64 box. Actually I have the same problem in x86-64 The content of my arch/x86/lib/inat-tables.c: /* x86 opcode map generated from x86-opcode-map.txt */ /* Do not change this code. */ /* Table: one byte opcode */ /* Escape opcode map array */ const insn_attr_t const *inat_escape_tables[INAT_ESC_MAX + 1][INAT_LPREFIX_MAX + 1] = { }; /* Group opcode map array */ const insn_attr_t const *inat_group_tables[INAT_GRP_MAX + 1][INAT_LPREFIX_MAX + 1] = { }; I guess there is a problem with the generation of this file. Aah, you may use mawk on Ubuntu 9.04, right? If so, unfortunately, mawk is still under development. http://invisible-island.net/mawk/CHANGES Aargh... 20090727 add check/fix to prevent gsub from recurring to modify on a substring of the current line when the regular expression is anchored to the beginning of the line; fixes gawk's anchgsub testcase. add check for implicit concatenation mistaken for exponent; fixes gawk's hex testcase. add character-classes to built-in regular expressions. ^^ Look, this means we can't use char-class expressions like [:lower:] until this version... And I've found another bug in mawk-1.3.3-20090728(the latest one). it almost works, but; $ mawk 'BEGIN {printf(0x%x\n, 0)}' 0x1 Ouch, indeed. $ gawk 'BEGIN {printf(0x%x\n, 0)}' 0x0 This bug skips an array element index 0x0 in inat-tables.c :( So, I recommend you to install gawk instead mawk until that supports all posix-awk features, since I don't think it is good idea to avoid all those bugs which depends on implementation (not specification). Thank you, Yeah, indeed. May be add a warning (or build error) in case the user uses mawk? Anyway that works fine now with gawk, thanks! All your patches build well :-) -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip v14 01/12] x86: instruction decoder API
On Thu, Aug 20, 2009 at 12:16:05PM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: On Thu, Aug 20, 2009 at 11:03:40AM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: On Thu, Aug 20, 2009 at 01:42:31AM +0200, Frederic Weisbecker wrote: On Thu, Aug 13, 2009 at 04:34:13PM -0400, Masami Hiramatsu wrote: Add x86 instruction decoder to arch-specific libraries. This decoder can decode x86 instructions used in kernel into prefix, opcode, modrm, sib, displacement and immediates. This can also show the length of instructions. This version introduces instruction attributes for decoding instructions. The instruction attribute tables are generated from the opcode map file (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk). Currently, the opcode maps are based on opcode maps in Intel(R) 64 and IA-32 Architectures Software Developers Manual Vol.2: Appendix.A, and consist of below two types of opcode tables. 1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are written as below; Table: table-name Referrer: escaped-name opcode: mnemonic|GrpXXX [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] (or) opcode: escape # escaped-name EndTable Group opcodes, which has 8 elements, are written as below; GrpTable: GrpXXX reg: mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] EndTable These opcode maps include a few SSE and FP opcodes (for setup), because those opcodes are used in the kernel. I'm getting the following build error on an old K7 box: arch/x86/lib/inat.c: In function ‘inat_get_opcode_attribute’: arch/x86/lib/inat.c:29: erreur: ‘inat_primary_table’ undeclared (first use in this function) arch/x86/lib/inat.c:29: erreur: (Each undeclared identifier is reported only once arch/x86/lib/inat.c:29: erreur: for each function it appears in.) I've attached my config. I haven't such problem on a dual x86-64 box. Actually I have the same problem in x86-64 The content of my arch/x86/lib/inat-tables.c: /* x86 opcode map generated from x86-opcode-map.txt */ /* Do not change this code. */ /* Table: one byte opcode */ /* Escape opcode map array */ const insn_attr_t const *inat_escape_tables[INAT_ESC_MAX + 1][INAT_LPREFIX_MAX + 1] = { }; /* Group opcode map array */ const insn_attr_t const *inat_group_tables[INAT_GRP_MAX + 1][INAT_LPREFIX_MAX + 1] = { }; I guess there is a problem with the generation of this file. Aah, you may use mawk on Ubuntu 9.04, right? If so, unfortunately, mawk is still under development. http://invisible-island.net/mawk/CHANGES Aargh... 20090727 add check/fix to prevent gsub from recurring to modify on a substring of the current line when the regular expression is anchored to the beginning of the line; fixes gawk's anchgsub testcase. add check for implicit concatenation mistaken for exponent; fixes gawk's hex testcase. add character-classes to built-in regular expressions. ^^ Look, this means we can't use char-class expressions like [:lower:] until this version... And I've found another bug in mawk-1.3.3-20090728(the latest one). it almost works, but; $ mawk 'BEGIN {printf(0x%x\n, 0)}' 0x1 Ouch, indeed. $ gawk 'BEGIN {printf(0x%x\n, 0)}' 0x0 This bug skips an array element index 0x0 in inat-tables.c :( So, I recommend you to install gawk instead mawk until that supports all posix-awk features, since I don't think it is good idea to avoid all those bugs which depends on implementation (not specification). Thank you, Yeah, indeed. May be add a warning (or build error) in case the user uses mawk? Hmm, it is possible that mawk will fix those bugs and catch up soon, so, I think checking mawk is not a good idea. (and since there will be other awk implementations, it's not fair.) I think what all I can do now is reporting bugs to mawk and ubuntu people.:-) Yeah, but without your tip I couldn't be able to find the origin before some time. And the kernel couldn't build anyway. At least we should do something with this version of mawk. Anyway that works fine now with gawk, thanks! All your patches build well :-) Thank you for testing! -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [TOOL] kprobestest : Kprobe stress test tool
On Thu, Aug 13, 2009 at 04:57:20PM -0400, Masami Hiramatsu wrote: This script tests kprobes to probe on all symbols in the kernel and finds symbols which must be blacklisted. Usage - kprobestest [-s SYMLIST] [-b BLACKLIST] [-w WHITELIST] Run stress test. If SYMLIST file is specified, use it as an initial symbol list (This is useful for verifying white list after diagnosing all symbols). kprobestest cleanup Cleanup all lists How to Work --- This tool list up all symbols in the kernel via /proc/kallsyms, and sorts it into groups (each of them including 64 symbols in default). And then, it tests each group by using kprobe-tracer. If a kernel crash occurred, that group is moved into 'failed' dir. If the group passed the test, this script moves it into 'passed' dir and saves kprobe_profile into 'passed/profiles/'. After testing all groups, all 'failed' groups are merged and sorted into smaller groups (divided by 4, in default). And those are tested again. This loop will be repeated until all group has just 1 symbol. Finally, the script sorts all 'passed' symbols into 'tested', 'untested', and 'missed' based on profiles. Note - This script just gives us some clues to the blacklisted functions. In some cases, a combination of probe points will cause a problem, but each of them doesn't cause the problem alone. Thank you, This script makes my x86-64 dual core easily and hardly locking-up on the 1st batch of symbols to test. I have one sym list in the failed and unset directories: int_very_careful int_signal int_restore_rest stub_clone stub_fork stub_vfork stub_sigaltstack stub_iopl ptregscall_common stub_execve stub_rt_sigreturn irq_entries_start common_interrupt ret_from_intr exit_intr retint_with_reschedule retint_check retint_swapgs retint_restore_args restore_args irq_return retint_careful retint_signal retint_kernel irq_move_cleanup_interrupt reboot_interrupt apic_timer_interrupt generic_interrupt invalidate_interrupt0 invalidate_interrupt1 invalidate_interrupt2 invalidate_interrupt3 invalidate_interrupt4 invalidate_interrupt5 invalidate_interrupt6 invalidate_interrupt7 threshold_interrupt thermal_interrupt mce_self_interrupt call_function_single_interrupt call_function_interrupt reschedule_interrupt error_interrupt spurious_interrupt perf_pending_interrupt divide_error overflow bounds invalid_op device_not_available double_fault coprocessor_segment_overrun invalid_TSS segment_not_present spurious_interrupt_bug coprocessor_error alignment_check simd_coprocessor_error native_load_gs_index gs_change kernel_thread child_rip kernel_execve call_softirq I don't have a crash log because I was running with X. But it also happened with other batch of symbols. The problem is that I don't have any serial line in this box then I can't catch any crash log. My K7 testbox also died in my arms this afternoon. But I still have two other testboxes (one P2 and one P3), hopefully I could reproduce the problem in these boxes in which I can connect a serial line. I've pushed your patches in the following git tree: git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git \ tracing/kprobes So you can send patches on top of this one. Config in attachment: # # Automatically generated make config: don't edit # Linux kernel version: 2.6.31-rc5 # Thu Aug 20 19:35:39 2009 # CONFIG_64BIT=y # CONFIG_X86_32 is not set CONFIG_X86_64=y CONFIG_X86=y CONFIG_OUTPUT_FORMAT=elf64-x86-64 CONFIG_ARCH_DEFCONFIG=arch/x86/configs/x86_64_defconfig CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_FAST_CMPXCHG_LOCAL=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_GENERIC_SPINLOCK=y # CONFIG_RWSEM_XCHGADD_ALGORITHM is not set CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_DEFAULT_IDLE=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_HAVE_DYNAMIC_PER_CPU_AREA=y CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ZONE_DMA32=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_USE_GENERIC_SMP_HELPERS=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_X86_TRAMPOLINE=y # CONFIG_KTIME_SCALAR is not set CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config CONFIG_CONSTRUCTORS=y # # General setup # CONFIG_EXPERIMENTAL=y
Re: [PATCH -tip v14 01/12] x86: instruction decoder API
On Thu, Aug 20, 2009 at 03:01:25PM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: On Thu, Aug 20, 2009 at 12:16:05PM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: On Thu, Aug 20, 2009 at 11:03:40AM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: On Thu, Aug 20, 2009 at 01:42:31AM +0200, Frederic Weisbecker wrote: On Thu, Aug 13, 2009 at 04:34:13PM -0400, Masami Hiramatsu wrote: Add x86 instruction decoder to arch-specific libraries. This decoder can decode x86 instructions used in kernel into prefix, opcode, modrm, sib, displacement and immediates. This can also show the length of instructions. This version introduces instruction attributes for decoding instructions. The instruction attribute tables are generated from the opcode map file (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk). Currently, the opcode maps are based on opcode maps in Intel(R) 64 and IA-32 Architectures Software Developers Manual Vol.2: Appendix.A, and consist of below two types of opcode tables. 1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are written as below; Table: table-name Referrer: escaped-name opcode: mnemonic|GrpXXX [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] (or) opcode: escape # escaped-name EndTable Group opcodes, which has 8 elements, are written as below; GrpTable: GrpXXX reg: mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] EndTable These opcode maps include a few SSE and FP opcodes (for setup), because those opcodes are used in the kernel. I'm getting the following build error on an old K7 box: arch/x86/lib/inat.c: In function ‘inat_get_opcode_attribute’: arch/x86/lib/inat.c:29: erreur: ‘inat_primary_table’ undeclared (first use in this function) arch/x86/lib/inat.c:29: erreur: (Each undeclared identifier is reported only once arch/x86/lib/inat.c:29: erreur: for each function it appears in.) I've attached my config. I haven't such problem on a dual x86-64 box. Actually I have the same problem in x86-64 The content of my arch/x86/lib/inat-tables.c: /* x86 opcode map generated from x86-opcode-map.txt */ /* Do not change this code. */ /* Table: one byte opcode */ /* Escape opcode map array */ const insn_attr_t const *inat_escape_tables[INAT_ESC_MAX + 1][INAT_LPREFIX_MAX + 1] = { }; /* Group opcode map array */ const insn_attr_t const *inat_group_tables[INAT_GRP_MAX + 1][INAT_LPREFIX_MAX + 1] = { }; I guess there is a problem with the generation of this file. Aah, you may use mawk on Ubuntu 9.04, right? If so, unfortunately, mawk is still under development. http://invisible-island.net/mawk/CHANGES Aargh... 20090727 add check/fix to prevent gsub from recurring to modify on a substring of the current line when the regular expression is anchored to the beginning of the line; fixes gawk's anchgsub testcase. add check for implicit concatenation mistaken for exponent; fixes gawk's hex testcase. add character-classes to built-in regular expressions. ^^ Look, this means we can't use char-class expressions like [:lower:] until this version... And I've found another bug in mawk-1.3.3-20090728(the latest one). it almost works, but; $ mawk 'BEGIN {printf(0x%x\n, 0)}' 0x1 Ouch, indeed. $ gawk 'BEGIN {printf(0x%x\n, 0)}' 0x0 This bug skips an array element index 0x0 in inat-tables.c :( So, I recommend you to install gawk instead mawk until that supports all posix-awk features, since I don't think it is good idea to avoid all those bugs which depends on implementation (not specification). Thank you, Yeah, indeed. May be add a warning (or build error) in case the user uses mawk? Hmm, it is possible that mawk will fix those bugs and catch up soon, so, I think checking mawk is not a good idea. (and since there will be other awk implementations, it's not fair.) I think what all I can do now is reporting bugs to mawk and ubuntu people.:-) Yeah, but without your tip I couldn't be able to find the origin before some time. And the kernel couldn't build anyway. At least we should do something with this version of mawk. Hm, indeed. Maybe, we can run additional sanity check script before using awk, like this; --- res=`echo a | $AWK '/[[:lower:]]+/{print OK}'` [ $res != OK ] exit 1 res=`$AWK 'BEGIN {printf(%x, 0)}'` [ $res != 0 ] exit 1 exit 0 --- Thanks, Yeah, that looks good. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip v14 07/12] tracing: Introduce TRACE_FIELD_ZERO() macro
On Tue, Aug 18, 2009 at 10:20:11PM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: On Thu, Aug 13, 2009 at 04:35:01PM -0400, Masami Hiramatsu wrote: Use TRACE_FIELD_ZERO(type, item) instead of TRACE_FIELD_ZERO_CHAR(item). This also includes a fix of TRACE_ZERO_CHAR() macro. I can't find what the fix is about (see below) Ah, OK. This patch actually includes two parts. One is introducing TRACE_FIELD_ZERO which is more generic than TRACE_FIELD_ZERO_CHAR, I think. Another is a typo fix of TRACE_ZERO_CHAR. Ok. +#undef TRACE_FIELD_ZERO +#define TRACE_FIELD_ZERO(type, item) + Is it about the above moving? If so, could you just tell so that I can add something about it in the changelog. No, I assume that TRACE_ZERO_CHAR is just a typo of TRACE_FIELD_ZERO_CHAR. (because I couldn't find any other TRACE_ZERO_CHAR) BTW, this patch may not be needed after applying patch 10/12, since it removes ftrace event definitions of TRACE_KPROBE/KRETPROBE. Perhaps, would I better merge and split those additional patches(and remove this change)? (It also could make the incremental review hard...) Thank you, No let's keep it as is. It should be fine. Thanks. -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip v14 01/12] x86: instruction decoder API
On Thu, Aug 13, 2009 at 04:34:13PM -0400, Masami Hiramatsu wrote: Add x86 instruction decoder to arch-specific libraries. This decoder can decode x86 instructions used in kernel into prefix, opcode, modrm, sib, displacement and immediates. This can also show the length of instructions. This version introduces instruction attributes for decoding instructions. The instruction attribute tables are generated from the opcode map file (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk). Currently, the opcode maps are based on opcode maps in Intel(R) 64 and IA-32 Architectures Software Developers Manual Vol.2: Appendix.A, and consist of below two types of opcode tables. 1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are written as below; Table: table-name Referrer: escaped-name opcode: mnemonic|GrpXXX [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] (or) opcode: escape # escaped-name EndTable Group opcodes, which has 8 elements, are written as below; GrpTable: GrpXXX reg: mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] EndTable These opcode maps include a few SSE and FP opcodes (for setup), because those opcodes are used in the kernel. I'm getting the following build error on an old K7 box: arch/x86/lib/inat.c: In function ‘inat_get_opcode_attribute’: arch/x86/lib/inat.c:29: erreur: ‘inat_primary_table’ undeclared (first use in this function) arch/x86/lib/inat.c:29: erreur: (Each undeclared identifier is reported only once arch/x86/lib/inat.c:29: erreur: for each function it appears in.) I've attached my config. I haven't such problem on a dual x86-64 box. # # Automatically generated make config: don't edit # Linux kernel version: 2.6.31-rc5 # Fri Jun 11 05:59:12 2010 # # CONFIG_64BIT is not set CONFIG_X86_32=y # CONFIG_X86_64 is not set CONFIG_X86=y CONFIG_OUTPUT_FORMAT=elf32-i386 CONFIG_ARCH_DEFCONFIG=arch/x86/configs/i386_defconfig CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_FAST_CMPXCHG_LOCAL=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y # CONFIG_RWSEM_GENERIC_SPINLOCK is not set CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y # CONFIG_GENERIC_TIME_VSYSCALL is not set CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_DEFAULT_IDLE=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_HAVE_DYNAMIC_PER_CPU_AREA=y # CONFIG_HAVE_CPUMASK_OF_CPU_MAP is not set CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y # CONFIG_ZONE_DMA32 is not set CONFIG_ARCH_POPULATES_NODE_MAP=y # CONFIG_AUDIT_ARCH is not set CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_USE_GENERIC_SMP_HELPERS=y CONFIG_X86_32_SMP=y CONFIG_X86_HT=y CONFIG_X86_TRAMPOLINE=y CONFIG_X86_32_LAZY_GS=y CONFIG_KTIME_SCALAR=y CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config CONFIG_CONSTRUCTORS=y # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION= # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y # CONFIG_TASKSTATS is not set CONFIG_AUDIT=y # CONFIG_AUDITSYSCALL is not set # # RCU Subsystem # CONFIG_CLASSIC_RCU=y # CONFIG_TREE_RCU is not set # CONFIG_PREEMPT_RCU is not set # CONFIG_TREE_RCU_TRACE is not set # CONFIG_PREEMPT_RCU_TRACE is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=17 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y # CONFIG_GROUP_SCHED is not set # CONFIG_CGROUPS is not set CONFIG_SYSFS_DEPRECATED=y CONFIG_SYSFS_DEPRECATED_V2=y CONFIG_RELAY=y CONFIG_NAMESPACES=y # CONFIG_UTS_NS is not set # CONFIG_IPC_NS is not set # CONFIG_USER_NS is not set # CONFIG_PID_NS is not set # CONFIG_NET_NS is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE= CONFIG_RD_GZIP=y CONFIG_RD_BZIP2=y CONFIG_RD_LZMA=y # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_ANON_INODES=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_PCSPKR_PLATFORM=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y
Re: [PATCH -tip v14 01/12] x86: instruction decoder API
On Thu, Aug 20, 2009 at 01:42:31AM +0200, Frederic Weisbecker wrote: On Thu, Aug 13, 2009 at 04:34:13PM -0400, Masami Hiramatsu wrote: Add x86 instruction decoder to arch-specific libraries. This decoder can decode x86 instructions used in kernel into prefix, opcode, modrm, sib, displacement and immediates. This can also show the length of instructions. This version introduces instruction attributes for decoding instructions. The instruction attribute tables are generated from the opcode map file (x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk). Currently, the opcode maps are based on opcode maps in Intel(R) 64 and IA-32 Architectures Software Developers Manual Vol.2: Appendix.A, and consist of below two types of opcode tables. 1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are written as below; Table: table-name Referrer: escaped-name opcode: mnemonic|GrpXXX [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] (or) opcode: escape # escaped-name EndTable Group opcodes, which has 8 elements, are written as below; GrpTable: GrpXXX reg: mnemonic [operand1[,operand2...]] [(extra1)[,(extra2)...] [| 2nd-mnemonic ...] EndTable These opcode maps include a few SSE and FP opcodes (for setup), because those opcodes are used in the kernel. I'm getting the following build error on an old K7 box: arch/x86/lib/inat.c: In function ‘inat_get_opcode_attribute’: arch/x86/lib/inat.c:29: erreur: ‘inat_primary_table’ undeclared (first use in this function) arch/x86/lib/inat.c:29: erreur: (Each undeclared identifier is reported only once arch/x86/lib/inat.c:29: erreur: for each function it appears in.) I've attached my config. I haven't such problem on a dual x86-64 box. Actually I have the same problem in x86-64 The content of my arch/x86/lib/inat-tables.c: /* x86 opcode map generated from x86-opcode-map.txt */ /* Do not change this code. */ /* Table: one byte opcode */ /* Escape opcode map array */ const insn_attr_t const *inat_escape_tables[INAT_ESC_MAX + 1][INAT_LPREFIX_MAX + 1] = { }; /* Group opcode map array */ const insn_attr_t const *inat_group_tables[INAT_GRP_MAX + 1][INAT_LPREFIX_MAX + 1] = { }; I guess there is a problem with the generation of this file. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip v14 03/12] kprobes: checks probe address is instruction boudary on x86
On Thu, Aug 13, 2009 at 04:34:28PM -0400, Masami Hiramatsu wrote: Ensure safeness of inserting kprobes by checking whether the specified address is at the first byte of a instruction on x86. This is done by decoding probed function from its head to the probe point. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Acked-by: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: H. Peter Anvin h...@zytor.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: Jim Keniston jkeni...@us.ibm.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- arch/x86/kernel/kprobes.c | 69 + 1 files changed, 69 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c index b5b1848..80d493f 100644 --- a/arch/x86/kernel/kprobes.c +++ b/arch/x86/kernel/kprobes.c @@ -48,6 +48,7 @@ #include linux/preempt.h #include linux/module.h #include linux/kdebug.h +#include linux/kallsyms.h #include asm/cacheflush.h #include asm/desc.h @@ -55,6 +56,7 @@ #include asm/uaccess.h #include asm/alternative.h #include asm/debugreg.h +#include asm/insn.h void jprobe_return_end(void); @@ -245,6 +247,71 @@ retry: } } +/* Recover the probed instruction at addr for further analysis. */ +static int recover_probed_instruction(kprobe_opcode_t *buf, unsigned long addr) +{ + struct kprobe *kp; + kp = get_kprobe((void *)addr); + if (!kp) + return -EINVAL; + + /* + * Basically, kp-ainsn.insn has an original instruction. + * However, RIP-relative instruction can not do single-stepping + * at different place, fix_riprel() tweaks the displacement of + * that instruction. In that case, we can't recover the instruction + * from the kp-ainsn.insn. + * + * On the other hand, kp-opcode has a copy of the first byte of + * the probed instruction, which is overwritten by int3. And + * the instruction at kp-addr is not modified by kprobes except + * for the first byte, we can recover the original instruction + * from it and kp-opcode. + */ + memcpy(buf, kp-addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t)); + buf[0] = kp-opcode; + return 0; +} + +/* Dummy buffers for kallsyms_lookup */ +static char __dummy_buf[KSYM_NAME_LEN]; + +/* Check if paddr is at an instruction boundary */ +static int __kprobes can_probe(unsigned long paddr) +{ + int ret; + unsigned long addr, offset = 0; + struct insn insn; + kprobe_opcode_t buf[MAX_INSN_SIZE]; + + if (!kallsyms_lookup(paddr, NULL, offset, NULL, __dummy_buf)) + return 0; + + /* Decode instructions */ + addr = paddr - offset; + while (addr paddr) { + kernel_insn_init(insn, (void *)addr); + insn_get_opcode(insn); + + /* Check if the instruction has been modified. */ + if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION) { + ret = recover_probed_instruction(buf, addr); I'm confused about the reason of this recovering. Is it to remove kprobes behind the current setting one in the current function? If such cleanup is needed for whatever reason, I wonder what happens to the corresponding kprobe structure, why isn't it using the arch_disarm_ helper to patch back? (Questions that may prove my solid misunderstanding of the kprobes code ;-) Frederic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip v14 03/12] kprobes: checks probe address is instruction boudary on x86
On Tue, Aug 18, 2009 at 07:17:39PM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: + while (addr paddr) { + kernel_insn_init(insn, (void *)addr); + insn_get_opcode(insn); + + /* Check if the instruction has been modified. */ + if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION) { + ret = recover_probed_instruction(buf, addr); I'm confused about the reason of this recovering. Is it to remove kprobes behind the current setting one in the current function? No, it recovers just an instruction which is probed by a kprobe, because we need to know the first byte of this instruction for decoding it. Perhaps we'd better to have more generic interface (text_peek?) for it because another subsystem (e.g. kgdb) may want to insert int3... Thank you, Aah, I see now, it's to keep a sane check of the instructions boundaries without int 3 artifacts in the middle. But in that case, you should re-arm the breakpoint after your check, right? Or may be you could do the check without repatching? May be by doing a copy of insn.opcode.bytes and replacing bytes[0] with what a random kprobe has stolen? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip v14 07/12] tracing: Introduce TRACE_FIELD_ZERO() macro
On Thu, Aug 13, 2009 at 04:35:01PM -0400, Masami Hiramatsu wrote: Use TRACE_FIELD_ZERO(type, item) instead of TRACE_FIELD_ZERO_CHAR(item). This also includes a fix of TRACE_ZERO_CHAR() macro. I can't find what the fix is about (see below) Signed-off-by: Masami Hiramatsu mhira...@redhat.com Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Avi Kivity a...@redhat.com Cc: Andi Kleen a...@linux.intel.com Cc: Christoph Hellwig h...@infradead.org Cc: Frank Ch. Eigler f...@redhat.com Cc: Frederic Weisbecker fweis...@gmail.com Cc: H. Peter Anvin h...@zytor.com Cc: Ingo Molnar mi...@elte.hu Cc: Jason Baron jba...@redhat.com Cc: Jim Keniston jkeni...@us.ibm.com Cc: K.Prasad pra...@linux.vnet.ibm.com Cc: Lai Jiangshan la...@cn.fujitsu.com Cc: Li Zefan l...@cn.fujitsu.com Cc: Przemysław Pawełczyk przemys...@pawelczyk.it Cc: Roland McGrath rol...@redhat.com Cc: Sam Ravnborg s...@ravnborg.org Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tzanu...@gmail.com Cc: Vegard Nossum vegard.nos...@gmail.com --- kernel/trace/trace_event_types.h |4 ++-- kernel/trace/trace_export.c | 16 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/kernel/trace/trace_event_types.h b/kernel/trace/trace_event_types.h index 6db005e..e74f090 100644 --- a/kernel/trace/trace_event_types.h +++ b/kernel/trace/trace_event_types.h @@ -109,7 +109,7 @@ TRACE_EVENT_FORMAT(bprint, TRACE_BPRINT, bprint_entry, ignore, TRACE_STRUCT( TRACE_FIELD(unsigned long, ip, ip) TRACE_FIELD(char *, fmt, fmt) - TRACE_FIELD_ZERO_CHAR(buf) + TRACE_FIELD_ZERO(char, buf) ), TP_RAW_FMT(%08lx (%d) fmt:%p %s) ); @@ -117,7 +117,7 @@ TRACE_EVENT_FORMAT(bprint, TRACE_BPRINT, bprint_entry, ignore, TRACE_EVENT_FORMAT(print, TRACE_PRINT, print_entry, ignore, TRACE_STRUCT( TRACE_FIELD(unsigned long, ip, ip) - TRACE_FIELD_ZERO_CHAR(buf) + TRACE_FIELD_ZERO(char, buf) ), TP_RAW_FMT(%08lx (%d) fmt:%p %s) ); diff --git a/kernel/trace/trace_export.c b/kernel/trace/trace_export.c index 71c8d7f..b0ac92c 100644 --- a/kernel/trace/trace_export.c +++ b/kernel/trace/trace_export.c @@ -42,9 +42,9 @@ extern void __bad_type_size(void); if (!ret) \ return 0; -#undef TRACE_FIELD_ZERO_CHAR -#define TRACE_FIELD_ZERO_CHAR(item) \ - ret = trace_seq_printf(s, \tfield:char #item ;\t \ +#undef TRACE_FIELD_ZERO +#define TRACE_FIELD_ZERO(type, item) \ + ret = trace_seq_printf(s, \tfield: #type #item ;\t \ offset:%u;\tsize:0;\n, \ (unsigned int)offsetof(typeof(field), item)); \ if (!ret) \ @@ -92,9 +92,6 @@ ftrace_format_##call(struct ftrace_event_call *unused, \ #include trace_event_types.h -#undef TRACE_ZERO_CHAR -#define TRACE_ZERO_CHAR(arg) - #undef TRACE_FIELD #define TRACE_FIELD(type, item, assign)\ entry-item = assign; @@ -107,6 +104,9 @@ ftrace_format_##call(struct ftrace_event_call *unused, \ #define TRACE_FIELD_SIGN(type, item, assign, is_signed) \ TRACE_FIELD(type, item, assign) +#undef TRACE_FIELD_ZERO +#define TRACE_FIELD_ZERO(type, item) + Is it about the above moving? If so, could you just tell so that I can add something about it in the changelog. Thanks. Frederic. #undef TP_CMD #define TP_CMD(cmd...) cmd @@ -178,8 +178,8 @@ __attribute__((section(_ftrace_events))) event_##call = { \ if (ret)\ return ret; -#undef TRACE_FIELD_ZERO_CHAR -#define TRACE_FIELD_ZERO_CHAR(item) +#undef TRACE_FIELD_ZERO +#define TRACE_FIELD_ZERO(type, item) #undef TRACE_EVENT_FORMAT #define TRACE_EVENT_FORMAT(call, proto, args, fmt, tstruct, tpfmt) \ -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip -v10 7/7] tracing: add kprobe-based event tracer
On Tue, Jun 30, 2009 at 09:09:23PM -0400, Masami Hiramatsu wrote: Add kprobes-based event tracer on ftrace. This tracer is similar to the events tracer which is based on Tracepoint infrastructure. Instead of Tracepoint, this tracer is based on kprobes(kprobe and kretprobe). It probes anywhere where kprobes can probe(this means, all functions body except for __kprobes functions). Similar to the events tracer, this tracer doesn't need to be activated via current_tracer, instead of that, just set probe points via /sys/kernel/debug/tracing/kprobe_events. And you can set filters on each probe events via /sys/kernel/debug/tracing/events/kprobes/EVENT/filter. This tracer supports following probe arguments for each probe. %REG : Fetch register REG sN: Fetch Nth entry of stack (N = 0) @ADDR : Fetch memory at ADDR (ADDR should be in kernel) @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) aN: Fetch function argument. (N = 0) rv: Fetch return value. ra: Fetch return address. +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address. See Documentation/trace/kprobes.txt for details. Changes from v9: - Select CONFIG_GENERIC_TRACER when CONFIG_KPROBE_TRACER=y. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Acked-by: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Christoph Hellwig h...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Ingo Molnar mi...@elte.hu Cc: Frederic Weisbecker fweis...@gmail.com Cc: Tom Zanussi tzanu...@gmail.com --- Documentation/trace/kprobes.txt | 138 kernel/trace/Kconfig | 12 kernel/trace/Makefile|1 kernel/trace/trace.h | 22 + kernel/trace/trace_event_types.h | 20 + kernel/trace/trace_kprobe.c | 1183 ++ 6 files changed, 1376 insertions(+), 0 deletions(-) create mode 100644 Documentation/trace/kprobes.txt create mode 100644 kernel/trace/trace_kprobe.c diff --git a/Documentation/trace/kprobes.txt b/Documentation/trace/kprobes.txt new file mode 100644 index 000..3a90ebb --- /dev/null +++ b/Documentation/trace/kprobes.txt @@ -0,0 +1,138 @@ + Kprobe-based Event Tracer + = + + Documentation is written by Masami Hiramatsu + + +Overview + +This tracer is similar to the events tracer which is based on Tracepoint +infrastructure. Instead of Tracepoint, this tracer is based on kprobes(kprobe +and kretprobe). It probes anywhere where kprobes can probe(this means, all +functions body except for __kprobes functions). + +Unlike the function tracer, this tracer can probe instructions inside of +kernel functions. It allows you to check which instruction has been executed. + +Unlike the Tracepoint based events tracer, this tracer can add and remove +probe points on the fly. + +Similar to the events tracer, this tracer doesn't need to be activated via +current_tracer, instead of that, just set probe points via +/sys/kernel/debug/tracing/kprobe_events. And you can set filters on each +probe events via /sys/kernel/debug/tracing/events/kprobes/EVENT/filter. + + +Synopsis of kprobe_events +- + p[:EVENT] SYMBOL[+offs|-offs]|MEMADDR [FETCHARGS] : set a probe + r[:EVENT] SYMBOL[+0] [FETCHARGS] : set a return probe + + EVENT : Event name + SYMBOL[+offs|-offs] : Symbol+offset where the probe is inserted + MEMADDR : Address where the probe is inserted + + FETCHARGS : Arguments + %REG : Fetch register REG + sN : Fetch Nth entry of stack (N = 0) + @ADDR : Fetch memory at ADDR (ADDR should be in kernel) + @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) + aN : Fetch function argument. (N = 0)(*) + rv : Fetch return value.(**) + ra : Fetch return address.(**) + +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.(***) + + (*) aN may not correct on asmlinkaged functions and at the middle of + function body. + (**) only for return probe. + (***) this is useful for fetching a field of data structures. + + +Per-Probe Event Filtering +- + Per-probe event filtering feature allows you to set different filter on each +probe and gives you what arguments will be shown in trace buffer. If an event +name is specified right after 'p:' or 'r:' in kprobe_events, the tracer adds +an event under tracing/events/kprobes/EVENT, at the directory you can see +'id', 'enabled', 'format' and 'filter'. + +enabled: + You can enable/disable the probe by writing 1 or 0 on it. + +format: + It shows the format of this probe event. It also shows aliases of arguments + which you specified to kprobe_events. + +filter: + You can write filtering rules of this event
Re: [PATCH -tip -v10 7/7] tracing: add kprobe-based event tracer
On Tue, Jul 07, 2009 at 03:55:28PM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 206cb7d..65945eb 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -45,6 +45,8 @@ enum trace_type { TRACE_POWER, TRACE_BLK, TRACE_KSYM, + TRACE_KPROBE, + TRACE_KRETPROBE, __TRACE_LAST_TYPE, }; @@ -227,6 +229,22 @@ struct trace_ksym { charksym_name[KSYM_NAME_LEN]; charp_name[TASK_COMM_LEN]; }; +#define TRACE_KPROBE_ARGS 6 + +struct kprobe_trace_entry { + struct trace_entry ent; + unsigned long ip; + int nargs; + unsigned long args[TRACE_KPROBE_ARGS]; I see that you actually make use of arg as a dynamic sizeable array. For clarity, args[TRACE_KPROBE_ARGS] could be args[0]. It's just a neat and wouldn't affect the code nor the data but would be clearer for readers of that code. Hmm. In that case, I think we'll need a new macro for field definition, like TRACE_FIELD_ZERO(type, item). You mean that for trace_define_field() to describe fields of events? Actually the fields should be defined dynamically depending on how is built the kprobe event (which arguments are requested, how many, etc..). Frederic. +}; + +struct kretprobe_trace_entry { + struct trace_entry ent; + unsigned long func; + unsigned long ret_ip; + int nargs; + unsigned long args[TRACE_KPROBE_ARGS]; +}; ditto /* * trace_flag_type is an enumeration that holds different @@ -344,6 +362,10 @@ extern void __ftrace_bad_type(void); IF_ASSIGN(var, ent, struct syscall_trace_exit, \ TRACE_SYSCALL_EXIT); \ IF_ASSIGN(var, ent, struct trace_ksym, TRACE_KSYM); \ + IF_ASSIGN(var, ent, struct kprobe_trace_entry, \ +TRACE_KPROBE);\ + IF_ASSIGN(var, ent, struct kretprobe_trace_entry, \ +TRACE_KRETPROBE); \ __ftrace_bad_type();\ } while (0) diff --git a/kernel/trace/trace_event_types.h b/kernel/trace/trace_event_types.h index 6db005e..ec2e6f3 100644 --- a/kernel/trace/trace_event_types.h +++ b/kernel/trace/trace_event_types.h @@ -175,4 +175,24 @@ TRACE_EVENT_FORMAT(kmem_free, TRACE_KMEM_FREE, kmemtrace_free_entry, ignore, TP_RAW_FMT(type:%u call_site:%lx ptr:%p) ); +TRACE_EVENT_FORMAT(kprobe, TRACE_KPROBE, kprobe_trace_entry, ignore, + TRACE_STRUCT( + TRACE_FIELD(unsigned long, ip, ip) + TRACE_FIELD(int, nargs, nargs) + TRACE_FIELD_SPECIAL(unsigned long args[TRACE_KPROBE_ARGS], + args, TRACE_KPROBE_ARGS, args) + ), + TP_RAW_FMT(%08lx: args:0x%lx ...) +); + +TRACE_EVENT_FORMAT(kretprobe, TRACE_KRETPROBE, kretprobe_trace_entry, ignore, + TRACE_STRUCT( + TRACE_FIELD(unsigned long, func, func) + TRACE_FIELD(unsigned long, ret_ip, ret_ip) + TRACE_FIELD(int, nargs, nargs) + TRACE_FIELD_SPECIAL(unsigned long args[TRACE_KPROBE_ARGS], + args, TRACE_KPROBE_ARGS, args) + ), + TP_RAW_FMT(%08lx - %08lx: args:0x%lx ...) +); #undef TRACE_SYSTEM diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c new file mode 100644 index 000..0951512 --- /dev/null +++ b/kernel/trace/trace_kprobe.c @@ -0,0 +1,1183 @@ +/* + * kprobe based kernel tracer + * + * Created by Masami Hiramatsu mhira...@redhat.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/module.h +#include linux/uaccess.h +#include linux/kprobes.h +#include linux/seq_file.h +#include linux/slab.h +#include linux/smp.h +#include linux/debugfs.h +#include linux/types.h +#include linux/string.h +#include linux/ctype.h +#include linux/ptrace.h + +#include trace.h +#include trace_output.h + +#define MAX_ARGSTR_LEN 63 + +/* currently, trace_kprobe only
Re: [PATCH -tip -v10 7/7] tracing: add kprobe-based event tracer
On Tue, Jul 07, 2009 at 04:42:32PM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: On Tue, Jul 07, 2009 at 03:55:28PM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 206cb7d..65945eb 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -45,6 +45,8 @@ enum trace_type { TRACE_POWER, TRACE_BLK, TRACE_KSYM, +TRACE_KPROBE, +TRACE_KRETPROBE, __TRACE_LAST_TYPE, }; @@ -227,6 +229,22 @@ struct trace_ksym { charksym_name[KSYM_NAME_LEN]; charp_name[TASK_COMM_LEN]; }; +#define TRACE_KPROBE_ARGS 6 + +struct kprobe_trace_entry { +struct trace_entry ent; +unsigned long ip; +int nargs; +unsigned long args[TRACE_KPROBE_ARGS]; I see that you actually make use of arg as a dynamic sizeable array. For clarity, args[TRACE_KPROBE_ARGS] could be args[0]. It's just a neat and wouldn't affect the code nor the data but would be clearer for readers of that code. Hmm. In that case, I think we'll need a new macro for field definition, like TRACE_FIELD_ZERO(type, item). You mean that for trace_define_field() to describe fields of events? Actually the fields should be defined dynamically depending on how is built the kprobe event (which arguments are requested, how many, etc..). Yeah, if you specified a probe point with its event name, the tracer will make a corresponding event dynamically. There are also anonymous probes which don't have corresponding events. For those anonymous probes, I need to define two generic event types(kprobe and kretprobe). Thank you, Ok. Btw, why do you need to define those two anonymous events? Actually your event types are always dynamically created. Those you defined through TRACE_FORMAT_EVENT are only ghost events, they only stand there as a abstract pattern, right? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip -v10 7/7] tracing: add kprobe-based event tracer
On Tue, Jul 07, 2009 at 05:31:25PM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: On Tue, Jul 07, 2009 at 04:42:32PM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: On Tue, Jul 07, 2009 at 03:55:28PM -0400, Masami Hiramatsu wrote: Frederic Weisbecker wrote: diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 206cb7d..65945eb 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -45,6 +45,8 @@ enum trace_type { TRACE_POWER, TRACE_BLK, TRACE_KSYM, + TRACE_KPROBE, + TRACE_KRETPROBE, __TRACE_LAST_TYPE, }; @@ -227,6 +229,22 @@ struct trace_ksym { charksym_name[KSYM_NAME_LEN]; charp_name[TASK_COMM_LEN]; }; +#define TRACE_KPROBE_ARGS 6 + +struct kprobe_trace_entry { + struct trace_entry ent; + unsigned long ip; + int nargs; + unsigned long args[TRACE_KPROBE_ARGS]; I see that you actually make use of arg as a dynamic sizeable array. For clarity, args[TRACE_KPROBE_ARGS] could be args[0]. It's just a neat and wouldn't affect the code nor the data but would be clearer for readers of that code. Hmm. In that case, I think we'll need a new macro for field definition, like TRACE_FIELD_ZERO(type, item). You mean that for trace_define_field() to describe fields of events? Actually the fields should be defined dynamically depending on how is built the kprobe event (which arguments are requested, how many, etc..). Yeah, if you specified a probe point with its event name, the tracer will make a corresponding event dynamically. There are also anonymous probes which don't have corresponding events. For those anonymous probes, I need to define two generic event types(kprobe and kretprobe). Thank you, Ok. Btw, why do you need to define those two anonymous events? Actually your event types are always dynamically created. Those you defined through TRACE_FORMAT_EVENT are only ghost events, they only stand there as a abstract pattern, right? Not always created. Below command will create an event event1; p probe_point:event1 a1 a2 a3 ... /debug/tracing/kprobe_events But next command doesn't create. p probe_point a1 a2 a3 ... /debug/tracing/kprobe_events Aah, ok. This just inserts a kprobe to probe_point. the advantage of this simple command is that you never be annoyed by making different name for new events :-) Indeed. But speaking about that, may be you could dynamically create a name following this simple model: func+offset Unless we can set several kprobes on the exact same address? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip -v10 5/7] x86: add pt_regs register and stack access APIs
On Tue, Jun 30, 2009 at 09:09:11PM -0400, Masami Hiramatsu wrote: Add following APIs for accessing registers and stack entries from pt_regs. - regs_query_register_offset(const char *name) Query the offset of name register. - regs_query_register_name(unsigned offset) Query the name of register by its offset. - regs_get_register(struct pt_regs *regs, unsigned offset) Get the value of a register by its offset. - regs_within_kernel_stack(struct pt_regs *regs, unsigned long addr) Check the address is in the kernel stack. - regs_get_kernel_stack_nth(struct pt_regs *reg, unsigned nth) Get Nth entry of the kernel stack. (N = 0) - regs_get_argument_nth(struct pt_regs *reg, unsigned nth) Get Nth argument at function call. (N = 0) Changes from v9: -Fix a typo in a comment. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Cc: Christoph Hellwig h...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Ingo Molnar mi...@elte.hu Cc: Frederic Weisbecker fweis...@gmail.com Cc: Roland McGrath rol...@redhat.com Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: linux-a...@vger.kernel.org Looks good! Reviewed-by: Frederic Weisbecker fweis...@gmail.com Frederic. --- arch/x86/include/asm/ptrace.h | 122 + arch/x86/kernel/ptrace.c | 73 + 2 files changed, 195 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 0f0d908..d5e3b3b 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -7,6 +7,7 @@ #ifdef __KERNEL__ #include asm/segment.h +#include asm/page_types.h #endif #ifndef __ASSEMBLY__ @@ -216,6 +217,127 @@ static inline unsigned long user_stack_pointer(struct pt_regs *regs) return regs-sp; } +/* Query offset/name of register from its name/offset */ +extern int regs_query_register_offset(const char *name); +extern const char *regs_query_register_name(unsigned offset); +#define MAX_REG_OFFSET (offsetof(struct pt_regs, ss)) + +/** + * regs_get_register() - get register value from its offset + * @regs:pt_regs from which register value is gotten. + * @offset: offset number of the register. + * + * regs_get_register returns the value of a register whose offset from @regs + * is @offset. The @offset is the offset of the register in struct pt_regs. + * If @offset is bigger than MAX_REG_OFFSET, this returns 0. + */ +static inline unsigned long regs_get_register(struct pt_regs *regs, + unsigned offset) +{ + if (unlikely(offset MAX_REG_OFFSET)) + return 0; + return *(unsigned long *)((unsigned long)regs + offset); +} + +/** + * regs_within_kernel_stack() - check the address in the stack + * @regs:pt_regs which contains kernel stack pointer. + * @addr:address which is checked. + * + * regs_within_kenel_stack() checks @addr is within the kernel stack page(s). + * If @addr is within the kernel stack, it returns true. If not, returns false. + */ +static inline int regs_within_kernel_stack(struct pt_regs *regs, +unsigned long addr) +{ + return ((addr ~(THREAD_SIZE - 1)) == + (kernel_stack_pointer(regs) ~(THREAD_SIZE - 1))); +} + +/** + * regs_get_kernel_stack_nth() - get Nth entry of the stack + * @regs:pt_regs which contains kernel stack pointer. + * @n: stack entry number. + * + * regs_get_kernel_stack_nth() returns @n th entry of the kernel stack which + * is specifined by @regs. If the @n th entry is NOT in the kernel stack, + * this returns 0. + */ +static inline unsigned long regs_get_kernel_stack_nth(struct pt_regs *regs, + unsigned n) +{ + unsigned long *addr = (unsigned long *)kernel_stack_pointer(regs); + addr += n; + if (regs_within_kernel_stack(regs, (unsigned long)addr)) + return *addr; + else + return 0; +} + +/** + * regs_get_argument_nth() - get Nth argument at function call + * @regs:pt_regs which contains registers at function entry. + * @n: argument number. + * + * regs_get_argument_nth() returns @n th argument of a function call. + * Since usually the kernel stack will be changed right after function entry, + * you must use this at function entry. If the @n th entry is NOT in the + * kernel stack or pt_regs, this returns 0. + */ +#ifdef CONFIG_X86_32 +#define NR_REGPARMS 3 +static inline unsigned long regs_get_argument_nth(struct pt_regs *regs, + unsigned n) +{ + if (n NR_REGPARMS) { + switch (n) { + case 0: + return regs-ax; + case 1
Re: [PATCH -tip -v10 6/7] tracing: ftrace dynamic ftrace_event_call support
On Tue, Jun 30, 2009 at 09:09:17PM -0400, Masami Hiramatsu wrote: Add dynamic ftrace_event_call support to ftrace. Trace engines can adds new ftrace_event_call to ftrace on the fly. Each operator functions of the call takes a ftrace_event_call data structure as an argument, because these functions may be shared among several ftrace_event_calls. Signed-off-by: Masami Hiramatsu mhira...@redhat.com Cc: Steven Rostedt rost...@goodmis.org Cc: Ingo Molnar mi...@elte.hu Cc: Tom Zanussi tzanu...@gmail.com Cc: Frederic Weisbecker fweis...@gmail.com Looks good too. Acked-by: Frederic Weisbecker fweis...@gmail.com --- include/linux/ftrace_event.h | 13 +--- include/trace/ftrace.h | 22 +++-- kernel/trace/trace_events.c | 70 -- kernel/trace/trace_export.c | 27 4 files changed, 85 insertions(+), 47 deletions(-) diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h index 5c093ff..f7733b6 100644 --- a/include/linux/ftrace_event.h +++ b/include/linux/ftrace_event.h @@ -108,12 +108,13 @@ struct ftrace_event_call { struct dentry *dir; struct trace_event *event; int enabled; - int (*regfunc)(void); - void(*unregfunc)(void); + int (*regfunc)(struct ftrace_event_call *); + void(*unregfunc)(struct ftrace_event_call *); int id; - int (*raw_init)(void); - int (*show_format)(struct trace_seq *s); - int (*define_fields)(void); + int (*raw_init)(struct ftrace_event_call *); + int (*show_format)(struct ftrace_event_call *, +struct trace_seq *); + int (*define_fields)(struct ftrace_event_call *); struct list_headfields; int filter_active; void*filter; @@ -138,6 +139,8 @@ extern int filter_current_check_discard(struct ftrace_event_call *call, extern int trace_define_field(struct ftrace_event_call *call, char *type, char *name, int offset, int size, int is_signed); +extern int trace_add_event_call(struct ftrace_event_call *call); +extern void trace_remove_event_call(struct ftrace_event_call *call); #define is_signed_type(type) (((type)(-1)) 0) diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h index 1867553..d696580 100644 --- a/include/trace/ftrace.h +++ b/include/trace/ftrace.h @@ -147,7 +147,8 @@ #undef TRACE_EVENT #define TRACE_EVENT(call, proto, args, tstruct, func, print) \ static int \ -ftrace_format_##call(struct trace_seq *s)\ +ftrace_format_##call(struct ftrace_event_call *event_call, \ + struct trace_seq *s) \ {\ struct ftrace_raw_##call field __attribute__((unused)); \ int ret = 0;\ @@ -289,10 +290,9 @@ ftrace_raw_output_##call(struct trace_iterator *iter, int flags) \ #undef TRACE_EVENT #define TRACE_EVENT(call, proto, args, tstruct, func, print) \ int \ -ftrace_define_fields_##call(void)\ +ftrace_define_fields_##call(struct ftrace_event_call *event_call)\ {\ struct ftrace_raw_##call field; \ - struct ftrace_event_call *event_call = event_##call; \ int ret;\ \ __common_field(int, type, 1); \ @@ -355,7 +355,7 @@ static inline int ftrace_get_offsets_##call( \ * event_trace_printk(_RET_IP_, call: fmt); * } * - * static int ftrace_reg_event_call(void) + * static int ftrace_reg_event_call(struct ftrace_event_call *unused) * { * int ret; * @@ -366,7 +366,7 @@ static inline int ftrace_get_offsets_##call( \ * return ret; * } * - * static void ftrace_unreg_event_call(void) + * static void ftrace_unreg_event_call(struct ftrace_event_call *unused) * { * unregister_trace_call(ftrace_event_call); * } @@ -399,7 +399,7 @@ static inline int ftrace_get_offsets_##call( \ * trace_current_buffer_unlock_commit(event, irq_flags, pc
Re: [PATCH -tip v5 4/7] tracing: add kprobe-based event tracer
On Sat, May 09, 2009 at 01:33:53PM -0400, Masami Hiramatsu wrote: Frédéric Weisbecker wrote: Hi, 2009/5/9 Masami Hiramatsu mhira...@redhat.com: [...] + +/* event recording functions */ +static void kprobe_trace_record(unsigned long ip, struct trace_probe *tp, + struct pt_regs *regs) +{ + __trace_bprintk(ip, %s%s%+ld\n, + probe_is_return(tp) ? - : @, + probe_symbol(tp), probe_offset(tp)); +} What happens here if you have: kprobe_trace_record() { probe_symbol() { probes_open() { cleanup_all_probes() { free_trace_probe(); return tp-symbol ? ; //crack! I wonder if you shouldn't use a per_cpu list of probes, spinlocked/irqsaved accessed and also a kind of prevention against nmi. Sure, cleanup_all_probes() invokes unregister_kprobe() via unregister_trace_probe(), which waits running probe-handlers by using synchronize_sched()(because kprobes disables preemption around its handlers), before free_trace_probe(). So you don't need any locks there :-) Thank you, Aah, ok :) So this patch looks sane. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -tip 2/6 V4] x86: add arch-dep register and stack access API to ptrace
On Thu, Apr 02, 2009 at 01:24:47PM -0400, Masami Hiramatsu wrote: Add following APIs for accessing registers and stack entries from pt_regs. - query_register_offset(const char *name) Query the offset of name register. - query_register_name(unsigned offset) Query the name of register by its offset. - get_register(struct pt_regs *regs, unsigned offset) Get the value of a register by its offset. - valid_stack_address(struct pt_regs *regs, unsigned long addr) Check the address is in the stack. - get_stack_nth(struct pt_regs *reg, unsigned nth) Get Nth entry of the stack. (N = 0) - get_argument_nth(struct pt_regs *reg, unsigned nth) Get Nth argument at function call. (N = 0) Signed-off-by: Masami Hiramatsu mhira...@redhat.com Cc: Steven Rostedt rost...@goodmis.org Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Ingo Molnar mi...@elte.hu Cc: Frederic Weisbecker fweis...@gmail.com --- arch/x86/include/asm/ptrace.h | 66 + arch/x86/kernel/ptrace.c | 59 + 2 files changed, 125 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index aed0894..44773b8 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -7,6 +7,7 @@ #ifdef __KERNEL__ #include asm/segment.h +#include asm/page_types.h #endif #ifndef __ASSEMBLY__ @@ -215,6 +216,71 @@ static inline unsigned long user_stack_pointer(struct pt_regs *regs) return regs-sp; } +/* Query offset/name of register from its name/offset */ +extern int query_register_offset(const char *name); +extern const char *query_register_name(unsigned offset); +#define MAX_REG_OFFSET (offsetof(struct pt_regs, sp)) + +/* Get register value from its offset */ +static inline unsigned long get_register(struct pt_regs *regs, unsigned offset) +{ + if (unlikely(offset MAX_REG_OFFSET)) + return 0; + return *(unsigned long *)((unsigned long)regs + offset); +} + +/* Check the address in the stack */ +static inline int valid_stack_address(struct pt_regs *regs, unsigned long addr) +{ + return ((addr ~(THREAD_SIZE - 1)) == + (kernel_trap_sp(regs) ~(THREAD_SIZE - 1))); +} + +/* Get Nth entry of the stack */ +static inline unsigned long get_stack_nth(struct pt_regs *regs, unsigned n) +{ + unsigned long *addr = (unsigned long *)kernel_trap_sp(regs); + addr += n; + if (valid_stack_address(regs, (unsigned long)addr)) + return *addr; + else + return 0; +} + +/* Get Nth argument at function call */ +static inline unsigned long get_argument_nth(struct pt_regs *regs, unsigned n) +{ +#ifdef CONFIG_X86_32 +#define NR_REGPARMS 3 + if (n NR_REGPARMS) { + switch (n) { + case 0: return regs-ax; + case 1: return regs-dx; + case 2: return regs-cx; + } + return 0; +#else /* CONFIG_X86_64 */ +#define NR_REGPARMS 6 + if (n NR_REGPARMS) { + switch (n) { + case 0: return regs-di; + case 1: return regs-si; + case 2: return regs-dx; + case 3: return regs-cx; + case 4: return regs-r8; + case 5: return regs-r9; + } + return 0; +#endif + } else { + /* + * The typical case: arg n is on the stack. + * (Note: stack[0] = return address, so skip it) + */ + return get_stack_nth(regs, 1 + n - NR_REGPARMS); + } +} + /* * These are defined as per linux/ptrace.h, which see. */ diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 5c6e463..8d65dcb 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -46,6 +46,65 @@ enum x86_regset { REGSET_IOPERM32, }; +struct pt_regs_offset { + const char *name; + int offset; +}; + +#define REG_OFFSET(r) offsetof(struct pt_regs, r) +#define REG_OFFSET_NAME(r) {.name = #r, .offset = REG_OFFSET(r)} +#define REG_OFFSET_END {.name = NULL, .offset = 0} + +static struct pt_regs_offset regoffset_table[] = { +#ifdef CONFIG_X86_64 + REG_OFFSET_NAME(r15), + REG_OFFSET_NAME(r14), + REG_OFFSET_NAME(r13), + REG_OFFSET_NAME(r12), + REG_OFFSET_NAME(r11), + REG_OFFSET_NAME(r10), + REG_OFFSET_NAME(r9), + REG_OFFSET_NAME(r8), +#endif + REG_OFFSET_NAME(bx), + REG_OFFSET_NAME(cx), + REG_OFFSET_NAME(dx), + REG_OFFSET_NAME(si), + REG_OFFSET_NAME(di), + REG_OFFSET_NAME(bp), + REG_OFFSET_NAME(ax), +#ifdef CONFIG_X86_32 + REG_OFFSET_NAME(ds), + REG_OFFSET_NAME(es), + REG_OFFSET_NAME(fs), + REG_OFFSET_NAME(gs), +#endif + REG_OFFSET_NAME(orig_ax), + REG_OFFSET_NAME(ip