Re: [PATCH 1/3] perf: add context field to perf_event
On Tue, Jul 12, 2011 at 10:31:00AM +0100, Peter Zijlstra wrote: On Tue, 2011-07-12 at 12:27 +0300, Avi Kivity wrote: On 07/12/2011 12:18 PM, Peter Zijlstra wrote: The guarantee is that the task was sleeping just before the function is called. Of course it's woken up to run the function. The idea is that you run the function in a known safe point to avoid extra synchronization. I'd much rather we didn't wake the task and let it sleep, that's usually a very safe place for tasks to be. All you'd need is a guarantee it won't be woken up while you're doing your thing. But it means that 'current' is not set to the right value. If the function depends on it, then it will misbehave. And in fact preempt_notifier_register(), which is the function we want to call here, does depend on current. Of course we need to find more users for this, but I have a feeling this will be generally useful. The alternative is to keep adding bits to thread_info::flags. Using TIF_bits sounds like a much better solution for this, wakeups are really rather expensive and its best to avoid extra if at all possible. The problem with using a TIF bit to tell a task that it needs to perform some preempt_notifier registrations is that you end up with something that looks a lot like preempt notifiers! You also don't escape the concurrent read/write to thelist of pending registrations. One thing I tried was simply using an RCU protected hlist for the preempt notifiers so that we don't have to worry about atomicity when reading the notifiers in finish_task_switch. It's a bit odd, since we know we only ever have a single reader, but I've included it below anyway. If anybody has any better ideas, I'm all ears. Will diff --git a/include/linux/preempt.h b/include/linux/preempt.h index 2e681d9..2e21ffe 100644 --- a/include/linux/preempt.h +++ b/include/linux/preempt.h @@ -132,6 +132,11 @@ struct preempt_notifier { void preempt_notifier_register(struct preempt_notifier *notifier); void preempt_notifier_unregister(struct preempt_notifier *notifier); +void preempt_notifier_register_task(struct preempt_notifier *notifier, + struct task_struct *tsk); +void preempt_notifier_unregister_task(struct preempt_notifier *notifier, + struct task_struct *tsk); + static inline void preempt_notifier_init(struct preempt_notifier *notifier, struct preempt_ops *ops) { diff --git a/include/linux/sched.h b/include/linux/sched.h index 496770a..5530d91 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1233,6 +1233,7 @@ struct task_struct { #ifdef CONFIG_PREEMPT_NOTIFIERS /* list of struct preempt_notifier: */ struct hlist_head preempt_notifiers; + struct mutex preempt_notifiers_mutex; #endif /* diff --git a/kernel/sched.c b/kernel/sched.c index 9769c75..d3c46ca 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2784,6 +2784,7 @@ static void __sched_fork(struct task_struct *p) #ifdef CONFIG_PREEMPT_NOTIFIERS INIT_HLIST_HEAD(p-preempt_notifiers); + mutex_init(p-preempt_notifiers_mutex); #endif } @@ -2901,13 +2902,31 @@ void wake_up_new_task(struct task_struct *p) #ifdef CONFIG_PREEMPT_NOTIFIERS +void preempt_notifier_register_task(struct preempt_notifier *notifier, + struct task_struct *tsk) +{ + mutex_lock(tsk-preempt_notifiers_mutex); + hlist_add_head_rcu(notifier-link, tsk-preempt_notifiers); + mutex_unlock(tsk-preempt_notifiers_mutex); +} +EXPORT_SYMBOL_GPL(preempt_notifier_register_task); + +void preempt_notifier_unregister_task(struct preempt_notifier *notifier, + struct task_struct *tsk) +{ + mutex_lock(tsk-preempt_notifiers_mutex); + hlist_del_rcu(notifier-link); + mutex_unlock(tsk-preempt_notifiers_mutex); +} +EXPORT_SYMBOL_GPL(preempt_notifier_unregister_task); + /** * preempt_notifier_register - tell me when current is being preempted rescheduled * @notifier: notifier struct to register */ void preempt_notifier_register(struct preempt_notifier *notifier) { - hlist_add_head(notifier-link, current-preempt_notifiers); + preempt_notifier_register_task(notifier, current); } EXPORT_SYMBOL_GPL(preempt_notifier_register); @@ -2919,7 +2938,7 @@ EXPORT_SYMBOL_GPL(preempt_notifier_register); */ void preempt_notifier_unregister(struct preempt_notifier *notifier) { - hlist_del(notifier-link); + preempt_notifier_unregister_task(notifier, current); } EXPORT_SYMBOL_GPL(preempt_notifier_unregister); @@ -2928,8 +2947,12 @@ static void fire_sched_in_preempt_notifiers(struct task_struct *curr) struct preempt_notifier *notifier; struct hlist_node *node; - hlist_for_each_entry(notifier, node, curr-preempt_notifiers, link) +
Re: [PATCH 1/3] perf: add context field to perf_event
On 07/21/2011 06:32 PM, Will Deacon wrote: Using TIF_bits sounds like a much better solution for this, wakeups are really rather expensive and its best to avoid extra if at all possible. The problem with using a TIF bit to tell a task that it needs to perform some preempt_notifier registrations is that you end up with something that looks a lot like preempt notifiers! You also don't escape the concurrent read/write to thelist of pending registrations. One thing I tried was simply using an RCU protected hlist for the preempt notifiers so that we don't have to worry about atomicity when reading the notifiers in finish_task_switch. It's a bit odd, since we know we only ever have a single reader, but I've included it below anyway. If anybody has any better ideas, I'm all ears. +void preempt_notifier_register_task(struct preempt_notifier *notifier, + struct task_struct *tsk) +{ + mutex_lock(tsk-preempt_notifiers_mutex); + hlist_add_head_rcu(notifier-link,tsk-preempt_notifiers); + mutex_unlock(tsk-preempt_notifiers_mutex); +} +EXPORT_SYMBOL_GPL(preempt_notifier_register_task); + +void preempt_notifier_unregister_task(struct preempt_notifier *notifier, + struct task_struct *tsk) +{ + mutex_lock(tsk-preempt_notifiers_mutex); + hlist_del_rcu(notifier-link); + mutex_unlock(tsk-preempt_notifiers_mutex); +} +EXPORT_SYMBOL_GPL(preempt_notifier_unregister_task); + /** * preempt_notifier_register - tell me when current is being preempted rescheduled * @notifier: notifier struct to register */ void preempt_notifier_register(struct preempt_notifier *notifier) { - hlist_add_head(notifier-link,current-preempt_notifiers); + preempt_notifier_register_task(notifier, current); } EXPORT_SYMBOL_GPL(preempt_notifier_register); This is (and must be) called from a preempt disabled context, no mutexes around here. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Thu, Jul 21, 2011 at 04:36:43PM +0100, Avi Kivity wrote: On 07/21/2011 06:32 PM, Will Deacon wrote: Using TIF_bits sounds like a much better solution for this, wakeups are really rather expensive and its best to avoid extra if at all possible. The problem with using a TIF bit to tell a task that it needs to perform some preempt_notifier registrations is that you end up with something that looks a lot like preempt notifiers! You also don't escape the concurrent read/write to thelist of pending registrations. One thing I tried was simply using an RCU protected hlist for the preempt notifiers so that we don't have to worry about atomicity when reading the notifiers in finish_task_switch. It's a bit odd, since we know we only ever have a single reader, but I've included it below anyway. If anybody has any better ideas, I'm all ears. +void preempt_notifier_register_task(struct preempt_notifier *notifier, + struct task_struct *tsk) +{ + mutex_lock(tsk-preempt_notifiers_mutex); + hlist_add_head_rcu(notifier-link,tsk-preempt_notifiers); + mutex_unlock(tsk-preempt_notifiers_mutex); +} +EXPORT_SYMBOL_GPL(preempt_notifier_register_task); + +void preempt_notifier_unregister_task(struct preempt_notifier *notifier, + struct task_struct *tsk) +{ + mutex_lock(tsk-preempt_notifiers_mutex); + hlist_del_rcu(notifier-link); + mutex_unlock(tsk-preempt_notifiers_mutex); +} +EXPORT_SYMBOL_GPL(preempt_notifier_unregister_task); + /** * preempt_notifier_register - tell me when current is being preempted rescheduled * @notifier: notifier struct to register */ void preempt_notifier_register(struct preempt_notifier *notifier) { - hlist_add_head(notifier-link,current-preempt_notifiers); + preempt_notifier_register_task(notifier, current); } EXPORT_SYMBOL_GPL(preempt_notifier_register); This is (and must be) called from a preempt disabled context, no mutexes around here. Bah, yes, that is essential if you're dealing with current. Maybe use a spinlock instead? Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On 07/21/2011 06:46 PM, Will Deacon wrote: This is (and must be) called from a preempt disabled context, no mutexes around here. Bah, yes, that is essential if you're dealing with current. Maybe use a spinlock instead? Could work. Not thrilled about adding it to the kvm hot path, but I can't say it will make a measurable impact. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Thu, Jul 21, 2011 at 04:59:00PM +0100, Avi Kivity wrote: On 07/21/2011 06:46 PM, Will Deacon wrote: This is (and must be) called from a preempt disabled context, no mutexes around here. Bah, yes, that is essential if you're dealing with current. Maybe use a spinlock instead? Could work. Not thrilled about adding it to the kvm hot path, but I can't say it will make a measurable impact. Understood, but at least it doesn't contribute to finish_task_switch. I also wouldn't expect to have multiple preempt registrations in parallel for the same task so that lock should rarely (if ever) be contended. Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On 07/12/2011 12:07 AM, Will Deacon wrote: On Mon, Jul 04, 2011 at 03:36:57PM +0100, Frederic Weisbecker wrote: On Mon, Jul 04, 2011 at 05:10:20PM +0300, Avi Kivity wrote: On 07/04/2011 04:58 PM, Frederic Weisbecker wrote: Another thing I would like to do in the even longer term is to not use perf anymore for ptrace breakpoints, because that involves a heavy dependency and few people are happy with that. Instead we should just have a generic hook into the sched_switch() and handle pure ptrace breakpoints there. The central breakpoint API would still be there to reserve/schedule breakpoint resources between ptrace and perf. 'struct preempt_notifier' may be the hook you're looking for. Yeah looks like a perfect fit as it's per task. I had a quick look at this and I think the preempt_notifier stuff needs slightly extending so that we can register a notifier for a task other than current [e.g. the child of current on which we are installing breakpoints]. If the task in question is running, it looks like this will introduce a race condition between notifier registration and rescheduling. For the purposes of ptrace this shouldn't be a problem as the child will be stopped, but others might also want to make use of the new functionality. Any ideas on how this could be achieved, or am I better off just restricting this to children that are being traced? Maybe we need a generic run this function in this task's context mechanism instead. Like an IPI, but targeting tasks instead of cpus. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Tue, 2011-07-12 at 10:20 +0300, Avi Kivity wrote: Maybe we need a generic run this function in this task's context mechanism instead. Like an IPI, but targeting tasks instead of cpus. kernel/event/core.c:task_function_call() like? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On 07/12/2011 11:38 AM, Peter Zijlstra wrote: On Tue, 2011-07-12 at 10:20 +0300, Avi Kivity wrote: Maybe we need a generic run this function in this task's context mechanism instead. Like an IPI, but targeting tasks instead of cpus. kernel/event/core.c:task_function_call() like? Similar, but with stronger guarantees: when the function is called, current == p, and the task was either sleeping or in userspace. This can be used to reduce synchronization requirements between tasks. For example, you can set a task-local bit flag with this, without atomics. If this is rare enough, it's a net win compared to adding atomics. aio completions might also make use of this. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Tue, 2011-07-12 at 12:08 +0300, Avi Kivity wrote: Similar, but with stronger guarantees: when the function is called, current == p, and the task was either sleeping or in userspace. If the task is sleeping, current can never be p. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On 07/12/2011 12:14 PM, Peter Zijlstra wrote: On Tue, 2011-07-12 at 12:08 +0300, Avi Kivity wrote: Similar, but with stronger guarantees: when the function is called, current == p, and the task was either sleeping or in userspace. If the task is sleeping, current can never be p. The guarantee is that the task was sleeping just before the function is called. Of course it's woken up to run the function. The idea is that you run the function in a known safe point to avoid extra synchronization. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On 07/12/2011 12:18 PM, Peter Zijlstra wrote: The guarantee is that the task was sleeping just before the function is called. Of course it's woken up to run the function. The idea is that you run the function in a known safe point to avoid extra synchronization. I'd much rather we didn't wake the task and let it sleep, that's usually a very safe place for tasks to be. All you'd need is a guarantee it won't be woken up while you're doing your thing. But it means that 'current' is not set to the right value. If the function depends on it, then it will misbehave. And in fact preempt_notifier_register(), which is the function we want to call here, does depend on current. Of course we need to find more users for this, but I have a feeling this will be generally useful. The alternative is to keep adding bits to thread_info::flags. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Tue, 2011-07-12 at 12:27 +0300, Avi Kivity wrote: On 07/12/2011 12:18 PM, Peter Zijlstra wrote: The guarantee is that the task was sleeping just before the function is called. Of course it's woken up to run the function. The idea is that you run the function in a known safe point to avoid extra synchronization. I'd much rather we didn't wake the task and let it sleep, that's usually a very safe place for tasks to be. All you'd need is a guarantee it won't be woken up while you're doing your thing. But it means that 'current' is not set to the right value. If the function depends on it, then it will misbehave. And in fact preempt_notifier_register(), which is the function we want to call here, does depend on current. Of course we need to find more users for this, but I have a feeling this will be generally useful. The alternative is to keep adding bits to thread_info::flags. Using TIF_bits sounds like a much better solution for this, wakeups are really rather expensive and its best to avoid extra if at all possible. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On 07/12/2011 12:31 PM, Peter Zijlstra wrote: But it means that 'current' is not set to the right value. If the function depends on it, then it will misbehave. And in fact preempt_notifier_register(), which is the function we want to call here, does depend on current. Of course we need to find more users for this, but I have a feeling this will be generally useful. The alternative is to keep adding bits to thread_info::flags. Using TIF_bits sounds like a much better solution for this, wakeups are really rather expensive and its best to avoid extra if at all possible. We do need a way to make the task look at them. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Tue, Jul 12, 2011 at 11:31:00AM +0200, Peter Zijlstra wrote: On Tue, 2011-07-12 at 12:27 +0300, Avi Kivity wrote: But it means that 'current' is not set to the right value. If the function depends on it, then it will misbehave. And in fact preempt_notifier_register(), which is the function we want to call here, does depend on current. Of course we need to find more users for this, but I have a feeling this will be generally useful. The alternative is to keep adding bits to thread_info::flags. Using TIF_bits sounds like a much better solution for this, wakeups are really rather expensive and its best to avoid extra if at all possible. I would rather vote for Avi's solution too. Such a functionality is very helpful for LWP-perf integration as well (because we need a way to call do_mmap for a task != current). Regards, Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Tue, Jul 12, 2011 at 10:36:05AM +0100, Avi Kivity wrote: On 07/12/2011 12:31 PM, Peter Zijlstra wrote: But it means that 'current' is not set to the right value. If the function depends on it, then it will misbehave. And in fact preempt_notifier_register(), which is the function we want to call here, does depend on current. Of course we need to find more users for this, but I have a feeling this will be generally useful. The alternative is to keep adding bits to thread_info::flags. Using TIF_bits sounds like a much better solution for this, wakeups are really rather expensive and its best to avoid extra if at all possible. We do need a way to make the task look at them. ... especially without pushing this down into the arch code where I don't think it belongs. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On 07/12/2011 12:41 PM, Joerg Roedel wrote: Using TIF_bits sounds like a much better solution for this, wakeups are really rather expensive and its best to avoid extra if at all possible. I would rather vote for Avi's solution too. Such a functionality is very helpful for LWP-perf integration as well (because we need a way to call do_mmap for a task != current). It's not needed for that. See use_mm() (caller must be a kernel thread). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Tue, Jul 12, 2011 at 12:44:17PM +0300, Avi Kivity wrote: On 07/12/2011 12:41 PM, Joerg Roedel wrote: Using TIF_bits sounds like a much better solution for this, wakeups are really rather expensive and its best to avoid extra if at all possible. I would rather vote for Avi's solution too. Such a functionality is very helpful for LWP-perf integration as well (because we need a way to call do_mmap for a task != current). It's not needed for that. See use_mm() (caller must be a kernel thread). But in our case the caller would be the process that wants to count, not a kernel thread. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Tue, 2011-07-12 at 12:16 +0300, Avi Kivity wrote: On 07/12/2011 12:14 PM, Peter Zijlstra wrote: On Tue, 2011-07-12 at 12:08 +0300, Avi Kivity wrote: Similar, but with stronger guarantees: when the function is called, current == p, and the task was either sleeping or in userspace. If the task is sleeping, current can never be p. The guarantee is that the task was sleeping just before the function is called. Of course it's woken up to run the function. The idea is that you run the function in a known safe point to avoid extra synchronization. I'd much rather we didn't wake the task and let it sleep, that's usually a very safe place for tasks to be. All you'd need is a guarantee it won't be woken up while you're doing your thing. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On 07/12/2011 12:48 PM, Joerg Roedel wrote: On Tue, Jul 12, 2011 at 12:44:17PM +0300, Avi Kivity wrote: On 07/12/2011 12:41 PM, Joerg Roedel wrote: Using TIF_bits sounds like a much better solution for this, wakeups are really rather expensive and its best to avoid extra if at all possible. I would rather vote for Avi's solution too. Such a functionality is very helpful for LWP-perf integration as well (because we need a way to call do_mmap for a task != current). It's not needed for that. See use_mm() (caller must be a kernel thread). But in our case the caller would be the process that wants to count, not a kernel thread. Have a helper kernel thread do it for you. Or extend use_mm() to return the old mm (without dropping its refcount) and add a way to restore it. Regarding LWP - I thought the intent was self-profiling by the process for jits and the like? If you also use it for perf, won't it be unusable for that? Also, can't the process interfere, from userspace, by executing the unprivileged LWP instructions? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Tue, Jul 12, 2011 at 12:55:25PM +0300, Avi Kivity wrote: Have a helper kernel thread do it for you. Or extend use_mm() to return the old mm (without dropping its refcount) and add a way to restore it. Making use_mm usable for regular tasks too sounds like a good idea. Thanks. Regarding LWP - I thought the intent was self-profiling by the process for jits and the like? If you also use it for perf, won't it be unusable for that? Also, can't the process interfere, from userspace, by executing the unprivileged LWP instructions? Ingo made perf-integration a merge-requirement for LWP. It is not really well-suited for being integrated into perf because the design goal was easy and efficient self-profiling of tasks (like you stated). So integrating it into perf causes some pain. But lets see how it works out. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On 07/12/2011 01:03 PM, Joerg Roedel wrote: Regarding LWP - I thought the intent was self-profiling by the process for jits and the like? If you also use it for perf, won't it be unusable for that? Also, can't the process interfere, from userspace, by executing the unprivileged LWP instructions? Ingo made perf-integration a merge-requirement for LWP. It is not really well-suited for being integrated into perf because the design goal was easy and efficient self-profiling of tasks (like you stated). So integrating it into perf causes some pain. But lets see how it works out. I don't think it's workable. Having do_mmap() called in the task's context can change how it works. And the task being able to kill/modify the profile, and not able to use LWP for itself, is a show stopper IMO. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Tue, Jul 12, 2011 at 01:07:25PM +0300, Avi Kivity wrote: On 07/12/2011 01:03 PM, Joerg Roedel wrote: Ingo made perf-integration a merge-requirement for LWP. It is not really well-suited for being integrated into perf because the design goal was easy and efficient self-profiling of tasks (like you stated). So integrating it into perf causes some pain. But lets see how it works out. I don't think it's workable. Having do_mmap() called in the task's context can change how it works. According to Ingo that is already done at other places in the kernel and should not be an issue. And the task being able to kill/modify the profile, and not able to use LWP for itself, is a show stopper IMO. I agree, that is a problem we have no solution for so far. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On 07/12/2011 01:24 PM, Joerg Roedel wrote: On Tue, Jul 12, 2011 at 01:07:25PM +0300, Avi Kivity wrote: On 07/12/2011 01:03 PM, Joerg Roedel wrote: Ingo made perf-integration a merge-requirement for LWP. It is not really well-suited for being integrated into perf because the design goal was easy and efficient self-profiling of tasks (like you stated). So integrating it into perf causes some pain. But lets see how it works out. I don't think it's workable. Having do_mmap() called in the task's context can change how it works. According to Ingo that is already done at other places in the kernel and should not be an issue. kvm is one of those places, unfortunately. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Mon, Jul 04, 2011 at 03:36:57PM +0100, Frederic Weisbecker wrote: On Mon, Jul 04, 2011 at 05:10:20PM +0300, Avi Kivity wrote: On 07/04/2011 04:58 PM, Frederic Weisbecker wrote: Another thing I would like to do in the even longer term is to not use perf anymore for ptrace breakpoints, because that involves a heavy dependency and few people are happy with that. Instead we should just have a generic hook into the sched_switch() and handle pure ptrace breakpoints there. The central breakpoint API would still be there to reserve/schedule breakpoint resources between ptrace and perf. 'struct preempt_notifier' may be the hook you're looking for. Yeah looks like a perfect fit as it's per task. I had a quick look at this and I think the preempt_notifier stuff needs slightly extending so that we can register a notifier for a task other than current [e.g. the child of current on which we are installing breakpoints]. If the task in question is running, it looks like this will introduce a race condition between notifier registration and rescheduling. For the purposes of ptrace this shouldn't be a problem as the child will be stopped, but others might also want to make use of the new functionality. Any ideas on how this could be achieved, or am I better off just restricting this to children that are being traced? Cheers, Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
Hi Frederic, On Mon, Jul 04, 2011 at 02:58:24PM +0100, Frederic Weisbecker wrote: On Wed, Jun 29, 2011 at 05:27:25PM +0100, Will Deacon wrote: Hi Frederic, Thanks for including me on CC. On Wed, Jun 29, 2011 at 05:08:45PM +0100, Frederic Weisbecker wrote: On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote: The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity a...@redhat.com I believe it can micro-optimize ptrace through register_user_hw_breakpoint() because we could store the index of the breakpoint that way, instead of iterating through 4 slots. Perhaps it can help in arm too, adding Will in Cc. Yes, we could store the breakpoint index in there and it would save us walking over the breakpoints when one fires. Not sure this helps us for anything else though. My main gripe with the ptrace interface to hw_breakpoints is that we have to convert all the breakpoint information from ARM_BREAKPOINT_* to HW_BREAKPOINT_* and then convert it all back again in the hw_breakpoint code. Yuck! Agreed, I don't like that either. Would you like to improve that? We probably need to be able to pass some arch data through the whole call of breakpoint creation, including perf_event_create_kernel_counter(). Sure, I'll make some time to look at this and try and get an RFC out in the next few weeks. There can be a transition step where we can either take generic attr or arch datas, until every archs are converted. So that you can handle the arm part and other arch developers can relay. Yup. Another thing I would like to do in the even longer term is to not use perf anymore for ptrace breakpoints, because that involves a heavy dependency and few people are happy with that. Instead we should just have a generic hook into the sched_switch() and handle pure ptrace breakpoints there. The central breakpoint API would still be there to reserve/schedule breakpoint resources between ptrace and perf. But beeing able to create ptrace breakpoints without converting to generic perf attr is a necessary first step in order to achieve this. Agreed, but I'll bear that in mind so I don't make it any more difficult than it already is! Cheers, Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Tue, Jul 05, 2011 at 03:30:26PM +0100, Will Deacon wrote: Hi Frederic, On Mon, Jul 04, 2011 at 02:58:24PM +0100, Frederic Weisbecker wrote: On Wed, Jun 29, 2011 at 05:27:25PM +0100, Will Deacon wrote: Hi Frederic, Thanks for including me on CC. On Wed, Jun 29, 2011 at 05:08:45PM +0100, Frederic Weisbecker wrote: On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote: The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity a...@redhat.com I believe it can micro-optimize ptrace through register_user_hw_breakpoint() because we could store the index of the breakpoint that way, instead of iterating through 4 slots. Perhaps it can help in arm too, adding Will in Cc. Yes, we could store the breakpoint index in there and it would save us walking over the breakpoints when one fires. Not sure this helps us for anything else though. My main gripe with the ptrace interface to hw_breakpoints is that we have to convert all the breakpoint information from ARM_BREAKPOINT_* to HW_BREAKPOINT_* and then convert it all back again in the hw_breakpoint code. Yuck! Agreed, I don't like that either. Would you like to improve that? We probably need to be able to pass some arch data through the whole call of breakpoint creation, including perf_event_create_kernel_counter(). Sure, I'll make some time to look at this and try and get an RFC out in the next few weeks. Great! Thanks a lot! There can be a transition step where we can either take generic attr or arch datas, until every archs are converted. So that you can handle the arm part and other arch developers can relay. Yup. Another thing I would like to do in the even longer term is to not use perf anymore for ptrace breakpoints, because that involves a heavy dependency and few people are happy with that. Instead we should just have a generic hook into the sched_switch() and handle pure ptrace breakpoints there. The central breakpoint API would still be there to reserve/schedule breakpoint resources between ptrace and perf. But beeing able to create ptrace breakpoints without converting to generic perf attr is a necessary first step in order to achieve this. Agreed, but I'll bear that in mind so I don't make it any more difficult than it already is! Cheers, Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Wed, Jun 29, 2011 at 05:27:25PM +0100, Will Deacon wrote: Hi Frederic, Thanks for including me on CC. On Wed, Jun 29, 2011 at 05:08:45PM +0100, Frederic Weisbecker wrote: On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote: The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity a...@redhat.com I believe it can micro-optimize ptrace through register_user_hw_breakpoint() because we could store the index of the breakpoint that way, instead of iterating through 4 slots. Perhaps it can help in arm too, adding Will in Cc. Yes, we could store the breakpoint index in there and it would save us walking over the breakpoints when one fires. Not sure this helps us for anything else though. My main gripe with the ptrace interface to hw_breakpoints is that we have to convert all the breakpoint information from ARM_BREAKPOINT_* to HW_BREAKPOINT_* and then convert it all back again in the hw_breakpoint code. Yuck! Agreed, I don't like that either. Would you like to improve that? We probably need to be able to pass some arch data through the whole call of breakpoint creation, including perf_event_create_kernel_counter(). There can be a transition step where we can either take generic attr or arch datas, until every archs are converted. So that you can handle the arm part and other arch developers can relay. Another thing I would like to do in the even longer term is to not use perf anymore for ptrace breakpoints, because that involves a heavy dependency and few people are happy with that. Instead we should just have a generic hook into the sched_switch() and handle pure ptrace breakpoints there. The central breakpoint API would still be there to reserve/schedule breakpoint resources between ptrace and perf. But beeing able to create ptrace breakpoints without converting to generic perf attr is a necessary first step in order to achieve this. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On 07/04/2011 04:58 PM, Frederic Weisbecker wrote: Another thing I would like to do in the even longer term is to not use perf anymore for ptrace breakpoints, because that involves a heavy dependency and few people are happy with that. Instead we should just have a generic hook into the sched_switch() and handle pure ptrace breakpoints there. The central breakpoint API would still be there to reserve/schedule breakpoint resources between ptrace and perf. 'struct preempt_notifier' may be the hook you're looking for. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
On Mon, Jul 04, 2011 at 05:10:20PM +0300, Avi Kivity wrote: On 07/04/2011 04:58 PM, Frederic Weisbecker wrote: Another thing I would like to do in the even longer term is to not use perf anymore for ptrace breakpoints, because that involves a heavy dependency and few people are happy with that. Instead we should just have a generic hook into the sched_switch() and handle pure ptrace breakpoints there. The central breakpoint API would still be there to reserve/schedule breakpoint resources between ptrace and perf. 'struct preempt_notifier' may be the hook you're looking for. Yeah looks like a perfect fit as it's per task. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] perf: add context field to perf_event
The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity a...@redhat.com --- arch/arm/kernel/ptrace.c|3 ++- arch/powerpc/kernel/ptrace.c|2 +- arch/sh/kernel/ptrace_32.c |3 ++- arch/x86/kernel/kgdb.c |2 +- arch/x86/kernel/ptrace.c|3 ++- drivers/oprofile/oprofile_perf.c|2 +- include/linux/hw_breakpoint.h | 10 -- include/linux/perf_event.h |4 +++- kernel/events/core.c| 21 +++-- kernel/events/hw_breakpoint.c | 10 +++--- kernel/watchdog.c |2 +- samples/hw_breakpoint/data_breakpoint.c |2 +- 12 files changed, 44 insertions(+), 20 deletions(-) diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c index 9726006..4911c94 100644 --- a/arch/arm/kernel/ptrace.c +++ b/arch/arm/kernel/ptrace.c @@ -479,7 +479,8 @@ static struct perf_event *ptrace_hbp_create(struct task_struct *tsk, int type) attr.bp_type= type; attr.disabled = 1; - return register_user_hw_breakpoint(attr, ptrace_hbptriggered, tsk); + return register_user_hw_breakpoint(attr, ptrace_hbptriggered, NULL, + tsk); } static int ptrace_gethbpregs(struct task_struct *tsk, long num, diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index cb22024..5249308 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -973,7 +973,7 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned long addr, attr.bp_type); thread-ptrace_bps[0] = bp = register_user_hw_breakpoint(attr, - ptrace_triggered, task); + ptrace_triggered, NULL, task); if (IS_ERR(bp)) { thread-ptrace_bps[0] = NULL; ptrace_put_breakpoints(task); diff --git a/arch/sh/kernel/ptrace_32.c b/arch/sh/kernel/ptrace_32.c index 3d7b209..930312f 100644 --- a/arch/sh/kernel/ptrace_32.c +++ b/arch/sh/kernel/ptrace_32.c @@ -91,7 +91,8 @@ static int set_single_step(struct task_struct *tsk, unsigned long addr) attr.bp_len = HW_BREAKPOINT_LEN_2; attr.bp_type = HW_BREAKPOINT_R; - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk); + bp = register_user_hw_breakpoint(attr, ptrace_triggered, +NULL, tsk); if (IS_ERR(bp)) return PTR_ERR(bp); diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c index 5f9ecff..473ab53 100644 --- a/arch/x86/kernel/kgdb.c +++ b/arch/x86/kernel/kgdb.c @@ -638,7 +638,7 @@ void kgdb_arch_late(void) for (i = 0; i HBP_NUM; i++) { if (breakinfo[i].pev) continue; - breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL); + breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL, NULL); if (IS_ERR((void * __force)breakinfo[i].pev)) { printk(KERN_ERR kgdb: Could not allocate hw breakpoints\nDisabling the kernel debugger\n); diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 807c2a2..28092ae 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -715,7 +715,8 @@ static int ptrace_set_breakpoint_addr(struct task_struct *tsk, int nr, attr.bp_type = HW_BREAKPOINT_W; attr.disabled = 1; - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk); + bp = register_user_hw_breakpoint(attr, ptrace_triggered, +NULL, tsk); /* * CHECKME: the previous code returned -EIO if the addr wasn't diff --git a/drivers/oprofile/oprofile_perf.c b/drivers/oprofile/oprofile_perf.c index 9046f7b..59acf9e 100644 --- a/drivers/oprofile/oprofile_perf.c +++ b/drivers/oprofile/oprofile_perf.c @@ -79,7 +79,7 @@ static int op_create_counter(int cpu, int event) pevent = perf_event_create_kernel_counter(counter_config[event].attr, cpu, NULL, - op_overflow_handler); +
Re: [PATCH 1/3] perf: add context field to perf_event
On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote: The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity a...@redhat.com I believe it can micro-optimize ptrace through register_user_hw_breakpoint() because we could store the index of the breakpoint that way, instead of iterating through 4 slots. Perhaps it can help in arm too, adding Will in Cc. But for register_wide_hw_breakpoint, I'm not sure. kgdb is the main user, may be Jason could find some use of it. --- arch/arm/kernel/ptrace.c|3 ++- arch/powerpc/kernel/ptrace.c|2 +- arch/sh/kernel/ptrace_32.c |3 ++- arch/x86/kernel/kgdb.c |2 +- arch/x86/kernel/ptrace.c|3 ++- drivers/oprofile/oprofile_perf.c|2 +- include/linux/hw_breakpoint.h | 10 -- include/linux/perf_event.h |4 +++- kernel/events/core.c| 21 +++-- kernel/events/hw_breakpoint.c | 10 +++--- kernel/watchdog.c |2 +- samples/hw_breakpoint/data_breakpoint.c |2 +- 12 files changed, 44 insertions(+), 20 deletions(-) diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c index 9726006..4911c94 100644 --- a/arch/arm/kernel/ptrace.c +++ b/arch/arm/kernel/ptrace.c @@ -479,7 +479,8 @@ static struct perf_event *ptrace_hbp_create(struct task_struct *tsk, int type) attr.bp_type= type; attr.disabled = 1; - return register_user_hw_breakpoint(attr, ptrace_hbptriggered, tsk); + return register_user_hw_breakpoint(attr, ptrace_hbptriggered, NULL, +tsk); } static int ptrace_gethbpregs(struct task_struct *tsk, long num, diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index cb22024..5249308 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -973,7 +973,7 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned long addr, attr.bp_type); thread-ptrace_bps[0] = bp = register_user_hw_breakpoint(attr, - ptrace_triggered, task); +ptrace_triggered, NULL, task); if (IS_ERR(bp)) { thread-ptrace_bps[0] = NULL; ptrace_put_breakpoints(task); diff --git a/arch/sh/kernel/ptrace_32.c b/arch/sh/kernel/ptrace_32.c index 3d7b209..930312f 100644 --- a/arch/sh/kernel/ptrace_32.c +++ b/arch/sh/kernel/ptrace_32.c @@ -91,7 +91,8 @@ static int set_single_step(struct task_struct *tsk, unsigned long addr) attr.bp_len = HW_BREAKPOINT_LEN_2; attr.bp_type = HW_BREAKPOINT_R; - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk); + bp = register_user_hw_breakpoint(attr, ptrace_triggered, + NULL, tsk); if (IS_ERR(bp)) return PTR_ERR(bp); diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c index 5f9ecff..473ab53 100644 --- a/arch/x86/kernel/kgdb.c +++ b/arch/x86/kernel/kgdb.c @@ -638,7 +638,7 @@ void kgdb_arch_late(void) for (i = 0; i HBP_NUM; i++) { if (breakinfo[i].pev) continue; - breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL); + breakinfo[i].pev = register_wide_hw_breakpoint(attr, NULL, NULL); if (IS_ERR((void * __force)breakinfo[i].pev)) { printk(KERN_ERR kgdb: Could not allocate hw breakpoints\nDisabling the kernel debugger\n); diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 807c2a2..28092ae 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -715,7 +715,8 @@ static int ptrace_set_breakpoint_addr(struct task_struct *tsk, int nr, attr.bp_type = HW_BREAKPOINT_W; attr.disabled = 1; - bp = register_user_hw_breakpoint(attr, ptrace_triggered, tsk); + bp = register_user_hw_breakpoint(attr, ptrace_triggered, + NULL, tsk); /* * CHECKME: the previous code returned -EIO if the addr wasn't diff --git a/drivers/oprofile/oprofile_perf.c
Re: [PATCH 1/3] perf: add context field to perf_event
On 06/29/2011 07:08 PM, Frederic Weisbecker wrote: On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote: The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivitya...@redhat.com I believe it can micro-optimize ptrace through register_user_hw_breakpoint() because we could store the index of the breakpoint that way, instead of iterating through 4 slots. Right, I noticed that while writing the patch. Perhaps it can help in arm too, adding Will in Cc. But for register_wide_hw_breakpoint, I'm not sure. kgdb is the main user, may be Jason could find some use of it. I think an API should not require its users to iterate in their callbacks, even if it doesn't affect current users for some reason. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] perf: add context field to perf_event
Hi Frederic, Thanks for including me on CC. On Wed, Jun 29, 2011 at 05:08:45PM +0100, Frederic Weisbecker wrote: On Wed, Jun 29, 2011 at 06:42:35PM +0300, Avi Kivity wrote: The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event-overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity a...@redhat.com I believe it can micro-optimize ptrace through register_user_hw_breakpoint() because we could store the index of the breakpoint that way, instead of iterating through 4 slots. Perhaps it can help in arm too, adding Will in Cc. Yes, we could store the breakpoint index in there and it would save us walking over the breakpoints when one fires. Not sure this helps us for anything else though. My main gripe with the ptrace interface to hw_breakpoints is that we have to convert all the breakpoint information from ARM_BREAKPOINT_* to HW_BREAKPOINT_* and then convert it all back again in the hw_breakpoint code. Yuck! Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html