Re: kvm lockdep splat with 3.8-rc1+
Hi Borislav On Thu, Dec 27, 2012 at 12:43 PM, Borislav Petkov b...@alien8.de wrote: On Wed, Dec 26, 2012 at 08:18:13PM +0800, Hillf Danton wrote: Can you please test with 5a505085f0 and 4fc3f1d66b reverted? sure can do, but am travelling ATM so I'll run it with the reverted commits when I get back next week. Jiri posted similar locking issue at https://lkml.org/lkml/2013/1/4/380 Take a look? Hillf -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm lockdep splat with 3.8-rc1+
On Wed, Dec 26, 2012 at 6:30 AM, Borislav Petkov b...@alien8.de wrote: Hi all, just saw this in dmesg while running -rc1 + tip/master: [ 6983.694615] = [ 6983.694617] [ INFO: possible recursive locking detected ] [ 6983.694620] 3.8.0-rc1+ #26 Not tainted [ 6983.694621] - [ 6983.694623] kvm/20461 is trying to acquire lock: [ 6983.694625] (anon_vma-rwsem){..}, at: [8111d2c8] mm_take_all_locks+0x148/0x1a0 [ 6983.694636] [ 6983.694636] but task is already holding lock: [ 6983.694638] (anon_vma-rwsem){..}, at: [8111d2c8] mm_take_all_locks+0x148/0x1a0 [ 6983.694645] [ 6983.694645] other info that might help us debug this: [ 6983.694647] Possible unsafe locking scenario: [ 6983.694647] [ 6983.694649]CPU0 [ 6983.694650] [ 6983.694651] lock(anon_vma-rwsem); [ 6983.694654] lock(anon_vma-rwsem); [ 6983.694657] [ 6983.694657] *** DEADLOCK *** [ 6983.694657] [ 6983.694659] May be due to missing lock nesting notation [ 6983.694659] [ 6983.694661] 4 locks held by kvm/20461: [ 6983.694663] #0: (mm-mmap_sem){++}, at: [8112afb3] do_mmu_notifier_register+0x153/0x180 [ 6983.694670] #1: (mm_all_locks_mutex){+.+...}, at: [8111d1bc] mm_take_all_locks+0x3c/0x1a0 [ 6983.694678] #2: (mapping-i_mmap_mutex){+.+...}, at: [8111d24d] mm_take_all_locks+0xcd/0x1a0 [ 6983.694686] #3: (anon_vma-rwsem){..}, at: [8111d2c8] mm_take_all_locks+0x148/0x1a0 [ 6983.694694] [ 6983.694694] stack backtrace: [ 6983.694696] Pid: 20461, comm: kvm Not tainted 3.8.0-rc1+ #26 [ 6983.694698] Call Trace: [ 6983.694704] [8109c2fa] __lock_acquire+0x89a/0x1f30 [ 6983.694708] [810978ed] ? trace_hardirqs_off+0xd/0x10 [ 6983.694711] [81099b8d] ? mark_held_locks+0x8d/0x110 [ 6983.694714] [8111d24d] ? mm_take_all_locks+0xcd/0x1a0 [ 6983.694718] [8109e05e] lock_acquire+0x9e/0x1f0 [ 6983.694720] [8111d2c8] ? mm_take_all_locks+0x148/0x1a0 [ 6983.694724] [81097ace] ? put_lock_stats.isra.17+0xe/0x40 [ 6983.694728] [81519949] down_write+0x49/0x90 [ 6983.694731] [8111d2c8] ? mm_take_all_locks+0x148/0x1a0 [ 6983.694734] [8111d2c8] mm_take_all_locks+0x148/0x1a0 [ 6983.694737] [8112afb3] ? do_mmu_notifier_register+0x153/0x180 [ 6983.694740] [8112aedf] do_mmu_notifier_register+0x7f/0x180 [ 6983.694742] [8112b013] mmu_notifier_register+0x13/0x20 [ 6983.694765] [a00e665d] kvm_dev_ioctl+0x3cd/0x4f0 [kvm] [ 6983.694768] [8114bcb0] do_vfs_ioctl+0x90/0x570 [ 6983.694772] [81157403] ? fget_light+0x323/0x4c0 [ 6983.694775] [8114c1e0] sys_ioctl+0x50/0x90 [ 6983.694781] [8123a25e] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 6983.694785] [8151d4c2] system_call_fastpath+0x16/0x1b Hey Boris, Can you please test with 5a505085f0 and 4fc3f1d66b reverted? Hillf -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Fwd: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function]
On Wed, Jan 5, 2011 at 5:41 PM, Peter Zijlstra pet...@infradead.org wrote: On Wed, 2011-01-05 at 00:38 +0100, Tommaso Cucinotta wrote: Il 04/01/2011 19:15, Dario Faggioli ha scritto: Forwarded Message From: Peter Zijlstraa.p.zijls...@chello.nl To: Rik van Rielr...@redhat.com Cc: Hillf Dantondhi...@gmail.com,kvm@vger.kernel.org, linux-ker...@vger.kernel.org, Avi Kivitia...@redhat.com, Srivatsa Vaddagiriva...@linux.vnet.ibm.com, Mike Galbraithefa...@gmx.de, Chris Wrightchr...@sous-sol.org Subject: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function Date: Tue, 04 Jan 2011 19:05:54 +0100 RT guests don't make sense, there's nowhere near enough infrastructure for that to work well. I'd argue that KVM running with RT priority is a bug. Peter, can I ask why did you state that ? In the IRMOS project, we are just deploying KVM VMs by using the Fabio's real-time scheduler (for others, a.k.a., the Fabio's EDF throttling patch, or IRMOS RT scheduler) in order to let the VMs get precise CPU scheduling guarantees by the kernel. So, in this context we do have KVM running at RT priority, and we do have experimental results showing how this can improve stability of performance of the hosted guest VMs. Of course, don't misunderstand me: this is a necessary condition for a stable performance of KVM VMs, I'm not saying it is sufficient for I was mostly referring to the existing RT cruft (SCHED_RR/FIFO), that's utterly useless for KVM. As to hosting vcpus with CBS this might maybe make sense, but RT-guests are still miles away. Anyway, I'm not quite sure how you would want to deal with the guest spinlock issue in CBS, ideally you'd use paravirt guests to avoid that whole problem. Anyway, /me goes do something useful, virt sucks and should be taken out back and shot in the head. I dont think we are now still in the track of the patch from Rik, in which Mike brought the yield_to method into scheduling. The focus, as I see, is mainly on the effectiveness of the new method, since it could also be utilized in other environments, though currently it has nothing to do with the RT cruft but aims at easing certain lock contention in KVM. Another issue is that the change in the fair scheduling class, accompanying the new method, is deserved, for any reason Rik hold. Lets please return to the patch, and defer the RT. thanks Hillf -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Thu, Jan 6, 2011 at 12:57 AM, Mike Galbraith efa...@gmx.de wrote: sched: Add yield_to(task, preempt) functionality. Currently only implemented for fair class tasks. Add a yield_to_task method() to the fair scheduling class. allowing the caller of yield_to() to accelerate another thread in it's thread group, task group, and sched class toward either it's cpu, or potentially the caller's own cpu if the 'preempt' argument is also passed. Implemented via a scheduler hint, using cfs_rq-next to encourage the target being selected. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Mike Galbraith efa...@gmx.de --- include/linux/sched.h | 1 kernel/sched.c | 56 ++ kernel/sched_fair.c | 52 ++ 3 files changed, 109 insertions(+) Index: linux-2.6/include/linux/sched.h === --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1056,6 +1056,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); + int (*yield_to_task) (struct task_struct *p, int preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -5327,6 +5327,62 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + */ +void __sched yield_to(struct task_struct *p, int preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + unsigned long flags; + int yield = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (!curr-sched_class-yield_to_task) + goto out; + + if (curr-sched_class != p-sched_class) + goto out; + to be clearer? if (task_running(p_rq, p) || p-state != TASK_RUNNING) + goto out; + + if (!same_thread_group(p, curr)) + goto out; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Tue, Jan 4, 2011 at 5:29 AM, Rik van Riel r...@redhat.com wrote: From: Mike Galbraith efa...@gmx.de Add a yield_to function to the scheduler code, allowing us to give enough of our timeslice to another thread to allow it to run and release whatever resource we need it to release. We may want to use this to provide a sys_yield_to system call one day. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Not-signed-off-by: Mike Galbraith efa...@gmx.de --- Mike, want to change the above into a Signed-off-by: ? :) This code seems to work well. diff --git a/include/linux/sched.h b/include/linux/sched.h index c5f926c..0b8a3e6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1083,6 +1083,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep); void (*yield_task) (struct rq *rq); + int (*yield_to_task) (struct task_struct *p, int preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); @@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p) # define rt_mutex_adjust_pi(p) do { } while (0) #endif +extern void yield_to(struct task_struct *p, int preempt); extern void set_user_nice(struct task_struct *p, long nice); extern int task_prio(const struct task_struct *p); extern int task_nice(const struct task_struct *p); diff --git a/kernel/sched.c b/kernel/sched.c index f8e5a25..ffa7a9d 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -6901,6 +6901,53 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + */ +void __sched yield_to(struct task_struct *p, int preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + unsigned long flags; + int yield = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (task_running(p_rq, p) || p-state || !p-se.on_rq || + !same_thread_group(p, curr) || + !curr-sched_class-yield_to_task || + curr-sched_class != p-sched_class) { + goto out; + } + + yield = curr-sched_class-yield_to_task(p, preempt); + +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + + if (yield) { + set_current_state(TASK_RUNNING); + schedule(); + } +} +EXPORT_SYMBOL(yield_to); + + /* * This task is about to go to sleep on IO. Increment rq-nr_iowait so * that process accounting knows that this is a task in IO wait state. diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5119b08..3288e7c 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq) } #ifdef CONFIG_SMP +static void pull_task(struct rq *src_rq, struct task_struct *p, + struct rq *this_rq, int this_cpu); +#endif + +static int yield_to_task_fair(struct task_struct *p, int preempt) +{ + struct sched_entity *se = current-se; + struct sched_entity *pse = p-se; + struct sched_entity *curr = (task_rq(p)-curr)-se; + struct cfs_rq *cfs_rq = cfs_rq_of(se); + struct cfs_rq *p_cfs_rq = cfs_rq_of(pse); + int yield = this_rq() == task_rq(p); + int want_preempt = preempt; + +#ifdef CONFIG_FAIR_GROUP_SCHED + if (cfs_rq-tg != p_cfs_rq-tg) + return 0; + + /* Preemption only allowed within the same task group. */ + if (preempt cfs_rq-tg != cfs_rq_of(curr)-tg) + preempt = 0; +#endif + /* Preemption only allowed within the same thread group. */ + if (preempt !same_thread_group(current, task_of(p_cfs_rq-curr))) + preempt = 0; + +#ifdef CONFIG_SMP + /* + * If this yield is important enough to want to preempt instead + * of only dropping a -next hint, we're alone, and the target + * is not alone, pull the target to this cpu. + */ + if (want_preempt !yield cfs_rq-nr_running == 1 + cpumask_test_cpu(smp_processor_id(), p-cpus_allowed)) { + pull_task(task_rq(p), p, this_rq(), smp_processor_id()); + p_cfs_rq = cfs_rq_of(pse); + yield = 1; + } +#endif + + if
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Tue, Jan 4, 2011 at 5:29 AM, Rik van Riel r...@redhat.com wrote: From: Mike Galbraith efa...@gmx.de Add a yield_to function to the scheduler code, allowing us to give enough of our timeslice to another thread to allow it to run and release whatever resource we need it to release. We may want to use this to provide a sys_yield_to system call one day. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Not-signed-off-by: Mike Galbraith efa...@gmx.de --- Mike, want to change the above into a Signed-off-by: ? :) This code seems to work well. diff --git a/include/linux/sched.h b/include/linux/sched.h index c5f926c..0b8a3e6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1083,6 +1083,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep); void (*yield_task) (struct rq *rq); + int (*yield_to_task) (struct task_struct *p, int preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); @@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p) # define rt_mutex_adjust_pi(p) do { } while (0) #endif +extern void yield_to(struct task_struct *p, int preempt); extern void set_user_nice(struct task_struct *p, long nice); extern int task_prio(const struct task_struct *p); extern int task_nice(const struct task_struct *p); diff --git a/kernel/sched.c b/kernel/sched.c index f8e5a25..ffa7a9d 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -6901,6 +6901,53 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + */ +void __sched yield_to(struct task_struct *p, int preempt) +{ + struct task_struct *curr = current; struct task_struct *next; + struct rq *rq, *p_rq; + unsigned long flags; + int yield = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (task_running(p_rq, p) || p-state || !p-se.on_rq || + !same_thread_group(p, curr) || /* !curr-sched_class-yield_to_task ||*/ + curr-sched_class != p-sched_class) { + goto out; + } + /* * ask scheduler to compute the next for successfully kicking @p onto its CPU * what if p_rq is rt_class to do? */ next = pick_next_task(p_rq); if (next != p) p-se.vruntime = next-se.vruntime -1; deactivate_task(p_rq, p, 0); activate_task(p_rq, p, 0); if (rq == p_rq) schedule(); else resched_task(p_rq-curr); yield = 0; /* yield = curr-sched_class-yield_to_task(p, preempt); */ + +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + + if (yield) { + set_current_state(TASK_RUNNING); + schedule(); + } +} +EXPORT_SYMBOL(yield_to); + + /* * This task is about to go to sleep on IO. Increment rq-nr_iowait so * that process accounting knows that this is a task in IO wait state. diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5119b08..3288e7c 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq) } #ifdef CONFIG_SMP +static void pull_task(struct rq *src_rq, struct task_struct *p, + struct rq *this_rq, int this_cpu); +#endif + +static int yield_to_task_fair(struct task_struct *p, int preempt) +{ + struct sched_entity *se = current-se; + struct sched_entity *pse = p-se; + struct sched_entity *curr = (task_rq(p)-curr)-se; + struct cfs_rq *cfs_rq = cfs_rq_of(se); + struct cfs_rq *p_cfs_rq = cfs_rq_of(pse); + int yield = this_rq() == task_rq(p); + int want_preempt = preempt; + +#ifdef CONFIG_FAIR_GROUP_SCHED + if (cfs_rq-tg != p_cfs_rq-tg) + return 0; + + /* Preemption only allowed within the same task group. */ + if (preempt cfs_rq-tg != cfs_rq_of(curr)-tg) + preempt = 0; +#endif + /* Preemption only allowed within the same thread group. */ + if (preempt !same_thread_group(current, task_of(p_cfs_rq-curr))) + preempt = 0; + +#ifdef CONFIG_SMP + /* + * If this
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, Jan 5, 2011 at 12:44 AM, Rik van Riel r...@redhat.com wrote: On 01/04/2011 11:41 AM, Hillf Danton wrote: /* !curr-sched_class-yield_to_task || */ + curr-sched_class != p-sched_class) { + goto out; + } + /* * ask scheduler to compute the next for successfully kicking @p onto its CPU * what if p_rq is rt_class to do? */ next = pick_next_task(p_rq); if (next != p) p-se.vruntime = next-se.vruntime -1; deactivate_task(p_rq, p, 0); activate_task(p_rq, p, 0); if (rq == p_rq) schedule(); else resched_task(p_rq-curr); yield = 0; Wouldn't that break for FIFO and RR tasks? There's a reason all the scheduler folks wanted a per-class yield_to_task function :) Where is the yield_to callback in the patch for RT schedule class? If @p is RT, what could you do? Hillf
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, Jan 5, 2011 at 12:54 AM, Rik van Riel r...@redhat.com wrote: On 01/04/2011 11:51 AM, Hillf Danton wrote: Wouldn't that break for FIFO and RR tasks? There's a reason all the scheduler folks wanted a per-class yield_to_task function :) Where is the yield_to callback in the patch for RT schedule class? If @p is RT, what could you do? If the user chooses to overcommit the CPU with realtime tasks, the user cannot expect realtime response. For realtime, I have not implemented the yield_to callback at all because it would probably break realtime semantics and I assume people will not overcommit the CPU with realtime tasks anyway. I could see running a few realtime guests on a system, with the number of realtime VCPUs not exceeding the number of physical CPUs. Then it looks curr-sched_class != p-sched_class is not enough, and yield_to can not ease the lock contention in KVM in case where p-rq-curr is RT. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
On Wed, Jan 5, 2011 at 1:08 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote: Where is the yield_to callback in the patch for RT schedule class? If @p is RT, what could you do? RT guests are a pipe dream, you first need to get the hypervisor (kvm in this case) to be RT, which it isn't. Then you either need to very statically set-up the host and the guest scheduling constraints (not possible with RR/FIFO) or have a complete paravirt RT scheduler which communicates its requirements to the host. Even guest is not RT, you could not prevent it from being preempted by RT task which has nothing to do guests. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: mmu: fix counting of rmap entries in rmap_add()
It seems that rmap entries are under counted. Signed-off-by: Hillf Danton dhi...@gmail.com --- --- o/linux-2.6.36-rc1/arch/x86/kvm/mmu.c 2010-08-16 08:41:38.0 +0800 +++ m/linux-2.6.36-rc1/arch/x86/kvm/mmu.c 2010-09-18 07:51:44.0 +0800 @@ -591,6 +591,7 @@ static int rmap_add(struct kvm_vcpu *vcp desc-sptes[0] = (u64 *)*rmapp; desc-sptes[1] = spte; *rmapp = (unsigned long)desc | 1; + ++count; } else { rmap_printk(rmap_add: %p %llx many-many\n, spte, *spte); desc = (struct kvm_rmap_desc *)(*rmapp ~1ul); @@ -603,7 +604,7 @@ static int rmap_add(struct kvm_vcpu *vcp desc = desc-more; } for (i = 0; desc-sptes[i]; ++i) - ; + ++count; desc-sptes[i] = spte; } return count; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html