Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-12 Thread Rik van Riel

On 01/07/2011 12:29 AM, Mike Galbraith wrote:


+#ifdef CONFIG_SMP
+   /*
+* If this yield is important enough to want to preempt instead
+* of only dropping a -next hint, we're alone, and the target
+* is not alone, pull the target to this cpu.
+*
+* NOTE: the target may be alone in it's cfs_rq if another class
+* task or another task group is currently executing on it's cpu.
+* In this case, we still pull, to accelerate it toward the cpu.
+*/
+   if (cfs_rq != p_cfs_rq  preempt  cfs_rq-nr_running == 1
+   cpumask_test_cpu(this_cpu,p-cpus_allowed)) {
+   pull_task(task_rq(p), p, this_rq(), this_cpu);
+   p_cfs_rq = cfs_rq_of(pse);
+   }
+#endif


This causes some fun issues in a simple test case on
my system.  The test consists of 2 4-VCPU KVM guests,
bound to the same 4 CPUs on the host.

One guest is running the AMQP performance test, the
other guest is totally idle.  This means that besides
the 4 very busy VCPUs, there is only a few percent
CPU use in background tasks from the idle guest and
qemu-kvm userspace bits.

However, the busy guest is restricted to using just
3 out of the 4 CPUs, leaving one idle!

A simple explanation for this is that the above
pulling code will pull another VCPU onto the local
CPU whenever we run into contention inside the guest
and some random background task runs on the CPU where
that other VCPU was.

From that point on, the 4 VCPUs will stay on 3
CPUs, leaving one idle.  Any time we have contention
inside the guest (pretty frequent), we move whoever
is not currently running to another CPU.

Cgroups only makes the matter worse - libvirt places
each KVM guest into its own cgroup, so a VCPU will
generally always be alone on its own per-cgroup, per-cpu
runqueue!  That can lead to pulling a VCPU onto our local
CPU because we think we are alone, when in reality we
share the CPU with others...

Removing the pulling code allows me to use all 4
CPUs with a 4-VCPU KVM guest in an uncontended situation.


+   /* Tell the scheduler that we'd really like pse to run next. */
+   p_cfs_rq-next = pse;


Using set_next_buddy propagates this up to the root,
allowing the scheduler to actually know who we want to
run next when cgroups is involved.


+   /* We know whether we want to preempt or not, but are we allowed? */
+   if (preempt  same_thread_group(p, task_of(p_cfs_rq-curr)))
+   resched_task(task_of(p_cfs_rq-curr));


With this in place, we can get into the situation where
we will gladly give up CPU time, but not actually give
any to the other VCPUs in our guest.

I believe we can get rid of that test, because pick_next_entity
already makes sure it ignores -next if picking -next would
lead to unfairness.

Removing this test (and simplifying yield_to_task_fair) seems
to lead to more predictable test results.

I'll send the updated patch in another email, since this one is
already way too long for a changelog :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-12 Thread Mike Galbraith
On Wed, 2011-01-12 at 22:02 -0500, Rik van Riel wrote:

 Cgroups only makes the matter worse - libvirt places
 each KVM guest into its own cgroup, so a VCPU will
 generally always be alone on its own per-cgroup, per-cpu
 runqueue!  That can lead to pulling a VCPU onto our local
 CPU because we think we are alone, when in reality we
 share the CPU with others...

How can that happen?  If the task you're trying to accelerate isn't in
your task group, the whole attempt should be a noop.

 Removing the pulling code allows me to use all 4
 CPUs with a 4-VCPU KVM guest in an uncontended situation.
 
  +   /* Tell the scheduler that we'd really like pse to run next. */
  +   p_cfs_rq-next = pse;
 
 Using set_next_buddy propagates this up to the root,
 allowing the scheduler to actually know who we want to
 run next when cgroups is involved.
 
  +   /* We know whether we want to preempt or not, but are we allowed? */
  +   if (preempt  same_thread_group(p, task_of(p_cfs_rq-curr)))
  +   resched_task(task_of(p_cfs_rq-curr));
 
 With this in place, we can get into the situation where
 we will gladly give up CPU time, but not actually give
 any to the other VCPUs in our guest.
 
 I believe we can get rid of that test, because pick_next_entity
 already makes sure it ignores -next if picking -next would
 lead to unfairness.

Preempting everybody who is in your way isn't playing nice neighbor, so
I think at least the same_thread_group() test needs to stay.  But that's
Peter's call.  Starting a zillion threads to play wakeup preempt and
lets hog the cpu isn't nice either, but it's allowed.

 Removing this test (and simplifying yield_to_task_fair) seems
 to lead to more predictable test results.

Less is more :)

-Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-12 Thread Rik van Riel

On 01/12/2011 10:26 PM, Mike Galbraith wrote:

On Wed, 2011-01-12 at 22:02 -0500, Rik van Riel wrote:


Cgroups only makes the matter worse - libvirt places
each KVM guest into its own cgroup, so a VCPU will
generally always be alone on its own per-cgroup, per-cpu
runqueue!  That can lead to pulling a VCPU onto our local
CPU because we think we are alone, when in reality we
share the CPU with others...


How can that happen?  If the task you're trying to accelerate isn't in
your task group, the whole attempt should be a noop.


Nono, all the VCPUs from the same guest are in the same
cgroup.  However, with 4 VCPUs and 4 physical CPUs,
chances are that they're all alone on their own per-cpu,
per-cgroup cfs_rq.

However, each CPU might have other runnable processes in
other cfs_rq sched entities.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Fwd: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function]

2011-01-06 Thread Hillf Danton
On Wed, Jan 5, 2011 at 5:41 PM, Peter Zijlstra pet...@infradead.org wrote:
 On Wed, 2011-01-05 at 00:38 +0100, Tommaso Cucinotta wrote:
 Il 04/01/2011 19:15, Dario Faggioli ha scritto:
 
   Forwarded Message 
  From: Peter Zijlstraa.p.zijls...@chello.nl
  To: Rik van Rielr...@redhat.com
  Cc: Hillf Dantondhi...@gmail.com,kvm@vger.kernel.org,
  linux-ker...@vger.kernel.org, Avi Kivitia...@redhat.com, Srivatsa
  Vaddagiriva...@linux.vnet.ibm.com, Mike Galbraithefa...@gmx.de,
  Chris Wrightchr...@sous-sol.org
  Subject: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  Date: Tue, 04 Jan 2011 19:05:54 +0100
  RT guests don't make sense, there's nowhere near enough infrastructure
  for that to work well.
 
  I'd argue that KVM running with RT priority is a bug.
 Peter, can I ask why did you state that ? In the IRMOS project, we
 are just deploying KVM VMs by using the Fabio's real-time scheduler
 (for others, a.k.a., the Fabio's EDF throttling patch, or IRMOS RT
 scheduler)
 in order to let the VMs get precise CPU scheduling guarantees by the
 kernel. So, in this context we do have KVM running at RT priority, and
 we do have experimental results showing how this can improve stability
 of performance of the hosted guest VMs.
 Of course, don't misunderstand me: this is a necessary condition for a
 stable performance of KVM VMs, I'm not saying it is sufficient for

 I was mostly referring to the existing RT cruft (SCHED_RR/FIFO), that's
 utterly useless for KVM.

 As to hosting vcpus with CBS this might maybe make sense, but RT-guests
 are still miles away. Anyway, I'm not quite sure how you would want to
 deal with the guest spinlock issue in CBS, ideally you'd use paravirt
 guests to avoid that whole problem.

 Anyway, /me goes do something useful, virt sucks and should be taken out
 back and shot in the head.


I dont think we are now still in the track of the patch from Rik, in
which Mike brought the yield_to method into scheduling.

The focus, as I see, is mainly on the effectiveness of the new method,
since it could also be utilized in other environments, though
currently it has nothing to do with the RT cruft but aims at easing
certain lock contention in KVM.

Another issue is that the change in the fair scheduling class,
accompanying the new method, is deserved, for any reason Rik hold.

Lets please return to the patch, and defer the RT.

thanks
Hillf
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-06 Thread Hillf Danton
On Thu, Jan 6, 2011 at 12:57 AM, Mike Galbraith efa...@gmx.de wrote:
 sched: Add yield_to(task, preempt) functionality.

 Currently only implemented for fair class tasks.

 Add a yield_to_task method() to the fair scheduling class. allowing the
 caller of yield_to() to accelerate another thread in it's thread group,
 task group, and sched class toward either it's cpu, or potentially the
 caller's own cpu if the 'preempt' argument is also passed.

 Implemented via a scheduler hint, using cfs_rq-next to encourage the
 target being selected.

 Signed-off-by: Rik van Riel r...@redhat.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 Signed-off-by: Mike Galbraith efa...@gmx.de

 ---
  include/linux/sched.h |    1
  kernel/sched.c        |   56 
 ++
  kernel/sched_fair.c   |   52 ++
  3 files changed, 109 insertions(+)

 Index: linux-2.6/include/linux/sched.h
 ===
 --- linux-2.6.orig/include/linux/sched.h
 +++ linux-2.6/include/linux/sched.h
 @@ -1056,6 +1056,7 @@ struct sched_class {
        void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
        void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
        void (*yield_task) (struct rq *rq);
 +       int (*yield_to_task) (struct task_struct *p, int preempt);

        void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
 flags);

 Index: linux-2.6/kernel/sched.c
 ===
 --- linux-2.6.orig/kernel/sched.c
 +++ linux-2.6/kernel/sched.c
 @@ -5327,6 +5327,62 @@ void __sched yield(void)
  }
  EXPORT_SYMBOL(yield);

 +/**
 + * yield_to - yield the current processor to another thread in
 + * your thread group, or accelerate that thread toward the
 + * processor it's on.
 + *
 + * It's the caller's job to ensure that the target task struct
 + * can't go away on us before we can do any checks.
 + */
 +void __sched yield_to(struct task_struct *p, int preempt)
 +{
 +       struct task_struct *curr = current;
 +       struct rq *rq, *p_rq;
 +       unsigned long flags;
 +       int yield = 0;
 +
 +       local_irq_save(flags);
 +       rq = this_rq();
 +
 +again:
 +       p_rq = task_rq(p);
 +       double_rq_lock(rq, p_rq);
 +       while (task_rq(p) != p_rq) {
 +               double_rq_unlock(rq, p_rq);
 +               goto again;
 +       }
 +
 +       if (!curr-sched_class-yield_to_task)
 +               goto out;
 +
 +       if (curr-sched_class != p-sched_class)
 +               goto out;
 +

to be clearer?
        if (task_running(p_rq, p) || p-state != TASK_RUNNING)

 +               goto out;
 +
 +       if (!same_thread_group(p, curr))
 +               goto out;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-06 Thread Mike Galbraith
On Wed, 2011-01-05 at 18:04 +0100, Peter Zijlstra wrote:
 On Wed, 2011-01-05 at 17:57 +0100, Mike Galbraith wrote:
  +   p_cfs_rq = cfs_rq_of(pse);
  +   local = 1;
  +   }
  +#endif
  +
  +   /* Tell the scheduler that we'd really like pse to run next. */
  +   p_cfs_rq-next = pse;
  +
  +   /* We know whether we want to preempt or not, but are we allowed? */
  +   preempt = same_thread_group(p, task_of(p_cfs_rq-curr));
  +
  +   if (local)
  +   clear_buddies(cfs_rq, se); 
 
 You might want to clear before setting next :-)

Or better, just remove dept. of redundancy dept. cruft.  We clear
buddies upon selection.  It's also pointless worrying whether to set
TIF_RESCHED or not, no cycle savings to be had there methinks.

While performing cruftectomy, also did cosmetic int == bool.



sched: Add yield_to(task, preempt) functionality.

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group, and sched class toward either it's cpu, or potentially the
caller's own cpu if the 'preempt' argument is also passed.

Implemented via a scheduler hint, using cfs_rq-next to encourage the
target being selected.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Mike Galbraith efa...@gmx.de

---
 include/linux/sched.h |1 
 kernel/sched.c|   56 ++
 kernel/sched_fair.c   |   44 +++
 3 files changed, 101 insertions(+)

Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1056,6 +1056,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
+   bool (*yield_to_task) (struct task_struct *p, bool preempt);
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
flags);
 
Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -5327,6 +5327,62 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, bool preempt)
+{
+   struct task_struct *curr = current;
+   struct rq *rq, *p_rq;
+   unsigned long flags;
+   bool yield = 0;
+
+   local_irq_save(flags);
+   rq = this_rq();
+
+again:
+   p_rq = task_rq(p);
+   double_rq_lock(rq, p_rq);
+   while (task_rq(p) != p_rq) {
+   double_rq_unlock(rq, p_rq);
+   goto again;
+   }
+
+   if (!curr-sched_class-yield_to_task)
+   goto out;
+
+   if (curr-sched_class != p-sched_class)
+   goto out;
+
+   if (task_running(p_rq, p) || p-state)
+   goto out;
+
+   if (!same_thread_group(p, curr))
+   goto out;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+   if (task_group(p) != task_group(curr))
+   goto out;
+#endif
+
+   yield = curr-sched_class-yield_to_task(p, preempt);
+
+out:
+   double_rq_unlock(rq, p_rq);
+   local_irq_restore(flags);
+
+   if (yield) {
+   set_current_state(TASK_RUNNING);
+   schedule();
+   }
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 /*
  * This task is about to go to sleep on IO. Increment rq-nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
Index: linux-2.6/kernel/sched_fair.c
===
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1337,6 +1337,49 @@ static void yield_task_fair(struct rq *r
 }
 
 #ifdef CONFIG_SMP
+static void pull_task(struct rq *src_rq, struct task_struct *p,
+ struct rq *this_rq, int this_cpu);
+#endif
+
+static bool yield_to_task_fair(struct task_struct *p, bool preempt)
+{
+   struct sched_entity *se = current-se;
+   struct sched_entity *pse = p-se;
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
+   int this_cpu = smp_processor_id();
+
+   if (!pse-on_rq)
+   return false;
+
+#ifdef CONFIG_SMP
+   /*
+* If this yield is important enough to want to preempt instead
+* of only dropping a -next hint, we're alone, and the target
+ 

Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Avi Kivity

On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote:

  On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:
Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
avoid almost Nehalem (and other P2P cache arch) lock unfairness
problem. (probaby creating pthread_condattr_setautoyield_np or similar
knob is good one)

  Often, the thread calling pthread_cond_signal() wants to continue
  executing, not yield.

Then, it doesn't work.

After calling pthread_cond_signal(), T1 which cond_signal caller and T2
which waked start to GIL grab race. But usually T1 is always win because
lock variable is in T1's cpu cache. Why kernel and userland have so much
different result? One of a reason is glibc doesn't have any ticket lock scheme.

If you are interesting GIL mess and issue, please feel free to ask more.


I suggest looking into an explicit round-robin scheme, where each thread 
adds itself to a queue and an unlock wakes up the first waiter.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Avi Kivity

On 01/05/2011 10:40 AM, KOSAKI Motohiro wrote:

  On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote:
   On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:
  Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
  avoid almost Nehalem (and other P2P cache arch) lock unfairness
  problem. (probaby creating pthread_condattr_setautoyield_np or 
similar
  knob is good one)

   Often, the thread calling pthread_cond_signal() wants to continue
   executing, not yield.
  
Then, it doesn't work.
  
After calling pthread_cond_signal(), T1 which cond_signal caller and T2
which waked start to GIL grab race. But usually T1 is always win because
lock variable is in T1's cpu cache. Why kernel and userland have so much
different result? One of a reason is glibc doesn't have any ticket lock 
scheme.
  
If you are interesting GIL mess and issue, please feel free to ask more.

  I suggest looking into an explicit round-robin scheme, where each thread
  adds itself to a queue and an unlock wakes up the first waiter.

I'm sure you haven't try your scheme. but I did. It's slow.


Won't anything with a heavily contented global/giant lock be slow?

What's the average lock hold time per thread? 10%? 50%? 90%?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread KOSAKI Motohiro
 On 01/05/2011 10:40 AM, KOSAKI Motohiro wrote:
On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote:
 On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:
Also, If pthread_cond_signal() call sys_yield_to imlicitly, 
   we can
avoid almost Nehalem (and other P2P cache arch) lock 
   unfairness
problem. (probaby creating pthread_condattr_setautoyield_np 
   or similar
knob is good one)
  
 Often, the thread calling pthread_cond_signal() wants to continue
 executing, not yield.

  Then, it doesn't work.

  After calling pthread_cond_signal(), T1 which cond_signal caller and 
   T2
  which waked start to GIL grab race. But usually T1 is always win 
   because
  lock variable is in T1's cpu cache. Why kernel and userland have so 
   much
  different result? One of a reason is glibc doesn't have any ticket 
   lock scheme.

  If you are interesting GIL mess and issue, please feel free to ask 
   more.
  
I suggest looking into an explicit round-robin scheme, where each thread
adds itself to a queue and an unlock wakes up the first waiter.
 
  I'm sure you haven't try your scheme. but I did. It's slow.
 
 Won't anything with a heavily contented global/giant lock be slow?
 What's the average lock hold time per thread? 10%? 50%? 90%?

Well, Of cource all of heavily contetion are slow. but we don't have to
compare heavily contended with light contended. we have to compare
heavily contended with heavily contended or light contended with light
contended. If we are talking a scripting language VM, pipe benchmark
show impressively FIFO overhead which like your propsed. Because
pipe bench makes frequently GIL grab/ungrab storm. Similar to pipe 
bench showed our (very) old kernel's bottleneck. Sadly userspace have
no way to implement per-cpu runqueue. I think.

And, if we are talking a language VM, I can't say any average time. It
depend on running script.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Avi Kivity

On 01/05/2011 11:30 AM, KOSAKI Motohiro wrote:

  On 01/05/2011 10:40 AM, KOSAKI Motohiro wrote:
   On 01/05/2011 04:39 AM, KOSAKI Motohiro wrote:
  On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:
  Also, If pthread_cond_signal() call sys_yield_to 
imlicitly, we can
  avoid almost Nehalem (and other P2P cache arch) lock 
unfairness
  problem. (probaby creating 
pthread_condattr_setautoyield_np or similar
  knob is good one)
  
  Often, the thread calling pthread_cond_signal() wants to 
continue
  executing, not yield.
   
  Then, it doesn't work.
   
  After calling pthread_cond_signal(), T1 which cond_signal caller 
and T2
  which waked start to GIL grab race. But usually T1 is always win 
because
  lock variable is in T1's cpu cache. Why kernel and userland have 
so much
  different result? One of a reason is glibc doesn't have any 
ticket lock scheme.
   
  If you are interesting GIL mess and issue, please feel free to 
ask more.

   I suggest looking into an explicit round-robin scheme, where each 
thread
   adds itself to a queue and an unlock wakes up the first waiter.
  
I'm sure you haven't try your scheme. but I did. It's slow.

  Won't anything with a heavily contented global/giant lock be slow?
  What's the average lock hold time per thread? 10%? 50%? 90%?

Well, Of cource all of heavily contetion are slow. but we don't have to
compare heavily contended with light contended. we have to compare
heavily contended with heavily contended or light contended with light
contended. If we are talking a scripting language VM, pipe benchmark
show impressively FIFO overhead which like your propsed. Because
pipe bench makes frequently GIL grab/ungrab storm. Similar to pipe
bench showed our (very) old kernel's bottleneck. Sadly userspace have
no way to implement per-cpu runqueue. I think.


A completely fair lock will likely be slower than an unfair lock.


And, if we are talking a language VM, I can't say any average time. It
depend on running script.


Pick some parallel compute intensive script, please.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Fwd: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function]

2011-01-05 Thread Peter Zijlstra
On Wed, 2011-01-05 at 00:38 +0100, Tommaso Cucinotta wrote:
 Il 04/01/2011 19:15, Dario Faggioli ha scritto:
 
   Forwarded Message 
  From: Peter Zijlstraa.p.zijls...@chello.nl
  To: Rik van Rielr...@redhat.com
  Cc: Hillf Dantondhi...@gmail.com,kvm@vger.kernel.org,
  linux-ker...@vger.kernel.org, Avi Kivitia...@redhat.com, Srivatsa
  Vaddagiriva...@linux.vnet.ibm.com, Mike Galbraithefa...@gmx.de,
  Chris Wrightchr...@sous-sol.org
  Subject: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
  Date: Tue, 04 Jan 2011 19:05:54 +0100
  RT guests don't make sense, there's nowhere near enough infrastructure
  for that to work well.
 
  I'd argue that KVM running with RT priority is a bug.
 Peter, can I ask why did you state that ? In the IRMOS project, we
 are just deploying KVM VMs by using the Fabio's real-time scheduler
 (for others, a.k.a., the Fabio's EDF throttling patch, or IRMOS RT 
 scheduler)
 in order to let the VMs get precise CPU scheduling guarantees by the
 kernel. So, in this context we do have KVM running at RT priority, and
 we do have experimental results showing how this can improve stability
 of performance of the hosted guest VMs.
 Of course, don't misunderstand me: this is a necessary condition for a
 stable performance of KVM VMs, I'm not saying it is sufficient for

I was mostly referring to the existing RT cruft (SCHED_RR/FIFO), that's
utterly useless for KVM.

As to hosting vcpus with CBS this might maybe make sense, but RT-guests
are still miles away. Anyway, I'm not quite sure how you would want to
deal with the guest spinlock issue in CBS, ideally you'd use paravirt
guests to avoid that whole problem.

Anyway, /me goes do something useful, virt sucks and should be taken out
back and shot in the head.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Peter Zijlstra
On Wed, 2011-01-05 at 11:39 +0900, KOSAKI Motohiro wrote:

 After calling pthread_cond_signal(), T1 which cond_signal caller and T2
 which waked start to GIL grab race. But usually T1 is always win because
 lock variable is in T1's cpu cache. Why kernel and userland have so much
 different result? One of a reason is glibc doesn't have any ticket lock 
 scheme. 

The problem is that making locks strictly fair is that that sucks for
performance, iirc most futex ops are fifo-ordered when they his the
block path, but we do allow for lock-stealing.

Lock-stealing greatly improves performance since it avoids lots of
block/wakeup cycles, but it does make things unfair.

I'm not sure we have a futex option to disable lock-stealing, nor do I
think you really want to, performance suffers really badly.

[This btw is the reason why people reported a performance improvement
when they wrapped all mmap() calls in a pthread_mutex, the
rwsem_down_write() thing doesn't allow for lock-stealing since it needs
to be strict fifo-fair]
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Peter Zijlstra
On Wed, 2011-01-05 at 17:40 +0900, KOSAKI Motohiro wrote:
   If you are interesting GIL mess and issue, please feel free to ask more.
  
  I suggest looking into an explicit round-robin scheme, where each thread 
  adds itself to a queue and an unlock wakes up the first waiter.
 
 I'm sure you haven't try your scheme. but I did. It's slow.

Of course it is, but then your locking scheme (GIL) is the problem, not
the locking constructs.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Mike Galbraith
It's Rik's patch to do with whatever he wants (I donated suggestion;),
but I took the liberty of cleaning it up a bit, see below.  It's
essentially the same, but fixes a bug where the caller may have targeted
a task in it's thread group etc, but when preemption decision time came,
may have preempted a forbidden task.

On Tue, 2011-01-04 at 19:04 +0100, Peter Zijlstra wrote:
 On Mon, 2011-01-03 at 16:29 -0500, Rik van Riel wrote:
   
  +out:
  +   double_rq_unlock(rq, p_rq);
  +   local_irq_restore(flags);
  +
  +   if (yield) {
  +   set_current_state(TASK_RUNNING);
  +   schedule();
  +   }
  +}
  +EXPORT_SYMBOL(yield_to);
 
 This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
 I'd make it so only kvm.o could use it. It really sucks that kvm is a
 module.

Did that.

  +#ifdef CONFIG_FAIR_GROUP_SCHED
  +   if (cfs_rq-tg != p_cfs_rq-tg)
  +   return 0;
  +
  +   /* Preemption only allowed within the same task group. */
  +   if (preempt  cfs_rq-tg != cfs_rq_of(curr)-tg)
  +   preempt = 0;
  +#endif
 
 I'd simply bail if its not the same cgroup, who cares about that case
 anyway, all KVM vcpu threads should be in the same cgroup I think.

See below.

  +   /* Preemption only allowed within the same thread group. */
  +   if (preempt  !same_thread_group(current, task_of(p_cfs_rq-curr)))
  +   preempt = 0;
 
 The calling site already checks for same_thread_group(), we never even
 get here if that's not the case.

Yeah, task group check was over there already as well.  Hasty hack.

  +#ifdef CONFIG_SMP
  +   /*
  +* If this yield is important enough to want to preempt instead
  +* of only dropping a -next hint, we're alone, and the target
  +* is not alone, pull the target to this cpu.
  +*/
  +   if (want_preempt  !yield  cfs_rq-nr_running == 1 
  +   cpumask_test_cpu(smp_processor_id(), p-cpus_allowed)) 
  {
  +   pull_task(task_rq(p), p, this_rq(), smp_processor_id());
 
 This only works by the grace that the caller checked p-se.on_rq. A
 comment might be in order.

I moved all of the entity decisions to the class method, and tried to
make it a bit less ugly.


sched: Add yield_to(task, preempt) functionality.

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group, and sched class toward either it's cpu, or potentially the
caller's own cpu if the 'preempt' argument is also passed.

Implemented via a scheduler hint, using cfs_rq-next to encourage the
target being selected.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Mike Galbraith efa...@gmx.de

---
 include/linux/sched.h |1 
 kernel/sched.c|   56 ++
 kernel/sched_fair.c   |   52 ++
 3 files changed, 109 insertions(+)

Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1056,6 +1056,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
+   int (*yield_to_task) (struct task_struct *p, int preempt);
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
flags);
 
Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -5327,6 +5327,62 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, int preempt)
+{
+   struct task_struct *curr = current;
+   struct rq *rq, *p_rq;
+   unsigned long flags;
+   int yield = 0;
+
+   local_irq_save(flags);
+   rq = this_rq();
+
+again:
+   p_rq = task_rq(p);
+   double_rq_lock(rq, p_rq);
+   while (task_rq(p) != p_rq) {
+   double_rq_unlock(rq, p_rq);
+   goto again;
+   }
+
+   if (!curr-sched_class-yield_to_task)
+   goto out;
+
+   if (curr-sched_class != p-sched_class)
+   goto out;
+
+   if (task_running(p_rq, p) || p-state)
+   goto out;
+
+   if (!same_thread_group(p, curr))
+   goto out;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+   if (task_group(p) != task_group(curr))
+   goto out;
+#endif
+
+   yield 

Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Peter Zijlstra
On Wed, 2011-01-05 at 17:57 +0100, Mike Galbraith wrote:
 +   p_cfs_rq = cfs_rq_of(pse);
 +   local = 1;
 +   }
 +#endif
 +
 +   /* Tell the scheduler that we'd really like pse to run next. */
 +   p_cfs_rq-next = pse;
 +
 +   /* We know whether we want to preempt or not, but are we allowed? */
 +   preempt = same_thread_group(p, task_of(p_cfs_rq-curr));
 +
 +   if (local)
 +   clear_buddies(cfs_rq, se); 

You might want to clear before setting next :-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Avi Kivity

On 01/04/2011 08:04 PM, Peter Zijlstra wrote:

This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
I'd make it so only kvm.o could use it. It really sucks that kvm is a
module.


Why does it suck?  I mean apart from the virtualization is crap song.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Avi Kivity

On 01/05/2011 07:15 PM, Peter Zijlstra wrote:

On Wed, 2011-01-05 at 19:10 +0200, Avi Kivity wrote:
  On 01/04/2011 08:04 PM, Peter Zijlstra wrote:
This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
I'd make it so only kvm.o could use it. It really sucks that kvm is a
module.

  Why does it suck?  I mean apart from the virtualization is crap song.

Because it needs hooks all over the core kernel, like this yield_to()
stuff. Exporting this might lead to others wanting to use it too.


Well, it's very convenient for development (modprobe vs. reboot).  What 
hooks do you object to? mmu notifiers are useful for some drivers, sched 
notifiers are useful for cmwq and possibly perf?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Mike Galbraith
On Wed, 2011-01-05 at 18:04 +0100, Peter Zijlstra wrote:
 On Wed, 2011-01-05 at 17:57 +0100, Mike Galbraith wrote:
  +   p_cfs_rq = cfs_rq_of(pse);
  +   local = 1;
  +   }
  +#endif
  +
  +   /* Tell the scheduler that we'd really like pse to run next. */
  +   p_cfs_rq-next = pse;
  +
  +   /* We know whether we want to preempt or not, but are we allowed? */
  +   preempt = same_thread_group(p, task_of(p_cfs_rq-curr));
  +
  +   if (local)
  +   clear_buddies(cfs_rq, se); 
 
 You might want to clear before setting next :-)

Heh, you made me thwap forehead.. but pse != se or p_cfs_rq-curr.
Moved it back where it came from anyway ;-)


sched: Add yield_to(task, preempt) functionality.

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group, and sched class toward either it's cpu, or potentially the
caller's own cpu if the 'prempt' argument is also passed.

Implemented via a scheduler hint, using cfs_rq-next to encourage the
target being selected.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Mike Galbraith efa...@gmx.de

---
 include/linux/sched.h |1 
 kernel/sched.c|   56 ++
 kernel/sched_fair.c   |   52 ++
 3 files changed, 109 insertions(+)

Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1056,6 +1056,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
+   int (*yield_to_task) (struct task_struct *p, int preempt);
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
flags);
 
Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -5327,6 +5327,62 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, int preempt)
+{
+   struct task_struct *curr = current;
+   struct rq *rq, *p_rq;
+   unsigned long flags;
+   int yield = 0;
+
+   local_irq_save(flags);
+   rq = this_rq();
+
+again:
+   p_rq = task_rq(p);
+   double_rq_lock(rq, p_rq);
+   while (task_rq(p) != p_rq) {
+   double_rq_unlock(rq, p_rq);
+   goto again;
+   }
+
+   if (!curr-sched_class-yield_to_task)
+   goto out;
+
+   if (curr-sched_class != p-sched_class)
+   goto out;
+
+   if (task_running(p_rq, p) || p-state)
+   goto out;
+
+   if (!same_thread_group(p, curr))
+   goto out;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+   if (task_group(p) != task_group(curr))
+   goto out;
+#endif
+
+   yield = curr-sched_class-yield_to_task(p, preempt);
+
+out:
+   double_rq_unlock(rq, p_rq);
+   local_irq_restore(flags);
+
+   if (yield) {
+   set_current_state(TASK_RUNNING);
+   schedule();
+   }
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 /*
  * This task is about to go to sleep on IO. Increment rq-nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
Index: linux-2.6/kernel/sched_fair.c
===
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1337,6 +1337,57 @@ static void yield_task_fair(struct rq *r
 }
 
 #ifdef CONFIG_SMP
+static void pull_task(struct rq *src_rq, struct task_struct *p,
+ struct rq *this_rq, int this_cpu);
+#endif
+
+static int yield_to_task_fair(struct task_struct *p, int preempt)
+{
+   struct sched_entity *se = current-se;
+   struct sched_entity *pse = p-se;
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
+   int local = cfs_rq == p_cfs_rq;
+   int this_cpu = smp_processor_id();
+
+   if (!pse-on_rq)
+   return 0;
+
+#ifdef CONFIG_SMP
+   /*
+* If this yield is important enough to want to preempt instead
+* of only dropping a -next hint, we're alone, and the target
+* is not alone, pull the target to this cpu.
+*
+* NOTE: the target may be alone in it's 

Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Peter Zijlstra
On Wed, 2011-01-05 at 19:19 +0200, Avi Kivity wrote:
 On 01/05/2011 07:15 PM, Peter Zijlstra wrote:
  On Wed, 2011-01-05 at 19:10 +0200, Avi Kivity wrote:
On 01/04/2011 08:04 PM, Peter Zijlstra wrote:
  This definitely wants to be EXPORT_SYMBOL_GPL() and if it were 
   possible
  I'd make it so only kvm.o could use it. It really sucks that kvm is a
  module.
  
Why does it suck?  I mean apart from the virtualization is crap song.
 
  Because it needs hooks all over the core kernel, like this yield_to()
  stuff. Exporting this might lead to others wanting to use it too.
 
 Well, it's very convenient for development (modprobe vs. reboot).

Get a machine that boots fast ;-), it might be because I've never worked
on anything modular, but typically stuff crashes when I got it wrong, in
which case you're forced to power-cycle anyway.

   What hooks do you object to? 

Don't really keep a list, but every now and again see something that
makes me think that ought not be exported.

 mmu notifiers are useful for some drivers, sched 
 notifiers are useful for cmwq and possibly perf?

Both use explicit function calls because its easier to read the code and
faster.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Avi Kivity

On 01/05/2011 07:28 PM, Peter Zijlstra wrote:

On Wed, 2011-01-05 at 19:19 +0200, Avi Kivity wrote:
  On 01/05/2011 07:15 PM, Peter Zijlstra wrote:
On Wed, 2011-01-05 at 19:10 +0200, Avi Kivity wrote:
   On 01/04/2011 08:04 PM, Peter Zijlstra wrote:
  This definitely wants to be EXPORT_SYMBOL_GPL() and if it were 
possible
  I'd make it so only kvm.o could use it. It really sucks that kvm 
is a
  module.

   Why does it suck?  I mean apart from the virtualization is crap 
song.
  
Because it needs hooks all over the core kernel, like this yield_to()
stuff. Exporting this might lead to others wanting to use it too.

  Well, it's very convenient for development (modprobe vs. reboot).

Get a machine that boots fast ;-), it might be because I've never worked
on anything modular, but typically stuff crashes when I got it wrong, in
which case you're forced to power-cycle anyway.


I often get by with just guest crashes.  Or maybe it oopses, I look 
around, then reboot at my leisure.



What hooks do you object to?

Don't really keep a list, but every now and again see something that
makes me think that ought not be exported.

  mmu notifiers are useful for some drivers, sched
  notifiers are useful for cmwq and possibly perf?

Both use explicit function calls because its easier to read the code and
faster.


I doubt it's measurable, especially for perf with the msr tweaking it does.

Tejun, why did you end up not using preempt_notifiers in cmwq?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Peter Zijlstra
On Wed, 2011-01-05 at 19:35 +0200, Avi Kivity wrote:

 Tejun, why did you end up not using preempt_notifiers in cmwq?

Because I told him to use explicit function calls because that keeps the
code easier to read.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-05 Thread Peter Zijlstra
On Wed, 2011-01-05 at 19:10 +0200, Avi Kivity wrote:
 On 01/04/2011 08:04 PM, Peter Zijlstra wrote:
  This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
  I'd make it so only kvm.o could use it. It really sucks that kvm is a
  module.
 
 Why does it suck?  I mean apart from the virtualization is crap song.

Because it needs hooks all over the core kernel, like this yield_to()
stuff. Exporting this might lead to others wanting to use it too.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Avi Kivity

On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:

Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
avoid almost Nehalem (and other P2P cache arch) lock unfairness
problem. (probaby creating pthread_condattr_setautoyield_np or similar
knob is good one)


Often, the thread calling pthread_cond_signal() wants to continue 
executing, not yield.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Hillf Danton
On Tue, Jan 4, 2011 at 5:29 AM, Rik van Riel r...@redhat.com wrote:
 From: Mike Galbraith efa...@gmx.de

 Add a yield_to function to the scheduler code, allowing us to
 give enough of our timeslice to another thread to allow it to
 run and release whatever resource we need it to release.

 We may want to use this to provide a sys_yield_to system call
 one day.

 Signed-off-by: Rik van Riel r...@redhat.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 Not-signed-off-by: Mike Galbraith efa...@gmx.de

 ---
 Mike, want to change the above into a Signed-off-by: ? :)
 This code seems to work well.

 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index c5f926c..0b8a3e6 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1083,6 +1083,7 @@ struct sched_class {
        void (*enqueue_task) (struct rq *rq, struct task_struct *p, int 
 wakeup);
        void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
        void (*yield_task) (struct rq *rq);
 +       int (*yield_to_task) (struct task_struct *p, int preempt);

        void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
 flags);

 @@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct 
 *p)
  # define rt_mutex_adjust_pi(p)         do { } while (0)
  #endif

 +extern void yield_to(struct task_struct *p, int preempt);
  extern void set_user_nice(struct task_struct *p, long nice);
  extern int task_prio(const struct task_struct *p);
  extern int task_nice(const struct task_struct *p);
 diff --git a/kernel/sched.c b/kernel/sched.c
 index f8e5a25..ffa7a9d 100644
 --- a/kernel/sched.c
 +++ b/kernel/sched.c
 @@ -6901,6 +6901,53 @@ void __sched yield(void)
  }
  EXPORT_SYMBOL(yield);

 +/**
 + * yield_to - yield the current processor to another thread in
 + * your thread group, or accelerate that thread toward the
 + * processor it's on.
 + *
 + * It's the caller's job to ensure that the target task struct
 + * can't go away on us before we can do any checks.
 + */
 +void __sched yield_to(struct task_struct *p, int preempt)
 +{
 +       struct task_struct *curr = current;
 +       struct rq *rq, *p_rq;
 +       unsigned long flags;
 +       int yield = 0;
 +
 +       local_irq_save(flags);
 +       rq = this_rq();
 +
 +again:
 +       p_rq = task_rq(p);
 +       double_rq_lock(rq, p_rq);
 +       while (task_rq(p) != p_rq) {
 +               double_rq_unlock(rq, p_rq);
 +               goto again;
 +       }
 +
 +       if (task_running(p_rq, p) || p-state || !p-se.on_rq ||
 +                       !same_thread_group(p, curr) ||
 +                       !curr-sched_class-yield_to_task ||
 +                       curr-sched_class != p-sched_class) {
 +               goto out;
 +       }
 +
 +       yield = curr-sched_class-yield_to_task(p, preempt);
 +
 +out:
 +       double_rq_unlock(rq, p_rq);
 +       local_irq_restore(flags);
 +
 +       if (yield) {
 +               set_current_state(TASK_RUNNING);
 +               schedule();
 +       }
 +}
 +EXPORT_SYMBOL(yield_to);
 +
 +
  /*
  * This task is about to go to sleep on IO. Increment rq-nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
 diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
 index 5119b08..3288e7c 100644
 --- a/kernel/sched_fair.c
 +++ b/kernel/sched_fair.c
 @@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq)
  }

  #ifdef CONFIG_SMP
 +static void pull_task(struct rq *src_rq, struct task_struct *p,
 +                     struct rq *this_rq, int this_cpu);
 +#endif
 +
 +static int yield_to_task_fair(struct task_struct *p, int preempt)
 +{
 +       struct sched_entity *se = current-se;
 +       struct sched_entity *pse = p-se;
 +       struct sched_entity *curr = (task_rq(p)-curr)-se;
 +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
 +       struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
 +       int yield = this_rq() == task_rq(p);
 +       int want_preempt = preempt;
 +
 +#ifdef CONFIG_FAIR_GROUP_SCHED
 +       if (cfs_rq-tg != p_cfs_rq-tg)
 +               return 0;
 +
 +       /* Preemption only allowed within the same task group. */
 +       if (preempt  cfs_rq-tg != cfs_rq_of(curr)-tg)
 +               preempt = 0;
 +#endif
 +       /* Preemption only allowed within the same thread group. */
 +       if (preempt  !same_thread_group(current, task_of(p_cfs_rq-curr)))
 +               preempt = 0;
 +
 +#ifdef CONFIG_SMP
 +       /*
 +        * If this yield is important enough to want to preempt instead
 +        * of only dropping a -next hint, we're alone, and the target
 +        * is not alone, pull the target to this cpu.
 +        */
 +       if (want_preempt  !yield  cfs_rq-nr_running == 1 
 +                       cpumask_test_cpu(smp_processor_id(), 
 p-cpus_allowed)) {
 +               pull_task(task_rq(p), p, this_rq(), smp_processor_id());
 +               p_cfs_rq = cfs_rq_of(pse);
 +               yield = 1;
 +       }
 +#endif
 +
 +       if 

Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Hillf Danton
On Tue, Jan 4, 2011 at 5:29 AM, Rik van Riel r...@redhat.com wrote:
 From: Mike Galbraith efa...@gmx.de

 Add a yield_to function to the scheduler code, allowing us to
 give enough of our timeslice to another thread to allow it to
 run and release whatever resource we need it to release.

 We may want to use this to provide a sys_yield_to system call
 one day.

 Signed-off-by: Rik van Riel r...@redhat.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 Not-signed-off-by: Mike Galbraith efa...@gmx.de

 ---
 Mike, want to change the above into a Signed-off-by: ? :)
 This code seems to work well.

 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index c5f926c..0b8a3e6 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1083,6 +1083,7 @@ struct sched_class {
        void (*enqueue_task) (struct rq *rq, struct task_struct *p, int 
 wakeup);
        void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
        void (*yield_task) (struct rq *rq);
 +       int (*yield_to_task) (struct task_struct *p, int preempt);

        void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
 flags);

 @@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct 
 *p)
  # define rt_mutex_adjust_pi(p)         do { } while (0)
  #endif

 +extern void yield_to(struct task_struct *p, int preempt);
  extern void set_user_nice(struct task_struct *p, long nice);
  extern int task_prio(const struct task_struct *p);
  extern int task_nice(const struct task_struct *p);
 diff --git a/kernel/sched.c b/kernel/sched.c
 index f8e5a25..ffa7a9d 100644
 --- a/kernel/sched.c
 +++ b/kernel/sched.c
 @@ -6901,6 +6901,53 @@ void __sched yield(void)
  }
  EXPORT_SYMBOL(yield);

 +/**
 + * yield_to - yield the current processor to another thread in
 + * your thread group, or accelerate that thread toward the
 + * processor it's on.
 + *
 + * It's the caller's job to ensure that the target task struct
 + * can't go away on us before we can do any checks.
 + */
 +void __sched yield_to(struct task_struct *p, int preempt)
 +{
 +       struct task_struct *curr = current;

struct task_struct *next;

 +       struct rq *rq, *p_rq;
 +       unsigned long flags;
 +       int yield = 0;
 +
 +       local_irq_save(flags);
 +       rq = this_rq();
 +
 +again:
 +       p_rq = task_rq(p);
 +       double_rq_lock(rq, p_rq);
 +       while (task_rq(p) != p_rq) {
 +               double_rq_unlock(rq, p_rq);
 +               goto again;
 +       }
 +
 +       if (task_running(p_rq, p) || p-state || !p-se.on_rq ||
 +                       !same_thread_group(p, curr) ||

/*                       !curr-sched_class-yield_to_task ||*/

 +                       curr-sched_class != p-sched_class) {
 +               goto out;
 +       }
 +
/*
 * ask scheduler to compute the next for successfully kicking
@p onto its CPU
 * what if p_rq is rt_class to do?
 */
next = pick_next_task(p_rq);
if (next != p)
p-se.vruntime = next-se.vruntime -1;
deactivate_task(p_rq, p, 0);
activate_task(p_rq, p, 0);
if (rq == p_rq)
schedule();
else
resched_task(p_rq-curr);
yield = 0;

/*       yield = curr-sched_class-yield_to_task(p, preempt); */

 +
 +out:
 +       double_rq_unlock(rq, p_rq);
 +       local_irq_restore(flags);
 +
 +       if (yield) {
 +               set_current_state(TASK_RUNNING);
 +               schedule();
 +       }
 +}
 +EXPORT_SYMBOL(yield_to);
 +
 +
  /*
  * This task is about to go to sleep on IO. Increment rq-nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
 diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
 index 5119b08..3288e7c 100644
 --- a/kernel/sched_fair.c
 +++ b/kernel/sched_fair.c
 @@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq)
  }

  #ifdef CONFIG_SMP
 +static void pull_task(struct rq *src_rq, struct task_struct *p,
 +                     struct rq *this_rq, int this_cpu);
 +#endif
 +
 +static int yield_to_task_fair(struct task_struct *p, int preempt)
 +{
 +       struct sched_entity *se = current-se;
 +       struct sched_entity *pse = p-se;
 +       struct sched_entity *curr = (task_rq(p)-curr)-se;
 +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
 +       struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
 +       int yield = this_rq() == task_rq(p);
 +       int want_preempt = preempt;
 +
 +#ifdef CONFIG_FAIR_GROUP_SCHED
 +       if (cfs_rq-tg != p_cfs_rq-tg)
 +               return 0;
 +
 +       /* Preemption only allowed within the same task group. */
 +       if (preempt  cfs_rq-tg != cfs_rq_of(curr)-tg)
 +               preempt = 0;
 +#endif
 +       /* Preemption only allowed within the same thread group. */
 +       if (preempt  !same_thread_group(current, task_of(p_cfs_rq-curr)))
 +               preempt = 0;
 +
 +#ifdef CONFIG_SMP
 +       /*
 +        * If this 

Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Rik van Riel

On 01/04/2011 11:41 AM, Hillf Danton wrote:


/*   !curr-sched_class-yield_to_task ||*/


+   curr-sched_class != p-sched_class) {
+   goto out;
+   }
+

/*
  * ask scheduler to compute the next for successfully kicking
@p onto its CPU
  * what if p_rq is rt_class to do?
  */
next = pick_next_task(p_rq);
if (next != p)
p-se.vruntime = next-se.vruntime -1;
deactivate_task(p_rq, p, 0);
activate_task(p_rq, p, 0);
if (rq == p_rq)
schedule();
else
resched_task(p_rq-curr);
yield = 0;


Wouldn't that break for FIFO and RR tasks?

There's a reason all the scheduler folks wanted a
per-class yield_to_task function :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Hillf Danton
On Wed, Jan 5, 2011 at 12:44 AM, Rik van Riel r...@redhat.com wrote:
 On 01/04/2011 11:41 AM, Hillf Danton wrote:

 /*                       !curr-sched_class-yield_to_task ||        */

 +                       curr-sched_class != p-sched_class) {
 +               goto out;
 +       }
 +

        /*
          * ask scheduler to compute the next for successfully kicking
 @p onto its CPU
          * what if p_rq is rt_class to do?
          */
        next = pick_next_task(p_rq);
        if (next != p)
                p-se.vruntime = next-se.vruntime -1;
        deactivate_task(p_rq, p, 0);
        activate_task(p_rq, p, 0);
        if (rq == p_rq)
                schedule();
        else
                resched_task(p_rq-curr);
        yield = 0;

 Wouldn't that break for FIFO and RR tasks?

 There's a reason all the scheduler folks wanted a
 per-class yield_to_task function :)


Where is the yield_to callback in the patch for RT schedule class?
If @p is RT, what could you do?

Hillf


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Rik van Riel

On 01/04/2011 11:51 AM, Hillf Danton wrote:


Wouldn't that break for FIFO and RR tasks?

There's a reason all the scheduler folks wanted a
per-class yield_to_task function :)



Where is the yield_to callback in the patch for RT schedule class?
If @p is RT, what could you do?


If the user chooses to overcommit the CPU with realtime
tasks, the user cannot expect realtime response.

For realtime, I have not implemented the yield_to callback
at all because it would probably break realtime semantics
and I assume people will not overcommit the CPU with realtime
tasks anyway.

I could see running a few realtime guests on a system, with
the number of realtime VCPUs not exceeding the number of
physical CPUs.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Hillf Danton
On Wed, Jan 5, 2011 at 12:54 AM, Rik van Riel r...@redhat.com wrote:
 On 01/04/2011 11:51 AM, Hillf Danton wrote:

 Wouldn't that break for FIFO and RR tasks?

 There's a reason all the scheduler folks wanted a
 per-class yield_to_task function :)


 Where is the yield_to callback in the patch for RT schedule class?
 If @p is RT, what could you do?

 If the user chooses to overcommit the CPU with realtime
 tasks, the user cannot expect realtime response.

 For realtime, I have not implemented the yield_to callback
 at all because it would probably break realtime semantics
 and I assume people will not overcommit the CPU with realtime
 tasks anyway.

 I could see running a few realtime guests on a system, with
 the number of realtime VCPUs not exceeding the number of
 physical CPUs.

Then it looks curr-sched_class != p-sched_class is not enough,
and yield_to can not ease the lock contention in KVM in case where
p-rq-curr is RT.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Peter Zijlstra
On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote:
 Where is the yield_to callback in the patch for RT schedule class?
 If @p is RT, what could you do? 

RT guests are a pipe dream, you first need to get the hypervisor (kvm in
this case) to be RT, which it isn't. Then you either need to very
statically set-up the host and the guest scheduling constraints (not
possible with RR/FIFO) or have a complete paravirt RT scheduler which
communicates its requirements to the host.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Hillf Danton
On Wed, Jan 5, 2011 at 1:08 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote:
 On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote:
 Where is the yield_to callback in the patch for RT schedule class?
 If @p is RT, what could you do?

 RT guests are a pipe dream, you first need to get the hypervisor (kvm in
 this case) to be RT, which it isn't. Then you either need to very
 statically set-up the host and the guest scheduling constraints (not
 possible with RR/FIFO) or have a complete paravirt RT scheduler which
 communicates its requirements to the host.

Even guest is not RT, you could not prevent it from being preempted by
RT task which has nothing to do guests.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Peter Zijlstra
On Tue, 2011-01-04 at 15:14 +0900, KOSAKI Motohiro wrote:
  From: Mike Galbraith efa...@gmx.de
  
  Add a yield_to function to the scheduler code, allowing us to
  give enough of our timeslice to another thread to allow it to
  run and release whatever resource we need it to release.
  
  We may want to use this to provide a sys_yield_to system call
  one day.
 
 At least I want. Ruby has GIL(giant interpreter lock). And giant lock
 naturally enforce an app to implement cooperative multithreading model.
 Therefore it has similar problem with your one. Python solved this issue
 by slowing lock mechanism (two pthread-cond wakeup each GIL releasing), 
 but I don't want it.
 
 Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
 avoid almost Nehalem (and other P2P cache arch) lock unfairness 
 problem. (probaby creating pthread_condattr_setautoyield_np or similar
 knob is good one)

NAK NAK NAK, yield_to is utter crap, and the only reason kvm 'needs' it
is because its wants to be utter crap (run unmodified guests).

There is plenty of sane serialization primitives for userspace, fix your
locking mess instead of pushing crap.

The only reason I'm maybe half-way considering this is because its a
pure in-kernel interface which we can 'fix' once unmodified guests
aren't important anymore.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Peter Zijlstra
On Wed, 2011-01-05 at 01:12 +0800, Hillf Danton wrote:
 On Wed, Jan 5, 2011 at 1:08 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote:
  On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote:
  Where is the yield_to callback in the patch for RT schedule class?
  If @p is RT, what could you do?
 
  RT guests are a pipe dream, you first need to get the hypervisor (kvm in
  this case) to be RT, which it isn't. Then you either need to very
  statically set-up the host and the guest scheduling constraints (not
  possible with RR/FIFO) or have a complete paravirt RT scheduler which
  communicates its requirements to the host.
 
 Even guest is not RT, you could not prevent it from being preempted by
 RT task which has nothing to do guests.

Sure, but yield_to() being a NOP for RT tasks is perfectly fine. Pretty
much all yield*() usage is a bug anyway.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Rik van Riel

On 01/04/2011 12:08 PM, Peter Zijlstra wrote:

On Wed, 2011-01-05 at 00:51 +0800, Hillf Danton wrote:

Where is the yield_to callback in the patch for RT schedule class?
If @p is RT, what could you do?


RT guests are a pipe dream, you first need to get the hypervisor (kvm in
this case) to be RT, which it isn't. Then you either need to very
statically set-up the host and the guest scheduling constraints (not
possible with RR/FIFO) or have a complete paravirt RT scheduler which
communicates its requirements to the host.


There's a limited use case.

One host can have a few RT guests, say a host with 8 CPUs
can have up to 6 or 7 RT VCPUs.  Those guests get top
priority.

The host can then have some other, low priority, guests
that scavenge remaining CPU time.

In this case, no yield_to is required for RT guests,
because they do not do overcommit.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Peter Zijlstra
On Mon, 2011-01-03 at 16:29 -0500, Rik van Riel wrote:
 From: Mike Galbraith efa...@gmx.de
 
 Add a yield_to function to the scheduler code, allowing us to
 give enough of our timeslice to another thread to allow it to
 run and release whatever resource we need it to release.
 
 We may want to use this to provide a sys_yield_to system call
 one day.
 
 Signed-off-by: Rik van Riel r...@redhat.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 Not-signed-off-by: Mike Galbraith efa...@gmx.de
 
 --- 
 Mike, want to change the above into a Signed-off-by: ? :)
 This code seems to work well.
 
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index c5f926c..0b8a3e6 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1083,6 +1083,7 @@ struct sched_class {
   void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
   void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
   void (*yield_task) (struct rq *rq);
 + int (*yield_to_task) (struct task_struct *p, int preempt);
  
   void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
 flags);
  
 @@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct 
 *p)
  # define rt_mutex_adjust_pi(p)   do { } while (0)
  #endif
  
 +extern void yield_to(struct task_struct *p, int preempt);
  extern void set_user_nice(struct task_struct *p, long nice);
  extern int task_prio(const struct task_struct *p);
  extern int task_nice(const struct task_struct *p);
 diff --git a/kernel/sched.c b/kernel/sched.c
 index f8e5a25..ffa7a9d 100644
 --- a/kernel/sched.c
 +++ b/kernel/sched.c
 @@ -6901,6 +6901,53 @@ void __sched yield(void)
  }
  EXPORT_SYMBOL(yield);
  
 +/**
 + * yield_to - yield the current processor to another thread in
 + * your thread group, or accelerate that thread toward the
 + * processor it's on.
 + *
 + * It's the caller's job to ensure that the target task struct
 + * can't go away on us before we can do any checks.
 + */
 +void __sched yield_to(struct task_struct *p, int preempt)
 +{
 + struct task_struct *curr = current;
 + struct rq *rq, *p_rq;
 + unsigned long flags;
 + int yield = 0;
 +
 + local_irq_save(flags);
 + rq = this_rq();
 +
 +again:
 + p_rq = task_rq(p);
 + double_rq_lock(rq, p_rq);
 + while (task_rq(p) != p_rq) {
 + double_rq_unlock(rq, p_rq);
 + goto again;
 + }
 +
 + if (task_running(p_rq, p) || p-state || !p-se.on_rq ||
 + !same_thread_group(p, curr) ||
 + !curr-sched_class-yield_to_task ||
 + curr-sched_class != p-sched_class) {
 + goto out;
 + }
 +
 + yield = curr-sched_class-yield_to_task(p, preempt);
 +
 +out:
 + double_rq_unlock(rq, p_rq);
 + local_irq_restore(flags);
 +
 + if (yield) {
 + set_current_state(TASK_RUNNING);
 + schedule();
 + }
 +}
 +EXPORT_SYMBOL(yield_to);

This definitely wants to be EXPORT_SYMBOL_GPL() and if it were possible
I'd make it so only kvm.o could use it. It really sucks that kvm is a
module.

  /*
   * This task is about to go to sleep on IO. Increment rq-nr_iowait so
   * that process accounting knows that this is a task in IO wait state.
 diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
 index 5119b08..3288e7c 100644
 --- a/kernel/sched_fair.c
 +++ b/kernel/sched_fair.c
 @@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq)
  }
  
  #ifdef CONFIG_SMP
 +static void pull_task(struct rq *src_rq, struct task_struct *p,
 +   struct rq *this_rq, int this_cpu);
 +#endif
 +
 +static int yield_to_task_fair(struct task_struct *p, int preempt)
 +{
 + struct sched_entity *se = current-se;
 + struct sched_entity *pse = p-se;
 + struct sched_entity *curr = (task_rq(p)-curr)-se;
 + struct cfs_rq *cfs_rq = cfs_rq_of(se);
 + struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
 + int yield = this_rq() == task_rq(p);
 + int want_preempt = preempt;
 +
 +#ifdef CONFIG_FAIR_GROUP_SCHED
 + if (cfs_rq-tg != p_cfs_rq-tg)
 + return 0;
 +
 + /* Preemption only allowed within the same task group. */
 + if (preempt  cfs_rq-tg != cfs_rq_of(curr)-tg)
 + preempt = 0;
 +#endif

I'd simply bail if its not the same cgroup, who cares about that case
anyway, all KVM vcpu threads should be in the same cgroup I think.

 + /* Preemption only allowed within the same thread group. */
 + if (preempt  !same_thread_group(current, task_of(p_cfs_rq-curr)))
 + preempt = 0;

The calling site already checks for same_thread_group(), we never even
get here if that's not the case.

 +#ifdef CONFIG_SMP
 + /*
 +  * If this yield is important enough to want to preempt instead
 +  * of only dropping a -next hint, we're alone, and the target
 +  * is not alone, pull the target to this cpu.
 +  */
 + if (want_preempt  

Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread Mike Galbraith
On Tue, 2011-01-04 at 19:04 +0100, Peter Zijlstra wrote:

  +   p_cfs_rq = cfs_rq_of(pse);
  +   yield = 1;
  +   }
  +#endif
  +
  +   if (yield)
  +   clear_buddies(cfs_rq, se);
  +   else if (preempt)
  +   clear_buddies(p_cfs_rq, curr);
  +
  +   /* Tell the scheduler that we'd really like pse to run next. */
  +   p_cfs_rq-next = pse;
  +
  +   if (!yield  preempt)
  +   resched_task(task_of(p_cfs_rq-curr));
 
 I don't get this.. Why would you resched the remote cpu, surely you
 didn't just pull its current task over..

:) hope not.   Caller bails on task_running(p_rq, p).  wants a comment
too I suppose.

Target doesn't live here, preempt is still set/allowed, so we want the
remote cpu to schedule.

-Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Fwd: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function]

2011-01-04 Thread Tommaso Cucinotta

Il 04/01/2011 19:15, Dario Faggioli ha scritto:


 Forwarded Message 
From: Peter Zijlstraa.p.zijls...@chello.nl
To: Rik van Rielr...@redhat.com
Cc: Hillf Dantondhi...@gmail.com,kvm@vger.kernel.org,
linux-ker...@vger.kernel.org, Avi Kivitia...@redhat.com, Srivatsa
Vaddagiriva...@linux.vnet.ibm.com, Mike Galbraithefa...@gmx.de,
Chris Wrightchr...@sous-sol.org
Subject: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function
Date: Tue, 04 Jan 2011 19:05:54 +0100
RT guests don't make sense, there's nowhere near enough infrastructure
for that to work well.

I'd argue that KVM running with RT priority is a bug.

Peter, can I ask why did you state that ? In the IRMOS project, we
are just deploying KVM VMs by using the Fabio's real-time scheduler
(for others, a.k.a., the Fabio's EDF throttling patch, or IRMOS RT 
scheduler)

in order to let the VMs get precise CPU scheduling guarantees by the
kernel. So, in this context we do have KVM running at RT priority, and
we do have experimental results showing how this can improve stability
of performance of the hosted guest VMs.
Of course, don't misunderstand me: this is a necessary condition for a
stable performance of KVM VMs, I'm not saying it is sufficient for

--
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Fwd: Re: [RFC -v3 PATCH 2/3] sched: add yield_to function]

2011-01-04 Thread Tommaso Cucinotta

Il 05/01/2011 00:38, Tommaso Cucinotta ha scritto:

Of course, don't misunderstand me: this is a necessary condition for a
stable performance of KVM VMs, I'm not saying it is sufficient for


... it.

Please, comment on this (reply to all, please, I'm not following LKML).

Thanks,

Tommaso
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread KOSAKI Motohiro
 On 01/04/2011 08:14 AM, KOSAKI Motohiro wrote:
  Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
  avoid almost Nehalem (and other P2P cache arch) lock unfairness
  problem. (probaby creating pthread_condattr_setautoyield_np or similar
  knob is good one)
 
 Often, the thread calling pthread_cond_signal() wants to continue 
 executing, not yield.

Then, it doesn't work.

After calling pthread_cond_signal(), T1 which cond_signal caller and T2
which waked start to GIL grab race. But usually T1 is always win because
lock variable is in T1's cpu cache. Why kernel and userland have so much
different result? One of a reason is glibc doesn't have any ticket lock scheme. 

If you are interesting GIL mess and issue, please feel free to ask more.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-04 Thread KOSAKI Motohiro
 NAK NAK NAK, yield_to is utter crap, and the only reason kvm 'needs' it
 is because its wants to be utter crap (run unmodified guests).
 
 There is plenty of sane serialization primitives for userspace, fix your
 locking mess instead of pushing crap.
 
 The only reason I'm maybe half-way considering this is because its a
 pure in-kernel interface which we can 'fix' once unmodified guests
 aren't important anymore.

KVM also have slow and inefficient mecanism. Userspace have a lot of
slow primitives too.

But okey, I'm withdraw this claim even though I disagree userspace have
plenty of sane serialization. It's offtopic and we need more discuss
at proper place.

Thanks.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-03 Thread Rik van Riel
From: Mike Galbraith efa...@gmx.de

Add a yield_to function to the scheduler code, allowing us to
give enough of our timeslice to another thread to allow it to
run and release whatever resource we need it to release.

We may want to use this to provide a sys_yield_to system call
one day.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Not-signed-off-by: Mike Galbraith efa...@gmx.de

--- 
Mike, want to change the above into a Signed-off-by: ? :)
This code seems to work well.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5f926c..0b8a3e6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1083,6 +1083,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
void (*yield_task) (struct rq *rq);
+   int (*yield_to_task) (struct task_struct *p, int preempt);
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
flags);
 
@@ -1981,6 +1982,7 @@ static inline int rt_mutex_getprio(struct task_struct *p)
 # define rt_mutex_adjust_pi(p) do { } while (0)
 #endif
 
+extern void yield_to(struct task_struct *p, int preempt);
 extern void set_user_nice(struct task_struct *p, long nice);
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index f8e5a25..ffa7a9d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6901,6 +6901,53 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, int preempt)
+{
+   struct task_struct *curr = current;
+   struct rq *rq, *p_rq;
+   unsigned long flags;
+   int yield = 0;
+
+   local_irq_save(flags);
+   rq = this_rq();
+
+again:
+   p_rq = task_rq(p);
+   double_rq_lock(rq, p_rq);
+   while (task_rq(p) != p_rq) {
+   double_rq_unlock(rq, p_rq);
+   goto again;
+   }
+
+   if (task_running(p_rq, p) || p-state || !p-se.on_rq ||
+   !same_thread_group(p, curr) ||
+   !curr-sched_class-yield_to_task ||
+   curr-sched_class != p-sched_class) {
+   goto out;
+   }
+
+   yield = curr-sched_class-yield_to_task(p, preempt);
+
+out:
+   double_rq_unlock(rq, p_rq);
+   local_irq_restore(flags);
+
+   if (yield) {
+   set_current_state(TASK_RUNNING);
+   schedule();
+   }
+}
+EXPORT_SYMBOL(yield_to);
+
+
 /*
  * This task is about to go to sleep on IO. Increment rq-nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5119b08..3288e7c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1119,6 +1119,61 @@ static void yield_task_fair(struct rq *rq)
 }
 
 #ifdef CONFIG_SMP
+static void pull_task(struct rq *src_rq, struct task_struct *p,
+ struct rq *this_rq, int this_cpu);
+#endif
+
+static int yield_to_task_fair(struct task_struct *p, int preempt)
+{
+   struct sched_entity *se = current-se;
+   struct sched_entity *pse = p-se;
+   struct sched_entity *curr = (task_rq(p)-curr)-se;
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
+   int yield = this_rq() == task_rq(p);
+   int want_preempt = preempt;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+   if (cfs_rq-tg != p_cfs_rq-tg)
+   return 0;
+
+   /* Preemption only allowed within the same task group. */
+   if (preempt  cfs_rq-tg != cfs_rq_of(curr)-tg)
+   preempt = 0;
+#endif
+   /* Preemption only allowed within the same thread group. */
+   if (preempt  !same_thread_group(current, task_of(p_cfs_rq-curr)))
+   preempt = 0;
+
+#ifdef CONFIG_SMP
+   /*
+* If this yield is important enough to want to preempt instead
+* of only dropping a -next hint, we're alone, and the target
+* is not alone, pull the target to this cpu.
+*/
+   if (want_preempt  !yield  cfs_rq-nr_running == 1 
+   cpumask_test_cpu(smp_processor_id(), p-cpus_allowed)) 
{
+   pull_task(task_rq(p), p, this_rq(), smp_processor_id());
+   p_cfs_rq = cfs_rq_of(pse);
+   yield = 1;
+   }
+#endif
+
+   if (yield)
+   clear_buddies(cfs_rq, se);
+   else if (preempt)
+   clear_buddies(p_cfs_rq, curr);
+
+   /* Tell the scheduler that we'd really like pse to run next. */
+

Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-03 Thread Mike Galbraith
On Mon, 2011-01-03 at 16:29 -0500, Rik van Riel wrote:
 From: Mike Galbraith efa...@gmx.de
 
 Add a yield_to function to the scheduler code, allowing us to
 give enough of our timeslice to another thread to allow it to
 run and release whatever resource we need it to release.
 
 We may want to use this to provide a sys_yield_to system call
 one day.
 
 Signed-off-by: Rik van Riel r...@redhat.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 Signed-off-by: Mike Galbraith efa...@gmx.de
 
 --- 
 Mike, want to change the above into a Signed-off-by: ? :)
 This code seems to work well.

Done.

-Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v3 PATCH 2/3] sched: add yield_to function

2011-01-03 Thread KOSAKI Motohiro
 From: Mike Galbraith efa...@gmx.de
 
 Add a yield_to function to the scheduler code, allowing us to
 give enough of our timeslice to another thread to allow it to
 run and release whatever resource we need it to release.
 
 We may want to use this to provide a sys_yield_to system call
 one day.

At least I want. Ruby has GIL(giant interpreter lock). And giant lock
naturally enforce an app to implement cooperative multithreading model.
Therefore it has similar problem with your one. Python solved this issue
by slowing lock mechanism (two pthread-cond wakeup each GIL releasing), 
but I don't want it.

Also, If pthread_cond_signal() call sys_yield_to imlicitly, we can
avoid almost Nehalem (and other P2P cache arch) lock unfairness 
problem. (probaby creating pthread_condattr_setautoyield_np or similar
knob is good one)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html