Re: [PATCH -v2 4/7] RT overloaded runqueues accounting

2007-10-23 Thread Paul Menage
On 10/22/07, Paul Jackson [EMAIL PROTECTED] wrote:
 Steven wrote:
  +void cpuset_rt_set_overload(struct task_struct *tsk, int cpu)
  +{
  + cpu_set(cpu, task_cs(tsk)-rt_overload);
  +}

 Question for Steven:

   What locks are held when cpuset_rt_set_overload() is called?

 Questions for Paul Menage:

   Does 'tsk' need to be locked for the above task_cs() call?

Cgroups doesn't change the locking rules for accessing a cpuset from a
task - you have to have one of:

- task_lock(task)

- callback_mutex

- be in an RCU section from the point when you call task_cs to the
point when you stop using its result. (Additionally, in this case
there's no guarantee that the task stays in this cpuset for the
duration of the RCU section).

Paul
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2 4/7] RT overloaded runqueues accounting

2007-10-23 Thread Paul Jackson
Paul M wrote:
 Cgroups doesn't change the locking rules for accessing a cpuset from a
 task - you have to have one of:

Good - could you comment task_cs() with this explanation?

The rules are derived from the cpuset rules, as you explain,
and as I suspected, but now task_cs() is the most popular
way to access a tasks cpuset from code within kernel/cpuset.c,
and that's code you added.

The reason that I started this subthread is that I didn't see
any immediate evidence that the RT code was honoring this locking,
and I suspected that I clear comment over task_cs() could have
clarified that for them.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2 4/7] RT overloaded runqueues accounting

2007-10-23 Thread Steven Rostedt

--
On Mon, 22 Oct 2007, Paul Menage wrote:

 On 10/22/07, Paul Jackson [EMAIL PROTECTED] wrote:
  Steven wrote:
   +void cpuset_rt_set_overload(struct task_struct *tsk, int cpu)
   +{
   + cpu_set(cpu, task_cs(tsk)-rt_overload);
   +}
 
  Question for Steven:
 
What locks are held when cpuset_rt_set_overload() is called?

Right now only the rq lock. I know it's not enough. This needs to be
fixed. These are called in the heart of the scheduler so...

 
  Questions for Paul Menage:
 
Does 'tsk' need to be locked for the above task_cs() call?

 Cgroups doesn't change the locking rules for accessing a cpuset from a
 task - you have to have one of:

 - task_lock(task)

I'm not sure of the lock nesting between task_lock(task) and rq-lock.
If this nesting doesn't yet exist, then I can add code to do the
task_lock.  But if the rq-lock is called within the task_lock somewhere,
then this can't be used.


 - callback_mutex

Can schedule, so it's definitely out.


 - be in an RCU section from the point when you call task_cs to the
 point when you stop using its result. (Additionally, in this case
 there's no guarantee that the task stays in this cpuset for the
 duration of the RCU section).

This may also be an option. Although with interrupts disabled for the
entire time the cpusets are used should keep RCU grace periods from moving
forward.

But let me give you two some background to what I'm trying to solve. And
then I'll get to where cpusets come in.

Currently the mainline vanilla kernel does not handle migrating RT tasks
well. And this problem also exists (but not as badly) in the -rt patch.
When a RT task is queued on a CPU that is running an even higher priority
RT task, it should be pushed off to another CPU that is running a lower
priority task. But this does not always happen and a RT task may take
several milliseconds before it gets a chance to run.

This latancy is not acceptible for RT tasks. So I added logic to push and
pull RT tasks to and from CPUS.  The push happens when a RT task wakes up
and can't preempt the RT task running on the same CPU, or when a lower RT
task is preempted by a higher one. The lower may be pushed to another CPU.

This alone is not enough to cover RT migration. We also need to pull RT
tasks to a CPU if that CPU is lowering its priority (a high priority RT
task has just went to sleep).

Idealy, and when CONFIG_CPUSETS is not defined, I keep track of all CPUS
that have more than one RT task queued to run on it.  This I call an RT
overload.  There's an RT overload bitmask that keeps track of the CPUS
that have more than one RT task queued.  When a CPU stops running a high
priority RT task, a search is made of all the CPUS that are in the RT
overload state to see if there exists a RT task that can migrate to the
CPU that is lowering its priority.

Ingo Molnar and Peter Zijlstra pointed out to me that this global cpumask
would kill performance on 64 CPU boxes due to cacheline bouncing. To
solve this issue, I placed the RT overload mask into the cpusets.

Ideally, the RT overload mask would keep track of all CPUS that have tasks
that can run on a given CPU. In-other-words, a cpuset from the point of
view of the CPU (or runqueue).  But this is overkill for the RT migration
code.  Large # CPU boxes probably don't have the RT balancing problems
that small # CPU boxes have, since the CPU resource is greater to run RT
tasks on.

Using cpusets seemed to be a nice place to add functionality to keep the
RT overload code from crippling large # CPU boxes.

The RT overload mask is bound to the CPU (runqueue) and not to the task.
But to get to the cpuset, I needed to go through the task. The task I used
was whatever was currently running on the given CPU, or sometimes the task
that was being pushed to a CPU.

Due to overlapping cpusets, there can be inconsistencies between the RT
overload mask and actual CPUS that are in the overload state. This is
tolerated, as the more CPUS you have, the less of a problem it should be
to have overloaded CPUS. Remember, the push task doesn't use the overload.
Only the pull does, and that happens when a push didn't succeed. With more
CPUS, pushes are more likely to succeed.

So the switch to use cpusets was to keep the RT balancing code from
hurting large SMP boxes than for actually being correct on those boxes.
The RT balance is much more important when the CPU resource is limited.
The cpusets were picked just because it seemed resonable that most of the
time a cpuset of one task on a runqueue would equal that of another task
on the same runqueue. But the code is good enough if that's not the
case.

My code can handle inconsistencies between the RT overload mask and actual
overloaded CPUS. So what I need to really protect with regards to cpusets
is from them disappearing and causing an oops.  Whether or not a task
comes and goes from a cpuset is not the important part.  The RT balance
code only uses the cpuset to determine what other RT tasks are 

[PATCH -v2 4/7] RT overloaded runqueues accounting

2007-10-22 Thread Steven Rostedt
This patch adds an RT overload accounting system. When a runqueue has
more than one RT task queued, it is marked as overloaded. That is that it
is a candidate to have RT tasks pulled from it.

If CONFIG_CPUSET is defined, then an rt overloaded CPU bitmask is created
in the cpusets.  The algorithm for pulling tasks is limited to the cpuset
of the current task on the runqueue. Because of overlapping cpusets, it is
possible that the bitmask may get out of sync with actual overloaded RT
runqueues. But it wont cause any real harm. The worst that can happen is
that a RT task may not migrate to a CPU that it can run on when it could.
But that's a OK price to pay to keep the accounting simple and not kill
the cache on large SMP boxes.

When CONFIG_CPUSET is not set, then a single RT overload CPU mask is used.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---
 include/linux/cpuset.h |6 
 kernel/cpuset.c|   62 +
 kernel/sched_rt.c  |   29 ++
 3 files changed, 97 insertions(+)

Index: linux-test.git/kernel/sched_rt.c
===
--- linux-test.git.orig/kernel/sched_rt.c   2007-10-22 22:31:59.0 
-0400
+++ linux-test.git/kernel/sched_rt.c2007-10-22 22:32:18.0 -0400
@@ -3,6 +3,31 @@
  * policies)
  */
 
+#ifdef CONFIG_CPUSETS
+# define rt_overload(p) cpuset_rt_overload(p)
+# define rt_set_overload(p, cpu) cpuset_rt_set_overload(p, cpu)
+# define rt_clear_overload(p, cpu) cpuset_rt_clear_overload(p, cpu)
+# define rt_overloaded(p) cpuset_rt_overloaded(p)
+#else
+static cpumask_t rt_overload_mask;
+static inline int rt_overloaded(struct task_struct *p)
+{
+   return !cpus_empty(rt_overload_mask);
+}
+static inline cpumask_t rt_overload(struct task_struct *p)
+{
+   return rt_overload_mask;
+}
+static inline void rt_set_overload(struct task_struct *p, int cpu)
+{
+   cpu_set(cpu, rt_overload_mask);
+}
+static inline void rt_clear_overload(struct task_struct *p, int cpu)
+{
+   cpu_clear(cpu, rt_overload_mask);
+}
+#endif
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -32,6 +57,8 @@ static inline void inc_rt_tasks(struct t
 #ifdef CONFIG_SMP
if (p-prio  rq-rt.highest_prio)
rq-rt.highest_prio = p-prio;
+   if (rq-rt.rt_nr_running  1)
+   rt_set_overload(p, rq-cpu);
 #endif /* CONFIG_SMP */
 }
 
@@ -53,6 +80,8 @@ static inline void dec_rt_tasks(struct t
} /* otherwise leave rq-highest prio alone */
} else
rq-rt.highest_prio = MAX_RT_PRIO;
+   if (rq-rt.rt_nr_running  2)
+   rt_clear_overload(p, rq-cpu);
 #endif /* CONFIG_SMP */
 }
 
Index: linux-test.git/include/linux/cpuset.h
===
--- linux-test.git.orig/include/linux/cpuset.h  2007-10-22 22:18:53.0 
-0400
+++ linux-test.git/include/linux/cpuset.h   2007-10-22 22:32:18.0 
-0400
@@ -78,6 +78,12 @@ extern void cpuset_track_online_nodes(vo
 
 extern int current_cpuset_is_being_rebound(void);
 
+/* The cpuset_rt_overload code is only used when CONFIG_CPUSETS is defined */
+extern int cpuset_rt_overloaded(struct task_struct *tsk);
+extern void cpuset_rt_set_overload(struct task_struct *tsk, int cpu);
+extern cpumask_t cpuset_rt_overload(struct task_struct *tsk);
+extern void cpuset_rt_clear_overload(struct task_struct *tsk, int cpu);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init_early(void) { return 0; }
Index: linux-test.git/kernel/cpuset.c
===
--- linux-test.git.orig/kernel/cpuset.c 2007-10-22 22:18:53.0 -0400
+++ linux-test.git/kernel/cpuset.c  2007-10-22 22:36:29.0 -0400
@@ -84,6 +84,9 @@ struct cpuset {
cpumask_t cpus_allowed; /* CPUs allowed to tasks in cpuset */
nodemask_t mems_allowed;/* Memory Nodes allowed to tasks */
 
+   /* bits protected by rq locks. */
+   cpumask_t rt_overload;  /* runqueue overload mask */
+
struct cpuset *parent;  /* my parent */
 
/*
@@ -179,6 +182,7 @@ static struct cpuset top_cpuset = {
.flags = ((1  CS_CPU_EXCLUSIVE) | (1  CS_MEM_EXCLUSIVE)),
.cpus_allowed = CPU_MASK_ALL,
.mems_allowed = NODE_MASK_ALL,
+   .rt_overload = CPU_MASK_NONE,
 };
 
 /*
@@ -1566,6 +1570,7 @@ static void cpuset_post_clone(struct cgr
 
cs-mems_allowed = parent_cs-mems_allowed;
cs-cpus_allowed = parent_cs-cpus_allowed;
+   cs-rt_overload = parent_cs-rt_overload;
return;
 }
 
@@ -1604,6 +1609,7 @@ static struct cgroup_subsys_state *cpuse
set_bit(CS_SCHED_LOAD_BALANCE, cs-flags);
cs-cpus_allowed = CPU_MASK_NONE;
cs-mems_allowed = NODE_MASK_NONE;
+   cs-rt_overload = CPU_MASK_NONE;
  

Re: [PATCH -v2 4/7] RT overloaded runqueues accounting

2007-10-22 Thread Paul Jackson
Steven wrote:
 +void cpuset_rt_set_overload(struct task_struct *tsk, int cpu)
 +{
 + cpu_set(cpu, task_cs(tsk)-rt_overload);
 +}

Question for Steven:

  What locks are held when cpuset_rt_set_overload() is called?

Questions for Paul Menage:

  Does 'tsk' need to be locked for the above task_cs() call?

  Background concern -- one of the things that I like to think has
  allowed cpusets to be useful to others is the careful documentation
  of its locking requirements.  I hope that the cgroup infrastructure,
  and the portions of the cpuset code, such as this task_cs() call,
  that were adapted to work with cgroups, have their locking needs
  documented as well.  I suspect that there is still some presence of
  stale cpuset locking comments, and perhaps an absence of complete
  comments on new cgroup locking needs.  My recollection is that this
  is already on your todo list -- I'm just being annoying here ;).

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html