Re: [PATCH -v2 4/7] RT overloaded runqueues accounting
On 10/22/07, Paul Jackson [EMAIL PROTECTED] wrote: Steven wrote: +void cpuset_rt_set_overload(struct task_struct *tsk, int cpu) +{ + cpu_set(cpu, task_cs(tsk)-rt_overload); +} Question for Steven: What locks are held when cpuset_rt_set_overload() is called? Questions for Paul Menage: Does 'tsk' need to be locked for the above task_cs() call? Cgroups doesn't change the locking rules for accessing a cpuset from a task - you have to have one of: - task_lock(task) - callback_mutex - be in an RCU section from the point when you call task_cs to the point when you stop using its result. (Additionally, in this case there's no guarantee that the task stays in this cpuset for the duration of the RCU section). Paul - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2 4/7] RT overloaded runqueues accounting
Paul M wrote: Cgroups doesn't change the locking rules for accessing a cpuset from a task - you have to have one of: Good - could you comment task_cs() with this explanation? The rules are derived from the cpuset rules, as you explain, and as I suspected, but now task_cs() is the most popular way to access a tasks cpuset from code within kernel/cpuset.c, and that's code you added. The reason that I started this subthread is that I didn't see any immediate evidence that the RT code was honoring this locking, and I suspected that I clear comment over task_cs() could have clarified that for them. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2 4/7] RT overloaded runqueues accounting
-- On Mon, 22 Oct 2007, Paul Menage wrote: On 10/22/07, Paul Jackson [EMAIL PROTECTED] wrote: Steven wrote: +void cpuset_rt_set_overload(struct task_struct *tsk, int cpu) +{ + cpu_set(cpu, task_cs(tsk)-rt_overload); +} Question for Steven: What locks are held when cpuset_rt_set_overload() is called? Right now only the rq lock. I know it's not enough. This needs to be fixed. These are called in the heart of the scheduler so... Questions for Paul Menage: Does 'tsk' need to be locked for the above task_cs() call? Cgroups doesn't change the locking rules for accessing a cpuset from a task - you have to have one of: - task_lock(task) I'm not sure of the lock nesting between task_lock(task) and rq-lock. If this nesting doesn't yet exist, then I can add code to do the task_lock. But if the rq-lock is called within the task_lock somewhere, then this can't be used. - callback_mutex Can schedule, so it's definitely out. - be in an RCU section from the point when you call task_cs to the point when you stop using its result. (Additionally, in this case there's no guarantee that the task stays in this cpuset for the duration of the RCU section). This may also be an option. Although with interrupts disabled for the entire time the cpusets are used should keep RCU grace periods from moving forward. But let me give you two some background to what I'm trying to solve. And then I'll get to where cpusets come in. Currently the mainline vanilla kernel does not handle migrating RT tasks well. And this problem also exists (but not as badly) in the -rt patch. When a RT task is queued on a CPU that is running an even higher priority RT task, it should be pushed off to another CPU that is running a lower priority task. But this does not always happen and a RT task may take several milliseconds before it gets a chance to run. This latancy is not acceptible for RT tasks. So I added logic to push and pull RT tasks to and from CPUS. The push happens when a RT task wakes up and can't preempt the RT task running on the same CPU, or when a lower RT task is preempted by a higher one. The lower may be pushed to another CPU. This alone is not enough to cover RT migration. We also need to pull RT tasks to a CPU if that CPU is lowering its priority (a high priority RT task has just went to sleep). Idealy, and when CONFIG_CPUSETS is not defined, I keep track of all CPUS that have more than one RT task queued to run on it. This I call an RT overload. There's an RT overload bitmask that keeps track of the CPUS that have more than one RT task queued. When a CPU stops running a high priority RT task, a search is made of all the CPUS that are in the RT overload state to see if there exists a RT task that can migrate to the CPU that is lowering its priority. Ingo Molnar and Peter Zijlstra pointed out to me that this global cpumask would kill performance on 64 CPU boxes due to cacheline bouncing. To solve this issue, I placed the RT overload mask into the cpusets. Ideally, the RT overload mask would keep track of all CPUS that have tasks that can run on a given CPU. In-other-words, a cpuset from the point of view of the CPU (or runqueue). But this is overkill for the RT migration code. Large # CPU boxes probably don't have the RT balancing problems that small # CPU boxes have, since the CPU resource is greater to run RT tasks on. Using cpusets seemed to be a nice place to add functionality to keep the RT overload code from crippling large # CPU boxes. The RT overload mask is bound to the CPU (runqueue) and not to the task. But to get to the cpuset, I needed to go through the task. The task I used was whatever was currently running on the given CPU, or sometimes the task that was being pushed to a CPU. Due to overlapping cpusets, there can be inconsistencies between the RT overload mask and actual CPUS that are in the overload state. This is tolerated, as the more CPUS you have, the less of a problem it should be to have overloaded CPUS. Remember, the push task doesn't use the overload. Only the pull does, and that happens when a push didn't succeed. With more CPUS, pushes are more likely to succeed. So the switch to use cpusets was to keep the RT balancing code from hurting large SMP boxes than for actually being correct on those boxes. The RT balance is much more important when the CPU resource is limited. The cpusets were picked just because it seemed resonable that most of the time a cpuset of one task on a runqueue would equal that of another task on the same runqueue. But the code is good enough if that's not the case. My code can handle inconsistencies between the RT overload mask and actual overloaded CPUS. So what I need to really protect with regards to cpusets is from them disappearing and causing an oops. Whether or not a task comes and goes from a cpuset is not the important part. The RT balance code only uses the cpuset to determine what other RT tasks are
[PATCH -v2 4/7] RT overloaded runqueues accounting
This patch adds an RT overload accounting system. When a runqueue has more than one RT task queued, it is marked as overloaded. That is that it is a candidate to have RT tasks pulled from it. If CONFIG_CPUSET is defined, then an rt overloaded CPU bitmask is created in the cpusets. The algorithm for pulling tasks is limited to the cpuset of the current task on the runqueue. Because of overlapping cpusets, it is possible that the bitmask may get out of sync with actual overloaded RT runqueues. But it wont cause any real harm. The worst that can happen is that a RT task may not migrate to a CPU that it can run on when it could. But that's a OK price to pay to keep the accounting simple and not kill the cache on large SMP boxes. When CONFIG_CPUSET is not set, then a single RT overload CPU mask is used. Signed-off-by: Steven Rostedt [EMAIL PROTECTED] --- include/linux/cpuset.h |6 kernel/cpuset.c| 62 + kernel/sched_rt.c | 29 ++ 3 files changed, 97 insertions(+) Index: linux-test.git/kernel/sched_rt.c === --- linux-test.git.orig/kernel/sched_rt.c 2007-10-22 22:31:59.0 -0400 +++ linux-test.git/kernel/sched_rt.c2007-10-22 22:32:18.0 -0400 @@ -3,6 +3,31 @@ * policies) */ +#ifdef CONFIG_CPUSETS +# define rt_overload(p) cpuset_rt_overload(p) +# define rt_set_overload(p, cpu) cpuset_rt_set_overload(p, cpu) +# define rt_clear_overload(p, cpu) cpuset_rt_clear_overload(p, cpu) +# define rt_overloaded(p) cpuset_rt_overloaded(p) +#else +static cpumask_t rt_overload_mask; +static inline int rt_overloaded(struct task_struct *p) +{ + return !cpus_empty(rt_overload_mask); +} +static inline cpumask_t rt_overload(struct task_struct *p) +{ + return rt_overload_mask; +} +static inline void rt_set_overload(struct task_struct *p, int cpu) +{ + cpu_set(cpu, rt_overload_mask); +} +static inline void rt_clear_overload(struct task_struct *p, int cpu) +{ + cpu_clear(cpu, rt_overload_mask); +} +#endif + /* * Update the current task's runtime statistics. Skip current tasks that * are not in our scheduling class. @@ -32,6 +57,8 @@ static inline void inc_rt_tasks(struct t #ifdef CONFIG_SMP if (p-prio rq-rt.highest_prio) rq-rt.highest_prio = p-prio; + if (rq-rt.rt_nr_running 1) + rt_set_overload(p, rq-cpu); #endif /* CONFIG_SMP */ } @@ -53,6 +80,8 @@ static inline void dec_rt_tasks(struct t } /* otherwise leave rq-highest prio alone */ } else rq-rt.highest_prio = MAX_RT_PRIO; + if (rq-rt.rt_nr_running 2) + rt_clear_overload(p, rq-cpu); #endif /* CONFIG_SMP */ } Index: linux-test.git/include/linux/cpuset.h === --- linux-test.git.orig/include/linux/cpuset.h 2007-10-22 22:18:53.0 -0400 +++ linux-test.git/include/linux/cpuset.h 2007-10-22 22:32:18.0 -0400 @@ -78,6 +78,12 @@ extern void cpuset_track_online_nodes(vo extern int current_cpuset_is_being_rebound(void); +/* The cpuset_rt_overload code is only used when CONFIG_CPUSETS is defined */ +extern int cpuset_rt_overloaded(struct task_struct *tsk); +extern void cpuset_rt_set_overload(struct task_struct *tsk, int cpu); +extern cpumask_t cpuset_rt_overload(struct task_struct *tsk); +extern void cpuset_rt_clear_overload(struct task_struct *tsk, int cpu); + #else /* !CONFIG_CPUSETS */ static inline int cpuset_init_early(void) { return 0; } Index: linux-test.git/kernel/cpuset.c === --- linux-test.git.orig/kernel/cpuset.c 2007-10-22 22:18:53.0 -0400 +++ linux-test.git/kernel/cpuset.c 2007-10-22 22:36:29.0 -0400 @@ -84,6 +84,9 @@ struct cpuset { cpumask_t cpus_allowed; /* CPUs allowed to tasks in cpuset */ nodemask_t mems_allowed;/* Memory Nodes allowed to tasks */ + /* bits protected by rq locks. */ + cpumask_t rt_overload; /* runqueue overload mask */ + struct cpuset *parent; /* my parent */ /* @@ -179,6 +182,7 @@ static struct cpuset top_cpuset = { .flags = ((1 CS_CPU_EXCLUSIVE) | (1 CS_MEM_EXCLUSIVE)), .cpus_allowed = CPU_MASK_ALL, .mems_allowed = NODE_MASK_ALL, + .rt_overload = CPU_MASK_NONE, }; /* @@ -1566,6 +1570,7 @@ static void cpuset_post_clone(struct cgr cs-mems_allowed = parent_cs-mems_allowed; cs-cpus_allowed = parent_cs-cpus_allowed; + cs-rt_overload = parent_cs-rt_overload; return; } @@ -1604,6 +1609,7 @@ static struct cgroup_subsys_state *cpuse set_bit(CS_SCHED_LOAD_BALANCE, cs-flags); cs-cpus_allowed = CPU_MASK_NONE; cs-mems_allowed = NODE_MASK_NONE; + cs-rt_overload = CPU_MASK_NONE;
Re: [PATCH -v2 4/7] RT overloaded runqueues accounting
Steven wrote: +void cpuset_rt_set_overload(struct task_struct *tsk, int cpu) +{ + cpu_set(cpu, task_cs(tsk)-rt_overload); +} Question for Steven: What locks are held when cpuset_rt_set_overload() is called? Questions for Paul Menage: Does 'tsk' need to be locked for the above task_cs() call? Background concern -- one of the things that I like to think has allowed cpusets to be useful to others is the careful documentation of its locking requirements. I hope that the cgroup infrastructure, and the portions of the cpuset code, such as this task_cs() call, that were adapted to work with cgroups, have their locking needs documented as well. I suspect that there is still some presence of stale cpuset locking comments, and perhaps an absence of complete comments on new cgroup locking needs. My recollection is that this is already on your todo list -- I'm just being annoying here ;). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html