Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets (v0.2)
> Is the above code equivalant to what the comment states: > > if (is_cpu_isolated(trial) <= is_cpu_exclusive(trial)) > return -EINVAL; I think I got that backwards. How about: /* An isolated cpuset has to be exclusive */ if (!(is_cpu_isolated(trial) <= is_cpu_exclusive(trial))) return -EINVAL; -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets (v0.2)
Dinakar's patch contains: + /* Make the change */ + par->cpus_allowed = t.cpus_allowed; + par->isolated_map = t.isolated_map; Doesn't the above make changes to the parent cpus_allowed without calling validate_change()? Couldn't we do nasty things like empty that cpus_allowed, leaving tasks in that cpuset starved (or testing the last chance code that scans up the cpuset hierarchy looking for a non-empty cpus_allowed)? What prevents all the immediate children of the top cpuset from using up all the cpus as isolated cpus, leaving the top cpuset cpus_allowed empty, which fails even that last chance check, going to the really really last chance code, allowing any online cpu to tasks in that cpuset? These questions are in addition to my earlier question: Why don't you need to propogate upward this change to the parents cpus_allowed and isolated_map? If a parents isolated_map grows (or shrinks), doesn't that affect every ancestor, all the way to the top cpuset? I am unable to tell, just from code reading, whether this code has adequately worked through the details involved in properly handling nested changes. I am unable to build or test this on ia64, because you have code such as the rebuild_sched_domains() routine, that is in the '#else' half of a very large "#ifdef ARCH_HAS_SCHED_DOMAIN - #else - #endif" section of kernel/sched.c, and ia64 arch (and only that arch, so far as I know) defines ARCH_HAS_SCHED_DOMAIN, so doesn't see this '#else' half. + /* + * If current isolated cpuset has isolated children + * disallow changes to cpu mask +*/ + if (!cpus_empty(cs->isolated_map)) + return -EBUSY; 1) spacing - there's 8 spaces, not a tab, on two of the lines above. 2) I can't tell yet - but I am curious as to whether the above restriction prohibiting cpu mask changes to a cpuset with isolated children might be a bit draconian. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
Dinakar wrote: > Ok, Let me begin at the beginning and attempt to define what I am > doing here The statement of requirements and approach help. Thank-you. And the comments in the code patch are much easier for me to understand. Thanks. Let me step back and consider where we are here. I've not been entirely happy with the cpu_exclusive (and mem_exclusive) properties. They were easy to code, and they require only looking at ones siblings and parent, but they don't provide all that people usually want, which is system wide exclusivity, because they don't exclude tasks in ones parent (or more remote ancestor) cpusets from stealing resources. I take your isolated cpusets as a reasonable attempt to provide what's really wanted. I had avoided simple, system-wide exclusivity because I really wanted cpusets to be hierarchical. One should be able to subdivide and manage one subtree of the cpuset hierarchy, oblivious to what someone else is doing with a disjoint subtree. Your work shows how to provide a stronger form of isolation (exclusivity) without abandoning the hierarchical structure. There are three directions we could go from here. I am not yet decided between them: 1) Remove cpu and mem exclusive flags - they are of limited use. 2) Leave code as is. 3) Extend the exclusive capability to include isolation from parents, along the lines of your patch. If I was redoing cpusets from scratch, I might not include the exclusive feature at all - not sure. But it's cheap, at least in terms of code, and of some use to some users. So I would choose (2) over (1), given where we are now. The main cost at present of the exclusive flags is the cost in understanding - they tend to confuse people at first glance, due to their somewhat unusual approach. If we go with (3), then I'd like to consider the overall design of this a bit more. Your patch, as is common for patches, attempts to work within the current framework, minimizing change. Better to take a step back and consider what would have been the best design as if the past didn't matter, then with that clearly in mind, ask how best to get there from here. I don't think we would have both isolated and exclusive flags, in the 'ideal design.' The exclusive flags are essentially half (or a third) of what's needed, and the isolated flags and masks the rest of it. Essentially, your patch replaces the single set of CPUs in a cpuset with three, related sets: A] the set of all CPUs managed by that cpuset B] the set of CPUs allowed to tasks attached to that cpuset C] the set of CPUs isolated for the dedicated use of some descendent Sets [B] and [C] form a partition of [A] -- their intersection is empty, and their union is [A]. Your current presentation of these sets of CPUs shows set [B] in the cpus file, followed by set [C] in brackets, if I am recalling correctly. This format changes the format of the current cpus_allowed file, and it violates the preference for a single value or vector per file. I would like to consider alternatives. Your code automatically updates [C] if the child cpuset adds or removes CPUs from those it manages in isolation (though I am not sure that your code manages this change all the way back up the hierarchy to the top cpuset, and I wondering if perhaps your code should be doing this, as noted in my detailed comments on your patch earlier today.) I'd be tempted, if taking this approach (3) to consider a couple of alternatives. As I spelled out a few days ago, one could mark some cpusets that form a partition of the systems CPUs, for the purposes of establishing isolated scheduler domains, without requiring the above three related sets per cpuset instead of one. I am still unsure how much of your motivation is the need to make the scheduler more efficient by establishing useful isolated sched domains, and how much is the need to keep the usage of CPUs by various jobs isolated, even from tasks attached to parent cpusets. One can obtain the job isolation just in user code - if you don't want a task to use a parent cpusets access to your isolated cpuset, then simply don't attach a task to the parent cpusets. I do not understand yet how strong your requirement is to have the _kernel_ enforce that there are not tasks in a parent cpuset which could intrude on the non-isolated resources of a child. I provide (non open source) user level tools to my users which enable them to conveniently ensure that there are no such unwanted tasks, so they don't have a problem with a parent cpusets CPUs overlapping a cpuset that they are using for an isolated job. Perhaps I could persuade my employer that it would be appropriate to open source these tools. In any case, going (3) would result in _one_ attribute, not two (both exclusive and isolated, with overlapping semantics, which is confusing.) -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jacks
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets (v0.2)
A few code details (still working on more substantive reply): + /* An isolated cpuset has to be exclusive */ + if ((is_cpu_isolated(trial) && !is_cpu_exclusive(cur)) + || (!is_cpu_exclusive(trial) && is_cpu_isolated(cur))) + return -EINVAL; Is the above code equivalant to what the comment states: if (is_cpu_isolated(trial) <= is_cpu_exclusive(trial)) return -EINVAL; + t = old_parent = *par; + cpus_or(all_map, cs->cpus_allowed, cs->isolated_map); + + /* If cpuset empty or top_cpuset, return */ +if (cpus_empty(all_map) || par == NULL) +return; If the (par == NULL) check succeeds, then perhaps the earlier (*par) dereference will have oopsed first? + struct cpuset *par = cs->parent, t, old_parent; Looks like 't' was chosen to be a one-char variable name, to keep some lines below within 80 columns. I'd do the same myself. But this leaves a non-symmetrical naming pattern for the new and old parent cpuset values. Perhaps the following would work better? struct cpuset *parptr; struct cpuset o, n; /* old and new parent cpuset values */ +static void update_cpu_domains(struct cpuset *cs, cpumask_t old_map) Could old_map be passed as a (const cpumask_t *)? The stack space of this code, just for cpumask_t's (see the old and new above) is getting large for (really) big systems. + /* Make the change */ + par->cpus_allowed = t.cpus_allowed; + par->isolated_map = t.isolated_map; Why don't you need to propogate upward this change to the parents cpus_allowed and isolated_map? If a parents isolated_map grows (or shrinks), doesn't that affect every ancestor, all the way to the top cpuset? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] Dynamic sched domains aka Isolated cpusets (v0.2)
Based on the Paul's feedback, I have simplified and cleaned up the code quite a bit. o I have taken care of most of the nits, except for the output format change for cpusets with isolated children. o Also most of my documentation has been part of my earlier mails and I have not yet added them to cpusets.txt. o I still havent looked at the memory side of things. o Most of the changes are in the cpusets code and almost none in the sched code. (I'll do that next week) o Hopefully my earlier mails regarding the design have clarified many of the questions that were raised So here goes version 0.2 -rw-r--r--1 root root16548 Apr 21 20:54 cpuset.o.orig -rw-r--r--1 root root17548 Apr 21 22:09 cpuset.o.sd-v0.2 Around ~6% increase in kernel text size of cpuset.o include/linux/init.h |2 include/linux/sched.h |1 kernel/cpuset.c | 153 +- kernel/sched.c| 111 4 files changed, 216 insertions(+), 51 deletions(-) diff -Naurp linux-2.6.12-rc1-mm1.orig/include/linux/init.h linux-2.6.12-rc1-mm1/include/linux/init.h --- linux-2.6.12-rc1-mm1.orig/include/linux/init.h 2005-03-18 07:03:49.0 +0530 +++ linux-2.6.12-rc1-mm1/include/linux/init.h 2005-04-21 21:54:06.0 +0530 @@ -217,7 +217,7 @@ void __init parse_early_param(void); #define __initdata_or_module __initdata #endif /*CONFIG_MODULES*/ -#ifdef CONFIG_HOTPLUG +#if defined(CONFIG_HOTPLUG) || defined(CONFIG_CPUSETS) #define __devinit #define __devinitdata #define __devexit diff -Naurp linux-2.6.12-rc1-mm1.orig/include/linux/sched.h linux-2.6.12-rc1-mm1/include/linux/sched.h --- linux-2.6.12-rc1-mm1.orig/include/linux/sched.h 2005-04-21 21:50:26.0 +0530 +++ linux-2.6.12-rc1-mm1/include/linux/sched.h 2005-04-21 21:53:57.0 +0530 @@ -155,6 +155,7 @@ typedef struct task_struct task_t; extern void sched_init(void); extern void sched_init_smp(void); extern void init_idle(task_t *idle, int cpu); +extern void rebuild_sched_domains(cpumask_t span1, cpumask_t span2); extern cpumask_t nohz_cpu_mask; diff -Naurp linux-2.6.12-rc1-mm1.orig/kernel/cpuset.c linux-2.6.12-rc1-mm1/kernel/cpuset.c --- linux-2.6.12-rc1-mm1.orig/kernel/cpuset.c 2005-04-21 21:50:26.0 +0530 +++ linux-2.6.12-rc1-mm1/kernel/cpuset.c2005-04-21 22:00:36.0 +0530 @@ -57,7 +57,13 @@ struct cpuset { unsigned long flags;/* "unsigned long" so bitops work */ - cpumask_t cpus_allowed; /* CPUs allowed to tasks in cpuset */ + /* +* CPUs allowed to tasks in cpuset and +* not part of any isolated children +*/ + cpumask_t cpus_allowed; + + cpumask_t isolated_map; /* CPUs associated with isolated children */ nodemask_t mems_allowed;/* Memory Nodes allowed to tasks */ atomic_t count; /* count tasks using this cpuset */ @@ -82,6 +88,7 @@ struct cpuset { /* bits in struct cpuset flags field */ typedef enum { CS_CPU_EXCLUSIVE, + CS_CPU_ISOLATED, CS_MEM_EXCLUSIVE, CS_REMOVED, CS_NOTIFY_ON_RELEASE @@ -93,6 +100,11 @@ static inline int is_cpu_exclusive(const return !!test_bit(CS_CPU_EXCLUSIVE, &cs->flags); } +static inline int is_cpu_isolated(const struct cpuset *cs) +{ + return !!test_bit(CS_CPU_ISOLATED, &cs->flags); +} + static inline int is_mem_exclusive(const struct cpuset *cs) { return !!test_bit(CS_MEM_EXCLUSIVE, &cs->flags); @@ -127,8 +139,10 @@ static inline int notify_on_release(cons static atomic_t cpuset_mems_generation = ATOMIC_INIT(1); static struct cpuset top_cpuset = { - .flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)), + .flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_CPU_ISOLATED) | + (1 << CS_MEM_EXCLUSIVE)), .cpus_allowed = CPU_MASK_ALL, + .isolated_map = CPU_MASK_NONE, .mems_allowed = NODE_MASK_ALL, .count = ATOMIC_INIT(0), .sibling = LIST_HEAD_INIT(top_cpuset.sibling), @@ -543,9 +557,14 @@ static void refresh_mems(void) static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q) { - return cpus_subset(p->cpus_allowed, q->cpus_allowed) && + cpumask_t all_map; + + cpus_or(all_map, q->cpus_allowed, q->isolated_map); + + return cpus_subset(p->cpus_allowed, all_map) && nodes_subset(p->mems_allowed, q->mems_allowed) && is_cpu_exclusive(p) <= is_cpu_exclusive(q) && + is_cpu_isolated(p) <= is_cpu_isolated(q) && is_mem_exclusive(p) <= is_mem_exclusive(q); } @@ -587,6 +606,11 @@ static int validate_change(const struct if (!is_cpuset_subset(trial, par)) return -EACCES; + /* An isolated cpuset has to be exclusive */ + if ((is_cpu_isolated
Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
On Wed, Apr 20, 2005 at 12:09:46PM -0700, Paul Jackson wrote: > Earlier, I wrote to Dinakar: > > What are your invariants, and how can you assure yourself and us > > that your code preserves these invariants? Ok, Let me begin at the beginning and attempt to define what I am doing here 1. I need a method to isolate a random set of cpus in such a way that only the set of processes that are specifically assigned can make use of these CPUs 2. I need to ensure that the sched load balance code does not pull any tasks other than the assigned ones onto these cpus 3. I need to be able to create multiple such groupings of cpus that are disjoint from the rest and run only specified tasks 4. I need a user interface to specify which random set of cpus form such a grouping of disjoint cpus 5. I need to be able to dynamically create and destroy these grouping of disjoint cpus 6. I need to be able to add/remove cpus to/from this grouping Now if you try to fit these requirements onto cpusets, keeping in mind that it already has an user interface and some of the frame work required to create disjoint groupings of cpus 1. An exclusive cpuset ensures that the cpus it has are disjoint from all other cpusets except its parent and children 2. So now I need a way to disassociate the cpus of an exclusive cpuset from its parent, so that this set of cpus are truly disjoint from the rest of the system. 3. After I have done (2) above, I now need to build two set of sched domains corresponding to the cpus of this exclusive cpuset and the remaining cpus of its parent 4. Ensure that the current rules of non-isolated cpusets are all preserved such that if this feature is not used, all other features work as before This is exactly what I have tried to do. 1. Maintain a flag to indicate whether a cpuset is isolated 2. Maintain an isolated_map for every cpuset. This contains a cache of all cpus associated with isolated children 3. To isolate a cpuset x, x has to be an exclusive cpuset and its parent has to be an isolated cpuset 4. On isolating a cpuset by issuing /bin/echo 1 > cpu_isolated It ensures that conditions in (3) are satisfied and then removes the cpus of the current cpuset from the parent cpus_allowed mask. (It also puts the cpus of the current cpuset into the isolated_map of its parent) This ensures that only the current cpuset and its children will have access to the now isolated cpus. It also rebuilds the sched domains into two new domains consisting of a. All cpus in the parent->cpus_allowed b. All cpus in current->cpus_allowed 5. Similarly on setting isolated off on a isolated cpuset, (or on doing an rmdir on an isolated cpuset) It adds all of the cpus of the current cpuset into its parent cpuset's cpus_allowed mask and removes them from it's parent's isolated_map This ensures that all of the cpus in the current cpuset are now visible to the parent cpuset. It now rebuilds only one sched domain consisting of all of the cpus in its parent's cpus_allowed mask. 6. You can also modify the cpus present in an isolated cpuset x provided that x does not have any children that are also isolated. 7. On adding or removing cpus from an isolated cpuset that does not have any isolated children, it reworks the parent cpuset's cpus_allowed and isolated_map masks and rebuilds the sched domains appropriately 8. Since the function update_cpu_domains, which does all of the above updations to the parent cpuset's masks, is always called with cpuset_sem held, it ensures that all these changes are atomic. > > He removes cpus 4-5 from batch and adds them to cint > > Could you spell out the exact steps the user would take, for this part > of your example? What does the user do, what does the kernel do in > response, and what state the cpusets end up in, after each action of the > user? cpuset cpus isolated cpus_allowed isolated_map top 0-7 1 0 0-7 top/lowlat 0-11 0-1 0 top/others 2-71 4-72-3 top/others/cint 2-3 1 2-3 0 top/others/batch 4-7 0 4-7 0 At this point to remove cpus 4-5 from batch and add them to cint, the admin would do the following steps # Remove cpus 4-5 from batch # batch is not a isolated cpuset and hence this step # has no other implications /bin/echo 6-7 > /top/others/batch/cpus cpuset cpus isolated cpus_allowed isolated_map top 0-7 1 0 0-7 top/lowlat 0-11 0-1 0 top/others 2-71 4-72-3 top/others/cint 2-3 1 2-3
Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
Earlier, I wrote to Dinakar: > What are your invariants, and how can you assure yourself and us > that your code preserves these invariants? I repeat that question. === On my first reading of your example, I see the following. It is sinking into my dense skull more than it had before that your patch changes the meaning of the cpuset field 'cpus_allowed', to only include the cpus not in isolated children. However there are other uses of the 'cpus_allowed' field in the cpuset code that are not changed, and comments and documentation describing this field that are not changed. I suspect this is an incomplete change. You don't actually state it that I noticed, but the main point of your example seems to be that you support incrementally moving individual cpus between cpusets, without the constraint that both cpusets be in the same subset of the partition (the same isolation group). So you can move a cpu in and out of an isolated group without tearing down the group down first, only to rebuild it after. To do this, you've added new semantics to some of the operations to write the 'cpus' special file of a cpuset, if and only if that cpuset is marked isolated, which involves changing some other masks. These new semantics are something along the lines of "adding a cpu here implies removing it from there. This presumably allows you to move cpus in or out of or between an isolated cpuset, while preserving the essential properties of a partition - that it is a disjoint covering. > He removes cpus 4-5 from batch and adds them to cint Could you spell out the exact steps the user would take, for this part of your example? What does the user do, what does the kernel do in response, and what state the cpusets end up in, after each action of the user? === So far, to be honest, I am finding your patch to be rather frustrating. Perhaps the essential reason is this. The interface that cpusets presents in the cpuset file system, mounted at /dev/cpuset, is not in my intentions primarily a human interface. It is primarily a programmatic interface. As such, there is a high premium on clarity of design, consistency of behaviour and absence of side affects. Each operation should do one thing, clearly defined, changing only what is operated on, preserving clearly spelled out invariants. If it takes three steps instead of one to accomplish a typical task, that's fine. The programs that layer on top of /dev/cpuset don't mind doing three things to get one thing done. But such programs are a pain in the backside to program correctly if the affects of each operation are not clearly defined, not focused on the obvious object being operated on, or not precisely consistent with an overriding model. This patch seems to add side affects and the change the meanings of things, doing so with the most minimum of mention in the description, without clearly and consistently spelling out the new mental model, and without uniformly changing all uses, comments and documentation to fit the new model. This cpuset facility is also a less commonly used kernel facility, and changes to cpusets, outside of a few key hooks in the scheduler and allocator, are not performance critical. This means that there is a premium in keeping the kernel code minimal, leaving as many details as practical to userland. This patch seems to increase the kernel text size, for an ia64 SN2 build using gcc 3.2.3 of a 2.6.12-rc1-mm4 tree I had at hand, _just_ for the cpuset.c changes, from 23071 bytes to 28999. That's over a 25% per-cent increase in the kernel text size of the file kernel/cpuset.o, just for this feature. That's too much, in my view. I don't know yet if the ability to move cpus between isolated sched domains without tearing them down and rebuilding them, is a critical feature for you or not. You have not been clear on what are the essential requirements of this feature. I don't even know for sure yet that this is the one key feature in your view that separates your proposal from the variations I explored. But if this is for you the critical feature that your proposal has, and mine lack, then I'd like to see if there is a way to do it without implicit side affects, without messing with the semantics of what's there now, and with significantly fewer bytes of kernel text space. And I'd like to see if we can have uniform and precisely spelled out semantics, in the code, comments and documentation, with any changes to the current semantics made everywhere, uniformly. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
On Tue, Apr 19, 2005 at 08:26:39AM -0700, Paul Jackson wrote: > * Your understanding of "cpu_exclusive" is not the same as mine. Sorry for creating confusion by what I said earlier, I do understand exactly what cpu_exclusive means. Its just that when I started working on this (a long time ago) I had a different notion and that is what I was referring to, I probably should never have brought that up > > > Since isolated cpusets are trying to partition the system, this > > can be restricted to only the first level of cpusets. > > I do not think such a restriction is a good idea. For example, lets say > our 8 CPU system has the following cpusets: > And my current implementation has no such restriction, I was only suggesting that to simplify the code. > > > Also I think we can add further restrictions in terms not being able > > to change (add/remove) cpus within a isolated cpuset. > > My approach agrees on this restriction. Earlier I wrote: > > Also note that adding or removing a cpu from a cpuset that has > > its domain_cpu_current flag set true must fail, and similarly > > for domain_mem_current. > > This restriction is required in my approach because the CPUs in the > domain_cpu_current cpusets (the isolated CPUs, in your terms) form a > partition (disjoint cover) of the CPUs in the system, which property > would be violated immediately if any CPU were added or removed from any > cpuset defining the partition. See my other note explaining how things work currently. I do feel that this restriction is not good -Dinakar - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
On Tue, Apr 19, 2005 at 10:23:48AM -0700, Paul Jackson wrote: > > How does this play out in your interface? Are you convinced that > your invariants are preserved at all times, to all users? Can > you present a convincing argument to others that this is so? Let me give an example of how the current version of isolated cpusets can be used and hopefully clarify my approach. Consider a system with 8 cpus that needs to run a mix of workloads. One set of applications have low latency requirements and another set have a mixed workload. The administrator decides to allot 2 cpus to the low latency application and the rest to other apps. To do this, he creates two cpusets (All cpusets are considered to be exclusive for this discussion) cpuset cpus isolated cpus_allowed isolated_map top 0-7 1 0-7 0 top/lowlat 0-10 0-1 0 top/others 2-70 2-7 0 He now wants to partition the system along these lines as he wants to isolate lowlat from the rest of the system to ensure that a. No tasks from the parent cpuset (top_cpuset in this case) use these cpus b. load balance does not run across all cpus 0-7 He does this by cd /mount-point/lowlat /bin/echo 1 > cpu_isolated Internally it takes the cpuset_sem, does some sanity checks and ensures that these cpus are not visible to any other cpuset including its parent (by removing these cpus from its parent's cpus_allowed mask and adding them to its parent's isolated_map) and then calls sched code to partition the system as [0-1] [2-7] The internal state of data structures are as follows cpuset cpus isolated cpus_allowed isolated_map top 0-7 1 2-70-1 top/lowlat 0-11 0-1 0 top/others 2-70 2-7 0 --- The administrator now wants to further partition the "others" cpuset into a cpu intensive application and a batch one cpuset cpus isolated cpus_allowed isolated_map top 0-7 1 2-70-1 top/lowlat 0-11 0-1 0 top/others 2-70 2-7 0 top/others/cint 2-3 0 2-3 0 top/others/batch 4-7 0 4-7 0 If now the administrator wants to isolate the cint cpuset... cd /mount-point/others /bin/echo 1 > cpu_isolated (At this point no new sched domains are built as there exists a sched domain which exactly matches the cpus in the "others" cpuset.) cd /mount-point/others/cint /bin/echo 1 > cpu_isolated At this point cpus from the "others" cpuset are also taken away from its parent cpus_allowed mask and put into the parent's isolated_map. This means that the parent cpus_allowed mask is empty. This would now result in partitioning the "others" cpuset and builds two new sched domains as follows [2-3] [4-7] Notice that the cpus 0-1 having already been isolated are not affected in this operation cpuset cpus isolated cpus_allowed isolated_map top 0-7 1 0 0-7 top/lowlat 0-11 0-1 0 top/others 2-71 4-72-3 top/others/cint 2-3 1 2-3 0 top/others/batch 4-7 0 4-7 0 --- The admin now wants to run more applications in the cint cpuset and decides to borrow a couple of cpus from the batch cpuset He removes cpus 4-5 from batch and adds them to cint cpuset cpus isolated cpus_allowed isolated_map top 0-7 1 0 0-7 top/lowlat 0-11 0-1 0 top/others 2-71 6-72-5 top/others/cint 2-5 1 2-5 0 top/others/batch 6-7 0 6-7 0 As cint is already isolated, adding cpus causes it to rebuild all cpus covered by its cpus_allowed and its parent's cpus_allowed, so the new sched domains will look as follows [2-5] [6-7] cpus 0-1 are ofcourse still not affected Similarly the admin can remove cpus from cint, which will result in the domains being rebuilt to what was before [2-3] [4-7] --- Hope this clears up my approach. Also note that w
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
Dinakar wrote: > Also I think we can add further restrictions in terms not being able > to change (add/remove) cpus within a isolated cpuset. Instead one would > have to tear down an existing cpuset and make a new one with the > required configuration. that would simplify things even further My earlier reply to this missed the mark a little. Instead what I would say is this. If one wants to move a CPU from one cpuset to another, where these two cpusets are not in the same partitioned scheduler domain, then one first has to collapse the scheduler domain partitions so that both cpusets _are_ in the same partitioned scheduler domain. Then one can move the CPU between the two cpusets, and reestablish the more fine grained partitioned scheduler domain structure that isolates these two cpusets into different partitioned scheduler domains. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
Nick wrote: > Well the scheduler simply can't handle it, so it is not so much a > matter of pushing - you simply can't use partitioned domains and > meaningfully have a cpuset above them. Translating that into cpuset-speak, I think what you mean is that I can't have partitioned sched domains and have a task attached to a cpuset above them, if it matters to me that the task can actually use all the CPUs in its larger cpuset. But what you actually said was that I cannot have a cpuset above them. I can certainly _can_ have a cpuset above the cpusets that define the partitioned domains. I _have_ to have that, or toss the entire hierarchical design cpuset. The top cpuset encompasses all the CPUs on the system, and is above all others. Let's see if the following example helps clear up these confusions. Let's say we started out as one big happy family, with a single top cpuset, and a single sched domain, each encompassing the entire machine. All tasks are attached to that cpuset and load balanced and scheduled in that sched domain. Any task can be run anywhere. Then some yahoo comes along and decides to complicate things. They create my two cpusets Alpha and Beta, each covering half the system. They create two partitioned sched domains corresponding to Alpha and Beta, respectively. They move almost every task into one of Alpha or Beta, expecting hence forth that each such moved task will only run on whichever half of the system it was placed in. For instance, if they moved init into Alpha, that means they _want_ the init task to be constrained to the Alpha half of the system, even if every CPU in Beta has been idle for the last 5 hours. So far, all fine and dandy. But they leave behind a few tasks still attached to the top cpuset, with those tasks cpus_allowed still allowing any CPU in the system. They actually don't give a rat's patootie about these few tasks, because they consume less than 10 seconds each per day, and so long as they are allowed their few CPU cycles when they want them, all is well. They could have moved these tasks as well into Alpha or Beta, but they wanted to be annoying and see if they could concoct a test case that would break something here. Or maybe they were just forgetful. What breaks? You seem to be telling me that this is ver botten, but I don't see yet where the problem is. My timid guess is that about all that breaks is that each of these stray tasks will be forever after stuck in which ever one of Alpha or Beta it happened to be in at the point of the Great Divide. If say one of these tasks happened to be on the Beta side at that point, the Beta domain scheduler will never let an Alpha CPU see that task, leaving the task to only ever be picked up by a Beta CPU (even though the tasks cpuset and cpus_allowed would have allowed an Alpha CPU, in theory). Translating this back into a language my users might speak, I guess is this means I tell them: * No scheduling or load balancing is done across partitioned scheduler domains. * Even if one such domain is hugely oversubscribed, and another totally idle, no task in one will run in the other. If that's what you want, then go for it. * Tasks left attached to cpusets higher up in the hierarchy don't get moved or load balanced between partitioned sched domains below their cpuset. They will get stuck in one of the domains, willy-nilly. So if it matters to you in the slightest which of the partitions a task runs in, attach it appropriately, to one of the cpusets that define the partitioned scheduler domains, or below. In short, perhaps you were trying to make my life, or at least my efforts to understand this, simple, by telling me that I simply can't have any cpusets above partitioned sched domains. The literal translation of that into cpuset-speak throws out the entire cpuset architecture. So I have to push back and figure out in more detail what really matters here. Am I anywhere close? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
Dinakar wrote: > I was hoping that by the time we are done with this, we would > be able to completely get rid of the isolcpus= option. I won't miss it. Though, since it's in the main line kernel, do you need to mark it deprecated for a while first? > For that > ofcourse we need to be able build domains that dont run > load balance Ah - so that's what these isolcpus are - ones not load balanced? This was never clear to me. > The wording [/* Set ... */ ] was from the users point of view > for what action was being done, guess I'll change that Ok - at least now I can read and understand the comments, knowing this. The other comments in cpuset.c don't follow this convention, of speaking in the "user's voice", but rather speak in the "responding systems voice." Best to remain consistent in this matter. > It is complicated because it has to handle all of the different > possible actions that the user can initiate. It can be simplified > if we have stricter rules of what the user can/cannot do > w.r.t to isolated cpusets It is complicated because you are trying to pretend that to be doing a complex state change one step at a time, without a precise statement (at least, not that I saw) of what the invariants are, and atomic operations that preserve the invariants. > > First, let me verify one thing. I understand that the _key_ > > purpose of your patch is not so much to isolate cpus, as it > > is to allow for structuring scheduling domains to align with > > cpuset boundaries. I understand real isolated cpus to be ones > > that don't have a scheduling domain (have only the dummy one), > > as requested by the "isolcpus=..." boot flag. > > Not really. Isolated cpusets allows you to do a soft-partition > of the system, and it would make sense to continue to have load > balancing within these partitions. I would think not having > load balancing should be one of the options available Ok ... then is it correct to say that your purpose is to partition the systems CPUs into subsets, such that for each subset, either there is a scheduler domain for that exactly the CPUs in that subset, or none of the CPUs in the subset are in any scheduler domain? > I must confess that I havent looked at the memory side all that much, > having more interest in trying to build soft-partitioning of the cpu's This is an understandable focus of interest. Just know that one of the sanity tests I will apply to a solution for CPUs is whether there is a corresponding solution for Memory Nodes, using much the same principles, invariants and conventions. > ok I need to spend more time on you model Paul, but my first > guess is that it doesn't seem to be very intuitive and seems > to make it very complex from the users perspective. However as > I said I need to understand your model a bit more before I > comment on it Well ... I can't claim that my approach is simple. It does have a clearly defined (well, clear to me ;) mathematical model, with some invariants that are always preserved in what user space sees, with atomic operations for changing from one legal state to the next. The primary invariant is that the sets of CPUs in the cpusets marked domain_cpu_current form a partition (disjoint covering) of the CPUs in the system. What are your invariants, and how can you assure yourself and us that your code preserves these invariants? Also, I don't know that the sequence of user operations required by my interface is that much worse than yours. Let's take an example, and compare what the user would have to do. Let's say we have the following cpusets on our 8 CPU system: / # CPUs 0-7 /Alpha # CPUs 0-3 /Alpha/phi # CPUs0-1 /Alpha/chi # CPUs2-3 /Beta # CPUs 4-7 Let's say we currently have three scheduler domains, for three isolated (in your terms) cpusets: /Alpha/phi, /Alpha/chi and /Beta. Let's say we want to change the configuration to have just two scheduler domains (two isolated cpusets): /Alpha and /Beta. A user of my API would do the operations: echo 1 > /Alpha/domain_cpu_pending echo 1 > /Beta/domain_cpu_pending echo 0 > /Alpha/phi/domain_cpu_pending echo 0 > /Alpha/chi/domain_cpu_pending echo 1 > /domain_cpu_rebuild The domain_cpu_current state would not change until the final write (echo) above, at which time the cpuset_sem lock would be taken, and the system would, atomically to all viewing tasks, change from having the three cpusets /Alpha/phi, /Alpha/chi and /Beta marked with a true domain_cpu_current, to having the two cpusets /Alpha and /Beta so marked. The alternative API, which I didn't explore, could do this in one step by writing the new list of cpusets defining the partition, doing the rough equivalent (need nul separators, not space separators) of: echo /Alpha /Beta > /list_cpu_subdomains How does this play out in your interface? Are you convinced that your
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
Simon wrote: > I guess we hit a limit of the filesystem-interface approach here. > Are the possible failure reasons really that complex ? Given the amount of head scratching my proposal has provoked so far, they might be that complex, yes. Failure reasons include: * The cpuset Foo whose domain_cpu_rebuild file we wrote does not align with the current partition of CPUs on the system (align: every subset of the partition is either within or outside the CPUs of Foo) * The cpusets Foo and its descendents which are marked with a true domain_cpu_pending do not form a partition of the CPUs in Foo. This could be either because two of these cpusets have overlapping CPUs, or because the union of all the CPUs in these cpusets doesn't cover. * The usual other reasons such as lacking write permission. > If this is only to get a hint, OK. Yes - it would be a hint. The official explanation would be the errno setting on the failed write. The hint, written to the domain_cpu_error file, could actually state which two cpusets had overlapping CPUs, or which CPUs in Foo were not covered by the union of the CPUs in the marked descendent cpusets. Yes - it pushing the limits of available mechanisms. Though I don't offhand see where the filesystem-interface approach is to blame here. Can you describe any other approach that would provide such a similarly useful error explanation in a less unusual fashion? > Is such an error reporting scheme already in use in the kernel ? I don't think so. > On the other hand, there's also no guarantee that what we are triggering > by writing in domain_cpu_rebuild is what we have set up by writing in > domain_cpu_pending. User applications will need a bit of self-discipline. True. To preserve the invariant that the CPUs in the selected cpusets form a partition (disjoint cover) of the systems CPUs, we either need to provide an atomic operation that allows passing in a selection of cpusets, or we need to provide a sequence of operations that essentially drive a little finite state machine, building up a description of the new state while the old state remains in place, until the final trigger is fired. This suggests what the primary alternative to my proposed API would be, and that would be an interface that let one pass in a list of cpusets, requesting that the partition below the specified cpuset subtree Foo be completely and atomically rebuilt, to be that defined by the list of cpusets, with the set of CPUS in each of these cpusets defining one subset of the partition. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
Dinakar, replying to Nick: > > It doesn't work if you have *most* jobs bound to either > > {0, 1, 2, 3} or {4, 5, 6, 7} but one which should be allowed > > to use any CPU from 0-7. > > That is the current definition of cpu_exclusive on cpusets. > I initially thought of attaching exclusive cpusets to sched domains, > but that would not work because of this reason I can't make any sense of this reply, Dinakar. You say "_That_" is the current definition of cpu_exclusive -- I have no idea what "_That_" refers to. I see nothing in what Nick wrote that has anything much to do with the definition of cpu_exclusive. If a cpuset is marked cpu_exclusive, it means that the kernel will not allow any of its siblings to have overlapping CPUs. It doesn't mean that its parent can't overlap CPUs -- indeed it's parent must contain a superset of all the CPUs in a cpu_exclusive cpuset and its siblings. It doesn't mean that there cannot be tasks attached to each of the cpu_exclusive cpuset, its siblings and its parent. You say "attaching exclusive cpusets to sched domains ... would not work because of this reason." I have no idea what "this reason" is. I am pretty sure of a couple of things: * Your understanding of "cpu_exclusive" is not the same as mine. * We want to avoid any dependency on "cpu_exclusive" here. > Since isolated cpusets are trying to partition the system, this > can be restricted to only the first level of cpusets. I do not think such a restriction is a good idea. For example, lets say our 8 CPU system has the following cpusets: / # 0-7 /Alpha # 0-3 /Alpha/phi # 0-1 /Alpha/chi # 2-3 /Beta # 4-7 Then I see no problem with cpusets /Alpha/phi, /Alpha/chi and /Beta being the isolated cpusets, with corresponding scheduler domains. But phi and chi are not "first level cpusets." If we require a partition (disjoint cover) of the CPUs in the system, then enforce exactly that. Do not confuse a rough approximation with a simplified model. > Also I think we can add further restrictions in terms not being able > to change (add/remove) cpus within a isolated cpuset. My approach agrees on this restriction. Earlier I wrote: > Also note that adding or removing a cpu from a cpuset that has > its domain_cpu_current flag set true must fail, and similarly > for domain_mem_current. This restriction is required in my approach because the CPUs in the domain_cpu_current cpusets (the isolated CPUs, in your terms) form a partition (disjoint cover) of the CPUs in the system, which property would be violated immediately if any CPU were added or removed from any cpuset defining the partition. > Instead one would > have to tear down an existing cpuset and make a new one with the > required configuration. that would simplify things even further You've just come close to describing the approach that it took me "several more" words to describe. Though one doesn't need to tear down or make any new cpusets; rather one atomically selects a new set of cpusets to define the partition. If one had to tear down and remake cpusets to change the partition, then one would be in trouble -- it would be difficult to provide an API that allowed doing that atomically. If its not atomic, then we have illegal intermediate states, where one cpuset is gone and the new one has not arrived, and our partition of the cpusets in the system no longer covers the system ("our cover is blown", as they say in undercover police work.) > And maybe also have a flag that says whether to have load balancing > in this domain or not It's probably too early to think about that. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
On Tue, Apr 19, 2005 at 04:19:35PM +1000, Nick Piggin wrote: [...Snip...] > Though I imagine this becomes a complete superset of the > isolcpus= functionality, and it would actually be easier to > manage a single isolated CPU and its associated processes with > the cpusets interfaces after this. That is the idea, though I think that we need to be able to provide users the option of not doing a load balance within a sched domain > It doesn't work if you have *most* jobs bound to either > {0, 1, 2, 3} or {4, 5, 6, 7} but one which should be allowed > to use any CPU from 0-7. That is the current definition of cpu_exclusive on cpusets. I initially thought of attaching exclusive cpusets to sched domains, but that would not work because of this reason > > > > In the case of cpus, we really do prefer the partitions to be > > disjoint, because it would be better not to confuse the domain > > scheduler with overlapping domains. > > > > Yes. The domain scheduler can't handle this at all, it would > have to fall back on cpus_allowed, which in turn can create > big problems for multiprocessor balancing. > I agree > >From what I gather, this partitioning does not exactly fit > the cpusets architecture. Because with cpusets you are specifying > on what cpus can a set of tasks run, not dividing the whole system. Since isolated cpusets are trying to partition the system, this can be restricted to only the first level of cpusets. Keeping in mind that we have a flat sched domain heirarchy, I think probably this would simplify the update_sched_domains function quite a bit. Also I think we can add further restrictions in terms not being able to change (add/remove) cpus within a isolated cpuset. Instead one would have to tear down an existing cpuset and make a new one with the required configuration. that would simplify things even further > The sched-domains setup code will take care of all that for you > already. It won't know or care about the partitions. If you > partition a 64-way system into 2 32-ways, the domain setup code > will just think it is setting up a 32-way system. > > Don't worry about the sched-domains side of things at all, that's > pretty easy. Basically you just have to know that it has the > capability to partition the system in an arbitrary disjoint set > of sets of cpus. And maybe also have a flag that says whether to have load balancing in this domain or not -Dinakar - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
On Mon, Apr 18, 2005 at 10:54:27PM -0700, Paul Jackson wrote: > Hmmm ... interesting patch. My reaction to the changes in > kernel/cpuset.c are complicated: Thanks Paul for taking time off your vaction to reply to this. I was expecting to see one of your huge mails but this has exceeded all my expectations :) > * I'd probably ditch the all_cpus() macro, on the >concern that it obfuscates more than it helps. > * The need for _both_ a per-cpuset flag 'CS_CPU_ISOLATED' >and another per-cpuset mask 'isolated_map' concerns me. >I guess that the isolated_map is just a cache of the >set of CPUs isolated in child cpusets, not an independently >settable mask, but it needs to be clearly marked as such >if so. Currently the isolated_map is read-only as you have guessed. I did think of the user adding cpus to this map from the cpus_allowed mask but thought the current approach made more sense > * Some code lines go past column 80. I need to set my vi to wrap past 80... > * The name 'isolated' probably won't work. There is already >a boottime option "isolcpus=..." for 'isolated' cpus which >is (I think ?) rather different. Perhaps a better name will >fall out of the conceptual discussion, below. I was hoping that by the time we are done with this, we would be able to completely get rid of the isolcpus= option. For that ofcourse we need to be able build domains that dont run load balance > * The change to the output format of the special cpuset file >'cpus', to look like '0-3[4-7]' bothers me in a couple of >ways. It complicates the format from being a simple list. >And it means that the output format is not the same as the >input format (you can't just write back what you read from >such a file anymore). As i had said in my earlier mail, this was just one way of representing what I call isolated cpus. The other was to expose isolated_map to userspace and move cpus between cpus_allowed and isolated_map > * Several comments start with the word 'Set', as in: > Set isolated ON on a non exclusive cpuset >Such wording suggests to me that something is being set, >some bit or value changed or turned on. But in each case, >you are just testing for some condition that will return >or error out. Some phrasing such as "If ..." or other >conditional would be clearer. The wording was from the users point of view for what action was being done, guess I'll change that > * The update_sched_domains() routine is complicated, and >hence a primary clue that the conceptual model is not >clean yet. It is complicated because it has to handle all of the different possible actions that the user can initiate. It can be simplified if we have stricter rules of what the user can/cannot do w.r.t to isolated cpusets > * None of this was explained in Documentation/cpusets.txt. Yes I plan to add the documentation shortly > * Too bad that cpuset_common_file_write() has to have special >logic for this isolated case. The other flag settings just >turn on and off the associated bit, and don't trigger any >kernel code to adapt to new cpu or memory settings. We >should make an exception to that behaviour only if we must, >and then we must be explicit about the exception. See my notes on isolated_map above > First, let me verify one thing. I understand that the _key_ > purpose of your patch is not so much to isolate cpus, as it > is to allow for structuring scheduling domains to align with > cpuset boundaries. I understand real isolated cpus to be ones > that don't have a scheduling domain (have only the dummy one), > as requested by the "isolcpus=..." boot flag. Not really. Isolated cpusets allows you to do a soft-partition of the system, and it would make sense to continue to have load balancing within these partitions. I would think not having load balancing should be one of the options available > > Second, let me describe how this same issue shows up on the > memory side. > ...snip... > > > In the case of cpus, we really do prefer the partitions to be > disjoint, because it would be better not to confuse the domain > scheduler with overlapping domains. Absolutely one of the problem I had was to map the flat disjoint heirarchy of sched domains to the tree like heirarchy of cpusets > > In the case of memory, we technically probably don't _have_ to > keep the partitions disjoint. I doubt that the page allocator > (mm/page_alloc.c:__alloc_pages()) really cares. It will strive > valiantly to satisfy the memory request from any of the zones > (each node specific) in the list passed into it. > I must confess that I havent looked at the memory side all that much, having more interest in trying to build soft-partitioning of the cpu's > But for the purposes of providing a clear conceptual model to > our users, I think it is best that we impose this constraint on > the memory side as well as on the cpu si
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
On Mon, 18 Apr 2005, Paul Jackson wrote: > Hmmm ... interesting patch. My reaction to the changes in > kernel/cpuset.c are complicated: > > * I'm supposed to be on vacation the rest of this month, >so trying (entirely unsuccessfully so far) not to think >about this. > * This is perhaps the first non-trivial cpuset patch to come >in the last many months from someone other than Simon or >myself - welcome. I'm glad to see this happening. > This leads to a possible interface. For each of cpus and > memory, add four per-cpuset control files. Let me take the > cpu case first. > > Add the per-cpuset control files: > * domain_cpu_current# readonly boolean > * domain_cpu_pending# read/write boolean > * domain_cpu_rebuild# write only trigger > * domain_cpu_error # read only - last error msg > 4) If the write failed, read the domain_cpu_error file > for an explanation. > Otherwise the write will fail, and an error message explaining > the problem made available in domain_cpu_error for subsequent > reading. Just setting errno would be insufficient in this > case, as the possible reasons for error are too complex to be > adequately described that way. I guess we hit a limit of the filesystem-interface approach here. Are the possible failure reasons really that complex ? Is such an error reporting scheme already in use in the kernel ? I find the two-files approach a bit disturbing -- we have no guarantee that the error we read is the error we produced. If this is only to get a hint, OK. On the other hand, there's also no guarantee that what we are triggering by writing in domain_cpu_rebuild is what we have set up by writing in domain_cpu_pending. User applications will need a bit of self-discipline. > The above scheme should significantly reduce the number of > special cases in the update_sched_domains() routine (which I > would rename to update_cpu_domains, alongside another one to be > provided later, update_mem_domains.) These new update routines > will verify that all the preconditions are met, tear down all > the cpu or mem domains within the scope of the specified cpuset, > and rebuild them according to the partition defined by the > pending_*_domain flags on the descendent cpusets. It's the > same complete rebuild of the partitioning of some subtree, > each time, without all the special cases for incrementally > adding and removing cpus or mems from this or that. Complex > nested if-else-if-else logic is a breeding ground for bugs -- > good riddance. Oh yes. There's already a good bunch of if-then-else logic in the cpusets because of the different flags that can apply. We don't need more. > There -- what do you think of this alternative? Most of all, that you write mails faster than I am able to read them, so I might have missed something. But so far I like your proposal. Simon. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
On Tue, 2005-04-19 at 00:19 -0700, Paul Jackson wrote: > Nick wrote: > > It doesn't work if you have *most* jobs bound to either > > {0, 1, 2, 3} or {4, 5, 6, 7} but one which should be allowed > > to use any CPU from 0-7. > > How bad does it not work? > > My understanding is that Dinakar's patch did _not_ drive tasks out of > larger cpusets that included two or more of what he called isolated > cpusets, I call cpuset domains. > > For example: > > System starts up with 8 CPUs and all tasks (except for > a few kernel per-cpu daemons) in the root cpuset, able > to run on CPUs 0-7. > > Two cpusets, Alpha and Beta are created, where Alpha > has CPUs 0-3, and Beta has CPUs 4-7. > > Anytime someone logs in, their login shell and all > they run from it are placed in one of Alpha or Beta. > The main spawning daemons, such as inetd and cron, > are placed in one of Alpha or Beta. > > Only a few daemons that don't do much are left in the > root cpuset, able to run across 0-7. > > If we tried to partition the sched domains with Alpha and Beta as > separate domains, what kind of pain do these few daemons in > the root cpuset, on CPUs 0-7, cause? > They don't cause any pain for the scheduler. They will be *in* some pain because they can't escape from the domain in which they have been placed (unless you do a set_cpus_allowed thingy). So, eg. inetd might start up a million blahd servers, but they'll all be stuck in Alpha even if Beta is completely idle. > If the pain is too intolerable, then I'd guess not only do we have to > purge any cpusets superior to the ones determining the domain > partitioning of _all_ tasks, but we'd also have to invent yet one more > boolean flag attribute for any such superior cpusets, to mark them as > _not_ able to allow a task to be attached to them. And we'd have to > refine the hotplug co-existance logic in cpusets, which currently bumps > a task up to its parent cpuset when all the cpus in its current cpuset > are hot unplugged, to also rebuild the sched domains to some legal > configuration, if the parent cpuset was not allowed to have any tasks > attached. > > I'd rather not go there, unless push comes to shove. How hard are > you pushing? > Well the scheduler simply can't handle it, so it is not so much a matter of pushing - you simply can't use partitioned domains and meaningfully have a cpuset above them. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
On Tue, Apr 19, 2005 at 09:44:06AM +1000, Nick Piggin wrote: > Very good, I was wondering when someone would try to implement this ;) Thank you for the feedback ! > >-static void __devinit arch_init_sched_domains(void) > >+static void attach_domains(cpumask_t cpu_map) > > { > > This shouldn't be needed. There should probably just be one place that > attaches all domains. It is a bit difficult to explain what I mean when > you have 2 such places below. > Can you explain a bit more, not sure I understand what you mean > Interface isn't bad. It would seem to be able to handle everything, but > I think it can be made a bit simpler. > > fn_name(cpumask_t span1, cpumask_t span2) > > Yeah? The change_map is implicitly the union of the 2 spans. Also I don't > really like the name. It doesn't rebuild so much as split (or join). I > can't think of anything good off the top of my head. Yeah agreed. It kinda lived on from earlier versions I had > > >+unsigned long flags; > >+int i; > >+ > >+local_irq_save(flags); > >+ > >+for_each_cpu_mask(i, change_map) > >+spin_lock(&cpu_rq(i)->lock); > >+ > > Locking is wrong. And it has changed again in the latest -mm kernel. > Please diff against that. > I havent looked at the RCU sched domain changes as yet, but I put this in to address some problems I noticed during stress testing. Basically with the current hotplug code, it is possible to have a scenario like this rebuild domains load balance | | | take existing sd pointer | | attach to dummy domain | | loop thro sched groups change sched group info| access invalid pointer and panic > >+if (!cpus_empty(span1)) > >+build_sched_domains(span1); > >+if (!cpus_empty(span2)) > >+build_sched_domains(span2); > >+ > > You also can't do this - you have to 'offline' the domains first before > building new ones. See the CPU hotplug code that handles this. > By offline if you mean attach to dummy domain, see above > This makes a hotplug event destroy your nicely set up isolated domains, > doesn't it? > > This looks like the most difficult problem to overcome. It needs some > external information to redo the cpuset splits at cpu hotplug time. > Probably a hotplug handler in the cpusets code might be the best way > to do that. Yes I am aware of this. What I have in mind is for the hotplug code from scheduler to call into cpusets code. This will just return say 1 when cpusets is not compiled in and the sched code can continue to do what it is doing right now, else the cpusets code will find the leaf cpuset that contains the hotplugged cpu and rebuild the domains accordingly However the question still remains as to how cpusets should handle this hotplugged cpu -Dinakar - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
Nick wrote: > That would make sense. I'm not familiar with the workings of cpusets, > but that would require every task to be assigned to one of these > sets (or a subset within it), yes? That's the rub, as I noted a couple of messages ago, while you were writing this message. It doesn't require every task to be in one of these or a subset. Tasks could be in some multiple-domain superset, unless that is so painful that we have to add mechanisms to cpusets to prohibit it. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
> > So you do _not_ want to consider nested sched domains, just disjoint > > ones. Good. > > > > You don't either? Good. :) >From the point of view of cpusets, I'd rather not think about nested sched domains, for now at least. But my scheduler savvy colleagues on the big SGI boxes may well have ambitions here. I can't speak for them. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
Nick wrote: > It doesn't work if you have *most* jobs bound to either > {0, 1, 2, 3} or {4, 5, 6, 7} but one which should be allowed > to use any CPU from 0-7. How bad does it not work? My understanding is that Dinakar's patch did _not_ drive tasks out of larger cpusets that included two or more of what he called isolated cpusets, I call cpuset domains. For example: System starts up with 8 CPUs and all tasks (except for a few kernel per-cpu daemons) in the root cpuset, able to run on CPUs 0-7. Two cpusets, Alpha and Beta are created, where Alpha has CPUs 0-3, and Beta has CPUs 4-7. Anytime someone logs in, their login shell and all they run from it are placed in one of Alpha or Beta. The main spawning daemons, such as inetd and cron, are placed in one of Alpha or Beta. Only a few daemons that don't do much are left in the root cpuset, able to run across 0-7. If we tried to partition the sched domains with Alpha and Beta as separate domains, what kind of pain do these few daemons in the root cpuset, on CPUs 0-7, cause? If the pain is too intolerable, then I'd guess not only do we have to purge any cpusets superior to the ones determining the domain partitioning of _all_ tasks, but we'd also have to invent yet one more boolean flag attribute for any such superior cpusets, to mark them as _not_ able to allow a task to be attached to them. And we'd have to refine the hotplug co-existance logic in cpusets, which currently bumps a task up to its parent cpuset when all the cpus in its current cpuset are hot unplugged, to also rebuild the sched domains to some legal configuration, if the parent cpuset was not allowed to have any tasks attached. I'd rather not go there, unless push comes to shove. How hard are you pushing? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
On Mon, 2005-04-18 at 23:59 -0700, Paul Jackson wrote: > Nick wrote: > > Basically you just have to know that it has the > > capability to partition the system in an arbitrary disjoint set > > of sets of cpus. > > > > If you can make use of that, then we're in business ;) > > You read fast ;) > > So you do _not_ want to consider nested sched domains, just disjoint > ones. Good. > You don't either? Good. :) > > > From what I gather, this partitioning does not exactly fit > > the cpusets architecture. Because with cpusets you are specifying > > on what cpus can a set of tasks run, not dividing the whole system. > > My evil scheme, and Dinakar's as well, is to provide a way for the user > to designate _some_ of their cpusets as also defining the partition that > controls which cpus are in each sched domain, and so dividing the > system. > > "partition" == "an arbitrary disjoint set of sets of cpus" > That would make sense. I'm not familiar with the workings of cpusets, but that would require every task to be assigned to one of these sets (or a subset within it), yes? > This fits naturally with the way people use cpusets anyway. They divide > up the system along boundaries that are natural topologically and that > provide a good fit for their jobs, and hope that the kernel will adapt > to such localized placement. They then throw a few more nested (smaller) > cpusets at the problem, to deal with various special needs. If we can > provide them with a means to tell us which of their cpusets define the > natural partitioning of their system, for the job mix and hardware > topology they have, then all is well. > Sounds like a good fit then. I'll touch up the sched-domains side of the equation when I get some time hopefully this week or next. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
Nick wrote: > Basically you just have to know that it has the > capability to partition the system in an arbitrary disjoint set > of sets of cpus. > > If you can make use of that, then we're in business ;) You read fast ;) So you do _not_ want to consider nested sched domains, just disjoint ones. Good. > From what I gather, this partitioning does not exactly fit > the cpusets architecture. Because with cpusets you are specifying > on what cpus can a set of tasks run, not dividing the whole system. My evil scheme, and Dinakar's as well, is to provide a way for the user to designate _some_ of their cpusets as also defining the partition that controls which cpus are in each sched domain, and so dividing the system. "partition" == "an arbitrary disjoint set of sets of cpus" This fits naturally with the way people use cpusets anyway. They divide up the system along boundaries that are natural topologically and that provide a good fit for their jobs, and hope that the kernel will adapt to such localized placement. They then throw a few more nested (smaller) cpusets at the problem, to deal with various special needs. If we can provide them with a means to tell us which of their cpusets define the natural partitioning of their system, for the job mix and hardware topology they have, then all is well. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
On Mon, 2005-04-18 at 22:54 -0700, Paul Jackson wrote: > Now, onto the real stuff. > > This same issue, in a strange way, comes up on the memory side, > as well as on the cpu side. > > First, let me verify one thing. I understand that the _key_ > purpose of your patch is not so much to isolate cpus, as it > is to allow for structuring scheduling domains to align with > cpuset boundaries. I understand real isolated cpus to be ones > that don't have a scheduling domain (have only the dummy one), > as requested by the "isolcpus=..." boot flag. > Yes. > The following code snippet from kernel/sched.c is what I derive > this understanding from: > Correct. A better name instead of isolated cpusets may be 'partitioned cpusets' or somesuch. On the other hand, it is more or less equivalent to a single isolated CPU. Instead of an isolated cpu, you have an isolated cpuset. Though I imagine this becomes a complete superset of the isolcpus= functionality, and it would actually be easier to manage a single isolated CPU and its associated processes with the cpusets interfaces after this. > In both cases, we have an intermediate degree of partitioning > of a system, neither at the most detailed leaf cpuset, nor at > the all encompassing top cpuset. And in both cases, we want > to partition the system, along cpuset boundaries. > Yep. This sched-domains partitioning only works when you have more than one completely disjoint top level cpusets. That is, you effectively partition the CPUs. It doesn't work if you have *most* jobs bound to either {0, 1, 2, 3} or {4, 5, 6, 7} but one which should be allowed to use any CPU from 0-7. > Here I use "partition" in the mathematical sense: > > === > A partition of a set X is a set of nonempty subsets of X such > that every element x in X is in exactly one of these subsets. > > Equivalently, a set P of subsets of X, is a partition of X if > > 1. No element of P is empty. > 2. The union of the elements of P is equal to X. (We say the >elements of P cover X.) > 3. The intersection of any two elements of P is empty. (We say >the elements of P are pairwise disjoint.) > > http://www.absoluteastronomy.com/encyclopedia/p/pa/partition_of_a_set.htm > === > > In the case of cpus, we really do prefer the partitions to be > disjoint, because it would be better not to confuse the domain > scheduler with overlapping domains. > Yes. The domain scheduler can't handle this at all, it would have to fall back on cpus_allowed, which in turn can create big problems for multiprocessor balancing. > For the cpu case, we would provide a scheduler domain for each > subset of the cpu partitioning. > Yes. [snip the rest, which I didn't finish reading :P] >From what I gather, this partitioning does not exactly fit the cpusets architecture. Because with cpusets you are specifying on what cpus can a set of tasks run, not dividing the whole system. Basically for the sched-domains code to be happy, there should be some top level entity (whether it be cpusets or something else) which records your current partitioning (the default being one set, containing all cpus). > As stated above, there is a single system wide partition of > cpus, and another of mems. I suspect we should consider finding > a way to nest partitions. My (shakey) understanding of what > Nick is doing with scheduler domains is that for the biggest of > systems, we will probably want little scheduler domains inside > bigger ones. The sched-domains setup code will take care of all that for you already. It won't know or care about the partitions. If you partition a 64-way system into 2 32-ways, the domain setup code will just think it is setting up a 32-way system. Don't worry about the sched-domains side of things at all, that's pretty easy. Basically you just have to know that it has the capability to partition the system in an arbitrary disjoint set of sets of cpus. If you can make use of that, then we're in business ;) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
Hmmm ... interesting patch. My reaction to the changes in kernel/cpuset.c are complicated: * I'm supposed to be on vacation the rest of this month, so trying (entirely unsuccessfully so far) not to think about this. * This is perhaps the first non-trivial cpuset patch to come in the last many months from someone other than Simon or myself - welcome. * Some coding style and comment details will need working. * The conceptual model for how to represent this in cpusets needs some work. Let me do two things in this reply. First I'll just shoot off shotgun style the nit picking coding and comment details that I notice, in a scan of the patch. Then I will step back to a discussion of the conceptual model. I suspect that by the time we nail the conceptual model, the code will be sufficiently rewritten that most of the coding and comment nits will no longer apply anyway. But, since nit picking is easier than real thinking ... * I'd probably ditch the all_cpus() macro, on the concern that it obfuscates more than it helps. * The need for _both_ a per-cpuset flag 'CS_CPU_ISOLATED' and another per-cpuset mask 'isolated_map' concerns me. I guess that the isolated_map is just a cache of the set of CPUs isolated in child cpusets, not an independently settable mask, but it needs to be clearly marked as such if so. * Some code lines go past column 80. * The name 'isolated' probably won't work. There is already a boottime option "isolcpus=..." for 'isolated' cpus which is (I think ?) rather different. Perhaps a better name will fall out of the conceptual discussion, below. * The change to the output format of the special cpuset file 'cpus', to look like '0-3[4-7]' bothers me in a couple of ways. It complicates the format from being a simple list. And it means that the output format is not the same as the input format (you can't just write back what you read from such a file anymore). * Several comments start with the word 'Set', as in: Set isolated ON on a non exclusive cpuset Such wording suggests to me that something is being set, some bit or value changed or turned on. But in each case, you are just testing for some condition that will return or error out. Some phrasing such as "If ..." or other conditional would be clearer. * The update_sched_domains() routine is complicated, and hence a primary clue that the conceptual model is not clean yet. * None of this was explained in Documentation/cpusets.txt. * Too bad that cpuset_common_file_write() has to have special logic for this isolated case. The other flag settings just turn on and off the associated bit, and don't trigger any kernel code to adapt to new cpu or memory settings. We should make an exception to that behaviour only if we must, and then we must be explicit about the exception. Ok - enough nits. Now, onto the real stuff. This same issue, in a strange way, comes up on the memory side, as well as on the cpu side. First, let me verify one thing. I understand that the _key_ purpose of your patch is not so much to isolate cpus, as it is to allow for structuring scheduling domains to align with cpuset boundaries. I understand real isolated cpus to be ones that don't have a scheduling domain (have only the dummy one), as requested by the "isolcpus=..." boot flag. The following code snippet from kernel/sched.c is what I derive this understanding from: === static void __devinit arch_init_sched_domains(void) { ... /* * Setup mask for cpus without special case scheduling requirements. * For now this just excludes isolated cpus, but could be used to * exclude other special cases in the future. */ cpus_complement(cpu_default_map, cpu_isolated_map); cpus_and(cpu_default_map, cpu_default_map, cpu_online_map); /* * Set up domains. Isolated domains just stay on the dummy domain. */ for_each_cpu_mask(i, cpu_default_map) { ... === Second, let me describe how this same issue shows up on the memory side. Let's say, for example, someone has partitioned a large system (100's of cpus and nodes) in two major halves using cpusets, each half being used by a different organization. On one of the halves, they are running a large scientific program that works on a huge data set that just fits in the memory available on that half, and they are running a set of related tools that run different passes over that data. Some of these tools might take several cpus, running parallel threads, and using a little more data shared by the threads in that tool. Each of these tools might get its own cpuset, a child (subset) of the big cpuset that defines the half of the system that this large scientific program is running within. The big dataset has to be constrained to the big cpus
Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
Dinakar Guniguntala wrote: Here's an attempt at dynamic sched domains aka isolated cpusets Very good, I was wondering when someone would try to implement this ;) It needs some work. A few initial comments on the kernel/sched.c change - sorry, don't have too much time right now... --- linux-2.6.12-rc1-mm1.orig/kernel/sched.c 2005-04-18 00:46:40.0 +0530 +++ linux-2.6.12-rc1-mm1/kernel/sched.c 2005-04-18 00:47:55.0 +0530 @@ -4895,40 +4895,41 @@ static void check_sibling_maps(void) } #endif -/* - * Set up scheduler domains and groups. Callers must hold the hotplug lock. - */ -static void __devinit arch_init_sched_domains(void) +static void attach_domains(cpumask_t cpu_map) { This shouldn't be needed. There should probably just be one place that attaches all domains. It is a bit difficult to explain what I mean when you have 2 such places below. [...] +void rebuild_sched_domains(cpumask_t change_map, cpumask_t span1, cpumask_t span2) +{ Interface isn't bad. It would seem to be able to handle everything, but I think it can be made a bit simpler. fn_name(cpumask_t span1, cpumask_t span2) Yeah? The change_map is implicitly the union of the 2 spans. Also I don't really like the name. It doesn't rebuild so much as split (or join). I can't think of anything good off the top of my head. + unsigned long flags; + int i; + + local_irq_save(flags); + + for_each_cpu_mask(i, change_map) + spin_lock(&cpu_rq(i)->lock); + Locking is wrong. And it has changed again in the latest -mm kernel. Please diff against that. + if (!cpus_empty(span1)) + build_sched_domains(span1); + if (!cpus_empty(span2)) + build_sched_domains(span2); + You also can't do this - you have to 'offline' the domains first before building new ones. See the CPU hotplug code that handles this. [...] @@ -5046,13 +5082,13 @@ static int update_sched_domains(struct n unsigned long action, void *hcpu) { int i; + cpumask_t temp_map, hotcpu = cpumask_of_cpu((long)hcpu); switch (action) { case CPU_UP_PREPARE: case CPU_DOWN_PREPARE: - for_each_online_cpu(i) - cpu_attach_domain(&sched_domain_dummy, i); - arch_destroy_sched_domains(); + cpus_andnot(temp_map, cpu_online_map, hotcpu); + rebuild_sched_domains(cpu_online_map, temp_map, CPU_MASK_NONE); This makes a hotplug event destroy your nicely set up isolated domains, doesn't it? This looks like the most difficult problem to overcome. It needs some external information to redo the cpuset splits at cpu hotplug time. Probably a hotplug handler in the cpusets code might be the best way to do that. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] Dynamic sched domains aka Isolated cpusets
Here's an attempt at dynamic sched domains aka isolated cpusets o This functionality is on top of CPUSETs and provides a way to completely isolate any set of CPUs dynamically. o There is a new cpu_isolated flag that allows users to convert an exclusive cpuset to an isolated one o The isolated CPUs are part of their own sched domain. This ensures that the rebalance code works within the domain, prevents overhead due to a cpu trying to pull tasks only to find that its cpus_allowed mask does not allow it to be pulled. However it does not kick existing processes off the isolated domain o There is very little code change in the scheduler sched domain code. Most of it is just splitting up of the arch_init_sched_domains code to be called dynamically instead of only at boot time. It has only one API which takes in the map of all cpus affected and the two new domains to be built rebuild_sched_domains(cpumask_t change_map, cpumask_t span1, cpumask_t span2) There are some things that may/will change o This has been tested only on x86 [8 way -> 4 way with HT]. Still needs work on other arch's o I didn't get a chance to see Nick Piggin's RCU sched domains code as yet, but I know there would be changes here because of that... o This does not support CPU hotplug as yet o Making a cpuset isolated manipulates its parent cpus_allowed mask. When viewed from userspace this is represented as follows [EMAIL PROTECTED] cpusets] cat cpus 0-3[4-7] This indicates that CPUs 4-7 are isolated and is/are part of some child cpuset/s Appreciate any feedback. Patch against linux-2.6.12-rc1-mm1. include/linux/init.h |2 include/linux/sched.h |1 kernel/cpuset.c | 141 -- kernel/sched.c| 109 +- 4 files changed, 213 insertions(+), 40 deletions(-) -Dinakar diff -Naurp linux-2.6.12-rc1-mm1.orig/include/linux/init.h linux-2.6.12-rc1-mm1/include/linux/init.h --- linux-2.6.12-rc1-mm1.orig/include/linux/init.h 2005-03-18 07:03:49.0 +0530 +++ linux-2.6.12-rc1-mm1/include/linux/init.h 2005-04-18 00:48:26.0 +0530 @@ -217,7 +217,7 @@ void __init parse_early_param(void); #define __initdata_or_module __initdata #endif /*CONFIG_MODULES*/ -#ifdef CONFIG_HOTPLUG +#if defined(CONFIG_HOTPLUG) || defined(CONFIG_CPUSETS) #define __devinit #define __devinitdata #define __devexit diff -Naurp linux-2.6.12-rc1-mm1.orig/include/linux/sched.h linux-2.6.12-rc1-mm1/include/linux/sched.h --- linux-2.6.12-rc1-mm1.orig/include/linux/sched.h 2005-04-18 00:46:40.0 +0530 +++ linux-2.6.12-rc1-mm1/include/linux/sched.h 2005-04-18 00:48:19.0 +0530 @@ -155,6 +155,7 @@ typedef struct task_struct task_t; extern void sched_init(void); extern void sched_init_smp(void); extern void init_idle(task_t *idle, int cpu); +extern void rebuild_sched_domains(cpumask_t change_map, cpumask_t span1, cpumask_t span2); extern cpumask_t nohz_cpu_mask; diff -Naurp linux-2.6.12-rc1-mm1.orig/kernel/cpuset.c linux-2.6.12-rc1-mm1/kernel/cpuset.c --- linux-2.6.12-rc1-mm1.orig/kernel/cpuset.c 2005-04-18 00:46:40.0 +0530 +++ linux-2.6.12-rc1-mm1/kernel/cpuset.c2005-04-18 00:51:48.0 +0530 @@ -55,9 +55,17 @@ #define CPUSET_SUPER_MAGIC 0x27e0eb +#define all_cpus(cs) \ +({ \ + cpumask_t __tmp_map;\ + cpus_or(__tmp_map, cs->cpus_allowed, cs->isolated_map); \ + __tmp_map; \ +}) + struct cpuset { unsigned long flags;/* "unsigned long" so bitops work */ cpumask_t cpus_allowed; /* CPUs allowed to tasks in cpuset */ + cpumask_t isolated_map; /* CPUs associated with a sched domain */ nodemask_t mems_allowed;/* Memory Nodes allowed to tasks */ atomic_t count; /* count tasks using this cpuset */ @@ -82,6 +90,7 @@ struct cpuset { /* bits in struct cpuset flags field */ typedef enum { CS_CPU_EXCLUSIVE, + CS_CPU_ISOLATED, CS_MEM_EXCLUSIVE, CS_REMOVED, CS_NOTIFY_ON_RELEASE @@ -93,6 +102,11 @@ static inline int is_cpu_exclusive(const return !!test_bit(CS_CPU_EXCLUSIVE, &cs->flags); } +static inline int is_cpu_isolated(const struct cpuset *cs) +{ + return !!test_bit(CS_CPU_ISOLATED, &cs->flags); +} + static inline int is_mem_exclusive(const struct cpuset *cs) { return !!test_bit(CS_MEM_EXCLUSIVE, &cs->flags); @@ -127,8 +141,9 @@ static inline int notify_on_release(cons static atomic_t cpuset_mems_generation = ATOMIC_INIT(1); static struct cpuset top_cpuset = { - .flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIV