Re: Tiny cpusets -- cpusets for small systems?
Paul Jackson wrote: So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower level mechanism that actually makes kernel aware of what's isolated what's not. Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains but scheduler does not use cpusets directly. One could use cpusets to control the setting of cpu_isolated_map, separate from the code such as your select_irq_affinity() that uses it. Yes. That's what I proposed too. In one of the CPU isolation threads with Peter. The only issue is that you need to simulate CPU_DOWN hotplug even in order to cleanup what's already running on those CPUs. In a foreseeable future 2-8 cores will be most common configuration. Do you think that cpusets are needed/useful for those machines ? The reason I'm asking is because given the restrictions you mentioned above it seems that you might as well just do taskset -c 1,2,3 app1 taskset -c 3,4,5 app2 People tend to manage the CPU and memory placement of the threads and processes within a single co-operating job using taskset (sched_setaffinity) and numactl (mbind, set_mempolicy.) They tend to manage the placement of multiple unrelated jobs onto a single system, whether on separate or shared CPUs and nodes, using cpusets. > Something like cpu_isolated_map looks to me like a system-wide mechanism, which should, like sched_domains, be managed system-wide. Managing it with a mechanism that encourages each thread to update it directly, as if that thread owned the system, will break down, resulting in conflicting updates, as multiple, insufficiently co-operating threads issue conflicting settings. I'm not sure how to interpret that. I think you might have mixed a couple of things I asked about in one reply ;-). The question was that given the restrictions you talked about when you explained tiny-cpusets functionality I asked how much one gains from using them compared to the taskset/numactl. ie On the machines with 2-8 cores it's fairly easy to manage cpus with simple affinity masks. The second part of your reply seems to imply that I somehow made you think that I suggested that cpu_isolated_map is managed per thread. That is of course not the case. It's definitely a system-wide mechanism and individual threads have nothing to do with it. btw I just re-read my prev reply. I definitely did not say anything about threads managing cpu_isolated_map :). Stuff that I'm working on this days (wireless basestations) is designed with the following model: cpuN - runs soft-RT networking and management code cpuN+1 to cpuN+x - are used as dedicated engines ie Simplest example would be cpu0 - runs IP, L2 and control plane cpu1 - runs hard-RT MAC So if CPU isolation is implemented on top of the cpusets what kind of API do you envision for such an app ? That depends on what more API is needed. Do we need to place irqs better ... cpusets might not be a natural for that use. Aren't irqs directed to specific CPUs, not to hierarchically nested subsets of CPUs. You clipped the part where I elaborated. Which was: So if CPU isolation is implemented on top of the cpusets what kind of API do you envision for such an app ? I mean currently cpusets seems to be mostly dealing with entire processes, whereas in this case we're really dealing with the threads. ie Different threads of the same process require different policies, some must run on isolated cpus some must not. I guess one could write a thread's pid into cpusets fs but that's not very convenient. pthread_set_affinity() is exactly what's needed. In other words how would an app place its individual threads into the different cpusets. IRQ stuff is separate, like we said above cpusets could simply update cpu_isolated_map which would take care of IRQs. I was talking specifically about the thread management. Separate question: Is it desired that the dedicated CPUs cpuN+1 ... cpuN+x even appear as general purpose systems running a Linux kernel in your systems? These dedicated engines seem more like intelligent devices to me, such as disk controllers, which the kernel controls via device drivers, not by loading itself on them too. We still want to be able to run normal threads on them. Which means IPI, memory management, etc is still needed. So yes they better show up as normal CPUs :) Also with dynamic isolation you can for example un-isolate a cpu when you're compiling stuff on the machine and then isolate it when you're running special app(s). Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
> So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as > lower > level mechanism that actually makes kernel aware of what's isolated what's > not. > Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains > but scheduler does not use cpusets directly. One could use cpusets to control the setting of cpu_isolated_map, separate from the code such as your select_irq_affinity() that uses it. > In a foreseeable future 2-8 cores will be most common configuration. > Do you think that cpusets are needed/useful for those machines ? > The reason I'm asking is because given the restrictions you mentioned > above it seems that you might as well just do > taskset -c 1,2,3 app1 > taskset -c 3,4,5 app2 People tend to manage the CPU and memory placement of the threads and processes within a single co-operating job using taskset (sched_setaffinity) and numactl (mbind, set_mempolicy.) They tend to manage the placement of multiple unrelated jobs onto a single system, whether on separate or shared CPUs and nodes, using cpusets. Something like cpu_isolated_map looks to me like a system-wide mechanism, which should, like sched_domains, be managed system-wide. Managing it with a mechanism that encourages each thread to update it directly, as if that thread owned the system, will break down, resulting in conflicting updates, as multiple, insufficiently co-operating threads issue conflicting settings. > Stuff that I'm working on this days (wireless basestations) is designed > with the following model: > cpuN - runs soft-RT networking and management code > cpuN+1 to cpuN+x - are used as dedicated engines > ie Simplest example would be > cpu0 - runs IP, L2 and control plane > cpu1 - runs hard-RT MAC > > So if CPU isolation is implemented on top of the cpusets what kind of API do > you envision for such an app ? That depends on what more API is needed. Do we need to place irqs better ... cpusets might not be a natural for that use. Aren't irqs directed to specific CPUs, not to hierarchically nested subsets of CPUs. Separate question: Is it desired that the dedicated CPUs cpuN+1 ... cpuN+x even appear as general purpose systems running a Linux kernel in your systems? These dedicated engines seem more like intelligent devices to me, such as disk controllers, which the kernel controls via device drivers, not by loading itself on them too. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower level mechanism that actually makes kernel aware of what's isolated what's not. Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains but scheduler does not use cpusets directly. One could use cpusets to control the setting of cpu_isolated_map, separate from the code such as your select_irq_affinity() that uses it. In a foreseeable future 2-8 cores will be most common configuration. Do you think that cpusets are needed/useful for those machines ? The reason I'm asking is because given the restrictions you mentioned above it seems that you might as well just do taskset -c 1,2,3 app1 taskset -c 3,4,5 app2 People tend to manage the CPU and memory placement of the threads and processes within a single co-operating job using taskset (sched_setaffinity) and numactl (mbind, set_mempolicy.) They tend to manage the placement of multiple unrelated jobs onto a single system, whether on separate or shared CPUs and nodes, using cpusets. Something like cpu_isolated_map looks to me like a system-wide mechanism, which should, like sched_domains, be managed system-wide. Managing it with a mechanism that encourages each thread to update it directly, as if that thread owned the system, will break down, resulting in conflicting updates, as multiple, insufficiently co-operating threads issue conflicting settings. Stuff that I'm working on this days (wireless basestations) is designed with the following model: cpuN - runs soft-RT networking and management code cpuN+1 to cpuN+x - are used as dedicated engines ie Simplest example would be cpu0 - runs IP, L2 and control plane cpu1 - runs hard-RT MAC So if CPU isolation is implemented on top of the cpusets what kind of API do you envision for such an app ? That depends on what more API is needed. Do we need to place irqs better ... cpusets might not be a natural for that use. Aren't irqs directed to specific CPUs, not to hierarchically nested subsets of CPUs. Separate question: Is it desired that the dedicated CPUs cpuN+1 ... cpuN+x even appear as general purpose systems running a Linux kernel in your systems? These dedicated engines seem more like intelligent devices to me, such as disk controllers, which the kernel controls via device drivers, not by loading itself on them too. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.940.382.4214 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
Paul Jackson wrote: So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower level mechanism that actually makes kernel aware of what's isolated what's not. Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains but scheduler does not use cpusets directly. One could use cpusets to control the setting of cpu_isolated_map, separate from the code such as your select_irq_affinity() that uses it. Yes. That's what I proposed too. In one of the CPU isolation threads with Peter. The only issue is that you need to simulate CPU_DOWN hotplug even in order to cleanup what's already running on those CPUs. In a foreseeable future 2-8 cores will be most common configuration. Do you think that cpusets are needed/useful for those machines ? The reason I'm asking is because given the restrictions you mentioned above it seems that you might as well just do taskset -c 1,2,3 app1 taskset -c 3,4,5 app2 People tend to manage the CPU and memory placement of the threads and processes within a single co-operating job using taskset (sched_setaffinity) and numactl (mbind, set_mempolicy.) They tend to manage the placement of multiple unrelated jobs onto a single system, whether on separate or shared CPUs and nodes, using cpusets. Something like cpu_isolated_map looks to me like a system-wide mechanism, which should, like sched_domains, be managed system-wide. Managing it with a mechanism that encourages each thread to update it directly, as if that thread owned the system, will break down, resulting in conflicting updates, as multiple, insufficiently co-operating threads issue conflicting settings. I'm not sure how to interpret that. I think you might have mixed a couple of things I asked about in one reply ;-). The question was that given the restrictions you talked about when you explained tiny-cpusets functionality I asked how much one gains from using them compared to the taskset/numactl. ie On the machines with 2-8 cores it's fairly easy to manage cpus with simple affinity masks. The second part of your reply seems to imply that I somehow made you think that I suggested that cpu_isolated_map is managed per thread. That is of course not the case. It's definitely a system-wide mechanism and individual threads have nothing to do with it. btw I just re-read my prev reply. I definitely did not say anything about threads managing cpu_isolated_map :). Stuff that I'm working on this days (wireless basestations) is designed with the following model: cpuN - runs soft-RT networking and management code cpuN+1 to cpuN+x - are used as dedicated engines ie Simplest example would be cpu0 - runs IP, L2 and control plane cpu1 - runs hard-RT MAC So if CPU isolation is implemented on top of the cpusets what kind of API do you envision for such an app ? That depends on what more API is needed. Do we need to place irqs better ... cpusets might not be a natural for that use. Aren't irqs directed to specific CPUs, not to hierarchically nested subsets of CPUs. You clipped the part where I elaborated. Which was: So if CPU isolation is implemented on top of the cpusets what kind of API do you envision for such an app ? I mean currently cpusets seems to be mostly dealing with entire processes, whereas in this case we're really dealing with the threads. ie Different threads of the same process require different policies, some must run on isolated cpus some must not. I guess one could write a thread's pid into cpusets fs but that's not very convenient. pthread_set_affinity() is exactly what's needed. In other words how would an app place its individual threads into the different cpusets. IRQ stuff is separate, like we said above cpusets could simply update cpu_isolated_map which would take care of IRQs. I was talking specifically about the thread management. Separate question: Is it desired that the dedicated CPUs cpuN+1 ... cpuN+x even appear as general purpose systems running a Linux kernel in your systems? These dedicated engines seem more like intelligent devices to me, such as disk controllers, which the kernel controls via device drivers, not by loading itself on them too. We still want to be able to run normal threads on them. Which means IPI, memory management, etc is still needed. So yes they better show up as normal CPUs :) Also with dynamic isolation you can for example un-isolate a cpu when you're compiling stuff on the machine and then isolate it when you're running special app(s). Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
Hi Pual > Looking at some IA64 sn2 config builds I have laying about, I see the > following text sizes for a couple of versions, showing the growth of > the cpuset/cgroup apparatus over time: > > 25933 2.6.18-rc3-mm1/kernel/cpuset.o (Aug 2006) > vs. > 37823 2.6.25-rc2-mm1/kernel/cgroup.o (Feb 2008) > 19558 2.6.25-rc2-mm1/kernel/cpuset.o > > So the total has grown from 25933 to 57381 text bytes (note that > this is IA64 arch; most arch's will have proportionately smaller > text sizes.) hm, interesting. but unfortunately the cpuset have more than depend.(i.e. CONFIG_SMP) To more bad thing, some embedded cpu have poor or no atomic instruction support. at that, turn on CONFIG_SMP become large performace regression ;) I am not already embedded engineer. thus, I might have made a mistake. (BTW: I am large server engineer now) but no thinking dependency is wrong, may be. Pavel, what do you think it? - kosaki -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
Hi Pual Looking at some IA64 sn2 config builds I have laying about, I see the following text sizes for a couple of versions, showing the growth of the cpuset/cgroup apparatus over time: 25933 2.6.18-rc3-mm1/kernel/cpuset.o (Aug 2006) vs. 37823 2.6.25-rc2-mm1/kernel/cgroup.o (Feb 2008) 19558 2.6.25-rc2-mm1/kernel/cpuset.o So the total has grown from 25933 to 57381 text bytes (note that this is IA64 arch; most arch's will have proportionately smaller text sizes.) hm, interesting. but unfortunately the cpuset have more than depend.(i.e. CONFIG_SMP) To more bad thing, some embedded cpu have poor or no atomic instruction support. at that, turn on CONFIG_SMP become large performace regression ;) I am not already embedded engineer. thus, I might have made a mistake. (BTW: I am large server engineer now) but no thinking dependency is wrong, may be. Pavel, what do you think it? - kosaki -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
Hi Paul, > A couple of proposals have been made recently by people working Linux > on smaller systems, for improving realtime isolation and memory > pressure handling: > > (1) cpu isolation for hard(er) realtime > http://lkml.org/lkml/2008/2/21/517 > Max Krasnyanskiy <[EMAIL PROTECTED]> > [PATCH sched-devel 0/7] CPU isolation extensions > > (2) notify user space of tight memory > http://lkml.org/lkml/2008/2/9/144 > KOSAKI Motohiro <[EMAIL PROTECTED]> > [PATCH 0/8][for -mm] mem_notify v6 > > In both cases, some of us have responded "why not use cpusets", and the > original submitters have replied "cpusets are too fat" (well, they > were more diplomatic than that, but I guess I can say that ;) My primary issue with cpusets (from CPU isolation perspective that is) was not the fatness. I did make a couple of comments like "On dual-cpu box I do not need cpusets to manage the CPUs" but that's not directly related to the CPU isolation. For the CPU isolation in particular I need code like this int select_irq_affinity(unsigned int irq) { cpumask_t usable_cpus; cpus_andnot(usable_cpus, cpu_online_map, cpu_isolated_map); irq_desc[irq].affinity = usable_cpus; irq_desc[irq].chip->set_affinity(irq, usable_cpus); return 0; } How would you implement that with cpusets ? I haven't seen you patches but I'd imagine that they will still need locks and iterators for "Is CPU N isolated" functionality. So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower level mechanism that actually makes kernel aware of what's isolated what's not. Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains but scheduler does not use cpusets directly. > I wonder if there might be room for a "tiny cpusets" configuration option: > * provide the same hooks to the rest of the kernel, and > * provide the same syntactic interface to user space, but > * with more limited semantics. > > The primary semantic limit I'd suggest would be supporting exactly > one layer depth of cpusets, not a full hierarchy. So one could still > successfully issue from user space 'mkdir /dev/cpuset/foo', but trying > to do 'mkdir /dev/cpuset/foo/bar' would fail. This reminds me of > very early FAT file systems, which had just a single, fixed size > root directory ;). There might even be a configurable fixed upper > limit on how many /dev/cpuset/* directories were allowed, further > simplifying the locking and dynamic memory behavior of this apparatus. In a foreseeable future 2-8 cores will be most common configuration. Do you think that cpusets are needed/useful for those machines ? The reason I'm asking is because given the restrictions you mentioned above it seems that you might as well just do taskset -c 1,2,3 app1 taskset -c 3,4,5 app2 Yes it's not quite the same of course but imo covers most cases. That's what we do on 2-4 cores these days, and are quite happy with that. ie We either let the specialized apps manage their thread affinities themselves or use "taskset" to manage the apps. > User space would see the same API, except that some valid operations > on full cpusets, such as a nested mkdir, would fail on tiny cpusets. Speaking of user-space API. I guess it's not directly related to the tiny-cpusets proposal but rather to the cpusets in general. Stuff that I'm working on this days (wireless basestations) is designed with the following model: cpuN - runs soft-RT networking and management code cpuN+1 to cpuN+x - are used as dedicated engines ie Simplest example would be cpu0 - runs IP, L2 and control plane cpu1 - runs hard-RT MAC So if CPU isolation is implemented on top of the cpusets what kind of API do you envision for such an app ? I mean currently cpusets seems to be mostly dealing with entire processes, whereas in this case we're really dealing with the threads. ie Different threads of the same process require different policies, some must run on isolated cpus some must not. I guess one could write a thread's pid into cpusets fs but that's not very convenient. pthread_set_affinity() is exactly what's needed. Personally I do not see much use for cpusets for those kinds of designs. But maybe I missing something. I got really excited when cpusets where first merged into mainline but after looking closer I could not really find a use for them, at least for not for our apps. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
Paul M wrote: > I'm don't think that either of these would be enough to justify big > changes to cpusets or cgroups, although eliminating bloat is always a > good thing. My "tiny cpuset" idea doesn't so much eliminate bloat, as provide a thin alternative, along side of the existing fat alternative. So far as kernel source goes, it would get bigger, not smaller, with now two CONFIG choices for cpusets, fat or tiny. The odds are, however, given that one of us has just promised not to code this, and the other of us doesn't figure it's worth it, this idea will not live long. Someone would have to step up from the embedded side with a coded version that saved a nice chunk of memory (from their perspective) to get this off the ground, and no telling whether even that would meet with a warm reception. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
On Sat, Feb 23, 2008 at 4:09 AM, Paul Jackson <[EMAIL PROTECTED]> wrote: > A couple of proposals have been made recently by people working Linux > on smaller systems, for improving realtime isolation and memory > pressure handling: > > (1) cpu isolation for hard(er) realtime > http://lkml.org/lkml/2008/2/21/517 > Max Krasnyanskiy <[EMAIL PROTECTED]> > [PATCH sched-devel 0/7] CPU isolation extensions > > (2) notify user space of tight memory > http://lkml.org/lkml/2008/2/9/144 > KOSAKI Motohiro <[EMAIL PROTECTED]> > [PATCH 0/8][for -mm] mem_notify v6 > > In both cases, some of us have responded "why not use cpusets", and the > original submitters have replied "cpusets are too fat" (well, they > were more diplomatic than that, but I guess I can say that ;) Having read those threads, it looks to me as though: - the parts of Max's problem that would be solved by cpusets can be mostly accomplished just via sched_setaffinity() - Motohiro wants to add a new system-wide API that you would also like to have available on a per-cpuset basis. (Why not just add two access points for the same feature?) I'm don't think that either of these would be enough to justify big changes to cpusets or cgroups, although eliminating bloat is always a good thing. > The primary semantic limit I'd suggest would be supporting exactly > one layer depth of cpusets, not a full hierarchy. So one could still > successfully issue from user space 'mkdir /dev/cpuset/foo', but trying > to do 'mkdir /dev/cpuset/foo/bar' would fail. This reminds me of > very early FAT file systems, which had just a single, fixed size > root directory ;). There might even be a configurable fixed upper > limit on how many /dev/cpuset/* directories were allowed, further > simplifying the locking and dynamic memory behavior of this apparatus. I'm not sure that either of these would make much difference to the overall footprint. A single layer of cpusets would allow you to simplify validate_change() but not much else. I don't see how a fixed upper limit on the number of cpusets makes the locking sufficiently simpler to save much code. > > How this extends to cgroups I don't know; for now I suspect that most > cgroup module development is motivated by the needs of larger systems, > not smaller systems. However, cpusets is now a module client of > cgroups, and it is cgroups that now provides cpusets with its interface > to the vfs infrastructure. It would seem unfortunate if this relation > was not continued with tiny cpusets. Perhaps someone can imagine a tiny > cgroups? This might be the most difficult part of this proposal. If we wanted to go this way, I can imagine a cgroups config option that forces just a single hierarchy, which would allow a bunch of simplifications that would save plenty of text. > > Looking at some IA64 sn2 config builds I have laying about, I see the > following text sizes for a couple of versions, showing the growth of > the cpuset/cgroup apparatus over time: > > 25933 2.6.18-rc3-mm1/kernel/cpuset.o (Aug 2006) > vs. > 37823 2.6.25-rc2-mm1/kernel/cgroup.o (Feb 2008) > 19558 2.6.25-rc2-mm1/kernel/cpuset.o > > So the total has grown from 25933 to 57381 text bytes (note that > this is IA64 arch; most arch's will have proportionately smaller > text sizes.) On x86_64 they're: cgroup.o: 17348 cpuset.o: 8533 Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Tiny cpusets -- cpusets for small systems?
A couple of proposals have been made recently by people working Linux on smaller systems, for improving realtime isolation and memory pressure handling: (1) cpu isolation for hard(er) realtime http://lkml.org/lkml/2008/2/21/517 Max Krasnyanskiy <[EMAIL PROTECTED]> [PATCH sched-devel 0/7] CPU isolation extensions (2) notify user space of tight memory http://lkml.org/lkml/2008/2/9/144 KOSAKI Motohiro <[EMAIL PROTECTED]> [PATCH 0/8][for -mm] mem_notify v6 In both cases, some of us have responded "why not use cpusets", and the original submitters have replied "cpusets are too fat" (well, they were more diplomatic than that, but I guess I can say that ;) I wonder if there might be room for a "tiny cpusets" configuration option: * provide the same hooks to the rest of the kernel, and * provide the same syntactic interface to user space, but * with more limited semantics. The primary semantic limit I'd suggest would be supporting exactly one layer depth of cpusets, not a full hierarchy. So one could still successfully issue from user space 'mkdir /dev/cpuset/foo', but trying to do 'mkdir /dev/cpuset/foo/bar' would fail. This reminds me of very early FAT file systems, which had just a single, fixed size root directory ;). There might even be a configurable fixed upper limit on how many /dev/cpuset/* directories were allowed, further simplifying the locking and dynamic memory behavior of this apparatus. Some other features that aren't so easy to implement, and which have less value on small systems, such as notify_on_release, could also be stubbed out and always disabled, simply returning error if requested to be enabled from user space. The recent, chunky piece of code needed to compute dynamic sched domains from the cpuset hierarchy probably admits of a simpler variant in the tiny cpuset configuration. I suppose it would still be a vfs-based pseudo file system (even embedded Linux still has that infrastructure), except that the vfs operator functions could be simpler, as this would really be just a flat set of cpumask_t's and nodemask_t's at the core of the implementation, not an arbitrarily nested hierarchy of them. See further my comments on cgroups, below. The rest of the kernel would see no difference ... except that some of the cpuset_*() hooks would return more quickly. This tiny cpuset option would provide the same kernel hooks as are now provided by the defines and inline stubs, in the "#else" to "#endif" half of the "#ifdef CONFIG_CPUSETS" code lines in linux/cpuset.h. User space would see the same API, except that some valid operations on full cpusets, such as a nested mkdir, would fail on tiny cpusets. How this extends to cgroups I don't know; for now I suspect that most cgroup module development is motivated by the needs of larger systems, not smaller systems. However, cpusets is now a module client of cgroups, and it is cgroups that now provides cpusets with its interface to the vfs infrastructure. It would seem unfortunate if this relation was not continued with tiny cpusets. Perhaps someone can imagine a tiny cgroups? This might be the most difficult part of this proposal. Looking at some IA64 sn2 config builds I have laying about, I see the following text sizes for a couple of versions, showing the growth of the cpuset/cgroup apparatus over time: 25933 2.6.18-rc3-mm1/kernel/cpuset.o (Aug 2006) vs. 37823 2.6.25-rc2-mm1/kernel/cgroup.o (Feb 2008) 19558 2.6.25-rc2-mm1/kernel/cpuset.o So the total has grown from 25933 to 57381 text bytes (note that this is IA64 arch; most arch's will have proportionately smaller text sizes.) Unfortunately, ideas without code are usually met with the sound of silence, as well they should be. Furthermore, I can promise that I have no time to design or develop this myself; my good employer is quite focused on the other end of things - the big honkin NUMA and cluster systems. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Tiny cpusets -- cpusets for small systems?
A couple of proposals have been made recently by people working Linux on smaller systems, for improving realtime isolation and memory pressure handling: (1) cpu isolation for hard(er) realtime http://lkml.org/lkml/2008/2/21/517 Max Krasnyanskiy [EMAIL PROTECTED] [PATCH sched-devel 0/7] CPU isolation extensions (2) notify user space of tight memory http://lkml.org/lkml/2008/2/9/144 KOSAKI Motohiro [EMAIL PROTECTED] [PATCH 0/8][for -mm] mem_notify v6 In both cases, some of us have responded why not use cpusets, and the original submitters have replied cpusets are too fat (well, they were more diplomatic than that, but I guess I can say that ;) I wonder if there might be room for a tiny cpusets configuration option: * provide the same hooks to the rest of the kernel, and * provide the same syntactic interface to user space, but * with more limited semantics. The primary semantic limit I'd suggest would be supporting exactly one layer depth of cpusets, not a full hierarchy. So one could still successfully issue from user space 'mkdir /dev/cpuset/foo', but trying to do 'mkdir /dev/cpuset/foo/bar' would fail. This reminds me of very early FAT file systems, which had just a single, fixed size root directory ;). There might even be a configurable fixed upper limit on how many /dev/cpuset/* directories were allowed, further simplifying the locking and dynamic memory behavior of this apparatus. Some other features that aren't so easy to implement, and which have less value on small systems, such as notify_on_release, could also be stubbed out and always disabled, simply returning error if requested to be enabled from user space. The recent, chunky piece of code needed to compute dynamic sched domains from the cpuset hierarchy probably admits of a simpler variant in the tiny cpuset configuration. I suppose it would still be a vfs-based pseudo file system (even embedded Linux still has that infrastructure), except that the vfs operator functions could be simpler, as this would really be just a flat set of cpumask_t's and nodemask_t's at the core of the implementation, not an arbitrarily nested hierarchy of them. See further my comments on cgroups, below. The rest of the kernel would see no difference ... except that some of the cpuset_*() hooks would return more quickly. This tiny cpuset option would provide the same kernel hooks as are now provided by the defines and inline stubs, in the #else to #endif half of the #ifdef CONFIG_CPUSETS code lines in linux/cpuset.h. User space would see the same API, except that some valid operations on full cpusets, such as a nested mkdir, would fail on tiny cpusets. How this extends to cgroups I don't know; for now I suspect that most cgroup module development is motivated by the needs of larger systems, not smaller systems. However, cpusets is now a module client of cgroups, and it is cgroups that now provides cpusets with its interface to the vfs infrastructure. It would seem unfortunate if this relation was not continued with tiny cpusets. Perhaps someone can imagine a tiny cgroups? This might be the most difficult part of this proposal. Looking at some IA64 sn2 config builds I have laying about, I see the following text sizes for a couple of versions, showing the growth of the cpuset/cgroup apparatus over time: 25933 2.6.18-rc3-mm1/kernel/cpuset.o (Aug 2006) vs. 37823 2.6.25-rc2-mm1/kernel/cgroup.o (Feb 2008) 19558 2.6.25-rc2-mm1/kernel/cpuset.o So the total has grown from 25933 to 57381 text bytes (note that this is IA64 arch; most arch's will have proportionately smaller text sizes.) Unfortunately, ideas without code are usually met with the sound of silence, as well they should be. Furthermore, I can promise that I have no time to design or develop this myself; my good employer is quite focused on the other end of things - the big honkin NUMA and cluster systems. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.940.382.4214 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
On Sat, Feb 23, 2008 at 4:09 AM, Paul Jackson [EMAIL PROTECTED] wrote: A couple of proposals have been made recently by people working Linux on smaller systems, for improving realtime isolation and memory pressure handling: (1) cpu isolation for hard(er) realtime http://lkml.org/lkml/2008/2/21/517 Max Krasnyanskiy [EMAIL PROTECTED] [PATCH sched-devel 0/7] CPU isolation extensions (2) notify user space of tight memory http://lkml.org/lkml/2008/2/9/144 KOSAKI Motohiro [EMAIL PROTECTED] [PATCH 0/8][for -mm] mem_notify v6 In both cases, some of us have responded why not use cpusets, and the original submitters have replied cpusets are too fat (well, they were more diplomatic than that, but I guess I can say that ;) Having read those threads, it looks to me as though: - the parts of Max's problem that would be solved by cpusets can be mostly accomplished just via sched_setaffinity() - Motohiro wants to add a new system-wide API that you would also like to have available on a per-cpuset basis. (Why not just add two access points for the same feature?) I'm don't think that either of these would be enough to justify big changes to cpusets or cgroups, although eliminating bloat is always a good thing. The primary semantic limit I'd suggest would be supporting exactly one layer depth of cpusets, not a full hierarchy. So one could still successfully issue from user space 'mkdir /dev/cpuset/foo', but trying to do 'mkdir /dev/cpuset/foo/bar' would fail. This reminds me of very early FAT file systems, which had just a single, fixed size root directory ;). There might even be a configurable fixed upper limit on how many /dev/cpuset/* directories were allowed, further simplifying the locking and dynamic memory behavior of this apparatus. I'm not sure that either of these would make much difference to the overall footprint. A single layer of cpusets would allow you to simplify validate_change() but not much else. I don't see how a fixed upper limit on the number of cpusets makes the locking sufficiently simpler to save much code. How this extends to cgroups I don't know; for now I suspect that most cgroup module development is motivated by the needs of larger systems, not smaller systems. However, cpusets is now a module client of cgroups, and it is cgroups that now provides cpusets with its interface to the vfs infrastructure. It would seem unfortunate if this relation was not continued with tiny cpusets. Perhaps someone can imagine a tiny cgroups? This might be the most difficult part of this proposal. If we wanted to go this way, I can imagine a cgroups config option that forces just a single hierarchy, which would allow a bunch of simplifications that would save plenty of text. Looking at some IA64 sn2 config builds I have laying about, I see the following text sizes for a couple of versions, showing the growth of the cpuset/cgroup apparatus over time: 25933 2.6.18-rc3-mm1/kernel/cpuset.o (Aug 2006) vs. 37823 2.6.25-rc2-mm1/kernel/cgroup.o (Feb 2008) 19558 2.6.25-rc2-mm1/kernel/cpuset.o So the total has grown from 25933 to 57381 text bytes (note that this is IA64 arch; most arch's will have proportionately smaller text sizes.) On x86_64 they're: cgroup.o: 17348 cpuset.o: 8533 Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
Paul M wrote: I'm don't think that either of these would be enough to justify big changes to cpusets or cgroups, although eliminating bloat is always a good thing. My tiny cpuset idea doesn't so much eliminate bloat, as provide a thin alternative, along side of the existing fat alternative. So far as kernel source goes, it would get bigger, not smaller, with now two CONFIG choices for cpusets, fat or tiny. The odds are, however, given that one of us has just promised not to code this, and the other of us doesn't figure it's worth it, this idea will not live long. Someone would have to step up from the embedded side with a coded version that saved a nice chunk of memory (from their perspective) to get this off the ground, and no telling whether even that would meet with a warm reception. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.940.382.4214 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
Hi Paul, A couple of proposals have been made recently by people working Linux on smaller systems, for improving realtime isolation and memory pressure handling: (1) cpu isolation for hard(er) realtime http://lkml.org/lkml/2008/2/21/517 Max Krasnyanskiy [EMAIL PROTECTED] [PATCH sched-devel 0/7] CPU isolation extensions (2) notify user space of tight memory http://lkml.org/lkml/2008/2/9/144 KOSAKI Motohiro [EMAIL PROTECTED] [PATCH 0/8][for -mm] mem_notify v6 In both cases, some of us have responded why not use cpusets, and the original submitters have replied cpusets are too fat (well, they were more diplomatic than that, but I guess I can say that ;) My primary issue with cpusets (from CPU isolation perspective that is) was not the fatness. I did make a couple of comments like On dual-cpu box I do not need cpusets to manage the CPUs but that's not directly related to the CPU isolation. For the CPU isolation in particular I need code like this int select_irq_affinity(unsigned int irq) { cpumask_t usable_cpus; cpus_andnot(usable_cpus, cpu_online_map, cpu_isolated_map); irq_desc[irq].affinity = usable_cpus; irq_desc[irq].chip-set_affinity(irq, usable_cpus); return 0; } How would you implement that with cpusets ? I haven't seen you patches but I'd imagine that they will still need locks and iterators for Is CPU N isolated functionality. So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower level mechanism that actually makes kernel aware of what's isolated what's not. Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains but scheduler does not use cpusets directly. I wonder if there might be room for a tiny cpusets configuration option: * provide the same hooks to the rest of the kernel, and * provide the same syntactic interface to user space, but * with more limited semantics. The primary semantic limit I'd suggest would be supporting exactly one layer depth of cpusets, not a full hierarchy. So one could still successfully issue from user space 'mkdir /dev/cpuset/foo', but trying to do 'mkdir /dev/cpuset/foo/bar' would fail. This reminds me of very early FAT file systems, which had just a single, fixed size root directory ;). There might even be a configurable fixed upper limit on how many /dev/cpuset/* directories were allowed, further simplifying the locking and dynamic memory behavior of this apparatus. In a foreseeable future 2-8 cores will be most common configuration. Do you think that cpusets are needed/useful for those machines ? The reason I'm asking is because given the restrictions you mentioned above it seems that you might as well just do taskset -c 1,2,3 app1 taskset -c 3,4,5 app2 Yes it's not quite the same of course but imo covers most cases. That's what we do on 2-4 cores these days, and are quite happy with that. ie We either let the specialized apps manage their thread affinities themselves or use taskset to manage the apps. User space would see the same API, except that some valid operations on full cpusets, such as a nested mkdir, would fail on tiny cpusets. Speaking of user-space API. I guess it's not directly related to the tiny-cpusets proposal but rather to the cpusets in general. Stuff that I'm working on this days (wireless basestations) is designed with the following model: cpuN - runs soft-RT networking and management code cpuN+1 to cpuN+x - are used as dedicated engines ie Simplest example would be cpu0 - runs IP, L2 and control plane cpu1 - runs hard-RT MAC So if CPU isolation is implemented on top of the cpusets what kind of API do you envision for such an app ? I mean currently cpusets seems to be mostly dealing with entire processes, whereas in this case we're really dealing with the threads. ie Different threads of the same process require different policies, some must run on isolated cpus some must not. I guess one could write a thread's pid into cpusets fs but that's not very convenient. pthread_set_affinity() is exactly what's needed. Personally I do not see much use for cpusets for those kinds of designs. But maybe I missing something. I got really excited when cpusets where first merged into mainline but after looking closer I could not really find a use for them, at least for not for our apps. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/