Re: vmstat: On demand vmstat workers V3
Hi Viresh, On 04/22/2014 03:32 AM, Viresh Kumar wrote: > On Thu, Oct 3, 2013 at 11:10 PM, Christoph Lameter wrote: >> V2->V3: >> - Introduce a new tick_get_housekeeping_cpu() function. Not sure >> if that is exactly what we want but it is a start. Thomas? >> - Migrate the shepherd task if the output of >> tick_get_housekeeping_cpu() changes. >> - Fixes recommended by Andrew. > > Hi Christoph, > > This vmstat interrupt is disturbing my core isolation :), have you got > any far with this patchset? You don't mean an interrupt, right? The updates are done via the regular priority workqueue. I'm playing with isolation as well (has been more or less a background thing for the last 6+ years). Our threads that run on the isolated cores are SCHED_FIFO and therefor low prio workqueue stuff, like vmstat, doesn't get in the way. I do have a few patches for the workqueues to make things better for isolation. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vmstat: On demand vmstat workers V3
Hi Viresh, On 04/22/2014 03:32 AM, Viresh Kumar wrote: On Thu, Oct 3, 2013 at 11:10 PM, Christoph Lameter c...@linux.com wrote: V2-V3: - Introduce a new tick_get_housekeeping_cpu() function. Not sure if that is exactly what we want but it is a start. Thomas? - Migrate the shepherd task if the output of tick_get_housekeeping_cpu() changes. - Fixes recommended by Andrew. Hi Christoph, This vmstat interrupt is disturbing my core isolation :), have you got any far with this patchset? You don't mean an interrupt, right? The updates are done via the regular priority workqueue. I'm playing with isolation as well (has been more or less a background thing for the last 6+ years). Our threads that run on the isolated cores are SCHED_FIFO and therefor low prio workqueue stuff, like vmstat, doesn't get in the way. I do have a few patches for the workqueues to make things better for isolation. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: why does kernel 3.8-rc1 put all TAP devices into state RUNNING during boot
On 01/05/2013 02:16 AM, Toralf Förster wrote: > At my stable Gentoo Linux I'm observed a change behaviour for the > configured TAP devices after the boot process. > > $ diff 3.7.1 3.8.0-rc1+ | grep UP >- br0: flags=4355 mtu 1500 >+ br0: flags=4419 mtu 1500 >- tap0: flags=4099 mtu 1500 >+ tap0: flags=4163 mtu 1500 > > May I ask you if this changed behaviour is intended ? I'm not aware of any changes in this behavior. btw Looks like it changed for your bridge interfaces as well. So it's not really specific to the TAP devices. Someone from netdev would know :) Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: why does kernel 3.8-rc1 put all TAP devices into state RUNNING during boot
On 01/05/2013 02:16 AM, Toralf Förster wrote: At my stable Gentoo Linux I'm observed a change behaviour for the configured TAP devices after the boot process. $ diff 3.7.1 3.8.0-rc1+ | grep UP - br0: flags=4355UP,BROADCAST,PROMISC,MULTICAST mtu 1500 + br0: flags=4419UP,BROADCAST,RUNNING,PROMISC,MULTICAST mtu 1500 - tap0: flags=4099UP,BROADCAST,MULTICAST mtu 1500 + tap0: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 May I ask you if this changed behaviour is intended ? I'm not aware of any changes in this behavior. btw Looks like it changed for your bridge interfaces as well. So it's not really specific to the TAP devices. Someone from netdev would know :) Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] MAINTAINERS: fix bouncing tun/tap entries
On 11/30/2012 09:28 AM, David Miller wrote: > From: Jiri Slaby > Date: Fri, 30 Nov 2012 18:05:40 +0100 > >> Delivery to the following recipient failed permanently: >> >> v...@office.satix.net >> >> Technical details of permanent failure: >> DNS Error: Domain name not found >> >> Of course: >> $ host office.satix.net >> Host office.satix.net not found: 3(NXDOMAIN) >> >> === >> >> And "Change of Email Address Notification": >> Old AddressNew Address Email Subject >> -- >> m...@qualcomm.com m...@qti.qualcomm.com "tuntap: multiqueue... >> >> Signed-off-by: Jiri Slaby > > Applied. > Thanks for fixing that guys. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] MAINTAINERS: fix bouncing tun/tap entries
On 11/30/2012 09:28 AM, David Miller wrote: From: Jiri Slaby jsl...@suse.cz Date: Fri, 30 Nov 2012 18:05:40 +0100 Delivery to the following recipient failed permanently: v...@office.satix.net Technical details of permanent failure: DNS Error: Domain name not found Of course: $ host office.satix.net Host office.satix.net not found: 3(NXDOMAIN) === And Change of Email Address Notification: Old AddressNew Address Email Subject -- m...@qualcomm.com m...@qti.qualcomm.com tuntap: multiqueue... Signed-off-by: Jiri Slaby jsl...@suse.cz Applied. Thanks for fixing that guys. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [net-next v5 0/7] Multiqueue support in tuntap
On 10/31/2012 10:45 PM, Jason Wang wrote: > Hello All: > > This is an update of multiqueue support in tuntap from V3. Please consider to > merge. > > The main idea for this series is to let tun/tap device to be benefited from > multiqueue network cards and multi-core host. We used to have a single queue > for > tuntap which could be a bottleneck in a multiqueue/core environment. So this > series let the device could be attched with multiple sockets and expose them > through fd to the userspace as multiqueues. The sereis were orignally designed > to serve as backend for multiqueue virtio-net in KVM, but the design is > generic > for other application to be used. > > Some quick overview of the design: > > - Moving socket from tun_device to tun_file. > - Allowing multiple sockets to be attached to a tun/tap devices. > - Using RCU to synchronize the data path and system call. > - Two new ioctls were added for the usespace to attach and detach socket to > the > device. > - API compatibility were maintained without userspace notable changes, so > legacy > userspace that only use one queue won't need any changes. > - A flow(rxhash) to queue table were maintained by tuntap which choose the txq > based on the last rxq where it comes. I'm still trying to wrap my head around the new locking/RCU stuff but it looks like Paul and others already looked at it. Otherwise looks good to me. btw In the description above you really meant allowing for attaching multiple file descriptors not sockets. Thanks Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [net-next v5 0/7] Multiqueue support in tuntap
On 10/31/2012 10:45 PM, Jason Wang wrote: Hello All: This is an update of multiqueue support in tuntap from V3. Please consider to merge. The main idea for this series is to let tun/tap device to be benefited from multiqueue network cards and multi-core host. We used to have a single queue for tuntap which could be a bottleneck in a multiqueue/core environment. So this series let the device could be attched with multiple sockets and expose them through fd to the userspace as multiqueues. The sereis were orignally designed to serve as backend for multiqueue virtio-net in KVM, but the design is generic for other application to be used. Some quick overview of the design: - Moving socket from tun_device to tun_file. - Allowing multiple sockets to be attached to a tun/tap devices. - Using RCU to synchronize the data path and system call. - Two new ioctls were added for the usespace to attach and detach socket to the device. - API compatibility were maintained without userspace notable changes, so legacy userspace that only use one queue won't need any changes. - A flow(rxhash) to queue table were maintained by tuntap which choose the txq based on the last rxq where it comes. I'm still trying to wrap my head around the new locking/RCU stuff but it looks like Paul and others already looked at it. Otherwise looks good to me. btw In the description above you really meant allowing for attaching multiple file descriptors not sockets. Thanks Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
Hi Paul, > A couple of proposals have been made recently by people working Linux > on smaller systems, for improving realtime isolation and memory > pressure handling: > > (1) cpu isolation for hard(er) realtime > http://lkml.org/lkml/2008/2/21/517 > Max Krasnyanskiy <[EMAIL PROTECTED]> > [PATCH sched-devel 0/7] CPU isolation extensions > > (2) notify user space of tight memory > http://lkml.org/lkml/2008/2/9/144 > KOSAKI Motohiro <[EMAIL PROTECTED]> > [PATCH 0/8][for -mm] mem_notify v6 > > In both cases, some of us have responded "why not use cpusets", and the > original submitters have replied "cpusets are too fat" (well, they > were more diplomatic than that, but I guess I can say that ;) My primary issue with cpusets (from CPU isolation perspective that is) was not the fatness. I did make a couple of comments like "On dual-cpu box I do not need cpusets to manage the CPUs" but that's not directly related to the CPU isolation. For the CPU isolation in particular I need code like this int select_irq_affinity(unsigned int irq) { cpumask_t usable_cpus; cpus_andnot(usable_cpus, cpu_online_map, cpu_isolated_map); irq_desc[irq].affinity = usable_cpus; irq_desc[irq].chip->set_affinity(irq, usable_cpus); return 0; } How would you implement that with cpusets ? I haven't seen you patches but I'd imagine that they will still need locks and iterators for "Is CPU N isolated" functionality. So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower level mechanism that actually makes kernel aware of what's isolated what's not. Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains but scheduler does not use cpusets directly. > I wonder if there might be room for a "tiny cpusets" configuration option: > * provide the same hooks to the rest of the kernel, and > * provide the same syntactic interface to user space, but > * with more limited semantics. > > The primary semantic limit I'd suggest would be supporting exactly > one layer depth of cpusets, not a full hierarchy. So one could still > successfully issue from user space 'mkdir /dev/cpuset/foo', but trying > to do 'mkdir /dev/cpuset/foo/bar' would fail. This reminds me of > very early FAT file systems, which had just a single, fixed size > root directory ;). There might even be a configurable fixed upper > limit on how many /dev/cpuset/* directories were allowed, further > simplifying the locking and dynamic memory behavior of this apparatus. In a foreseeable future 2-8 cores will be most common configuration. Do you think that cpusets are needed/useful for those machines ? The reason I'm asking is because given the restrictions you mentioned above it seems that you might as well just do taskset -c 1,2,3 app1 taskset -c 3,4,5 app2 Yes it's not quite the same of course but imo covers most cases. That's what we do on 2-4 cores these days, and are quite happy with that. ie We either let the specialized apps manage their thread affinities themselves or use "taskset" to manage the apps. > User space would see the same API, except that some valid operations > on full cpusets, such as a nested mkdir, would fail on tiny cpusets. Speaking of user-space API. I guess it's not directly related to the tiny-cpusets proposal but rather to the cpusets in general. Stuff that I'm working on this days (wireless basestations) is designed with the following model: cpuN - runs soft-RT networking and management code cpuN+1 to cpuN+x - are used as dedicated engines ie Simplest example would be cpu0 - runs IP, L2 and control plane cpu1 - runs hard-RT MAC So if CPU isolation is implemented on top of the cpusets what kind of API do you envision for such an app ? I mean currently cpusets seems to be mostly dealing with entire processes, whereas in this case we're really dealing with the threads. ie Different threads of the same process require different policies, some must run on isolated cpus some must not. I guess one could write a thread's pid into cpusets fs but that's not very convenient. pthread_set_affinity() is exactly what's needed. Personally I do not see much use for cpusets for those kinds of designs. But maybe I missing something. I got really excited when cpusets where first merged into mainline but after looking closer I could not really find a use for them, at least for not for our apps. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
Hi Paul, A couple of proposals have been made recently by people working Linux on smaller systems, for improving realtime isolation and memory pressure handling: (1) cpu isolation for hard(er) realtime http://lkml.org/lkml/2008/2/21/517 Max Krasnyanskiy [EMAIL PROTECTED] [PATCH sched-devel 0/7] CPU isolation extensions (2) notify user space of tight memory http://lkml.org/lkml/2008/2/9/144 KOSAKI Motohiro [EMAIL PROTECTED] [PATCH 0/8][for -mm] mem_notify v6 In both cases, some of us have responded why not use cpusets, and the original submitters have replied cpusets are too fat (well, they were more diplomatic than that, but I guess I can say that ;) My primary issue with cpusets (from CPU isolation perspective that is) was not the fatness. I did make a couple of comments like On dual-cpu box I do not need cpusets to manage the CPUs but that's not directly related to the CPU isolation. For the CPU isolation in particular I need code like this int select_irq_affinity(unsigned int irq) { cpumask_t usable_cpus; cpus_andnot(usable_cpus, cpu_online_map, cpu_isolated_map); irq_desc[irq].affinity = usable_cpus; irq_desc[irq].chip-set_affinity(irq, usable_cpus); return 0; } How would you implement that with cpusets ? I haven't seen you patches but I'd imagine that they will still need locks and iterators for Is CPU N isolated functionality. So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower level mechanism that actually makes kernel aware of what's isolated what's not. Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains but scheduler does not use cpusets directly. I wonder if there might be room for a tiny cpusets configuration option: * provide the same hooks to the rest of the kernel, and * provide the same syntactic interface to user space, but * with more limited semantics. The primary semantic limit I'd suggest would be supporting exactly one layer depth of cpusets, not a full hierarchy. So one could still successfully issue from user space 'mkdir /dev/cpuset/foo', but trying to do 'mkdir /dev/cpuset/foo/bar' would fail. This reminds me of very early FAT file systems, which had just a single, fixed size root directory ;). There might even be a configurable fixed upper limit on how many /dev/cpuset/* directories were allowed, further simplifying the locking and dynamic memory behavior of this apparatus. In a foreseeable future 2-8 cores will be most common configuration. Do you think that cpusets are needed/useful for those machines ? The reason I'm asking is because given the restrictions you mentioned above it seems that you might as well just do taskset -c 1,2,3 app1 taskset -c 3,4,5 app2 Yes it's not quite the same of course but imo covers most cases. That's what we do on 2-4 cores these days, and are quite happy with that. ie We either let the specialized apps manage their thread affinities themselves or use taskset to manage the apps. User space would see the same API, except that some valid operations on full cpusets, such as a nested mkdir, would fail on tiny cpusets. Speaking of user-space API. I guess it's not directly related to the tiny-cpusets proposal but rather to the cpusets in general. Stuff that I'm working on this days (wireless basestations) is designed with the following model: cpuN - runs soft-RT networking and management code cpuN+1 to cpuN+x - are used as dedicated engines ie Simplest example would be cpu0 - runs IP, L2 and control plane cpu1 - runs hard-RT MAC So if CPU isolation is implemented on top of the cpusets what kind of API do you envision for such an app ? I mean currently cpusets seems to be mostly dealing with entire processes, whereas in this case we're really dealing with the threads. ie Different threads of the same process require different policies, some must run on isolated cpus some must not. I guess one could write a thread's pid into cpusets fs but that's not very convenient. pthread_set_affinity() is exactly what's needed. Personally I do not see much use for cpusets for those kinds of designs. But maybe I missing something. I got really excited when cpusets where first merged into mainline but after looking closer I could not really find a use for them, at least for not for our apps. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Genirq and CPU isolation
Hi Thomas, While reviewing CPU isolation patches Peter pointed out that instead of changing arch specific irq handling I should be extending genirq code. Which makes perfect sense. Why didn't I think of that before :) Basically the idea is that by default isolated CPUs must not get HW irqs routed to them (besides IPIs and stuff of course). Does the patch included below look like the right approach ? btw select_smp_affinity() which is currently used only by alpha seemed out of place. It's called multiple times for shared irqs. ie Every time new handler is registered irq is moved to a different CPU. So I moved it under "if (!shared)" check inside setup_irq(). The patch introduces generic version of the select_smp_affinity() that sets the affinity mask to "online_cpus - isolated_cpus", and updates x86_32 and alpha load balancers to ignore isolated cpus. Booted on Core2 laptop and dual Opteron boxes with and w/o isolcpus= options and everything seems to work as expected. I wanted to run this by you before I include it in my patch series. Thanx Max diff --git a/arch/alpha/kernel/irq.c b/arch/alpha/kernel/irq.c index facf82a..6b01702 100644 --- a/arch/alpha/kernel/irq.c +++ b/arch/alpha/kernel/irq.c @@ -51,7 +51,7 @@ select_smp_affinity(unsigned int irq) if (!irq_desc[irq].chip->set_affinity || irq_user_affinity[irq]) return 1; - while (!cpu_possible(cpu)) + while (!cpu_possible(cpu) || cpu_isolated(cpu)) cpu = (cpu < (NR_CPUS-1) ? cpu + 1 : 0); last_cpu = cpu; diff --git a/arch/x86/kernel/genapic_flat_64.c b/arch/x86/kernel/genapic_flat_64.c index e02e58c..07352b7 100644 --- a/arch/x86/kernel/genapic_flat_64.c +++ b/arch/x86/kernel/genapic_flat_64.c @@ -21,9 +21,7 @@ static cpumask_t flat_target_cpus(void) { - cpumask_t target; - cpus_andnot(target, cpu_online_map, cpu_isolated_map); - return target; + return cpu_online_map; } static cpumask_t flat_vector_allocation_domain(int cpu) diff --git a/arch/x86/kernel/io_apic_32.c b/arch/x86/kernel/io_apic_32.c index 4ca5486..9c8816f 100644 --- a/arch/x86/kernel/io_apic_32.c +++ b/arch/x86/kernel/io_apic_32.c @@ -468,7 +468,7 @@ static void do_irq_balance(void) for_each_possible_cpu(i) { int package_index; CPU_IRQ(i) = 0; - if (!cpu_online(i)) + if (!cpu_online(i) || cpu_isolated(i)) continue; package_index = CPU_TO_PACKAGEINDEX(i); for (j = 0; j < NR_IRQS; j++) { diff --git a/include/linux/irq.h b/include/linux/irq.h index 176e5e7..287bc64 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -253,14 +253,7 @@ static inline void set_balance_irq_affinity(unsigned int irq, cpumask_t mask) } #endif -#ifdef CONFIG_AUTO_IRQ_AFFINITY extern int select_smp_affinity(unsigned int irq); -#else -static inline int select_smp_affinity(unsigned int irq) -{ - return 1; -} -#endif extern int no_irq_affinity; diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 438a014..e74db94 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -376,6 +376,9 @@ int setup_irq(unsigned int irq, struct irqaction *new) } else /* Undo nested disables: */ desc->depth = 1; + + /* Set default affinity mask once everything is setup */ + select_smp_affinity(irq); } /* Reset broken irq detection when installing new handler */ desc->irq_count = 0; @@ -488,6 +491,26 @@ void free_irq(unsigned int irq, void *dev_id) } EXPORT_SYMBOL(free_irq); +#ifndef CONFIG_AUTO_IRQ_AFFINITY +/** + * Generic version of the affinity autoselector. + * Called under desc->lock from setup_irq(). + * btw Should we rename this to select_irq_affinity() ? + */ +int select_smp_affinity(unsigned int irq) +{ + cpumask_t usable_cpus; + + if (!irq_can_set_affinity(irq)) + return 0; + + cpus_andnot(usable_cpus, cpu_online_map, cpu_isolated_map); + irq_desc[irq].affinity = usable_cpus; + irq_desc[irq].chip->set_affinity(irq, usable_cpus); + return 0; +} +#endif + /** * request_irq - allocate an interrupt line * @irq: Interrupt line to allocate @@ -555,8 +578,6 @@ int request_irq(unsigned int irq, irq_handler_t handler, action->next = NULL; action->dev_id = dev_id; - select_smp_affinity(irq); - #ifdef CONFIG_DEBUG_SHIRQ if (irqflags & IRQF_SHARED) { /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH sched-devel 1/7] cpuisol: Make cpu isolation configrable and export isolated map
This simple patch introduces new config option for CPU isolation. The reason I created the separate Kconfig file here is because more options will be added by the following patches. The patch also exports cpu_isolated_map, provides cpu_isolated() accessor macro and provides access to the isolation bit via sysfs. In other words cpu_isolated_map is exposed to the rest of the kernel and the user-space in much the same way cpu_online_map is exposed today. While at it I also moved cpu_*_map from kernel/sched.c into kernel/cpu.c Those maps have very little to do with the scheduler these days and therefor seem out of place in the scheduler code. This patch does not change/affect any existing scheduler functionality. Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]> --- arch/x86/Kconfig|1 + drivers/base/cpu.c | 48 ++ include/linux/cpumask.h |3 ++ kernel/Kconfig.cpuisol | 15 ++ kernel/Makefile |4 +- kernel/cpu.c| 49 +++ kernel/sched.c | 36 -- 7 files changed, 118 insertions(+), 38 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 3be2305..d228488 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -526,6 +526,7 @@ config SCHED_MC increased overhead in some places. If unsure say N here. source "kernel/Kconfig.preempt" +source "kernel/Kconfig.cpuisol" config X86_UP_APIC bool "Local APIC support on uniprocessors" diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 499b003..b6c5e0f 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -55,10 +55,58 @@ static ssize_t store_online(struct sys_device *dev, const char *buf, } static SYSDEV_ATTR(online, 0644, show_online, store_online); +#ifdef CONFIG_CPUISOL +/* + * This is under config hotplug because in order to + * dynamically isolate a CPU it needs to be brought off-line first. + * In other words the sequence is + * echo 0 > /sys/device/system/cpuN/online + * echo 1 > /sys/device/system/cpuN/isolated + * echo 1 > /sys/device/system/cpuN/online + */ +static ssize_t show_isol(struct sys_device *dev, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, sysdev); + + return sprintf(buf, "%u\n", !!cpu_isolated(cpu->sysdev.id)); +} + +static ssize_t store_isol(struct sys_device *dev, const char *buf, + size_t count) +{ + struct cpu *cpu = container_of(dev, struct cpu, sysdev); + ssize_t ret = 0; + + if (cpu_online(cpu->sysdev.id)) + return -EBUSY; + + switch (buf[0]) { + case '0': + cpu_clear(cpu->sysdev.id, cpu_isolated_map); + break; + case '1': + cpu_set(cpu->sysdev.id, cpu_isolated_map); + break; + default: + ret = -EINVAL; + } + + if (ret >= 0) + ret = count; + return ret; +} +static SYSDEV_ATTR(isolated, 0600, show_isol, store_isol); +#endif /* CONFIG_CPUISOL */ + static void __devinit register_cpu_control(struct cpu *cpu) { sysdev_create_file(>sysdev, _online); + +#ifdef CONFIG_CPUISOL + sysdev_create_file(>sysdev, _isolated); +#endif } + void unregister_cpu(struct cpu *cpu) { int logical_cpu = cpu->sysdev.id; diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 7047f58..cde2964 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -380,6 +380,7 @@ static inline void __cpus_remap(cpumask_t *dstp, const cpumask_t *srcp, extern cpumask_t cpu_possible_map; extern cpumask_t cpu_online_map; extern cpumask_t cpu_present_map; +extern cpumask_t cpu_isolated_map; #if NR_CPUS > 1 #define num_online_cpus() cpus_weight(cpu_online_map) @@ -388,6 +389,7 @@ extern cpumask_t cpu_present_map; #define cpu_online(cpu)cpu_isset((cpu), cpu_online_map) #define cpu_possible(cpu) cpu_isset((cpu), cpu_possible_map) #define cpu_present(cpu) cpu_isset((cpu), cpu_present_map) +#define cpu_isolated(cpu) cpu_isset((cpu), cpu_isolated_map) #else #define num_online_cpus() 1 #define num_possible_cpus()1 @@ -395,6 +397,7 @@ extern cpumask_t cpu_present_map; #define cpu_online(cpu)((cpu) == 0) #define cpu_possible(cpu) ((cpu) == 0) #define cpu_present(cpu) ((cpu) == 0) +#define cpu_isolated(cpu) (0) #endif #define cpu_is_offline(cpu)unlikely(!cpu_online(cpu)) diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol new file mode 100644 index 000..e606477 --- /dev/null +++ b/kernel/Kconfig.cpuisol @@ -0,0 +1,15 @@ +config CPUISOL + depends on SMP + bool "CPU isolation" + help + This option enables support for CPU isolation. + If enab
[PATCH sched-devel 5/7] cpuisol: Documentation updates
Documented sysfs interface as suggested by Andrew Morton. Added general documentation that describes how to configure and use CPU isolation features. Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]> --- Documentation/ABI/testing/sysfs-devices-system-cpu | 41 +++ Documentation/cpu-isolation.txt| 113 2 files changed, 154 insertions(+), 0 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu new file mode 100644 index 000..32dde5b --- /dev/null +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -0,0 +1,41 @@ +What: /sys/devices/system/cpu/... +Date: Feb. 2008 +KernelVersion: 2.6.24 +Contact:LKML +Description: + +The /sys/devices/system/cpu tree provides information about all cpu's +known to the running kernel. + +Following files are created for each cpu. 'N' is the cpu number. + +/sys/devices/system/cpu/cpuN/ + online (0644) On-line attribute. Indicates whether the cpu is on-line. +The cpu can be brought off-line by writing '0' into +this file. Similarly it can be brought back on-line +by writing '1' into this file. This attribute is +not available for the cpu's that cannot be brought +off-line. Typically cpu0. For more information see +Documentation/cpu-hotplug.txt + + isolated (0644) Isolation attribute. Indicates whether the cpu +is isolated. +The cpu can be isolated by writing '1' into this +file. Similarly it can be un-isolated by writing +'0' into this file. In order to isolate the cpu it +must first be brought off-line. This attribute is +not available for the cpu's that cannot be brought +off-line. Typically cpu0. +Note this attribute is present only if "CPU isolation" +is enabled. For more information see +Documentation/cpu-isolation.txt + + cpufreq(0755) Frequency scaling state. +For more info see +Documentation/cpu-freq/... + + cache (0755) Cache information. FIXME + + cpuidle(0755) Idle state information. FIXME + + topology (0755) Topology information. FIXME diff --git a/Documentation/cpu-isolation.txt b/Documentation/cpu-isolation.txt new file mode 100644 index 000..b9ca425 --- /dev/null +++ b/Documentation/cpu-isolation.txt @@ -0,0 +1,113 @@ +CPU isolation support in Linux(tm) Kernel + +Maintainers: + +Scheduler and scheduler domain bits: + Ingo Molnar <[EMAIL PROTECTED]> + +General framework, irq and workqueue isolation: + Max Krasnyanskiy <[EMAIL PROTECTED]> + +ChangeLog: +- Initial version. Feb 2008, MaxK + +Introduction + + +The primary idea behind CPU isolation is the ability to use some CPU cores +as a dedicated engines for running user-space code with minimal kernel +overhead/intervention, think of it as an SPE in the Cell processor. For +example CPU isolation allows for running CPU intensive(100%) RT task +on one of the processors without adversely affecting or being affected +by the other system activities. With the current (as of early 2008) +multi-core CPU trend we may see more and more applications that explore +this capability: real-time gaming engines, simulators, hard real-time +apps, etc. + +Current CPU isolation support consists of the following features: + +1. Isolated CPU(s) are excluded from the scheduler load balancing logic. + Applications must explicitly bind threads in order to run on those + CPU(s). + +2. By default interrupts are not routed to the isolated CPU(s). + Users must route interrupts (if any) to those CPU(s) explicitly. + +3. Kernel avoids any activity on the isolated CPU(s) as much as possible. + This includes workqueues, per CPU threads, etc. Please note that + this feature is optional and is disabled by default. + +Kernel configuration options + + +Following options need to be enabled in order to use CPU isolation + CONFIG_CPUISOL Top-level config option. Enables general +CPU isolation framework and enables features +#1 and #2 described above. + + CONFIG_CPUISOL_WORKQUEUEThese options provide deeper isolation + CONFIG_CPUISOL_STOPMACHINE from various kernel subsystems. They implement + CONFIG_CPUISOL_... feature #3 described above. +See Kconfig help for more information on each +individual option. + +How to isolate a CPU + + +There are two ways for isolating a CPU + +Kernel boot comm
[PATCH sched-devel 6/7] cpuisol: Minor updates to the Kconfig options
Fixed a couple of typos, long lines and referred to the documentation file. Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]> --- kernel/Kconfig.cpuisol | 31 +-- 1 files changed, 17 insertions(+), 14 deletions(-) diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol index 81f1972..e681b02 100644 --- a/kernel/Kconfig.cpuisol +++ b/kernel/Kconfig.cpuisol @@ -2,23 +2,26 @@ config CPUISOL depends on SMP bool "CPU isolation" help - This option enables support for CPU isolation. - If enabled the kernel will try to avoid kernel activity on the isolated CPUs. - By default user-space threads are not scheduled on the isolated CPUs unless - they explicitly request it (via sched_ and pthread_ affinity calls). Isolated - CPUs are not subject to the scheduler load-balancing algorithms. - - CPUs can be marked as isolated using 'isolcpus=' command line option or by - writing '1' into /sys/devices/system/cpu/cpuN/isolated. - - This feature is useful for hard realtime and high performance applications. + This option enables support for CPU isolation. If enabled the + kernel will try to avoid kernel activity on the isolated CPUs. + By default user-space threads are not scheduled on the isolated + CPUs unless they explicitly request it via sched_setaffinity() + and pthread_setaffinity_np() calls. Isolated CPUs are not + subject to the scheduler load-balancing algorithms. + + This feature is useful for hard realtime and high performance + applications. + See Documentation/cpu-isolation.txt for more details. + If unsure say 'N'. config CPUISOL_WORKQUEUE bool "Do not schedule workqueues on the isolated CPUs (EXPERIMENTAL)" depends on CPUISOL && EXPERIMENTAL help - In this option is enabled kernel will not schedule workqueues on the - isolated CPUs. - Please note that at this point this feature is experimental. It brakes - certain things like OProfile that heavily rely on per cpu workqueues. + If this option is enabled kernel will not schedule workqueues on + the isolated CPUs. Please note that at this point this feature + is experimental. It breaks certain things like OProfile that + heavily rely on per cpu workqueues. + + Say 'Y' to enable workqueue isolation. If unsure say 'N'. -- 1.5.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH sched-devel 3/7] cpuisol: Do not schedule workqueues on the isolated CPUs
This patch is addressing the use case when a high priority realtime (FIFO, RR) user-space thread is using 100% CPU for extended periods of time. In which case kernel workqueue threads do not get a chance to run and entire machine essentially hangs because other CPUs are waiting for scheduled workqueues to flush. This use case is perfectly valid if one is using a CPU as a dedicated engine (crunching numbers, hard realtime, etc). Think of it as an SPE in the Cell processor. Which is what CPU isolation enables in the first place. Most kernel subsystems do not rely on the per CPU workqueues. In fact we already have support for single threaded workqueues, this patch just makes it automatic. As mentioned in the introductory email this functionality has been tested on a wide range of full fledged systems (with IDE, SATA, USB, automount, NFS, NUMA, etc) in the production environment. The only feature (that I know of) that does not work when workqueue isolation is enabled is OProfile. It does not result in crashes or instability, OProfile is just unable to collect stats from the isolated CPUs. Hence this feature is marked as experimental. There is zero overhead if workqueue isolation is disabled. Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]> --- kernel/Kconfig.cpuisol |9 + kernel/workqueue.c | 30 +++--- 2 files changed, 32 insertions(+), 7 deletions(-) diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol index e606477..81f1972 100644 --- a/kernel/Kconfig.cpuisol +++ b/kernel/Kconfig.cpuisol @@ -13,3 +13,12 @@ config CPUISOL This feature is useful for hard realtime and high performance applications. If unsure say 'N'. + +config CPUISOL_WORKQUEUE + bool "Do not schedule workqueues on the isolated CPUs (EXPERIMENTAL)" + depends on CPUISOL && EXPERIMENTAL + help + In this option is enabled kernel will not schedule workqueues on the + isolated CPUs. + Please note that at this point this feature is experimental. It brakes + certain things like OProfile that heavily rely on per cpu workqueues. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ff06611..f48e13c 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -35,6 +35,16 @@ #include /* + * Stub out cpu_isolated() if isolated CPUs are allowed to + * run workqueues. + */ +#ifdef CONFIG_CPUISOL_WORKQUEUE +#define cpu_unusable(cpu) cpu_isolated(cpu) +#else +#define cpu_unusable(cpu) (0) +#endif + +/* * The per-CPU workqueue (if single thread, we always use the first * possible cpu). */ @@ -97,7 +107,7 @@ static const cpumask_t *wq_cpu_map(struct workqueue_struct *wq) static struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu) { - if (unlikely(is_single_threaded(wq))) + if (unlikely(is_single_threaded(wq)) || cpu_unusable(cpu)) cpu = singlethread_cpu; return per_cpu_ptr(wq->cpu_wq, cpu); } @@ -229,9 +239,11 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq, timer->data = (unsigned long)dwork; timer->function = delayed_work_timer_fn; - if (unlikely(cpu >= 0)) + if (unlikely(cpu >= 0)) { + if (cpu_unusable(cpu)) + cpu = singlethread_cpu; add_timer_on(timer, cpu); - else + } else add_timer(timer); ret = 1; } @@ -605,7 +617,8 @@ int schedule_on_each_cpu(work_func_t func) get_online_cpus(); for_each_online_cpu(cpu) { struct work_struct *work = per_cpu_ptr(works, cpu); - + if (cpu_unusable(cpu)) + continue; INIT_WORK(work, func); set_bit(WORK_STRUCT_PENDING, work_data_bits(work)); __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), work); @@ -754,7 +767,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name, for_each_possible_cpu(cpu) { cwq = init_cpu_workqueue(wq, cpu); - if (err || !cpu_online(cpu)) + if (err || !cpu_online(cpu) || cpu_unusable(cpu)) continue; err = create_workqueue_thread(cwq, cpu); start_workqueue_thread(cwq, cpu); @@ -833,8 +846,11 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb, struct cpu_workqueue_struct *cwq; struct workqueue_struct *wq; - action &= ~CPU_TASKS_FROZEN; + if (cpu_unusable(cpu)) + return NOTIFY_OK; + action &= ~CPU_TASKS_FROZEN; + switch (action) { case CPU_UP_PREPARE: @@ -869,7 +885,7 @@ static int __devinit workqueue
[PATCH sched-devel 7/7] cpuisol: Do not halt isolated CPUs with Stop Machine
This patch makes "stop machine" ignore isolated CPUs (if the config option is enabled). It addresses exact same usecase explained in the previous workqueue isolation patch. Where a user-space RT thread can prevent stop machine threads from running, which causes the entire system to hang. Stop machine is particularly bad when it comes to latencies because it halts every single CPU and may take several milliseconds to complete. It's currently used for module insertion and removal only. As some folks pointed out in the previous discussions this patch is potentially unsafe if applications running on the isolated CPUs use kernel services affected by the module insertion and removal. I've been running kernels with this patch on a wide range of the machines in production environment were we routinely insert/remove modules with applications running on isolated CPUs. Also I've recently done quite a bit of testing on life multi-core systems with "stop machine" _completely_ disabled, and was not able to trigger any problems. For more details please see this thread http://marc.info/?l=linux-kernel=120243837206248=2 That of course does not mean that the patch is totally safe but it does not seem to cause any instability in real life. This feature does not add any overhead when disabled. It's marked as experimental due to potential issues mentioned above. Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]> --- kernel/Kconfig.cpuisol | 15 +++ kernel/stop_machine.c |8 +++- 2 files changed, 22 insertions(+), 1 deletions(-) diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol index e681b02..24c1ef0 100644 --- a/kernel/Kconfig.cpuisol +++ b/kernel/Kconfig.cpuisol @@ -25,3 +25,18 @@ config CPUISOL_WORKQUEUE heavily rely on per cpu workqueues. Say 'Y' to enable workqueue isolation. If unsure say 'N'. + +config CPUISOL_STOPMACHINE + bool "Do not halt isolated CPUs with Stop Machine (EXPERIMENTAL)" + depends on CPUISOL && STOP_MACHINE && EXPERIMENTAL + help + If this option is enabled kernel will not halt isolated CPUs + when Stop Machine is triggered. Stop Machine is currently only + used by the module insertion and removal. + Please note that at this point this feature is experimental. It is + not known to really break anything but can potentially introduce + an instability due to race conditions in module removal logic. + + Say 'Y' if support for dynamic module insertion and removal is + required for the system that uses isolated CPUs. + If unsure say 'N'. diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index 6f4e0e1..aa3af15 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -89,6 +89,12 @@ static void stopmachine_set_state(enum stopmachine_state state) cpu_relax(); } +#ifdef CONFIG_CPUISOL_STOPMACHINE +#define cpu_unusable(cpu) cpu_isolated(cpu) +#else +#define cpu_unusable(cpu) (0) +#endif + static int stop_machine(void) { int i, ret = 0; @@ -98,7 +104,7 @@ static int stop_machine(void) stopmachine_state = STOPMACHINE_WAIT; for_each_online_cpu(i) { - if (i == raw_smp_processor_id()) + if (i == raw_smp_processor_id() || cpu_unusable(i)) continue; ret = kernel_thread(stopmachine, (void *)(long)i,CLONE_KERNEL); if (ret < 0) -- 1.5.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH sched-devel 4/7] cpuisol: Move on-stack array used for boot cmd parsing into __initdata
Suggested by Andrew Morton: isolated_cpu_setup() has an on-stack array of NR_CPUS integers. This will consume 4k of stack on ia64 (at least). We'll just squeak through for a ittle while, but this needs to be fixed. Just move it into __initdata. Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]> --- kernel/cpu.c | 15 ++- 1 files changed, 10 insertions(+), 5 deletions(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index a0ac386..b3af739 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -446,15 +446,20 @@ out: #ifdef CONFIG_CPUISOL /* Setup the mask of isolated cpus */ + +static int __initdata isolcpu[NR_CPUS]; + static int __init isolated_cpu_setup(char *str) { - int ints[NR_CPUS], i; + int i, n; + + str = get_options(str, ARRAY_SIZE(isolcpu), isolcpu); + n = isolcpu[0]; - str = get_options(str, ARRAY_SIZE(ints), ints); cpus_clear(cpu_isolated_map); - for (i = 1; i <= ints[0]; i++) - if (ints[i] < NR_CPUS) - cpu_set(ints[i], cpu_isolated_map); + for (i = 1; i <= n; i++) + if (isolcpu[i] < NR_CPUS) + cpu_set(isolcpu[i], cpu_isolated_map); return 1; } -- 1.5.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH sched-devel 2/7] cpuisol: Do not route IRQs to the CPUs isolated at boot
Most people would expect isolated CPUs to not get any IRQs by default. This happens naturally if a CPU is brought off-line, marked isolated and then brought back online. There was some confusion about this patch originaly. So I wanted to clarify that it does not completely disable IRQ handling on the isolated CPUs. Users still have the option or routing IRQs to them by modifying IRQ affinity mask. I cannot test other archs hence the patch is for x86_64 only. Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]> --- arch/x86/kernel/genapic_flat_64.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/genapic_flat_64.c b/arch/x86/kernel/genapic_flat_64.c index 07352b7..e02e58c 100644 --- a/arch/x86/kernel/genapic_flat_64.c +++ b/arch/x86/kernel/genapic_flat_64.c @@ -21,7 +21,9 @@ static cpumask_t flat_target_cpus(void) { - return cpu_online_map; + cpumask_t target; + cpus_andnot(target, cpu_online_map, cpu_isolated_map); + return target; } static cpumask_t flat_vector_allocation_domain(int cpu) -- 1.5.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH sched-devel 2/7] cpuisol: Do not route IRQs to the CPUs isolated at boot
Most people would expect isolated CPUs to not get any IRQs by default. This happens naturally if a CPU is brought off-line, marked isolated and then brought back online. There was some confusion about this patch originaly. So I wanted to clarify that it does not completely disable IRQ handling on the isolated CPUs. Users still have the option or routing IRQs to them by modifying IRQ affinity mask. I cannot test other archs hence the patch is for x86_64 only. Signed-off-by: Max Krasnyansky [EMAIL PROTECTED] --- arch/x86/kernel/genapic_flat_64.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/genapic_flat_64.c b/arch/x86/kernel/genapic_flat_64.c index 07352b7..e02e58c 100644 --- a/arch/x86/kernel/genapic_flat_64.c +++ b/arch/x86/kernel/genapic_flat_64.c @@ -21,7 +21,9 @@ static cpumask_t flat_target_cpus(void) { - return cpu_online_map; + cpumask_t target; + cpus_andnot(target, cpu_online_map, cpu_isolated_map); + return target; } static cpumask_t flat_vector_allocation_domain(int cpu) -- 1.5.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH sched-devel 4/7] cpuisol: Move on-stack array used for boot cmd parsing into __initdata
Suggested by Andrew Morton: isolated_cpu_setup() has an on-stack array of NR_CPUS integers. This will consume 4k of stack on ia64 (at least). We'll just squeak through for a ittle while, but this needs to be fixed. Just move it into __initdata. Signed-off-by: Max Krasnyansky [EMAIL PROTECTED] --- kernel/cpu.c | 15 ++- 1 files changed, 10 insertions(+), 5 deletions(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index a0ac386..b3af739 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -446,15 +446,20 @@ out: #ifdef CONFIG_CPUISOL /* Setup the mask of isolated cpus */ + +static int __initdata isolcpu[NR_CPUS]; + static int __init isolated_cpu_setup(char *str) { - int ints[NR_CPUS], i; + int i, n; + + str = get_options(str, ARRAY_SIZE(isolcpu), isolcpu); + n = isolcpu[0]; - str = get_options(str, ARRAY_SIZE(ints), ints); cpus_clear(cpu_isolated_map); - for (i = 1; i = ints[0]; i++) - if (ints[i] NR_CPUS) - cpu_set(ints[i], cpu_isolated_map); + for (i = 1; i = n; i++) + if (isolcpu[i] NR_CPUS) + cpu_set(isolcpu[i], cpu_isolated_map); return 1; } -- 1.5.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH sched-devel 3/7] cpuisol: Do not schedule workqueues on the isolated CPUs
This patch is addressing the use case when a high priority realtime (FIFO, RR) user-space thread is using 100% CPU for extended periods of time. In which case kernel workqueue threads do not get a chance to run and entire machine essentially hangs because other CPUs are waiting for scheduled workqueues to flush. This use case is perfectly valid if one is using a CPU as a dedicated engine (crunching numbers, hard realtime, etc). Think of it as an SPE in the Cell processor. Which is what CPU isolation enables in the first place. Most kernel subsystems do not rely on the per CPU workqueues. In fact we already have support for single threaded workqueues, this patch just makes it automatic. As mentioned in the introductory email this functionality has been tested on a wide range of full fledged systems (with IDE, SATA, USB, automount, NFS, NUMA, etc) in the production environment. The only feature (that I know of) that does not work when workqueue isolation is enabled is OProfile. It does not result in crashes or instability, OProfile is just unable to collect stats from the isolated CPUs. Hence this feature is marked as experimental. There is zero overhead if workqueue isolation is disabled. Signed-off-by: Max Krasnyansky [EMAIL PROTECTED] --- kernel/Kconfig.cpuisol |9 + kernel/workqueue.c | 30 +++--- 2 files changed, 32 insertions(+), 7 deletions(-) diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol index e606477..81f1972 100644 --- a/kernel/Kconfig.cpuisol +++ b/kernel/Kconfig.cpuisol @@ -13,3 +13,12 @@ config CPUISOL This feature is useful for hard realtime and high performance applications. If unsure say 'N'. + +config CPUISOL_WORKQUEUE + bool Do not schedule workqueues on the isolated CPUs (EXPERIMENTAL) + depends on CPUISOL EXPERIMENTAL + help + In this option is enabled kernel will not schedule workqueues on the + isolated CPUs. + Please note that at this point this feature is experimental. It brakes + certain things like OProfile that heavily rely on per cpu workqueues. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ff06611..f48e13c 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -35,6 +35,16 @@ #include linux/lockdep.h /* + * Stub out cpu_isolated() if isolated CPUs are allowed to + * run workqueues. + */ +#ifdef CONFIG_CPUISOL_WORKQUEUE +#define cpu_unusable(cpu) cpu_isolated(cpu) +#else +#define cpu_unusable(cpu) (0) +#endif + +/* * The per-CPU workqueue (if single thread, we always use the first * possible cpu). */ @@ -97,7 +107,7 @@ static const cpumask_t *wq_cpu_map(struct workqueue_struct *wq) static struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu) { - if (unlikely(is_single_threaded(wq))) + if (unlikely(is_single_threaded(wq)) || cpu_unusable(cpu)) cpu = singlethread_cpu; return per_cpu_ptr(wq-cpu_wq, cpu); } @@ -229,9 +239,11 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq, timer-data = (unsigned long)dwork; timer-function = delayed_work_timer_fn; - if (unlikely(cpu = 0)) + if (unlikely(cpu = 0)) { + if (cpu_unusable(cpu)) + cpu = singlethread_cpu; add_timer_on(timer, cpu); - else + } else add_timer(timer); ret = 1; } @@ -605,7 +617,8 @@ int schedule_on_each_cpu(work_func_t func) get_online_cpus(); for_each_online_cpu(cpu) { struct work_struct *work = per_cpu_ptr(works, cpu); - + if (cpu_unusable(cpu)) + continue; INIT_WORK(work, func); set_bit(WORK_STRUCT_PENDING, work_data_bits(work)); __queue_work(per_cpu_ptr(keventd_wq-cpu_wq, cpu), work); @@ -754,7 +767,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name, for_each_possible_cpu(cpu) { cwq = init_cpu_workqueue(wq, cpu); - if (err || !cpu_online(cpu)) + if (err || !cpu_online(cpu) || cpu_unusable(cpu)) continue; err = create_workqueue_thread(cwq, cpu); start_workqueue_thread(cwq, cpu); @@ -833,8 +846,11 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb, struct cpu_workqueue_struct *cwq; struct workqueue_struct *wq; - action = ~CPU_TASKS_FROZEN; + if (cpu_unusable(cpu)) + return NOTIFY_OK; + action = ~CPU_TASKS_FROZEN; + switch (action) { case CPU_UP_PREPARE: @@ -869,7 +885,7 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb, void
[PATCH sched-devel 7/7] cpuisol: Do not halt isolated CPUs with Stop Machine
This patch makes stop machine ignore isolated CPUs (if the config option is enabled). It addresses exact same usecase explained in the previous workqueue isolation patch. Where a user-space RT thread can prevent stop machine threads from running, which causes the entire system to hang. Stop machine is particularly bad when it comes to latencies because it halts every single CPU and may take several milliseconds to complete. It's currently used for module insertion and removal only. As some folks pointed out in the previous discussions this patch is potentially unsafe if applications running on the isolated CPUs use kernel services affected by the module insertion and removal. I've been running kernels with this patch on a wide range of the machines in production environment were we routinely insert/remove modules with applications running on isolated CPUs. Also I've recently done quite a bit of testing on life multi-core systems with stop machine _completely_ disabled, and was not able to trigger any problems. For more details please see this thread http://marc.info/?l=linux-kernelm=120243837206248w=2 That of course does not mean that the patch is totally safe but it does not seem to cause any instability in real life. This feature does not add any overhead when disabled. It's marked as experimental due to potential issues mentioned above. Signed-off-by: Max Krasnyansky [EMAIL PROTECTED] --- kernel/Kconfig.cpuisol | 15 +++ kernel/stop_machine.c |8 +++- 2 files changed, 22 insertions(+), 1 deletions(-) diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol index e681b02..24c1ef0 100644 --- a/kernel/Kconfig.cpuisol +++ b/kernel/Kconfig.cpuisol @@ -25,3 +25,18 @@ config CPUISOL_WORKQUEUE heavily rely on per cpu workqueues. Say 'Y' to enable workqueue isolation. If unsure say 'N'. + +config CPUISOL_STOPMACHINE + bool Do not halt isolated CPUs with Stop Machine (EXPERIMENTAL) + depends on CPUISOL STOP_MACHINE EXPERIMENTAL + help + If this option is enabled kernel will not halt isolated CPUs + when Stop Machine is triggered. Stop Machine is currently only + used by the module insertion and removal. + Please note that at this point this feature is experimental. It is + not known to really break anything but can potentially introduce + an instability due to race conditions in module removal logic. + + Say 'Y' if support for dynamic module insertion and removal is + required for the system that uses isolated CPUs. + If unsure say 'N'. diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index 6f4e0e1..aa3af15 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -89,6 +89,12 @@ static void stopmachine_set_state(enum stopmachine_state state) cpu_relax(); } +#ifdef CONFIG_CPUISOL_STOPMACHINE +#define cpu_unusable(cpu) cpu_isolated(cpu) +#else +#define cpu_unusable(cpu) (0) +#endif + static int stop_machine(void) { int i, ret = 0; @@ -98,7 +104,7 @@ static int stop_machine(void) stopmachine_state = STOPMACHINE_WAIT; for_each_online_cpu(i) { - if (i == raw_smp_processor_id()) + if (i == raw_smp_processor_id() || cpu_unusable(i)) continue; ret = kernel_thread(stopmachine, (void *)(long)i,CLONE_KERNEL); if (ret 0) -- 1.5.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH sched-devel 6/7] cpuisol: Minor updates to the Kconfig options
Fixed a couple of typos, long lines and referred to the documentation file. Signed-off-by: Max Krasnyansky [EMAIL PROTECTED] --- kernel/Kconfig.cpuisol | 31 +-- 1 files changed, 17 insertions(+), 14 deletions(-) diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol index 81f1972..e681b02 100644 --- a/kernel/Kconfig.cpuisol +++ b/kernel/Kconfig.cpuisol @@ -2,23 +2,26 @@ config CPUISOL depends on SMP bool CPU isolation help - This option enables support for CPU isolation. - If enabled the kernel will try to avoid kernel activity on the isolated CPUs. - By default user-space threads are not scheduled on the isolated CPUs unless - they explicitly request it (via sched_ and pthread_ affinity calls). Isolated - CPUs are not subject to the scheduler load-balancing algorithms. - - CPUs can be marked as isolated using 'isolcpus=' command line option or by - writing '1' into /sys/devices/system/cpu/cpuN/isolated. - - This feature is useful for hard realtime and high performance applications. + This option enables support for CPU isolation. If enabled the + kernel will try to avoid kernel activity on the isolated CPUs. + By default user-space threads are not scheduled on the isolated + CPUs unless they explicitly request it via sched_setaffinity() + and pthread_setaffinity_np() calls. Isolated CPUs are not + subject to the scheduler load-balancing algorithms. + + This feature is useful for hard realtime and high performance + applications. + See Documentation/cpu-isolation.txt for more details. + If unsure say 'N'. config CPUISOL_WORKQUEUE bool Do not schedule workqueues on the isolated CPUs (EXPERIMENTAL) depends on CPUISOL EXPERIMENTAL help - In this option is enabled kernel will not schedule workqueues on the - isolated CPUs. - Please note that at this point this feature is experimental. It brakes - certain things like OProfile that heavily rely on per cpu workqueues. + If this option is enabled kernel will not schedule workqueues on + the isolated CPUs. Please note that at this point this feature + is experimental. It breaks certain things like OProfile that + heavily rely on per cpu workqueues. + + Say 'Y' to enable workqueue isolation. If unsure say 'N'. -- 1.5.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH sched-devel 5/7] cpuisol: Documentation updates
Documented sysfs interface as suggested by Andrew Morton. Added general documentation that describes how to configure and use CPU isolation features. Signed-off-by: Max Krasnyansky [EMAIL PROTECTED] --- Documentation/ABI/testing/sysfs-devices-system-cpu | 41 +++ Documentation/cpu-isolation.txt| 113 2 files changed, 154 insertions(+), 0 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu new file mode 100644 index 000..32dde5b --- /dev/null +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -0,0 +1,41 @@ +What: /sys/devices/system/cpu/... +Date: Feb. 2008 +KernelVersion: 2.6.24 +Contact:LKML linux-kernel@vger.kernel.org +Description: + +The /sys/devices/system/cpu tree provides information about all cpu's +known to the running kernel. + +Following files are created for each cpu. 'N' is the cpu number. + +/sys/devices/system/cpu/cpuN/ + online (0644) On-line attribute. Indicates whether the cpu is on-line. +The cpu can be brought off-line by writing '0' into +this file. Similarly it can be brought back on-line +by writing '1' into this file. This attribute is +not available for the cpu's that cannot be brought +off-line. Typically cpu0. For more information see +Documentation/cpu-hotplug.txt + + isolated (0644) Isolation attribute. Indicates whether the cpu +is isolated. +The cpu can be isolated by writing '1' into this +file. Similarly it can be un-isolated by writing +'0' into this file. In order to isolate the cpu it +must first be brought off-line. This attribute is +not available for the cpu's that cannot be brought +off-line. Typically cpu0. +Note this attribute is present only if CPU isolation +is enabled. For more information see +Documentation/cpu-isolation.txt + + cpufreq(0755) Frequency scaling state. +For more info see +Documentation/cpu-freq/... + + cache (0755) Cache information. FIXME + + cpuidle(0755) Idle state information. FIXME + + topology (0755) Topology information. FIXME diff --git a/Documentation/cpu-isolation.txt b/Documentation/cpu-isolation.txt new file mode 100644 index 000..b9ca425 --- /dev/null +++ b/Documentation/cpu-isolation.txt @@ -0,0 +1,113 @@ +CPU isolation support in Linux(tm) Kernel + +Maintainers: + +Scheduler and scheduler domain bits: + Ingo Molnar [EMAIL PROTECTED] + +General framework, irq and workqueue isolation: + Max Krasnyanskiy [EMAIL PROTECTED] + +ChangeLog: +- Initial version. Feb 2008, MaxK + +Introduction + + +The primary idea behind CPU isolation is the ability to use some CPU cores +as a dedicated engines for running user-space code with minimal kernel +overhead/intervention, think of it as an SPE in the Cell processor. For +example CPU isolation allows for running CPU intensive(100%) RT task +on one of the processors without adversely affecting or being affected +by the other system activities. With the current (as of early 2008) +multi-core CPU trend we may see more and more applications that explore +this capability: real-time gaming engines, simulators, hard real-time +apps, etc. + +Current CPU isolation support consists of the following features: + +1. Isolated CPU(s) are excluded from the scheduler load balancing logic. + Applications must explicitly bind threads in order to run on those + CPU(s). + +2. By default interrupts are not routed to the isolated CPU(s). + Users must route interrupts (if any) to those CPU(s) explicitly. + +3. Kernel avoids any activity on the isolated CPU(s) as much as possible. + This includes workqueues, per CPU threads, etc. Please note that + this feature is optional and is disabled by default. + +Kernel configuration options + + +Following options need to be enabled in order to use CPU isolation + CONFIG_CPUISOL Top-level config option. Enables general +CPU isolation framework and enables features +#1 and #2 described above. + + CONFIG_CPUISOL_WORKQUEUEThese options provide deeper isolation + CONFIG_CPUISOL_STOPMACHINE from various kernel subsystems. They implement + CONFIG_CPUISOL_... feature #3 described above. +See Kconfig help for more information on each +individual option. + +How to isolate a CPU + + +There are two ways for isolating a CPU + +Kernel boot command line
[PATCH sched-devel 1/7] cpuisol: Make cpu isolation configrable and export isolated map
This simple patch introduces new config option for CPU isolation. The reason I created the separate Kconfig file here is because more options will be added by the following patches. The patch also exports cpu_isolated_map, provides cpu_isolated() accessor macro and provides access to the isolation bit via sysfs. In other words cpu_isolated_map is exposed to the rest of the kernel and the user-space in much the same way cpu_online_map is exposed today. While at it I also moved cpu_*_map from kernel/sched.c into kernel/cpu.c Those maps have very little to do with the scheduler these days and therefor seem out of place in the scheduler code. This patch does not change/affect any existing scheduler functionality. Signed-off-by: Max Krasnyansky [EMAIL PROTECTED] --- arch/x86/Kconfig|1 + drivers/base/cpu.c | 48 ++ include/linux/cpumask.h |3 ++ kernel/Kconfig.cpuisol | 15 ++ kernel/Makefile |4 +- kernel/cpu.c| 49 +++ kernel/sched.c | 36 -- 7 files changed, 118 insertions(+), 38 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 3be2305..d228488 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -526,6 +526,7 @@ config SCHED_MC increased overhead in some places. If unsure say N here. source kernel/Kconfig.preempt +source kernel/Kconfig.cpuisol config X86_UP_APIC bool Local APIC support on uniprocessors diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 499b003..b6c5e0f 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -55,10 +55,58 @@ static ssize_t store_online(struct sys_device *dev, const char *buf, } static SYSDEV_ATTR(online, 0644, show_online, store_online); +#ifdef CONFIG_CPUISOL +/* + * This is under config hotplug because in order to + * dynamically isolate a CPU it needs to be brought off-line first. + * In other words the sequence is + * echo 0 /sys/device/system/cpuN/online + * echo 1 /sys/device/system/cpuN/isolated + * echo 1 /sys/device/system/cpuN/online + */ +static ssize_t show_isol(struct sys_device *dev, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, sysdev); + + return sprintf(buf, %u\n, !!cpu_isolated(cpu-sysdev.id)); +} + +static ssize_t store_isol(struct sys_device *dev, const char *buf, + size_t count) +{ + struct cpu *cpu = container_of(dev, struct cpu, sysdev); + ssize_t ret = 0; + + if (cpu_online(cpu-sysdev.id)) + return -EBUSY; + + switch (buf[0]) { + case '0': + cpu_clear(cpu-sysdev.id, cpu_isolated_map); + break; + case '1': + cpu_set(cpu-sysdev.id, cpu_isolated_map); + break; + default: + ret = -EINVAL; + } + + if (ret = 0) + ret = count; + return ret; +} +static SYSDEV_ATTR(isolated, 0600, show_isol, store_isol); +#endif /* CONFIG_CPUISOL */ + static void __devinit register_cpu_control(struct cpu *cpu) { sysdev_create_file(cpu-sysdev, attr_online); + +#ifdef CONFIG_CPUISOL + sysdev_create_file(cpu-sysdev, attr_isolated); +#endif } + void unregister_cpu(struct cpu *cpu) { int logical_cpu = cpu-sysdev.id; diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 7047f58..cde2964 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -380,6 +380,7 @@ static inline void __cpus_remap(cpumask_t *dstp, const cpumask_t *srcp, extern cpumask_t cpu_possible_map; extern cpumask_t cpu_online_map; extern cpumask_t cpu_present_map; +extern cpumask_t cpu_isolated_map; #if NR_CPUS 1 #define num_online_cpus() cpus_weight(cpu_online_map) @@ -388,6 +389,7 @@ extern cpumask_t cpu_present_map; #define cpu_online(cpu)cpu_isset((cpu), cpu_online_map) #define cpu_possible(cpu) cpu_isset((cpu), cpu_possible_map) #define cpu_present(cpu) cpu_isset((cpu), cpu_present_map) +#define cpu_isolated(cpu) cpu_isset((cpu), cpu_isolated_map) #else #define num_online_cpus() 1 #define num_possible_cpus()1 @@ -395,6 +397,7 @@ extern cpumask_t cpu_present_map; #define cpu_online(cpu)((cpu) == 0) #define cpu_possible(cpu) ((cpu) == 0) #define cpu_present(cpu) ((cpu) == 0) +#define cpu_isolated(cpu) (0) #endif #define cpu_is_offline(cpu)unlikely(!cpu_online(cpu)) diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol new file mode 100644 index 000..e606477 --- /dev/null +++ b/kernel/Kconfig.cpuisol @@ -0,0 +1,15 @@ +config CPUISOL + depends on SMP + bool CPU isolation + help + This option enables support for CPU isolation. + If enabled the kernel will try to avoid kernel activity on the isolated CPUs. + By default user-space
[RFC] Genirq and CPU isolation
Hi Thomas, While reviewing CPU isolation patches Peter pointed out that instead of changing arch specific irq handling I should be extending genirq code. Which makes perfect sense. Why didn't I think of that before :) Basically the idea is that by default isolated CPUs must not get HW irqs routed to them (besides IPIs and stuff of course). Does the patch included below look like the right approach ? btw select_smp_affinity() which is currently used only by alpha seemed out of place. It's called multiple times for shared irqs. ie Every time new handler is registered irq is moved to a different CPU. So I moved it under if (!shared) check inside setup_irq(). The patch introduces generic version of the select_smp_affinity() that sets the affinity mask to online_cpus - isolated_cpus, and updates x86_32 and alpha load balancers to ignore isolated cpus. Booted on Core2 laptop and dual Opteron boxes with and w/o isolcpus= options and everything seems to work as expected. I wanted to run this by you before I include it in my patch series. Thanx Max diff --git a/arch/alpha/kernel/irq.c b/arch/alpha/kernel/irq.c index facf82a..6b01702 100644 --- a/arch/alpha/kernel/irq.c +++ b/arch/alpha/kernel/irq.c @@ -51,7 +51,7 @@ select_smp_affinity(unsigned int irq) if (!irq_desc[irq].chip-set_affinity || irq_user_affinity[irq]) return 1; - while (!cpu_possible(cpu)) + while (!cpu_possible(cpu) || cpu_isolated(cpu)) cpu = (cpu (NR_CPUS-1) ? cpu + 1 : 0); last_cpu = cpu; diff --git a/arch/x86/kernel/genapic_flat_64.c b/arch/x86/kernel/genapic_flat_64.c index e02e58c..07352b7 100644 --- a/arch/x86/kernel/genapic_flat_64.c +++ b/arch/x86/kernel/genapic_flat_64.c @@ -21,9 +21,7 @@ static cpumask_t flat_target_cpus(void) { - cpumask_t target; - cpus_andnot(target, cpu_online_map, cpu_isolated_map); - return target; + return cpu_online_map; } static cpumask_t flat_vector_allocation_domain(int cpu) diff --git a/arch/x86/kernel/io_apic_32.c b/arch/x86/kernel/io_apic_32.c index 4ca5486..9c8816f 100644 --- a/arch/x86/kernel/io_apic_32.c +++ b/arch/x86/kernel/io_apic_32.c @@ -468,7 +468,7 @@ static void do_irq_balance(void) for_each_possible_cpu(i) { int package_index; CPU_IRQ(i) = 0; - if (!cpu_online(i)) + if (!cpu_online(i) || cpu_isolated(i)) continue; package_index = CPU_TO_PACKAGEINDEX(i); for (j = 0; j NR_IRQS; j++) { diff --git a/include/linux/irq.h b/include/linux/irq.h index 176e5e7..287bc64 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -253,14 +253,7 @@ static inline void set_balance_irq_affinity(unsigned int irq, cpumask_t mask) } #endif -#ifdef CONFIG_AUTO_IRQ_AFFINITY extern int select_smp_affinity(unsigned int irq); -#else -static inline int select_smp_affinity(unsigned int irq) -{ - return 1; -} -#endif extern int no_irq_affinity; diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 438a014..e74db94 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -376,6 +376,9 @@ int setup_irq(unsigned int irq, struct irqaction *new) } else /* Undo nested disables: */ desc-depth = 1; + + /* Set default affinity mask once everything is setup */ + select_smp_affinity(irq); } /* Reset broken irq detection when installing new handler */ desc-irq_count = 0; @@ -488,6 +491,26 @@ void free_irq(unsigned int irq, void *dev_id) } EXPORT_SYMBOL(free_irq); +#ifndef CONFIG_AUTO_IRQ_AFFINITY +/** + * Generic version of the affinity autoselector. + * Called under desc-lock from setup_irq(). + * btw Should we rename this to select_irq_affinity() ? + */ +int select_smp_affinity(unsigned int irq) +{ + cpumask_t usable_cpus; + + if (!irq_can_set_affinity(irq)) + return 0; + + cpus_andnot(usable_cpus, cpu_online_map, cpu_isolated_map); + irq_desc[irq].affinity = usable_cpus; + irq_desc[irq].chip-set_affinity(irq, usable_cpus); + return 0; +} +#endif + /** * request_irq - allocate an interrupt line * @irq: Interrupt line to allocate @@ -555,8 +578,6 @@ int request_irq(unsigned int irq, irq_handler_t handler, action-next = NULL; action-dev_id = dev_id; - select_smp_affinity(irq); - #ifdef CONFIG_DEBUG_SHIRQ if (irqflags IRQF_SHARED) { /* -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions (updated)
Ingo Molnar wrote: > * Max Krasnyansky <[EMAIL PROTECTED]> wrote: > >> Ingo said a few different things (a bit too large to quote). > > [...] >> And at the end he said: >>> Also, i'd not mind some test-coverage in sched.git as well. > >> I far as I know "do not mind" does not mean "must go to" ;-). [...] > > the CPU isolation related patches have typically flown through > sched.git/sched-devel.git, so yes, you can take my "i'd not mind" > comment as "i'd not mind it at all". That's the tree that all the folks > who deal with this (such as Paul) are following. So lets go via the > normal contribution cycle and let this trickle through with all the > scheduler folks? I'd say 2.6.26 would be a tentative target, if it holds > up to scrutiny in sched-devel.git (both testing and review wise). And > because Andrew tracks sched-devel.git it will thus show up in -mm too. Sounds good. Can you pull my tree then ? Or do you want me to resend the patches. The tree is here: git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git Take the for-linus branch. Or as I said please let me know and I'll resend the patches. Thanx Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull for -mm] CPU isolation extensions (updated2)
Nick Piggin wrote: > On Wednesday 13 February 2008 17:06, Max Krasnyansky wrote: >> Nick Piggin wrote: > >>> But don't let me dissuade you from making these good improvements >>> to Linux as well :) Just that it isn't really going to be hard-rt >>> in general. >> Actually that's the cool thing about CPU isolation. Get rid of all latency >> sources from the CPU(s) and you get youself as hard-RT as it gets. > > Hmm, maybe. Removing all sources of latency from the CPU kind of > implies that you have to audit the whole kernel for source of > latency. That's exactly where cpu isolation comes in. It makes sure that an isolated CPU is excluded from: 1. HW interrupts. This means no softirq, etc. 2. Things like workqueues, stop machine, etc. This typically means no timers, etc. 3. Scheduler load balancing (we had support for that for awhile now). All that's left on that CPU is the scheduler tick and IPIs. And those are just fine. At that point it's up to the app to use or not to use kernel services. In other words no auditing is required. It's the RT preempt that needs to audit in order to be general purpose RT. >> I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec >> worst case and ~200nsec average latency. I do not even need Adeos/Xenomai >> or Preemp-RT just a few very small patches. And it can be used for non RT >> stuff too. > > OK, but you then are very restricted in what you can do, and easily > can break it especially if you run any userspace on that CPU. If > you just run a kernel module that, after setup, doesn't use any > other kernel resources except interrupt handling, then you might be > OK (depending on whether even interrupt handling can run into > contended locks)... > > If you started doing very much more, then you can easily run into > trouble. Yes I'm definitely not selling it as general purpose. And no, it's not just kernel code it's a pure user-space code. Carefully designed user-space code that is. The model is pretty simple. Lets say you have a dual cpu/core box. The app can be partitioned like this: - CPU0 handles HW irqs, runs general services, etc and soft-RT threads - CPU1 runs hard-RT threads or a special engine. For the description of the engine see http://marc.info/?l=linux-kernel=120232425515556=2 hard-RT threads do not need any system call besides pthread_ mutex and signals (those are perfectly fine). They can use direct HW access (if needed). ie Memory mapping something and acessing it without the syscalls (see libe1000.sf.net for example). Communication between hard-RT and soft-RT thread is lock-less (single reader/single writer queues, etc). It may sound fairly limited but you'd be surprised how much you can do. It's relatively easy to design the app that way once you get a hang of it :). I'm working with our legal folks on releasing user-space framework and afore mentioned engine with a bunch of examples. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull for -mm] CPU isolation extensions (updated2)
Nick Piggin wrote: On Wednesday 13 February 2008 17:06, Max Krasnyansky wrote: Nick Piggin wrote: But don't let me dissuade you from making these good improvements to Linux as well :) Just that it isn't really going to be hard-rt in general. Actually that's the cool thing about CPU isolation. Get rid of all latency sources from the CPU(s) and you get youself as hard-RT as it gets. Hmm, maybe. Removing all sources of latency from the CPU kind of implies that you have to audit the whole kernel for source of latency. That's exactly where cpu isolation comes in. It makes sure that an isolated CPU is excluded from: 1. HW interrupts. This means no softirq, etc. 2. Things like workqueues, stop machine, etc. This typically means no timers, etc. 3. Scheduler load balancing (we had support for that for awhile now). All that's left on that CPU is the scheduler tick and IPIs. And those are just fine. At that point it's up to the app to use or not to use kernel services. In other words no auditing is required. It's the RT preempt that needs to audit in order to be general purpose RT. I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec worst case and ~200nsec average latency. I do not even need Adeos/Xenomai or Preemp-RT just a few very small patches. And it can be used for non RT stuff too. OK, but you then are very restricted in what you can do, and easily can break it especially if you run any userspace on that CPU. If you just run a kernel module that, after setup, doesn't use any other kernel resources except interrupt handling, then you might be OK (depending on whether even interrupt handling can run into contended locks)... If you started doing very much more, then you can easily run into trouble. Yes I'm definitely not selling it as general purpose. And no, it's not just kernel code it's a pure user-space code. Carefully designed user-space code that is. The model is pretty simple. Lets say you have a dual cpu/core box. The app can be partitioned like this: - CPU0 handles HW irqs, runs general services, etc and soft-RT threads - CPU1 runs hard-RT threads or a special engine. For the description of the engine see http://marc.info/?l=linux-kernelm=120232425515556w=2 hard-RT threads do not need any system call besides pthread_ mutex and signals (those are perfectly fine). They can use direct HW access (if needed). ie Memory mapping something and acessing it without the syscalls (see libe1000.sf.net for example). Communication between hard-RT and soft-RT thread is lock-less (single reader/single writer queues, etc). It may sound fairly limited but you'd be surprised how much you can do. It's relatively easy to design the app that way once you get a hang of it :). I'm working with our legal folks on releasing user-space framework and afore mentioned engine with a bunch of examples. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions (updated)
Ingo Molnar wrote: * Max Krasnyansky [EMAIL PROTECTED] wrote: Ingo said a few different things (a bit too large to quote). [...] And at the end he said: Also, i'd not mind some test-coverage in sched.git as well. I far as I know do not mind does not mean must go to ;-). [...] the CPU isolation related patches have typically flown through sched.git/sched-devel.git, so yes, you can take my i'd not mind comment as i'd not mind it at all. That's the tree that all the folks who deal with this (such as Paul) are following. So lets go via the normal contribution cycle and let this trickle through with all the scheduler folks? I'd say 2.6.26 would be a tentative target, if it holds up to scrutiny in sched-devel.git (both testing and review wise). And because Andrew tracks sched-devel.git it will thus show up in -mm too. Sounds good. Can you pull my tree then ? Or do you want me to resend the patches. The tree is here: git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git Take the for-linus branch. Or as I said please let me know and I'll resend the patches. Thanx Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull for -mm] CPU isolation extensions (updated2)
Nick Piggin wrote: > On Wednesday 13 February 2008 14:32, Max Krasnyansky wrote: >> David Miller wrote: >>> From: Nick Piggin <[EMAIL PROTECTED]> >>> Date: Tue, 12 Feb 2008 17:41:21 +1100 >>> >>>> stop machine is used for more than just module loading and unloading. >>>> I don't think you can just disable it. >>> Right, in particular it is used for CPU hotplug. >> Ooops. Totally missed that. And a bunch of other places. >> >> [EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run >> Documentation/cpu-hotplug.txt >> arch/s390/kernel/kprobes.c >> drivers/char/hw_random/intel-rng.c >> include/linux/stop_machine.h >> kernel/cpu.c >> kernel/module.c >> kernel/stop_machine.c >> mm/page_alloc.c >> >> I wonder why I did not see any issues when I disabled stop machine >> completely. I mentioned in the other thread that I commented out the part >> that actually halts the machine and ran it for several hours on my dual >> core laptop and on the quad core server. Tried all kinds of workloads, >> which include constant module removal and insertion, and cpu hotplug as >> well. It cannot be just luck :). > > It really is. With subtle races, it can take a lot more than a few > hours. Consider that we have subtle races still in the kernel now, > which are almost never or rarely hit in maybe 10,000 hours * every > single person who has been using the current kernel for the past > year. > > For a less theoretical example -- when I was writing the RCU radix > tree code, I tried to run directed stress tests on a 64 CPU Altix > machine (which found no bugs). Then I ran it on a dedicated test > harness that could actually do a lot more than the existing kernel > users are able to, and promptly found a couple more bugs (on a 2 > CPU system). > > But your primary defence against concurrency bugs _has_ to be > knowing the code and all its interactions. 100% agree. btw For modules though it does not seem like luck (ie that it worked fine for me). I mean subsystems are supposed to cleanly register/unregister anyway. But I can of course be wrong. We'll see what Rusty says. >> Clearly though, you guys are right. It cannot be simply disabled. Based on >> the above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390 >> and intel rng driver. Hopefully we can avoid it at least in module >> insertion/removal. > > Yes, reducing the number of users by going through their code and > showing that it is safe, is the right way to do this. Also, you > could avoid module insertion/removal? I could. But it'd be nice if I did not have to :) > FWIW, I think the idea of trying to turn Linux into giving hard > realtime guarantees is just insane. If that is what you want, you > would IMO be much better off to spend effort with something like > improving adeos and communicatoin/administration between Linux and > the hard-rt kernel. > > But don't let me dissuade you from making these good improvements > to Linux as well :) Just that it isn't really going to be hard-rt > in general. Actually that's the cool thing about CPU isolation. Get rid of all latency sources from the CPU(s) and you get youself as hard-RT as it gets. I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec worst case and ~200nsec average latency. I do not even need Adeos/Xenomai or Preemp-RT just a few very small patches. And it can be used for non RT stuff too. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull for -mm] CPU isolation extensions (updated2)
Steven Rostedt wrote: > On Tue, 12 Feb 2008, Peter Zijlstra wrote: >>> Rusty - Stop machine. >>>After doing a bunch of testing last three days I actually downgraded >>> stop machine >>>changes from [highly experimental] to simply [experimental]. Pleas see >>> this thread >>>for more info: http://marc.info/?l=linux-kernel=120243837206248=2 >>>Short story is that I ran several insmod/rmmod workloads on live >>> multi-core boxes >>>with stop machine _completely_ disabled and did no see any issues. Rusty >>> did not get >>>a chance to reply yet, I hopping that we'll be able to make "stop >>> machine" completely >>>optional for some configurations. > > This part really scares me. The comment that you say you have run several > insmod/rmmod workloads without kstop_machine doesn't mean that it is still > safe. A lot of races that things like this protect may only happen under > load once a month. But the fact that it happens at all is reason to have > the protection. > > Before taking out any protection, please analyze it in detail and report > your findings why something is not needed. Not just some general hand > waving and "it doesn't crash on my box". Sure. I did not say lets disable it. I was hopping we could and I wanted to see what Rusty Russell has to say about this. > Besides that, kstop_machine may be used by other features that can have an > impact. Yes it is. I missed a few. Nick and Dave already pointed out CPU hotplug. I looked around and found more users. So disabling stop machine completely is definitely out. > Again, if you have a system that cant handle things like kstop_machine, > than don't do things that require a kstop_machine run. All modules should > be loaded, and no new modules should be added when the system is > performing critical work. I see no reason for disabling kstop_machine. I'm considering that option. So far it does not seem practical. At least the way we use those machines at this point. If we can prove that at least not halting isolation CPUs is safe that'd be better. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull for -mm] CPU isolation extensions (updated2)
Peter Zijlstra wrote: > On Mon, 2008-02-11 at 20:10 -0800, Max Krasnyansky wrote: >> Andrew, looks like Linus decided not to pull this stuff. >> Can we please put it into -mm then. >> >> My tree is here >> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git >> Please use 'master' branch (or 'for-linus' they are identical). > > I'm wondering why you insist on offering a git tree that bypasses the > regular maintainers. Why not post the patches and walk the normal route? > > To me this feels rather aggressive, which makes me feel less inclined to > look at it. Peter, it may sound stupid but I'm honestly not sure what you mean. Please bear with me I do not mean to sounds arrogant. I'm looking for advice here. So here are some questions: - First, who would the regular maintainer be in this case ? I felt that cpu isolation can just sit in its own tree since it does not seem to belong to any existing stuff. So far people suggested -mm and -shed. I do not think it has much to do much with the -sched. -mm seems more general purpose, since Linus did not pull it directly I asked Andrew to take this stuff into -mm. He was already ok with the patches when I sent original pull request to Linus. - Is it not easier for a regular maintainer (whoever it turns out to be in this case) to pull from GIT rather than use patches ? In any case I did post patches along with pull request. So for example if Andrew prefers patches he could take those instead of the git. In fact if you look at my email I mentioned that if needed I can repost the patches. - And last but not least I want to be able to just tell people who want to use CPU isolation "Go get get this tree and use it". Git it the best for that. I can see how pull request to Linus may have been a bit aggressive. But then again I posted patches (_without_ pull request). Got feedback from You, Paul and couple of other guys. Addressed/explained issues/questions. Posted patches again (_without_ pull request). Got _zero_ replies even though folks who replied to the first patchset were replying to other things in the same timeframe. So I figured since I addressed everything you guys are happy, why not push it to Linus. So what did I do wrong ? Max >> >> >> Diffstat: >> Documentation/ABI/testing/sysfs-devices-system-cpu | 41 +++ >> Documentation/cpu-isolation.txt| 113 >> + >> arch/x86/Kconfig |1 >> arch/x86/kernel/genapic_flat_64.c |4 >> drivers/base/cpu.c | 48 >> include/linux/cpumask.h|3 >> kernel/Kconfig.cpuisol | 42 +++ >> kernel/Makefile|4 >> kernel/cpu.c | 54 ++ >> kernel/sched.c | 36 -- >> kernel/stop_machine.c |8 + >> kernel/workqueue.c | 30 - >> 12 files changed, 337 insertions(+), 47 deletions(-) >> >> This addresses all Andrew's comments for the last submission. Details here: >>http://marc.info/?l=linux-kernel=120236394012766=2 >> >> There are no code changes since last time, besides minor fix for moving >> on-stack array >> to __initdata as suggested by Andrew. Other stuff is just documentation >> updates. >> >> List of commits >>cpuisol: Make cpu isolation configrable and export isolated map >>cpuisol: Do not route IRQs to the CPUs isolated at boot >>cpuisol: Do not schedule workqueues on the isolated CPUs >>cpuisol: Move on-stack array used for boot cmd parsing into __initdata >>cpuisol: Documentation updates >>cpuisol: Minor updates to the Kconfig options >>cpuisol: Do not halt isolated CPUs with Stop Machine >> >> I suggested by Ingo I'm CC'ing everyone who is even remotely >> connected/affected ;-) > > You forgot Oleg, he does a lot of the workqueue work. > > I'm worried by your approach to never start any workqueue on these cpus. > Like you said, it breaks Oprofile and others who depend on cpu local > workqueues being present. > > Under normal circumstances these workqueues will not do any work, > someone needs to provide work for them. That is, workqueues are passive. > > So I think your approach is the wrong way about. Instead of taking the > workqueue away, take away those that generate the work. > >> Ingo, Peter - Scheduler. >>There are _no_ changes in this area besides moving cpu_*_map m
Re: [git pull for -mm] CPU isolation extensions (updated2)
David Miller wrote: > From: Nick Piggin <[EMAIL PROTECTED]> > Date: Tue, 12 Feb 2008 17:41:21 +1100 > >> stop machine is used for more than just module loading and unloading. >> I don't think you can just disable it. > > Right, in particular it is used for CPU hotplug. Ooops. Totally missed that. And a bunch of other places. [EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run Documentation/cpu-hotplug.txt arch/s390/kernel/kprobes.c drivers/char/hw_random/intel-rng.c include/linux/stop_machine.h kernel/cpu.c kernel/module.c kernel/stop_machine.c mm/page_alloc.c I wonder why I did not see any issues when I disabled stop machine completely. I mentioned in the other thread that I commented out the part that actually halts the machine and ran it for several hours on my dual core laptop and on the quad core server. Tried all kinds of workloads, which include constant module removal and insertion, and cpu hotplug as well. It cannot be just luck :). Clearly though, you guys are right. It cannot be simply disabled. Based on the above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390 and intel rng driver. Hopefully we can avoid it at least in module insertion/removal. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull for -mm] CPU isolation extensions (updated2)
Peter Zijlstra wrote: On Mon, 2008-02-11 at 20:10 -0800, Max Krasnyansky wrote: Andrew, looks like Linus decided not to pull this stuff. Can we please put it into -mm then. My tree is here git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git Please use 'master' branch (or 'for-linus' they are identical). I'm wondering why you insist on offering a git tree that bypasses the regular maintainers. Why not post the patches and walk the normal route? To me this feels rather aggressive, which makes me feel less inclined to look at it. Peter, it may sound stupid but I'm honestly not sure what you mean. Please bear with me I do not mean to sounds arrogant. I'm looking for advice here. So here are some questions: - First, who would the regular maintainer be in this case ? I felt that cpu isolation can just sit in its own tree since it does not seem to belong to any existing stuff. So far people suggested -mm and -shed. I do not think it has much to do much with the -sched. -mm seems more general purpose, since Linus did not pull it directly I asked Andrew to take this stuff into -mm. He was already ok with the patches when I sent original pull request to Linus. - Is it not easier for a regular maintainer (whoever it turns out to be in this case) to pull from GIT rather than use patches ? In any case I did post patches along with pull request. So for example if Andrew prefers patches he could take those instead of the git. In fact if you look at my email I mentioned that if needed I can repost the patches. - And last but not least I want to be able to just tell people who want to use CPU isolation Go get get this tree and use it. Git it the best for that. I can see how pull request to Linus may have been a bit aggressive. But then again I posted patches (_without_ pull request). Got feedback from You, Paul and couple of other guys. Addressed/explained issues/questions. Posted patches again (_without_ pull request). Got _zero_ replies even though folks who replied to the first patchset were replying to other things in the same timeframe. So I figured since I addressed everything you guys are happy, why not push it to Linus. So what did I do wrong ? Max Diffstat: Documentation/ABI/testing/sysfs-devices-system-cpu | 41 +++ Documentation/cpu-isolation.txt| 113 + arch/x86/Kconfig |1 arch/x86/kernel/genapic_flat_64.c |4 drivers/base/cpu.c | 48 include/linux/cpumask.h|3 kernel/Kconfig.cpuisol | 42 +++ kernel/Makefile|4 kernel/cpu.c | 54 ++ kernel/sched.c | 36 -- kernel/stop_machine.c |8 + kernel/workqueue.c | 30 - 12 files changed, 337 insertions(+), 47 deletions(-) This addresses all Andrew's comments for the last submission. Details here: http://marc.info/?l=linux-kernelm=120236394012766w=2 There are no code changes since last time, besides minor fix for moving on-stack array to __initdata as suggested by Andrew. Other stuff is just documentation updates. List of commits cpuisol: Make cpu isolation configrable and export isolated map cpuisol: Do not route IRQs to the CPUs isolated at boot cpuisol: Do not schedule workqueues on the isolated CPUs cpuisol: Move on-stack array used for boot cmd parsing into __initdata cpuisol: Documentation updates cpuisol: Minor updates to the Kconfig options cpuisol: Do not halt isolated CPUs with Stop Machine I suggested by Ingo I'm CC'ing everyone who is even remotely connected/affected ;-) You forgot Oleg, he does a lot of the workqueue work. I'm worried by your approach to never start any workqueue on these cpus. Like you said, it breaks Oprofile and others who depend on cpu local workqueues being present. Under normal circumstances these workqueues will not do any work, someone needs to provide work for them. That is, workqueues are passive. So I think your approach is the wrong way about. Instead of taking the workqueue away, take away those that generate the work. Ingo, Peter - Scheduler. There are _no_ changes in this area besides moving cpu_*_map maps from kerne/sched.c to kernel/cpu.c. Ingo (and Thomas) do the genirq bits The IRQ isolation in concept isn't wrong. But it seems to me that arch/x86/kernel/genapic_flat_64.c isn't the best place to do this. It just considers one architecture, if you do this, please make it work across all. Paul - Cpuset Again there are _no_ changes in this area. For reasons why cpuset is not the right mechanism for cpu isolation see
Re: [git pull for -mm] CPU isolation extensions (updated2)
David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Tue, 12 Feb 2008 17:41:21 +1100 stop machine is used for more than just module loading and unloading. I don't think you can just disable it. Right, in particular it is used for CPU hotplug. Ooops. Totally missed that. And a bunch of other places. [EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run Documentation/cpu-hotplug.txt arch/s390/kernel/kprobes.c drivers/char/hw_random/intel-rng.c include/linux/stop_machine.h kernel/cpu.c kernel/module.c kernel/stop_machine.c mm/page_alloc.c I wonder why I did not see any issues when I disabled stop machine completely. I mentioned in the other thread that I commented out the part that actually halts the machine and ran it for several hours on my dual core laptop and on the quad core server. Tried all kinds of workloads, which include constant module removal and insertion, and cpu hotplug as well. It cannot be just luck :). Clearly though, you guys are right. It cannot be simply disabled. Based on the above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390 and intel rng driver. Hopefully we can avoid it at least in module insertion/removal. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull for -mm] CPU isolation extensions (updated2)
Steven Rostedt wrote: On Tue, 12 Feb 2008, Peter Zijlstra wrote: Rusty - Stop machine. After doing a bunch of testing last three days I actually downgraded stop machine changes from [highly experimental] to simply [experimental]. Pleas see this thread for more info: http://marc.info/?l=linux-kernelm=120243837206248w=2 Short story is that I ran several insmod/rmmod workloads on live multi-core boxes with stop machine _completely_ disabled and did no see any issues. Rusty did not get a chance to reply yet, I hopping that we'll be able to make stop machine completely optional for some configurations. This part really scares me. The comment that you say you have run several insmod/rmmod workloads without kstop_machine doesn't mean that it is still safe. A lot of races that things like this protect may only happen under load once a month. But the fact that it happens at all is reason to have the protection. Before taking out any protection, please analyze it in detail and report your findings why something is not needed. Not just some general hand waving and it doesn't crash on my box. Sure. I did not say lets disable it. I was hopping we could and I wanted to see what Rusty Russell has to say about this. Besides that, kstop_machine may be used by other features that can have an impact. Yes it is. I missed a few. Nick and Dave already pointed out CPU hotplug. I looked around and found more users. So disabling stop machine completely is definitely out. Again, if you have a system that cant handle things like kstop_machine, than don't do things that require a kstop_machine run. All modules should be loaded, and no new modules should be added when the system is performing critical work. I see no reason for disabling kstop_machine. I'm considering that option. So far it does not seem practical. At least the way we use those machines at this point. If we can prove that at least not halting isolation CPUs is safe that'd be better. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull for -mm] CPU isolation extensions (updated2)
Nick Piggin wrote: On Wednesday 13 February 2008 14:32, Max Krasnyansky wrote: David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Tue, 12 Feb 2008 17:41:21 +1100 stop machine is used for more than just module loading and unloading. I don't think you can just disable it. Right, in particular it is used for CPU hotplug. Ooops. Totally missed that. And a bunch of other places. [EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run Documentation/cpu-hotplug.txt arch/s390/kernel/kprobes.c drivers/char/hw_random/intel-rng.c include/linux/stop_machine.h kernel/cpu.c kernel/module.c kernel/stop_machine.c mm/page_alloc.c I wonder why I did not see any issues when I disabled stop machine completely. I mentioned in the other thread that I commented out the part that actually halts the machine and ran it for several hours on my dual core laptop and on the quad core server. Tried all kinds of workloads, which include constant module removal and insertion, and cpu hotplug as well. It cannot be just luck :). It really is. With subtle races, it can take a lot more than a few hours. Consider that we have subtle races still in the kernel now, which are almost never or rarely hit in maybe 10,000 hours * every single person who has been using the current kernel for the past year. For a less theoretical example -- when I was writing the RCU radix tree code, I tried to run directed stress tests on a 64 CPU Altix machine (which found no bugs). Then I ran it on a dedicated test harness that could actually do a lot more than the existing kernel users are able to, and promptly found a couple more bugs (on a 2 CPU system). But your primary defence against concurrency bugs _has_ to be knowing the code and all its interactions. 100% agree. btw For modules though it does not seem like luck (ie that it worked fine for me). I mean subsystems are supposed to cleanly register/unregister anyway. But I can of course be wrong. We'll see what Rusty says. Clearly though, you guys are right. It cannot be simply disabled. Based on the above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390 and intel rng driver. Hopefully we can avoid it at least in module insertion/removal. Yes, reducing the number of users by going through their code and showing that it is safe, is the right way to do this. Also, you could avoid module insertion/removal? I could. But it'd be nice if I did not have to :) FWIW, I think the idea of trying to turn Linux into giving hard realtime guarantees is just insane. If that is what you want, you would IMO be much better off to spend effort with something like improving adeos and communicatoin/administration between Linux and the hard-rt kernel. But don't let me dissuade you from making these good improvements to Linux as well :) Just that it isn't really going to be hard-rt in general. Actually that's the cool thing about CPU isolation. Get rid of all latency sources from the CPU(s) and you get youself as hard-RT as it gets. I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec worst case and ~200nsec average latency. I do not even need Adeos/Xenomai or Preemp-RT just a few very small patches. And it can be used for non RT stuff too. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[git pull for -mm] CPU isolation extensions (updated2)
Andrew, looks like Linus decided not to pull this stuff. Can we please put it into -mm then. My tree is here git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git Please use 'master' branch (or 'for-linus' they are identical). There are no changes since last time I sent it. Details below. Patches were sent out two days ago. I can resend them if needed. Thanx Max Diffstat: Documentation/ABI/testing/sysfs-devices-system-cpu | 41 +++ Documentation/cpu-isolation.txt| 113 + arch/x86/Kconfig |1 arch/x86/kernel/genapic_flat_64.c |4 drivers/base/cpu.c | 48 include/linux/cpumask.h|3 kernel/Kconfig.cpuisol | 42 +++ kernel/Makefile|4 kernel/cpu.c | 54 ++ kernel/sched.c | 36 -- kernel/stop_machine.c |8 + kernel/workqueue.c | 30 - 12 files changed, 337 insertions(+), 47 deletions(-) This addresses all Andrew's comments for the last submission. Details here: http://marc.info/?l=linux-kernel=120236394012766=2 There are no code changes since last time, besides minor fix for moving on-stack array to __initdata as suggested by Andrew. Other stuff is just documentation updates. List of commits cpuisol: Make cpu isolation configrable and export isolated map cpuisol: Do not route IRQs to the CPUs isolated at boot cpuisol: Do not schedule workqueues on the isolated CPUs cpuisol: Move on-stack array used for boot cmd parsing into __initdata cpuisol: Documentation updates cpuisol: Minor updates to the Kconfig options cpuisol: Do not halt isolated CPUs with Stop Machine I suggested by Ingo I'm CC'ing everyone who is even remotely connected/affected ;-) Ingo, Peter - Scheduler. There are _no_ changes in this area besides moving cpu_*_map maps from kerne/sched.c to kernel/cpu.c. Paul - Cpuset Again there are _no_ changes in this area. For reasons why cpuset is not the right mechanism for cpu isolation see this thread http://marc.info/?l=linux-kernel=120180692331461=2 Rusty - Stop machine. After doing a bunch of testing last three days I actually downgraded stop machine changes from [highly experimental] to simply [experimental]. Pleas see this thread for more info: http://marc.info/?l=linux-kernel=120243837206248=2 Short story is that I ran several insmod/rmmod workloads on live multi-core boxes with stop machine _completely_ disabled and did no see any issues. Rusty did not get a chance to reply yet, I hopping that we'll be able to make "stop machine" completely optional for some configurations. Gerg - ABI documentation. Nothing interesting here. I simply added Documentation/ABI/testing/sysfs-devices-system-cpu and documented some of the attributes exposed in there. Suggested by Andrew. I believe this is ready for the inclusion and my impression is that Andrew is ok with that. Most changes are very simple and do not affect existing behavior. As I mentioned before I've been using Workqueue and StopMachine changes in production for a couple of years now and have high confidence in them. Yet they are marked as experimental for now, just to be safe. My original explanation is included below. btw I'll be out skiing/snow boarding for the next 4 days and will have sporadic email access. Will do my best to address question/concerns (if any) during that time. Thanx Max -- This patch series extends CPU isolation support. Yes, most people want to virtuallize CPUs these days and I want to isolate them :) . The primary idea here is to be able to use some CPU cores as the dedicated engines for running user-space code with minimal kernel overhead/intervention, think of it as an SPE in the Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on one of the processors without adversely affecting or being affected by the other system activities. System activities here include _kernel_ activities as well. I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf multi- processor/core systems (vanilla kernel plus these patches) even under extreme system load. I'm working with legal folks on releasing hard RT user-space framework for that. I believe with the current multi-core CPU trend we will see more and more applications that explore this capability: RT gaming engines, simulators, hard RT apps, etc. Hence the
[git pull for -mm] CPU isolation extensions (updated2)
Andrew, looks like Linus decided not to pull this stuff. Can we please put it into -mm then. My tree is here git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git Please use 'master' branch (or 'for-linus' they are identical). There are no changes since last time I sent it. Details below. Patches were sent out two days ago. I can resend them if needed. Thanx Max Diffstat: Documentation/ABI/testing/sysfs-devices-system-cpu | 41 +++ Documentation/cpu-isolation.txt| 113 + arch/x86/Kconfig |1 arch/x86/kernel/genapic_flat_64.c |4 drivers/base/cpu.c | 48 include/linux/cpumask.h|3 kernel/Kconfig.cpuisol | 42 +++ kernel/Makefile|4 kernel/cpu.c | 54 ++ kernel/sched.c | 36 -- kernel/stop_machine.c |8 + kernel/workqueue.c | 30 - 12 files changed, 337 insertions(+), 47 deletions(-) This addresses all Andrew's comments for the last submission. Details here: http://marc.info/?l=linux-kernelm=120236394012766w=2 There are no code changes since last time, besides minor fix for moving on-stack array to __initdata as suggested by Andrew. Other stuff is just documentation updates. List of commits cpuisol: Make cpu isolation configrable and export isolated map cpuisol: Do not route IRQs to the CPUs isolated at boot cpuisol: Do not schedule workqueues on the isolated CPUs cpuisol: Move on-stack array used for boot cmd parsing into __initdata cpuisol: Documentation updates cpuisol: Minor updates to the Kconfig options cpuisol: Do not halt isolated CPUs with Stop Machine I suggested by Ingo I'm CC'ing everyone who is even remotely connected/affected ;-) Ingo, Peter - Scheduler. There are _no_ changes in this area besides moving cpu_*_map maps from kerne/sched.c to kernel/cpu.c. Paul - Cpuset Again there are _no_ changes in this area. For reasons why cpuset is not the right mechanism for cpu isolation see this thread http://marc.info/?l=linux-kernelm=120180692331461w=2 Rusty - Stop machine. After doing a bunch of testing last three days I actually downgraded stop machine changes from [highly experimental] to simply [experimental]. Pleas see this thread for more info: http://marc.info/?l=linux-kernelm=120243837206248w=2 Short story is that I ran several insmod/rmmod workloads on live multi-core boxes with stop machine _completely_ disabled and did no see any issues. Rusty did not get a chance to reply yet, I hopping that we'll be able to make stop machine completely optional for some configurations. Gerg - ABI documentation. Nothing interesting here. I simply added Documentation/ABI/testing/sysfs-devices-system-cpu and documented some of the attributes exposed in there. Suggested by Andrew. I believe this is ready for the inclusion and my impression is that Andrew is ok with that. Most changes are very simple and do not affect existing behavior. As I mentioned before I've been using Workqueue and StopMachine changes in production for a couple of years now and have high confidence in them. Yet they are marked as experimental for now, just to be safe. My original explanation is included below. btw I'll be out skiing/snow boarding for the next 4 days and will have sporadic email access. Will do my best to address question/concerns (if any) during that time. Thanx Max -- This patch series extends CPU isolation support. Yes, most people want to virtuallize CPUs these days and I want to isolate them :) . The primary idea here is to be able to use some CPU cores as the dedicated engines for running user-space code with minimal kernel overhead/intervention, think of it as an SPE in the Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on one of the processors without adversely affecting or being affected by the other system activities. System activities here include _kernel_ activities as well. I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf multi- processor/core systems (vanilla kernel plus these patches) even under extreme system load. I'm working with legal folks on releasing hard RT user-space framework for that. I believe with the current multi-core CPU trend we will see more and more applications that explore this capability: RT gaming engines, simulators, hard RT apps, etc. Hence the
Re: [git pull] CPU isolation extensions (updated)
Paul Jackson wrote: > Max wrote: >> Linus, please pull CPU isolation extensions from > > Did I miss something in this discussion? I thought > Ingo was quite clear, and Linus pretty clear too, > that this patch should bake in *-mm or some such > place for a bit first. > Andrew said: > The feature as a whole seems useful, and I don't actually oppose the merge > based on what I see here. As long as you're really sure that cpusets are > inappropriate (and bear in mind that Paul has a track record of being wrong > on this :)). But I see a few glitches As far as I can understand Andrew is ok with the merge. And I addressed all his comments. Linus said: > Have these been in -mm and widely discussed etc? I'd like to start more > carefully, and (a) have that controversial last patch not merged initially > and (b) make sure everybody is on the same page wrt this all.. As far as I can understand Linus _asked_ whether it was in -mm or not and whether everybody's on the same page. He did not say "this must be in -mm first". I explained that it has not been in -mm, and who it was discussed with, and did a bunch more testing/investigation on the controversial patch and explained why I think it's not that controversial any more. Ingo said a few different things (a bit too large to quote). - That it was not discussed. I explained that it was in fact discussed and provided a bunch of pointers to the mail threads. - That he thinks that cpuset is the way to do it. Again I explained why it's not. And at the end he said: > Also, i'd not mind some test-coverage in sched.git as well. I far as I know "do not mind" does not mean "must go to" ;-). Also I replied that I did not mind either but I do not think that it has much (if anything) to do with the scheduler. Anyway. I think I mentioned that I did not mind -mm either. I think it's ready for the mainline. But if people still strongly feel that it has to be in -mm that's fine. Lets just do s/Linus/Andrew/ on the first line and move on. But if Linus pulls it now even better ;-) Andrew, Linus, I'll let you guys decide which tree it needs to go. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[git pull] CPU isolation extensions (updated)
Linus, please pull CPU isolation extensions from git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus Diffstat: Documentation/ABI/testing/sysfs-devices-system-cpu | 41 +++ Documentation/cpu-isolation.txt| 113 + arch/x86/Kconfig |1 arch/x86/kernel/genapic_flat_64.c |4 drivers/base/cpu.c | 48 include/linux/cpumask.h|3 kernel/Kconfig.cpuisol | 42 +++ kernel/Makefile|4 kernel/cpu.c | 54 ++ kernel/sched.c | 36 -- kernel/stop_machine.c |8 + kernel/workqueue.c | 30 - 12 files changed, 337 insertions(+), 47 deletions(-) This addresses all Andrew's comments for the last submission. Details here: http://marc.info/?l=linux-kernel=120236394012766=2 There are no code changes since last time, besides minor fix for moving on-stack array to __initdata as suggested by Andrew. Other stuff is just documentation updates. List of commits cpuisol: Make cpu isolation configrable and export isolated map cpuisol: Do not route IRQs to the CPUs isolated at boot cpuisol: Do not schedule workqueues on the isolated CPUs cpuisol: Move on-stack array used for boot cmd parsing into __initdata cpuisol: Documentation updates cpuisol: Minor updates to the Kconfig options cpuisol: Do not halt isolated CPUs with Stop Machine I suggested by Ingo I'm CC'ing everyone who is even remotely connected/affected ;-) Ingo, Peter - Scheduler. There are _no_ changes in this area besides moving cpu_*_map maps from kerne/sched.c to kernel/cpu.c. Paul - Cpuset Again there are _no_ changes in this area. For reasons why cpuset is not the right mechanism for cpu isolation see this thread http://marc.info/?l=linux-kernel=120180692331461=2 Rusty - Stop machine. After doing a bunch of testing last three days I actually downgraded stop machine changes from [highly experimental] to simply [experimental]. Pleas see this thread for more info: http://marc.info/?l=linux-kernel=120243837206248=2 Short story is that I ran several insmod/rmmod workloads on live multi-core boxes with stop machine _completely_ disabled and did no see any issues. Rusty did not get a chance to reply yet, I hopping that we'll be able to make "stop machine" completely optional for some configurations. Gerg - ABI documentation. Nothing interesting here. I simply added Documentation/ABI/testing/sysfs-devices-system-cpu and documented some of the attributes exposed in there. Suggested by Andrew. I believe this is ready for the inclusion and my impression is that Andrew is ok with that. Most changes are very simple and do not affect existing behavior. As I mentioned before I've been using Workqueue and StopMachine changes in production for a couple of years now and have high confidence in them. Yet they are marked as experimental for now, just to be safe. My original explanation is included below. btw I'll be out skiing/snow boarding for the next 4 days and will have sporadic email access. Will do my best to address question/concerns (if any) during that time. Thanx Max -- This patch series extends CPU isolation support. Yes, most people want to virtuallize CPUs these days and I want to isolate them :) . The primary idea here is to be able to use some CPU cores as the dedicated engines for running user-space code with minimal kernel overhead/intervention, think of it as an SPE in the Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on one of the processors without adversely affecting or being affected by the other system activities. System activities here include _kernel_ activities as well. I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf multi- processor/core systems (vanilla kernel plus these patches) even under extreme system load. I'm working with legal folks on releasing hard RT user-space framework for that. I believe with the current multi-core CPU trend we will see more and more applications that explore this capability: RT gaming engines, simulators, hard RT apps, etc. Hence the proposal is to extend current CPU isolation feature. The new definition of the CPU isolation would be: --- 1. Isolated CPU(s) must not be subject to scheduler load balancing Users must explicitly bind threads in order to run on those CPU(s). 2. By default interrupts
[git pull] CPU isolation extensions (updated)
Linus, please pull CPU isolation extensions from git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus Diffstat: Documentation/ABI/testing/sysfs-devices-system-cpu | 41 +++ Documentation/cpu-isolation.txt| 113 + arch/x86/Kconfig |1 arch/x86/kernel/genapic_flat_64.c |4 drivers/base/cpu.c | 48 include/linux/cpumask.h|3 kernel/Kconfig.cpuisol | 42 +++ kernel/Makefile|4 kernel/cpu.c | 54 ++ kernel/sched.c | 36 -- kernel/stop_machine.c |8 + kernel/workqueue.c | 30 - 12 files changed, 337 insertions(+), 47 deletions(-) This addresses all Andrew's comments for the last submission. Details here: http://marc.info/?l=linux-kernelm=120236394012766w=2 There are no code changes since last time, besides minor fix for moving on-stack array to __initdata as suggested by Andrew. Other stuff is just documentation updates. List of commits cpuisol: Make cpu isolation configrable and export isolated map cpuisol: Do not route IRQs to the CPUs isolated at boot cpuisol: Do not schedule workqueues on the isolated CPUs cpuisol: Move on-stack array used for boot cmd parsing into __initdata cpuisol: Documentation updates cpuisol: Minor updates to the Kconfig options cpuisol: Do not halt isolated CPUs with Stop Machine I suggested by Ingo I'm CC'ing everyone who is even remotely connected/affected ;-) Ingo, Peter - Scheduler. There are _no_ changes in this area besides moving cpu_*_map maps from kerne/sched.c to kernel/cpu.c. Paul - Cpuset Again there are _no_ changes in this area. For reasons why cpuset is not the right mechanism for cpu isolation see this thread http://marc.info/?l=linux-kernelm=120180692331461w=2 Rusty - Stop machine. After doing a bunch of testing last three days I actually downgraded stop machine changes from [highly experimental] to simply [experimental]. Pleas see this thread for more info: http://marc.info/?l=linux-kernelm=120243837206248w=2 Short story is that I ran several insmod/rmmod workloads on live multi-core boxes with stop machine _completely_ disabled and did no see any issues. Rusty did not get a chance to reply yet, I hopping that we'll be able to make stop machine completely optional for some configurations. Gerg - ABI documentation. Nothing interesting here. I simply added Documentation/ABI/testing/sysfs-devices-system-cpu and documented some of the attributes exposed in there. Suggested by Andrew. I believe this is ready for the inclusion and my impression is that Andrew is ok with that. Most changes are very simple and do not affect existing behavior. As I mentioned before I've been using Workqueue and StopMachine changes in production for a couple of years now and have high confidence in them. Yet they are marked as experimental for now, just to be safe. My original explanation is included below. btw I'll be out skiing/snow boarding for the next 4 days and will have sporadic email access. Will do my best to address question/concerns (if any) during that time. Thanx Max -- This patch series extends CPU isolation support. Yes, most people want to virtuallize CPUs these days and I want to isolate them :) . The primary idea here is to be able to use some CPU cores as the dedicated engines for running user-space code with minimal kernel overhead/intervention, think of it as an SPE in the Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on one of the processors without adversely affecting or being affected by the other system activities. System activities here include _kernel_ activities as well. I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf multi- processor/core systems (vanilla kernel plus these patches) even under extreme system load. I'm working with legal folks on releasing hard RT user-space framework for that. I believe with the current multi-core CPU trend we will see more and more applications that explore this capability: RT gaming engines, simulators, hard RT apps, etc. Hence the proposal is to extend current CPU isolation feature. The new definition of the CPU isolation would be: --- 1. Isolated CPU(s) must not be subject to scheduler load balancing Users must explicitly bind threads in order to run on those CPU(s). 2. By default
Re: [git pull] CPU isolation extensions (updated)
Paul Jackson wrote: Max wrote: Linus, please pull CPU isolation extensions from Did I miss something in this discussion? I thought Ingo was quite clear, and Linus pretty clear too, that this patch should bake in *-mm or some such place for a bit first. Andrew said: The feature as a whole seems useful, and I don't actually oppose the merge based on what I see here. As long as you're really sure that cpusets are inappropriate (and bear in mind that Paul has a track record of being wrong on this :)). But I see a few glitches As far as I can understand Andrew is ok with the merge. And I addressed all his comments. Linus said: Have these been in -mm and widely discussed etc? I'd like to start more carefully, and (a) have that controversial last patch not merged initially and (b) make sure everybody is on the same page wrt this all.. As far as I can understand Linus _asked_ whether it was in -mm or not and whether everybody's on the same page. He did not say this must be in -mm first. I explained that it has not been in -mm, and who it was discussed with, and did a bunch more testing/investigation on the controversial patch and explained why I think it's not that controversial any more. Ingo said a few different things (a bit too large to quote). - That it was not discussed. I explained that it was in fact discussed and provided a bunch of pointers to the mail threads. - That he thinks that cpuset is the way to do it. Again I explained why it's not. And at the end he said: Also, i'd not mind some test-coverage in sched.git as well. I far as I know do not mind does not mean must go to ;-). Also I replied that I did not mind either but I do not think that it has much (if anything) to do with the scheduler. Anyway. I think I mentioned that I did not mind -mm either. I think it's ready for the mainline. But if people still strongly feel that it has to be in -mm that's fine. Lets just do s/Linus/Andrew/ on the first line and move on. But if Linus pulls it now even better ;-) Andrew, Linus, I'll let you guys decide which tree it needs to go. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Module loading/unloading and "The Stop Machine"
Hi Rusty, I was hopping you could answer a couple of questions about module loading/unloading and the stop machine. There was a recent discussion on LKML about CPU isolation patches I'm working on. One of the patches makes stop machine ignore the isolated CPUs. People of course had questions about that. So I started looking into more details and got this silly, crazy idea that maybe we do not need the stop machine any more :) As far as I can tell the stop machine is basically a safety net in case some locking and recounting mechanisms aren't bullet proof. In other words if a subsystem can actually handle registration/unregistration in a robust way, module loader/unloader does not necessarily have to halt entire machine in order to load/unload a module that belongs to that subsystem. I may of course be completely wrong on that. The problem with the stop machine is that it's a very very big gun :). In a sense that it totally kills all the latencies and stuff since the entire machine gets halted while module is being (un)loaded. Which is a major issue for any realtime apps. Specifically for CPU isolation the issue is that high-priority rt user-space thread prevents stop machine threads from running and entire box just hangs waiting for it. I'm kind of surprised that folks who use monster boxes with over 100 CPUs have not complained. It's must be a huge hit for those machines to halt the entire thing. It seems that over the last few years most subsystems got much better at locking and refcounting. And I'm hopping that we can avoid halting the entire machine these days. For CPU isolation in particular the solution is simple. We can just ignore isolated CPUs. What I'm trying to figure out is how safe it is and whether we can avoid full halt altogether. So. Here is what I tried today on my Core2 Duo laptop > --- a/kernel/stop_machine.c > +++ b/kernel/stop_machine.c > @@ -204,11 +204,14 @@ int stop_machine_run(int (*fn)(void *), void *data, > unsigned int cpu) > > /* No CPUs can come up or down during this. */ > lock_cpu_hotplug(); > +/* > p = __stop_machine_run(fn, data, cpu); > if (!IS_ERR(p)) > ret = kthread_stop(p); > else > ret = PTR_ERR(p); > +*/ > + ret = fn(data); > unlock_cpu_hotplug(); > > return ret; ie Completely disabled stop machine. It just loads/unloads modules without full halt. I then ran three scripts: while true; do /sbin/modprobe -r uhci_hcd /sbin/modprobe uhci_hcd sleep 10 done while true; do /sbin/modprobe -r tg3 /sbin/modprobe tg3 sleep 2 done while true; do /usr/sbin/tcpdump -i eth0 done The machine has a bunch of USB devices connected to it. The two most interesting are a Bluetooth dongle and a USB mouse. By loading/unloading UHCI driver we're touching Sysfs, USB stack, Bluetooth stack, HID layer, Input layer. The X is running and is using that USB mouse. The Bluetooth services are running too. By loading/unloading TG3 driver we're touching sysfs, network stack (a bunch of layers). The machine is running NetworkManager and tcpdumping on the eth0 which is registered by TG3. This is a pretty good stress test in general let alone the disabled stop machine. I left all that running for the whole day while doing normal day to day things. Compiling a bunch of things, emails, office apps, etc. That's where I'm writing this email from :). It's still running all that :) So the question is do we still need stop machine ? I must be missing something obvious. But things seem to be working pretty well without it. I certainly feel much better about at least ignoring isolated CPUs during stop machine execution. Which btw I've doing for a couple of years now on a wide range of the machines where people are inserting modules left and right. What do you think ? Thanx Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions
Hi Ingo, Thanks for your reply. > * Linus Torvalds <[EMAIL PROTECTED]> wrote: > >> On Wed, 6 Feb 2008, Max Krasnyansky wrote: >>> Linus, please pull CPU isolation extensions from >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git >>> for-linus >> Have these been in -mm and widely discussed etc? I'd like to start >> more carefully, and (a) have that controversial last patch not merged >> initially and (b) make sure everybody is on the same page wrt this >> all.. > > no, they have not been under nearly enough testing and review - these > patches surfaced on lkml for the first time one week ago (!). Almost two weeks actually. Ok 1.8 :) > I find the pull request totally premature, this stuff has not been discussed > and > agreed on _at all_. Ingo, I may have the wrong impression but my impression is that you ignored all the other emails and just read Linus' reply. I do not believe this accusation is valid. I apologize if my impression is incorrect. Since the patches _do not_ change/affect existing scheduler/cpuset functionality I did not know who to CC in the first email that I sent. Luckily Peter picked it up and CC'ed a bunch of folks, including Paul, Steven and You. All of them replied and had questions/concerns. As I mentioned before I believe I addressed all of them. > None of the people who maintain and have interest in > this code and participated in the (short) one-week discussion were > Cc:-ed to the pull request. Ok. I did not realize I'm supposed to do that. Since I got no replies to the second round of patches (take 2), which again was CC'ed to the same people that Peter CC'ed. I assumed that people are ok with it. That's what discussion on the first take ended with. > I think these patches also need a buy-in from Peter Zijlstra and Paul > Jackson (or really good reasoning while any objections from them should > be overriden) - all of whom deal with the code affected by these changes > on a daily basis and have an interest in CPU isolation features. See above. Following issues were raised: 1. Peter and Steven initially thought that workqueue isolation is not needed. 2. Paul thought that it should be implemented on top of cpusets. 3. Peter thought that stopmachine change is not safe. There were a couple of other minor misunderstandings (for example Peter thought that I'm completely disallowing IRQs on isolated CPUs, which is obviously not the case). I clarified all of them. #1 I explained in the original thread and then followed up with concrete code example of why it is needed. http://marc.info/?l=linux-kernel=120217173001671=2 Got no replies so far. So I'm assuming folks are happy. #2 I started a separate thread on that http://marc.info/?l=linux-kernel=120180692331461=2 The conclusion was, well let me just quote exactly what Paul had said: > Paul Jackson wrote: >> Max wrote: >>> Looks like I failed to explain what I'm trying to achieve. So let me try >>> again. >> >> Well done. I read through that, expecting to disagree or at least >> to not understand at some point, and got all the way through nodding >> my head in agreement. Good. >> >> Whether the earlier confusions were lack of clarity in the presentation, >> or lack of competence in my brain ... well guess I don't want to ask that >> question ;). And #3 Peter did not agree with me but said that it's up to Linus or Andrew to decide whether it's appropriate in mainline or not. I _clearly_ indicated that this part is somewhat controversial and maybe dangerous, I'm _not_ trying to sneak something in. Andrew picked it up and I'm going to do some more investigation on whether it's really not safe or is actually fine (about to send an email to Rusty). > Generally i think that cpusets is actually the feature and API that > should be used (and extended) for CPU isolation - and we already > extended it recently in the direction of CPU isolation. Most enterprise > distros have cpusets enabled so it's in use. Also, cpusets has the > appeal of being commonly used in the "big honking boxes" arena, so > reusing the same concept for RT and virtualization stuff would be the > natural approach. It already ties in to the scheduler domains code > dynamically and is flexible and scalable. I resisted ad-hoc CPU > isolation patches in -rt for that reason. That's exactly what Paul proposed initially. I completely disagree with that but I did look at it in _detail_. Please take a look here for detailed explanation http://marc.info/?l=linux-kernel=120180692331461=2 This email getting to long and I did not want to inline everything. > Also, i'd not mind some test-coverage in sched.git as well. I believe it has _nothin
Re: [E1000-devel] e1000 1sec latency problem
Kok, Auke wrote: > Max Krasnyansky wrote: >> Kok, Auke wrote: >>> Max Krasnyansky wrote: >>>> Kok, Auke wrote: >>>>> Max Krasnyansky wrote: >>>>>> So you don't think it's related to the interrupt coalescing by any >>>>>> chance ? >>>>>> I'd suggest to try and disable the coalescing and see if it makes any >>>>>> difference. >>>>>> We've had lots of issues with coalescing misbehavior. Not this bad (ie 1 >>>>>> second) though. >>>>>> >>>>>> Add this to modprobe.conf and reload e1000 module >>>>>> >>>>>> options e1000 RxIntDelay=0,0 RxAbsIntDelay=0,0 InterruptThrottleRate=0,0 >>>>>> TxIntDelay=0,0 TxAbsIntDelay=0,0 >>>>> that can't be the problem. irq moderation would only account for 2-3ms >>>>> variance >>>>> maximum. >>>> Oh, I've definitely seen worse than that. Not as bad as a 1second though. >>>> Plus you're talking >>>> about the case when coalescing logic is working as designed ;-). What if >>>> there is some kind of >>>> bug where timer did not expire or something. >>> we don't use a software timer in e1000 irq coalescing/moderation, it's all >>> in >>> hardware, so we don't have that problem at all. And I certainly have never >>> seen >>> anything you are referring to with e1000 hardware, and I do not know of any >>> bug >>> related to this. >>> >>> are you maybe confused with other hardware ? >>> >>> feel free to demonstrate an example... >> Just to give you a background. I wrote and maintain http://libe1000.sf.net >> So I know E1000 HW and SW in and out. > > wow, even I do not dare to say that! Ok maybe that was a bit of an overstatement :). >> And no I'm not confused with other HW and I know that we're >> not using SW timers for the coalescing. HW can be buggy as well. Note that >> I'm not saying that I >> know for sure that the problem is coalescing, I'm just suggesting to take it >> out of the equation >> while Pavel is investigating. >> >> Unfortunately I cannot demonstrate an example but I've seen unexplained >> packet delays in the range >> of 1-20 milliseconds on E1000 HW (and boy ... I do have a lot of it in my >> labs). Once coalescing >> was disabled those problems have gone away. > > this sounds like you have some sort of PCI POST-ing problem and those can > indeed > be worse if you use any form of interrupt coalescing. In any case that is > largely > irrelevant to the in-kernel drivers, and as I said we definately have no open > issues on that right now, and I really do not recollect any as well either > (other > than the issue of interference when both ends are irq coalescing) I was actually talking about in kernel drivers. ie We were seeing delays with TIPC running over in kernel E1000 driver. And no it was not a TIPC issue, everything worked fine with over TG3 and issues went away when coalescing was disabled. Anyway, I think we can drop this subject. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [E1000-devel] e1000 1sec latency problem
Kok, Auke wrote: > Max Krasnyansky wrote: >> Kok, Auke wrote: >>> Max Krasnyansky wrote: >>>> So you don't think it's related to the interrupt coalescing by any chance ? >>>> I'd suggest to try and disable the coalescing and see if it makes any >>>> difference. >>>> We've had lots of issues with coalescing misbehavior. Not this bad (ie 1 >>>> second) though. >>>> >>>> Add this to modprobe.conf and reload e1000 module >>>> >>>> options e1000 RxIntDelay=0,0 RxAbsIntDelay=0,0 InterruptThrottleRate=0,0 >>>> TxIntDelay=0,0 TxAbsIntDelay=0,0 >>> that can't be the problem. irq moderation would only account for 2-3ms >>> variance >>> maximum. >> Oh, I've definitely seen worse than that. Not as bad as a 1second though. >> Plus you're talking >> about the case when coalescing logic is working as designed ;-). What if >> there is some kind of >> bug where timer did not expire or something. > > we don't use a software timer in e1000 irq coalescing/moderation, it's all in > hardware, so we don't have that problem at all. And I certainly have never > seen > anything you are referring to with e1000 hardware, and I do not know of any > bug > related to this. > > are you maybe confused with other hardware ? > > feel free to demonstrate an example... Just to give you a background. I wrote and maintain http://libe1000.sf.net So I know E1000 HW and SW in and out. And no I'm not confused with other HW and I know that we're not using SW timers for the coalescing. HW can be buggy as well. Note that I'm not saying that I know for sure that the problem is coalescing, I'm just suggesting to take it out of the equation while Pavel is investigating. Unfortunately I cannot demonstrate an example but I've seen unexplained packet delays in the range of 1-20 milliseconds on E1000 HW (and boy ... I do have a lot of it in my labs). Once coalescing was disabled those problems have gone away. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions
Paul Jackson wrote: > Max - Andrew wondered if the rt tree had seen the > code or commented it on it. What became of that? I just replied to Andrew. It's not an RT feature per se. And yes Peter CC'ed RT folks. You probably did not get a chance to read all replies. They had some questions/concerns and stuff. I believe I answered/clarified all of them. > My two cents isn't worth a plug nickel here, but > I'm inclined to nod in agreement when Linus wants > to see these patches get some more exposure before > going into Linus's tree. ... what's the hurry? No hurry I guess. I did mentioned in the introductory email that I've been maintaining this stuff for awhile now. SLAB patches used to be messy, with new SLUB the mess goes away. CFS handles CPU hotplug much better than O(1), cpu hotplug is needed to be able to change isolated bit from sysfs. That's why I think it's a good time to merge. I don't mind of course if we put this stuff in -mm first. Although first part of the patchset (ie exporting isolated map, sysfs interface, etc) seem very simple and totally not controversial. Stop machine patch is really the only thing that may look suspicious. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions
Andrew Morton wrote: > On Thu, 7 Feb 2008 01:59:54 -0600 Paul Jackson <[EMAIL PROTECTED]> wrote: > >> but hard real time is not my expertise > > Speaking of which.. there is the -rt tree. Have those people had a look > at the feature, perhaps played with the code? Peter Z. and Steven R. sent me some comments, I believe I explained and addressed them. Ingo's been quite. Probably too busy. btw It's not an RT feature per se. It certainly helps RT but removing all the latency sources from isolated CPUs. But in general it's just "reducing kernel overhead on some CPUs" kind of feature. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [E1000-devel] e1000 1sec latency problem
Kok, Auke wrote: > Max Krasnyansky wrote: >> So you don't think it's related to the interrupt coalescing by any chance ? >> I'd suggest to try and disable the coalescing and see if it makes any >> difference. >> We've had lots of issues with coalescing misbehavior. Not this bad (ie 1 >> second) though. >> >> Add this to modprobe.conf and reload e1000 module >> >> options e1000 RxIntDelay=0,0 RxAbsIntDelay=0,0 InterruptThrottleRate=0,0 >> TxIntDelay=0,0 TxAbsIntDelay=0,0 > > that can't be the problem. irq moderation would only account for 2-3ms > variance > maximum. Oh, I've definitely seen worse than that. Not as bad as a 1second though. Plus you're talking about the case when coalescing logic is working as designed ;-). What if there is some kind of bug where timer did not expire or something. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions
Paul Jackson wrote: > Andrew wrote: >> (and bear in mind that Paul has a track record of being wrong >> on this :)) > > heh - I saw that . > > Max - Andrew's about right, as usual. You answered my initial > questions on this patch set adequately, but hard real time is > not my expertise, so in the final analysis, other than my saying > I don't have any more objections, my input doesn't mean much > either way. I honestly think this one is no brainer and I do not think this one will hurt Paul's track record :). Paul initially disagreed with me and that's when he was wrong ;-)) Andrew, I looked at this in detail and here is an explanation that I sent to Paul a few days ago (a bit shortened/updated version). I thought some more about your proposal to use sched_load_balance flag in cpusets instead of extending cpu_isolated_map. I looked at the cpusets, cgroups and here are my thoughts on this. Here is the list of issues with sched_load_balance flag from CPU isolation perspective: -- (1) Boot time isolation is not possible. There is currently no way to setup a cpuset at boot time. For example we won't be able to isolate cpus from irqs and workqueues at boot. Not a major issue but still an inconvenience. -- (2) There is currently no easy way to figure out what cpuset a cpu belongs to in order to query it's sched_load_balance flag. In order to do that we need a method that iterates all active cpusets and checks their cpus_allowed masks. This implies holding cgroup and cpuset mutexes. It's not clear whether it's ok to do that from the the contexts CPU isolation happens in (apic, sched, workqueue). It seems that cgroup/cpuset api is designed from top down access. ie adding a cpu to a set and then recomputing domains. Which makes perfect sense for the common cpuset usecase but is not what cpu isolation needs. In other words I think it's much simpler and cleaner to use the cpu_isolated_map for isolation purposes. No locks, no races, etc. -- (3) cpusets are a bit too dynamic :) . What I mean by this is that sched_load_balance flag can be changed at any time without bringing a CPU offline. What that means is that we'll need some notifier mechanisms for killing and restarting workqueue threads when that flag changes. Also we'd need some logic that makes sure that a user does not disable load balancing on all cpus because that effectively will kill workqueues on all the cpus. This particular case is already handled very nicely in my patches. Isolated bit can be set only when cpu is offline and it cannot be set on the first online cpu. Workqueus and other subsystems already handle cpu hotplug events nicely and can easily ignore isolated cpus when they come online. -- #1 is probably unfixable. #2 and #3 can be fixed but at the expense of extra complexity across the board. I seriously doubt that I'll be able to push that through the reviews ;-). Also personally I still think cpusets and cpu isolation attack two different problems. cpusets is about partitioning cpus and memory nodes, and managing tasks. Most of the cgroups/cpuset APIs are designed to deal with tasks. CPU isolation is much simpler and is at the lower layer. It deals with IRQs, kernel per cpu threads, etc. The only intersection I see is that both features affect scheduling domains. CPU isolation is again simple here it uses existing logic in sched.c it does not change anything in this area. - Andrew, hopefully that clarifies it. Let me know if you're not convinced. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions
Hi Linus, Linus Torvalds wrote: > > On Wed, 6 Feb 2008, Max Krasnyansky wrote: >> Linus, please pull CPU isolation extensions from >> >> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus > > Have these been in -mm and widely discussed etc? I'd like to start more > carefully, and (a) have that controversial last patch not merged initially > and (b) make sure everybody is on the same page wrt this all.. They've been discussed with RT/scheduler/cpuset folks. Andrew is definitely in the loop. He just replied and asked for some fixes and clarifications. He seems to be ok with merging this in general. The last patch may not be as bad as I originally thought. We'll discuss it some more with Andrew. I'll also check with Rusty who wrote the stopmachine in the first place. It actually seems like an overkill at this point. My impression is that it was supposed to be a safety net if some refcounting/locking is not fully safe and may not be needed or as critical anymore. I'm maybe wrong of course. So I'll find that out :) Thanx Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions
Andrew Morton wrote: > On Wed, 06 Feb 2008 21:32:55 -0800 Max Krasnyansky <[EMAIL PROTECTED]> wrote: > >> Linus, please pull CPU isolation extensions from >> >> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus > > The feature as a whole seems useful, and I don't actually oppose the merge > based on what I see here. Awesome :) I think it's get more and more useful as people will start trying to figure out what the heck there is supposed to do with the spare CPU cores. I mean pretty soon most machines will have 4 cores and some will have 8. One way to use those cores is the "dedicated engine" model. > As long as you're really sure that cpusets are > inappropriate (and bear in mind that Paul has a track record of being wrong > on this :)). I'll cover this in a separate email with more details. > But I see a few glitches Good catches. Thanks for reviewing. > - There are two separate and identical implementations of > cpu_unusable(cpu). Please do it once, in a header, preferably with C > function, not macros. Those are local versions that depend whether a feature is enabled or not. If CONFIG_CPUISOL_WORKQUEUE is disabled we want to cpu_unusable() in the workqueue.c to be a noop, and if it's enabled that macro resolve to cpu_isolated(). Same thing for the stopmachine.c. If CONFIG_CPUISOL_STOPMACHIN is disabled cpu_unusable() is a noop. In other words cpu_isolated() is the one common macro that subsystem may want to stub out. Do you see another way of doing this ? > - The Kconfig help is a bit scraggly: > > +config CPUISOL_STOPMACHINE > + bool "Do not halt isolated CPUs with Stop Machine (HIGHLY EXPERIMENTAL)" > + depends on CPUISOL && STOP_MACHINE && EXPERIMENTAL > + help > + If this option is enabled kernel will not halt isolated CPUs when > Stop Machine > > "the kernel" > > text is too wide Got it. Will fix asap. > + is triggered. > + Stop Machine is currently only used by the module insertion and > removal logic. > + Please note that at this point this feature is highly experimental > and maybe > + dangerous. It is not known to really brake anything but can > potentially > + introduce an instability. > > s/maybe/may be/ > s/brake/break/ Man, the typos are killing me :). Will fix. > Neither this text, nor the changelog nor the code comments tell us what the > potential instability with stopmachine *is*? Or maybe I missed it. That's the thing, we don't really know :). In real life does not seem to be a problem at all. As I mentioned in prev emails. We've been running all kinds of machines with this enabled, and inserting all kinds of modules left and right. Never seen any crashes or anything. But the fact that stopmachine is supposed to halt all cpus during module insertion/removal seems to imply that something bad may happen if some cpus are not halted. It may very well turnout that it's no longer needed because our locking and refcounting handles this just fine. I mean ideally we should not have to halt the entire box, it causes terrible latencies. > - Adding new sysfs files without updating Documentation/ABI/ makes Greg cry. Oh, did not know that. Will fix. > > - Why is cpu_isolated_map exported to modules? Just for api consistency, it > appears? Yes. For consistency. We'd want cpu_isolated() to work everywhere. > pre-existing problems: > > - isolated_cpu_setup() has an on-stack array of NR_CPUS integers. This > will consume 4k of stack on ia64 (at least). We'll just squeak through > for a ittle while, but this needs to be fixed. Just move it into > __initdata. Will do. > - isolated_cpu_setup() expects that the user can provide an up-to-1024 > character kernel boot parameter. Is this reasonable given cpu command > line limits, and given that NR_CPUS will surely grow beyond 1024 in the > future? I'm thinking that is reasonable for now. I'll fix and resend the patches asap. Thanx Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: e1000 1sec latency problem
Pavel Machek wrote: > Hi! > > I have the famous e1000 latency problems: > > 64 bytes from 195.113.31.123: icmp_seq=68 ttl=56 time=351.9 ms > 64 bytes from 195.113.31.123: icmp_seq=69 ttl=56 time=209.2 ms > 64 bytes from 195.113.31.123: icmp_seq=70 ttl=56 time=1004.1 ms > 64 bytes from 195.113.31.123: icmp_seq=71 ttl=56 time=308.9 ms > 64 bytes from 195.113.31.123: icmp_seq=72 ttl=56 time=305.4 ms > 64 bytes from 195.113.31.123: icmp_seq=73 ttl=56 time=9.8 ms > 64 bytes from 195.113.31.123: icmp_seq=74 ttl=56 time=3.7 ms > > ...and they are still there in 2.6.25-git0. I had ethernet EEPROM > checksum problems, which I fixed by the update, but problems are not > gone. > > irqpoll helps. > > nosmp (which implies XT-PIC is being used) does not help. > > 16: 1925 0 IO-APIC-fasteoi ahci, yenta, uhci_hcd:usb2, > eth0 > > Booting kernel with nosmp/ no yenta, no usb does not help. > > Hmm, as expected, interrupt load on ahci (find /) makes latencies go > away. > > It should be easily reproducible on x60 with latest bios, it is 100% > reproducible for me... So you don't think it's related to the interrupt coalescing by any chance ? I'd suggest to try and disable the coalescing and see if it makes any difference. We've had lots of issues with coalescing misbehavior. Not this bad (ie 1 second) though. Add this to modprobe.conf and reload e1000 module options e1000 RxIntDelay=0,0 RxAbsIntDelay=0,0 InterruptThrottleRate=0,0 TxIntDelay=0,0 TxAbsIntDelay=0,0 Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions
Andrew Morton wrote: On Thu, 7 Feb 2008 01:59:54 -0600 Paul Jackson [EMAIL PROTECTED] wrote: but hard real time is not my expertise Speaking of which.. there is the -rt tree. Have those people had a look at the feature, perhaps played with the code? Peter Z. and Steven R. sent me some comments, I believe I explained and addressed them. Ingo's been quite. Probably too busy. btw It's not an RT feature per se. It certainly helps RT but removing all the latency sources from isolated CPUs. But in general it's just reducing kernel overhead on some CPUs kind of feature. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [E1000-devel] e1000 1sec latency problem
Kok, Auke wrote: Max Krasnyansky wrote: Kok, Auke wrote: Max Krasnyansky wrote: Kok, Auke wrote: Max Krasnyansky wrote: So you don't think it's related to the interrupt coalescing by any chance ? I'd suggest to try and disable the coalescing and see if it makes any difference. We've had lots of issues with coalescing misbehavior. Not this bad (ie 1 second) though. Add this to modprobe.conf and reload e1000 module options e1000 RxIntDelay=0,0 RxAbsIntDelay=0,0 InterruptThrottleRate=0,0 TxIntDelay=0,0 TxAbsIntDelay=0,0 that can't be the problem. irq moderation would only account for 2-3ms variance maximum. Oh, I've definitely seen worse than that. Not as bad as a 1second though. Plus you're talking about the case when coalescing logic is working as designed ;-). What if there is some kind of bug where timer did not expire or something. we don't use a software timer in e1000 irq coalescing/moderation, it's all in hardware, so we don't have that problem at all. And I certainly have never seen anything you are referring to with e1000 hardware, and I do not know of any bug related to this. are you maybe confused with other hardware ? feel free to demonstrate an example... Just to give you a background. I wrote and maintain http://libe1000.sf.net So I know E1000 HW and SW in and out. wow, even I do not dare to say that! Ok maybe that was a bit of an overstatement :). And no I'm not confused with other HW and I know that we're not using SW timers for the coalescing. HW can be buggy as well. Note that I'm not saying that I know for sure that the problem is coalescing, I'm just suggesting to take it out of the equation while Pavel is investigating. Unfortunately I cannot demonstrate an example but I've seen unexplained packet delays in the range of 1-20 milliseconds on E1000 HW (and boy ... I do have a lot of it in my labs). Once coalescing was disabled those problems have gone away. this sounds like you have some sort of PCI POST-ing problem and those can indeed be worse if you use any form of interrupt coalescing. In any case that is largely irrelevant to the in-kernel drivers, and as I said we definately have no open issues on that right now, and I really do not recollect any as well either (other than the issue of interference when both ends are irq coalescing) I was actually talking about in kernel drivers. ie We were seeing delays with TIPC running over in kernel E1000 driver. And no it was not a TIPC issue, everything worked fine with over TG3 and issues went away when coalescing was disabled. Anyway, I think we can drop this subject. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [E1000-devel] e1000 1sec latency problem
Kok, Auke wrote: Max Krasnyansky wrote: Kok, Auke wrote: Max Krasnyansky wrote: So you don't think it's related to the interrupt coalescing by any chance ? I'd suggest to try and disable the coalescing and see if it makes any difference. We've had lots of issues with coalescing misbehavior. Not this bad (ie 1 second) though. Add this to modprobe.conf and reload e1000 module options e1000 RxIntDelay=0,0 RxAbsIntDelay=0,0 InterruptThrottleRate=0,0 TxIntDelay=0,0 TxAbsIntDelay=0,0 that can't be the problem. irq moderation would only account for 2-3ms variance maximum. Oh, I've definitely seen worse than that. Not as bad as a 1second though. Plus you're talking about the case when coalescing logic is working as designed ;-). What if there is some kind of bug where timer did not expire or something. we don't use a software timer in e1000 irq coalescing/moderation, it's all in hardware, so we don't have that problem at all. And I certainly have never seen anything you are referring to with e1000 hardware, and I do not know of any bug related to this. are you maybe confused with other hardware ? feel free to demonstrate an example... Just to give you a background. I wrote and maintain http://libe1000.sf.net So I know E1000 HW and SW in and out. And no I'm not confused with other HW and I know that we're not using SW timers for the coalescing. HW can be buggy as well. Note that I'm not saying that I know for sure that the problem is coalescing, I'm just suggesting to take it out of the equation while Pavel is investigating. Unfortunately I cannot demonstrate an example but I've seen unexplained packet delays in the range of 1-20 milliseconds on E1000 HW (and boy ... I do have a lot of it in my labs). Once coalescing was disabled those problems have gone away. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions
Hi Linus, Linus Torvalds wrote: On Wed, 6 Feb 2008, Max Krasnyansky wrote: Linus, please pull CPU isolation extensions from git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus Have these been in -mm and widely discussed etc? I'd like to start more carefully, and (a) have that controversial last patch not merged initially and (b) make sure everybody is on the same page wrt this all.. They've been discussed with RT/scheduler/cpuset folks. Andrew is definitely in the loop. He just replied and asked for some fixes and clarifications. He seems to be ok with merging this in general. The last patch may not be as bad as I originally thought. We'll discuss it some more with Andrew. I'll also check with Rusty who wrote the stopmachine in the first place. It actually seems like an overkill at this point. My impression is that it was supposed to be a safety net if some refcounting/locking is not fully safe and may not be needed or as critical anymore. I'm maybe wrong of course. So I'll find that out :) Thanx Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions
Paul Jackson wrote: Max - Andrew wondered if the rt tree had seen the code or commented it on it. What became of that? I just replied to Andrew. It's not an RT feature per se. And yes Peter CC'ed RT folks. You probably did not get a chance to read all replies. They had some questions/concerns and stuff. I believe I answered/clarified all of them. My two cents isn't worth a plug nickel here, but I'm inclined to nod in agreement when Linus wants to see these patches get some more exposure before going into Linus's tree. ... what's the hurry? No hurry I guess. I did mentioned in the introductory email that I've been maintaining this stuff for awhile now. SLAB patches used to be messy, with new SLUB the mess goes away. CFS handles CPU hotplug much better than O(1), cpu hotplug is needed to be able to change isolated bit from sysfs. That's why I think it's a good time to merge. I don't mind of course if we put this stuff in -mm first. Although first part of the patchset (ie exporting isolated map, sysfs interface, etc) seem very simple and totally not controversial. Stop machine patch is really the only thing that may look suspicious. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions
Paul Jackson wrote: Andrew wrote: (and bear in mind that Paul has a track record of being wrong on this :)) heh - I saw that grin. Max - Andrew's about right, as usual. You answered my initial questions on this patch set adequately, but hard real time is not my expertise, so in the final analysis, other than my saying I don't have any more objections, my input doesn't mean much either way. I honestly think this one is no brainer and I do not think this one will hurt Paul's track record :). Paul initially disagreed with me and that's when he was wrong ;-)) Andrew, I looked at this in detail and here is an explanation that I sent to Paul a few days ago (a bit shortened/updated version). I thought some more about your proposal to use sched_load_balance flag in cpusets instead of extending cpu_isolated_map. I looked at the cpusets, cgroups and here are my thoughts on this. Here is the list of issues with sched_load_balance flag from CPU isolation perspective: -- (1) Boot time isolation is not possible. There is currently no way to setup a cpuset at boot time. For example we won't be able to isolate cpus from irqs and workqueues at boot. Not a major issue but still an inconvenience. -- (2) There is currently no easy way to figure out what cpuset a cpu belongs to in order to query it's sched_load_balance flag. In order to do that we need a method that iterates all active cpusets and checks their cpus_allowed masks. This implies holding cgroup and cpuset mutexes. It's not clear whether it's ok to do that from the the contexts CPU isolation happens in (apic, sched, workqueue). It seems that cgroup/cpuset api is designed from top down access. ie adding a cpu to a set and then recomputing domains. Which makes perfect sense for the common cpuset usecase but is not what cpu isolation needs. In other words I think it's much simpler and cleaner to use the cpu_isolated_map for isolation purposes. No locks, no races, etc. -- (3) cpusets are a bit too dynamic :) . What I mean by this is that sched_load_balance flag can be changed at any time without bringing a CPU offline. What that means is that we'll need some notifier mechanisms for killing and restarting workqueue threads when that flag changes. Also we'd need some logic that makes sure that a user does not disable load balancing on all cpus because that effectively will kill workqueues on all the cpus. This particular case is already handled very nicely in my patches. Isolated bit can be set only when cpu is offline and it cannot be set on the first online cpu. Workqueus and other subsystems already handle cpu hotplug events nicely and can easily ignore isolated cpus when they come online. -- #1 is probably unfixable. #2 and #3 can be fixed but at the expense of extra complexity across the board. I seriously doubt that I'll be able to push that through the reviews ;-). Also personally I still think cpusets and cpu isolation attack two different problems. cpusets is about partitioning cpus and memory nodes, and managing tasks. Most of the cgroups/cpuset APIs are designed to deal with tasks. CPU isolation is much simpler and is at the lower layer. It deals with IRQs, kernel per cpu threads, etc. The only intersection I see is that both features affect scheduling domains. CPU isolation is again simple here it uses existing logic in sched.c it does not change anything in this area. - Andrew, hopefully that clarifies it. Let me know if you're not convinced. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] CPU isolation extensions
Hi Ingo, Thanks for your reply. * Linus Torvalds [EMAIL PROTECTED] wrote: On Wed, 6 Feb 2008, Max Krasnyansky wrote: Linus, please pull CPU isolation extensions from git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus Have these been in -mm and widely discussed etc? I'd like to start more carefully, and (a) have that controversial last patch not merged initially and (b) make sure everybody is on the same page wrt this all.. no, they have not been under nearly enough testing and review - these patches surfaced on lkml for the first time one week ago (!). Almost two weeks actually. Ok 1.8 :) I find the pull request totally premature, this stuff has not been discussed and agreed on _at all_. Ingo, I may have the wrong impression but my impression is that you ignored all the other emails and just read Linus' reply. I do not believe this accusation is valid. I apologize if my impression is incorrect. Since the patches _do not_ change/affect existing scheduler/cpuset functionality I did not know who to CC in the first email that I sent. Luckily Peter picked it up and CC'ed a bunch of folks, including Paul, Steven and You. All of them replied and had questions/concerns. As I mentioned before I believe I addressed all of them. None of the people who maintain and have interest in this code and participated in the (short) one-week discussion were Cc:-ed to the pull request. Ok. I did not realize I'm supposed to do that. Since I got no replies to the second round of patches (take 2), which again was CC'ed to the same people that Peter CC'ed. I assumed that people are ok with it. That's what discussion on the first take ended with. I think these patches also need a buy-in from Peter Zijlstra and Paul Jackson (or really good reasoning while any objections from them should be overriden) - all of whom deal with the code affected by these changes on a daily basis and have an interest in CPU isolation features. See above. Following issues were raised: 1. Peter and Steven initially thought that workqueue isolation is not needed. 2. Paul thought that it should be implemented on top of cpusets. 3. Peter thought that stopmachine change is not safe. There were a couple of other minor misunderstandings (for example Peter thought that I'm completely disallowing IRQs on isolated CPUs, which is obviously not the case). I clarified all of them. #1 I explained in the original thread and then followed up with concrete code example of why it is needed. http://marc.info/?l=linux-kernelm=120217173001671w=2 Got no replies so far. So I'm assuming folks are happy. #2 I started a separate thread on that http://marc.info/?l=linux-kernelm=120180692331461w=2 The conclusion was, well let me just quote exactly what Paul had said: Paul Jackson wrote: Max wrote: Looks like I failed to explain what I'm trying to achieve. So let me try again. Well done. I read through that, expecting to disagree or at least to not understand at some point, and got all the way through nodding my head in agreement. Good. Whether the earlier confusions were lack of clarity in the presentation, or lack of competence in my brain ... well guess I don't want to ask that question ;). And #3 Peter did not agree with me but said that it's up to Linus or Andrew to decide whether it's appropriate in mainline or not. I _clearly_ indicated that this part is somewhat controversial and maybe dangerous, I'm _not_ trying to sneak something in. Andrew picked it up and I'm going to do some more investigation on whether it's really not safe or is actually fine (about to send an email to Rusty). Generally i think that cpusets is actually the feature and API that should be used (and extended) for CPU isolation - and we already extended it recently in the direction of CPU isolation. Most enterprise distros have cpusets enabled so it's in use. Also, cpusets has the appeal of being commonly used in the big honking boxes arena, so reusing the same concept for RT and virtualization stuff would be the natural approach. It already ties in to the scheduler domains code dynamically and is flexible and scalable. I resisted ad-hoc CPU isolation patches in -rt for that reason. That's exactly what Paul proposed initially. I completely disagree with that but I did look at it in _detail_. Please take a look here for detailed explanation http://marc.info/?l=linux-kernelm=120180692331461w=2 This email getting to long and I did not want to inline everything. Also, i'd not mind some test-coverage in sched.git as well. I believe it has _nothing_ to do with the scheduler but I do not mind it being in that tree. Please read this email on why it has nothing to do with the scheduler http://marc.info/?l=linux-kernelm=120210515323578w=2 That's the email that convinced Paul. To sum it up. It has been discussed with the right people. I do
Module loading/unloading and The Stop Machine
Hi Rusty, I was hopping you could answer a couple of questions about module loading/unloading and the stop machine. There was a recent discussion on LKML about CPU isolation patches I'm working on. One of the patches makes stop machine ignore the isolated CPUs. People of course had questions about that. So I started looking into more details and got this silly, crazy idea that maybe we do not need the stop machine any more :) As far as I can tell the stop machine is basically a safety net in case some locking and recounting mechanisms aren't bullet proof. In other words if a subsystem can actually handle registration/unregistration in a robust way, module loader/unloader does not necessarily have to halt entire machine in order to load/unload a module that belongs to that subsystem. I may of course be completely wrong on that. The problem with the stop machine is that it's a very very big gun :). In a sense that it totally kills all the latencies and stuff since the entire machine gets halted while module is being (un)loaded. Which is a major issue for any realtime apps. Specifically for CPU isolation the issue is that high-priority rt user-space thread prevents stop machine threads from running and entire box just hangs waiting for it. I'm kind of surprised that folks who use monster boxes with over 100 CPUs have not complained. It's must be a huge hit for those machines to halt the entire thing. It seems that over the last few years most subsystems got much better at locking and refcounting. And I'm hopping that we can avoid halting the entire machine these days. For CPU isolation in particular the solution is simple. We can just ignore isolated CPUs. What I'm trying to figure out is how safe it is and whether we can avoid full halt altogether. So. Here is what I tried today on my Core2 Duo laptop --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -204,11 +204,14 @@ int stop_machine_run(int (*fn)(void *), void *data, unsigned int cpu) /* No CPUs can come up or down during this. */ lock_cpu_hotplug(); +/* p = __stop_machine_run(fn, data, cpu); if (!IS_ERR(p)) ret = kthread_stop(p); else ret = PTR_ERR(p); +*/ + ret = fn(data); unlock_cpu_hotplug(); return ret; ie Completely disabled stop machine. It just loads/unloads modules without full halt. I then ran three scripts: while true; do /sbin/modprobe -r uhci_hcd /sbin/modprobe uhci_hcd sleep 10 done while true; do /sbin/modprobe -r tg3 /sbin/modprobe tg3 sleep 2 done while true; do /usr/sbin/tcpdump -i eth0 done The machine has a bunch of USB devices connected to it. The two most interesting are a Bluetooth dongle and a USB mouse. By loading/unloading UHCI driver we're touching Sysfs, USB stack, Bluetooth stack, HID layer, Input layer. The X is running and is using that USB mouse. The Bluetooth services are running too. By loading/unloading TG3 driver we're touching sysfs, network stack (a bunch of layers). The machine is running NetworkManager and tcpdumping on the eth0 which is registered by TG3. This is a pretty good stress test in general let alone the disabled stop machine. I left all that running for the whole day while doing normal day to day things. Compiling a bunch of things, emails, office apps, etc. That's where I'm writing this email from :). It's still running all that :) So the question is do we still need stop machine ? I must be missing something obvious. But things seem to be working pretty well without it. I certainly feel much better about at least ignoring isolated CPUs during stop machine execution. Which btw I've doing for a couple of years now on a wide range of the machines where people are inserting modules left and right. What do you think ? Thanx Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[git pull] CPU isolation extensions
Linus, please pull CPU isolation extensions from git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus Diffstat: b/arch/x86/Kconfig |1 b/arch/x86/kernel/genapic_flat_64.c |5 ++- b/drivers/base/cpu.c| 48 +++ b/include/linux/cpumask.h |3 ++ b/kernel/Kconfig.cpuisol| 15 +++ b/kernel/Makefile |4 +- b/kernel/cpu.c | 49 b/kernel/sched.c| 37 --- b/kernel/stop_machine.c |9 +- b/kernel/workqueue.c| 31 -- kernel/Kconfig.cpuisol | 26 ++- 11 files changed, 176 insertions(+), 52 deletions(-) The patchset consist of 4 patches. cpuisol: Make cpu isolation configrable and export isolated map cpuisol: Do not route IRQs to the CPUs isolated at boot cpuisol: Do not schedule workqueues on the isolated CPUs cpuisol: Do not halt isolated CPUs with Stop Machine First two are very simple. They simply make "CPU isolation" a configurable feature, export cpu_isolated_map and provide some helper functions to access it (just like cpu_online() and friends). Last two patches add support for isolating CPUs from running workqueus and stop machine. Last patch is kind of controversial let me know if you think it's too ugly and I'll resend without it. For more details see below. This patch series extends CPU isolation support. Yes, most people want to virtuallize CPUs these days and I want to isolate them :) . The primary idea here is to be able to use some CPU cores as the dedicated engines for running user-space code with minimal kernel overhead/intervention, think of it as an SPE in the Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on one of the processors without adversely affecting or being affected by the other system activities. System activities here include _kernel_ activities as well. I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf multi- processor/core systems (vanilla kernel plus these patches) even under exteme system load. I'm working with legal folks on releasing hard RT user-space framework for that. I believe with the current multi-core CPU trend we will see more and more applications that explore this capability: RT gaming engines, simulators, hard RT apps, etc. Hence the proposal is to extend current CPU isolation feature. The new definition of the CPU isolation would be: --- 1. Isolated CPU(s) must not be subject to scheduler load balancing Users must explicitly bind threads in order to run on those CPU(s). 2. By default interrupts must not be routed to the isolated CPU(s) User must route interrupts (if any) to those CPUs explicitly. 3. In general kernel subsystems must avoid activity on the isolated CPU(s) as much as possible Includes workqueues, per CPU threads, etc. This feature is configurable and is disabled by default. --- I've been maintaining this stuff since around 2.6.18 and it's been running in production environment for a couple of years now. It's been tested on all kinds of machines, from NUMA boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110. The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1) did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs). So this seems like a good time to merge. We've had scheduler support for CPU isolation ever since O(1) scheduler went it. In other words #1 is already supported. These patches do not change/affect that functionality in any way. #2 is trivial one liner change to the IRQ init code. #3 is addressed by a couple of separate patches. The main problem here is that RT thread can prevent kernel threads from running and machine gets stuck because other CPUs are waiting for those threads to run and report back. Folks involved in the scheduler/cpuset development provided a lot of feedback on the first series of patches. I believe I managed to explain and clarify every aspect. Paul Jackson initially suggested to implement #2 and #3 using cpusets subsystem. Paul and I looked at it more closely and determined that exporting cpu_isolated_map instead is a better option. Last patch to the stop machine is potentially unsafe and is marked as highly experimental. Unfortunately it's currently the only option that allows dynamic module insertion/removal for above scenarios. If people still feel that it's t ugly I can revert that change and keep it in the separate tree
[git pull] CPU isolation extensions
Linus, please pull CPU isolation extensions from git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus Diffstat: b/arch/x86/Kconfig |1 b/arch/x86/kernel/genapic_flat_64.c |5 ++- b/drivers/base/cpu.c| 48 +++ b/include/linux/cpumask.h |3 ++ b/kernel/Kconfig.cpuisol| 15 +++ b/kernel/Makefile |4 +- b/kernel/cpu.c | 49 b/kernel/sched.c| 37 --- b/kernel/stop_machine.c |9 +- b/kernel/workqueue.c| 31 -- kernel/Kconfig.cpuisol | 26 ++- 11 files changed, 176 insertions(+), 52 deletions(-) The patchset consist of 4 patches. cpuisol: Make cpu isolation configrable and export isolated map cpuisol: Do not route IRQs to the CPUs isolated at boot cpuisol: Do not schedule workqueues on the isolated CPUs cpuisol: Do not halt isolated CPUs with Stop Machine First two are very simple. They simply make CPU isolation a configurable feature, export cpu_isolated_map and provide some helper functions to access it (just like cpu_online() and friends). Last two patches add support for isolating CPUs from running workqueus and stop machine. Last patch is kind of controversial let me know if you think it's too ugly and I'll resend without it. For more details see below. This patch series extends CPU isolation support. Yes, most people want to virtuallize CPUs these days and I want to isolate them :) . The primary idea here is to be able to use some CPU cores as the dedicated engines for running user-space code with minimal kernel overhead/intervention, think of it as an SPE in the Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on one of the processors without adversely affecting or being affected by the other system activities. System activities here include _kernel_ activities as well. I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf multi- processor/core systems (vanilla kernel plus these patches) even under exteme system load. I'm working with legal folks on releasing hard RT user-space framework for that. I believe with the current multi-core CPU trend we will see more and more applications that explore this capability: RT gaming engines, simulators, hard RT apps, etc. Hence the proposal is to extend current CPU isolation feature. The new definition of the CPU isolation would be: --- 1. Isolated CPU(s) must not be subject to scheduler load balancing Users must explicitly bind threads in order to run on those CPU(s). 2. By default interrupts must not be routed to the isolated CPU(s) User must route interrupts (if any) to those CPUs explicitly. 3. In general kernel subsystems must avoid activity on the isolated CPU(s) as much as possible Includes workqueues, per CPU threads, etc. This feature is configurable and is disabled by default. --- I've been maintaining this stuff since around 2.6.18 and it's been running in production environment for a couple of years now. It's been tested on all kinds of machines, from NUMA boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110. The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1) did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs). So this seems like a good time to merge. We've had scheduler support for CPU isolation ever since O(1) scheduler went it. In other words #1 is already supported. These patches do not change/affect that functionality in any way. #2 is trivial one liner change to the IRQ init code. #3 is addressed by a couple of separate patches. The main problem here is that RT thread can prevent kernel threads from running and machine gets stuck because other CPUs are waiting for those threads to run and report back. Folks involved in the scheduler/cpuset development provided a lot of feedback on the first series of patches. I believe I managed to explain and clarify every aspect. Paul Jackson initially suggested to implement #2 and #3 using cpusets subsystem. Paul and I looked at it more closely and determined that exporting cpu_isolated_map instead is a better option. Last patch to the stop machine is potentially unsafe and is marked as highly experimental. Unfortunately it's currently the only option that allows dynamic module insertion/removal for above scenarios. If people still feel that it's t ugly I can revert that change and keep it in the separate tree
Re: CPU hotplug and IRQ affinity with 2.6.24-rt1
Daniel Walker wrote: > On Mon, Feb 04, 2008 at 03:35:13PM -0800, Max Krasnyanskiy wrote: >> This is just an FYI. As part of the "Isolated CPU extensions" thread Daniel >> suggest for me >> to check out latest RT kernels. So I did or at least tried to and >> immediately spotted a couple >> of issues. >> >> The machine I'm running it on is: >> HP xw9300, Dual Opteron, NUMA >> >> It looks like with -rt kernel IRQ affinity masks are ignored on that >> system. ie I write 1 to lets say /proc/irq/23/smp_affinity but the >> interrupts keep coming to CPU1. Vanilla 2.6.24 does not have that issue. > > I tried this, and it works according to /proc/interrupts .. Are you > looking at the interrupt threads affinity ? Nope. I'm looking at the /proc/interrupts. ie The interrupt count keeps incrementing for cpu1 even though affinity mask is set to 1. IRQ thread affinity was btw set to 3 which is probably wrong. To clarify, by default after reboot: - IRQ affinity set 3, IRQ thread affinity set to 3 - User writes 1 into /proc/irq/N/smp_affinity - IRQ affinity is now set to 1, IRQ thread affinity is still set to 3 It'd still work I guess but does not seem right. Ideally IRQ thread affinity should have change as well. We could of course just have some user-space tool that adjusts both. Looks like Greg already replied to the cpu hotplug issue. For me it did not oops. Just got stuck probably because it could not move an IRQ due to broken IRQ affinity logic. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]
Paul Jackson wrote: > Max K wrote: >>> And for another thing, we already declare externs in cpumask.h for >>> the other, more widely used, cpu_*_map variables cpu_possible_map, >>> cpu_online_map, and cpu_present_map. >> Well, to address #2 and #3 isolated map will need to be exported as well. >> Those other maps do not really have much to do with the scheduler code. >> That's why I think either kernel/cpumask.c or kernel/cpu.c is a better place >> for them. > > Well, if you have need it to be exported for #2 or #3, then that's ok > by me - export it. > > I'm unaware of any kernel/cpumask.c. If you meant lib/cpumask.c, then > I'd prefer you not put it there, as lib/cpumask.c just contains the > implementation details of the abstract data type cpumask_t, not any of > its uses. If you mean kernel/cpuset.c, then that's not a good choice > either, as that just contains the implementation details of the cpuset > subsystem. You should usually define such things in one of the files > using it, and unless there is clearly a -better- place to move the > definition, it's usually better to just leave it where it is. I was thinking of creating the new file kernel/cpumask.c. But it probably does not make sense just for the masks. I'm now thinking kernel/cpu.c is the best place for it. It contains all the cpu hotplug logic that deals with those maps at the very top it has stuff like /* Serializes the updates to cpu_online_map, cpu_present_map */ static DEFINE_MUTEX(cpu_add_remove_lock); So it seems to make sense to keep the maps in there. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]
Paul Jackson wrote: Max K wrote: And for another thing, we already declare externs in cpumask.h for the other, more widely used, cpu_*_map variables cpu_possible_map, cpu_online_map, and cpu_present_map. Well, to address #2 and #3 isolated map will need to be exported as well. Those other maps do not really have much to do with the scheduler code. That's why I think either kernel/cpumask.c or kernel/cpu.c is a better place for them. Well, if you have need it to be exported for #2 or #3, then that's ok by me - export it. I'm unaware of any kernel/cpumask.c. If you meant lib/cpumask.c, then I'd prefer you not put it there, as lib/cpumask.c just contains the implementation details of the abstract data type cpumask_t, not any of its uses. If you mean kernel/cpuset.c, then that's not a good choice either, as that just contains the implementation details of the cpuset subsystem. You should usually define such things in one of the files using it, and unless there is clearly a -better- place to move the definition, it's usually better to just leave it where it is. I was thinking of creating the new file kernel/cpumask.c. But it probably does not make sense just for the masks. I'm now thinking kernel/cpu.c is the best place for it. It contains all the cpu hotplug logic that deals with those maps at the very top it has stuff like /* Serializes the updates to cpu_online_map, cpu_present_map */ static DEFINE_MUTEX(cpu_add_remove_lock); So it seems to make sense to keep the maps in there. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CPU hotplug and IRQ affinity with 2.6.24-rt1
Daniel Walker wrote: On Mon, Feb 04, 2008 at 03:35:13PM -0800, Max Krasnyanskiy wrote: This is just an FYI. As part of the Isolated CPU extensions thread Daniel suggest for me to check out latest RT kernels. So I did or at least tried to and immediately spotted a couple of issues. The machine I'm running it on is: HP xw9300, Dual Opteron, NUMA It looks like with -rt kernel IRQ affinity masks are ignored on that system. ie I write 1 to lets say /proc/irq/23/smp_affinity but the interrupts keep coming to CPU1. Vanilla 2.6.24 does not have that issue. I tried this, and it works according to /proc/interrupts .. Are you looking at the interrupt threads affinity ? Nope. I'm looking at the /proc/interrupts. ie The interrupt count keeps incrementing for cpu1 even though affinity mask is set to 1. IRQ thread affinity was btw set to 3 which is probably wrong. To clarify, by default after reboot: - IRQ affinity set 3, IRQ thread affinity set to 3 - User writes 1 into /proc/irq/N/smp_affinity - IRQ affinity is now set to 1, IRQ thread affinity is still set to 3 It'd still work I guess but does not seem right. Ideally IRQ thread affinity should have change as well. We could of course just have some user-space tool that adjusts both. Looks like Greg already replied to the cpu hotplug issue. For me it did not oops. Just got stuck probably because it could not move an IRQ due to broken IRQ affinity logic. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CPUISOL] CPU isolation extensions
Hi Daniel, Sorry for not replying right away. Daniel Walker wrote: > On Mon, 2008-01-28 at 16:12 -0800, Max Krasnyanskiy wrote: > >> Not accurate enough and way too much overhead for what I need. I know at >> this point it probably >> sounds like I'm talking BS :). I wish I've released the engine and examples >> by now. Anyway let >> me just say that SW MAC has crazy tight deadlines with lots of small tasks. >> Using nanosleep() & >> gettimeofday() is simply not practical. So it's all TSC based with clever >> time sync logic between >> HW and SW. > > I don't know if it's BS or not, you clearly fixed your own problem which > is good .. Although when you say "RT patches cannot achieve what I > needed. Even RTAI/Xenomai can't do that." , and HRT is "Not accurate > enough and way too much overhead" .. Given the hardware your using, > that's all difficult to believe.. You also said this code has been > running on production systems for two year, which means it's at least > two years old .. There's been some good sized leaps in real time linux > in the past two years .. I've been actually tracking RT patches fairly closely. I can't say I tried all of them but I do try them from time to time. I just got latest 2.6.24-rt1 running on HP xw9300. Looks like it does not handle CPU hotplug very well, I manged to kill it by bringing cpu 1 off-line. So I cannot run any tests right now will run some tomorrow. For now let me mention that I have a simple tests that sleeps for a millisecond, then does some bitbanging for 200 usec. It measures jitter caused by the periodic scheduler tick, IPIs and other kernel activities. With high-res timers disabled on most of the machines I mentioned before it shows around 1-1.2usec worst case. With high-res timers enabled it shows 5-6usec. This is with 2.6.24 running on an isolated CPU. Forget about using a user-space timer (nanosleep(), etc). Even scheduler tick itself is fairly heavy. gettimeofday() call on that machine takes on average 2-3usec (not a vsyscall) and SW MAC is all about precise timing. That's why I said that it's not practical to use that stuff for me. I do not see anything in -rt kernel that would improve this. This is btw not to say that -rt kernel is not useful for my app in general. We have a bunch of soft-RT threads that talk to the MAC thread. Those would definitely benefit. I think cpu isolation + -rt would work beautifully for wireless basestations. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]
Paul Jackson wrote: > Max wrote: >> Paul, I actually mentioned at the beginning of my email that I did read that >> thread >> started by Peter. I did learn quite a bit from it :) > > Ah - sorry - I missed that part. However, I'm still getting the feeling > that there were some key points in that thread that we have not managed > to communicate successfully. I think you are assuming that I only need to deal with RT scheduler and scheduler domains which is not correct. See below. >> Sounds like at this point we're in agreement that sched_load_balance is not >> suitable >> for what I'd like to achieve. > > I don't think we're in agreement; I think we're in confusion ;) Yeah. I don't believe I'm the confused side though ;-) > Yes, sched_load_balance does not *directly* have anything to do with this. > > But indirectly it is a critical element in what I think you'd like to > achieve. It affects how the cpuset code sets up sched_domains, and > if I understand correctly, you require either (1) some sched_domains to > only contain RT tasks, or (2) some CPUs to be in no sched_domain at all. > > Proper configuration of the cpuset hierarchy, including the setting of > the per-cpuset sched_load_balance flag, can provide either of these > sched_domain partitions, as desired. Again you're assuming that scheduling domain partitioning satisfies my requirements or addresses my use case. It does not. See below for more details. >> But how about making cpusets aware of the cpu_isolated_map ? > > No. That's confusing cpusets and the scheduler again. > > The cpu_isolated_map is a file static variable known only within > the kernel/sched.c file; this should not change. I completely disagree. In fact I think all the cpu_xxx_map (online, present, isolated) variables do not belong in the scheduler code. I'm thinking of submitting a patch that factors them out into kernel/cpumask.c We already have cpumask.h. > Presently, the boot parameter isolcpus= is just used to initialize > what CPUs are isolated at boot, and then the sched_domain partitioning, > as done in kernel/sched.c:partition_sched_domains() (the hook into > the sched code that cpusets uses) determines which CPUs are isolated > from that point forward. I doubt that this should change either. Sure, I did not even touch that part. I just proposed to extend the meaning of the 'isolated' bit. > In that thread referenced above, did you see the part where RT is > achieved not by isolating CPUs from any scheduler, but rather by > polymorphically having several schedulers available to operate on each > sched_domain, and having RT threads self-select the RT scheduler? Absolutely. Yes that is. I saw that part. But it has nothing to do with my use case. Looks like I failed to explain what I'm trying to achieve. So let me try again. I'd like to be able to run a CPU intensive (%100) RT task on one of the processors without adversely affecting or being affected by the other system activities. System activities here include _kernel_ activities as well. Hence the proposal is to extend current CPU isolation feature. The new definition of the CPU isolation would be: --- 1. Isolated CPU(s) must not be subject to scheduler load balancing Users must explicitly bind threads in order to run on those CPU(s). 2. By default interrupts must not be routed to the isolated CPU(s) User must route interrupts (if any) explicitly. 3. In general kernel subsystems must avoid activity on the isolated CPU(s) as much as possible Includes workqueues, per CPU threads, etc. This feature is configurable and is disabled by default. --- #1 affects scheduler and scheduler domains. It's already supported either by using isolcpus= boot option or by setting "sched_load_balance" in cpusets. I'm totally happy with the current behavior and my original patch did not mess with this functionality in any way. #2 and #3 have _nothing_ to do with the scheduler or scheduler domains. I've been trying to explain that for a few days now ;-). When you saw my patches for #2 and #3 you told me that you'd be interested to see them implemented on top of the "sched_load_balance" flag. Here is your original reply http://marc.info/?l=linux-kernel=120153260217699=2 So I looked into that and provided an explanation why it would not work or would work but would add lots of complexity (access to internal cpuset structures, locking, etc). My email on that is here: http://marc.info/?l=linux-kernel=120180692331461=2 Now, I felt from the beginning that cpusets is not the right mechanism to address number #2 and #3. The best mechanism IMO is to simply provide an access to the cpu_isolated_map to the rest of the kernel. Again the fact that cpu_isolated_map currently lives in the scheduler code does not change anything here because as I explained I'm proposing to extend the meaning of the "CPU isolation". I provided dynamic access to the "isolated" bit only for
Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]
Paul Jackson wrote: Max wrote: Paul, I actually mentioned at the beginning of my email that I did read that thread started by Peter. I did learn quite a bit from it :) Ah - sorry - I missed that part. However, I'm still getting the feeling that there were some key points in that thread that we have not managed to communicate successfully. I think you are assuming that I only need to deal with RT scheduler and scheduler domains which is not correct. See below. Sounds like at this point we're in agreement that sched_load_balance is not suitable for what I'd like to achieve. I don't think we're in agreement; I think we're in confusion ;) Yeah. I don't believe I'm the confused side though ;-) Yes, sched_load_balance does not *directly* have anything to do with this. But indirectly it is a critical element in what I think you'd like to achieve. It affects how the cpuset code sets up sched_domains, and if I understand correctly, you require either (1) some sched_domains to only contain RT tasks, or (2) some CPUs to be in no sched_domain at all. Proper configuration of the cpuset hierarchy, including the setting of the per-cpuset sched_load_balance flag, can provide either of these sched_domain partitions, as desired. Again you're assuming that scheduling domain partitioning satisfies my requirements or addresses my use case. It does not. See below for more details. But how about making cpusets aware of the cpu_isolated_map ? No. That's confusing cpusets and the scheduler again. The cpu_isolated_map is a file static variable known only within the kernel/sched.c file; this should not change. I completely disagree. In fact I think all the cpu_xxx_map (online, present, isolated) variables do not belong in the scheduler code. I'm thinking of submitting a patch that factors them out into kernel/cpumask.c We already have cpumask.h. Presently, the boot parameter isolcpus= is just used to initialize what CPUs are isolated at boot, and then the sched_domain partitioning, as done in kernel/sched.c:partition_sched_domains() (the hook into the sched code that cpusets uses) determines which CPUs are isolated from that point forward. I doubt that this should change either. Sure, I did not even touch that part. I just proposed to extend the meaning of the 'isolated' bit. In that thread referenced above, did you see the part where RT is achieved not by isolating CPUs from any scheduler, but rather by polymorphically having several schedulers available to operate on each sched_domain, and having RT threads self-select the RT scheduler? Absolutely. Yes that is. I saw that part. But it has nothing to do with my use case. Looks like I failed to explain what I'm trying to achieve. So let me try again. I'd like to be able to run a CPU intensive (%100) RT task on one of the processors without adversely affecting or being affected by the other system activities. System activities here include _kernel_ activities as well. Hence the proposal is to extend current CPU isolation feature. The new definition of the CPU isolation would be: --- 1. Isolated CPU(s) must not be subject to scheduler load balancing Users must explicitly bind threads in order to run on those CPU(s). 2. By default interrupts must not be routed to the isolated CPU(s) User must route interrupts (if any) explicitly. 3. In general kernel subsystems must avoid activity on the isolated CPU(s) as much as possible Includes workqueues, per CPU threads, etc. This feature is configurable and is disabled by default. --- #1 affects scheduler and scheduler domains. It's already supported either by using isolcpus= boot option or by setting sched_load_balance in cpusets. I'm totally happy with the current behavior and my original patch did not mess with this functionality in any way. #2 and #3 have _nothing_ to do with the scheduler or scheduler domains. I've been trying to explain that for a few days now ;-). When you saw my patches for #2 and #3 you told me that you'd be interested to see them implemented on top of the sched_load_balance flag. Here is your original reply http://marc.info/?l=linux-kernelm=120153260217699w=2 So I looked into that and provided an explanation why it would not work or would work but would add lots of complexity (access to internal cpuset structures, locking, etc). My email on that is here: http://marc.info/?l=linux-kernelm=120180692331461w=2 Now, I felt from the beginning that cpusets is not the right mechanism to address number #2 and #3. The best mechanism IMO is to simply provide an access to the cpu_isolated_map to the rest of the kernel. Again the fact that cpu_isolated_map currently lives in the scheduler code does not change anything here because as I explained I'm proposing to extend the meaning of the CPU isolation. I provided dynamic access to the isolated bit only for convince, it does _not_ change existing
Re: [CPUISOL] CPU isolation extensions
Hi Daniel, Sorry for not replying right away. Daniel Walker wrote: On Mon, 2008-01-28 at 16:12 -0800, Max Krasnyanskiy wrote: Not accurate enough and way too much overhead for what I need. I know at this point it probably sounds like I'm talking BS :). I wish I've released the engine and examples by now. Anyway let me just say that SW MAC has crazy tight deadlines with lots of small tasks. Using nanosleep() gettimeofday() is simply not practical. So it's all TSC based with clever time sync logic between HW and SW. I don't know if it's BS or not, you clearly fixed your own problem which is good .. Although when you say RT patches cannot achieve what I needed. Even RTAI/Xenomai can't do that. , and HRT is Not accurate enough and way too much overhead .. Given the hardware your using, that's all difficult to believe.. You also said this code has been running on production systems for two year, which means it's at least two years old .. There's been some good sized leaps in real time linux in the past two years .. I've been actually tracking RT patches fairly closely. I can't say I tried all of them but I do try them from time to time. I just got latest 2.6.24-rt1 running on HP xw9300. Looks like it does not handle CPU hotplug very well, I manged to kill it by bringing cpu 1 off-line. So I cannot run any tests right now will run some tomorrow. For now let me mention that I have a simple tests that sleeps for a millisecond, then does some bitbanging for 200 usec. It measures jitter caused by the periodic scheduler tick, IPIs and other kernel activities. With high-res timers disabled on most of the machines I mentioned before it shows around 1-1.2usec worst case. With high-res timers enabled it shows 5-6usec. This is with 2.6.24 running on an isolated CPU. Forget about using a user-space timer (nanosleep(), etc). Even scheduler tick itself is fairly heavy. gettimeofday() call on that machine takes on average 2-3usec (not a vsyscall) and SW MAC is all about precise timing. That's why I said that it's not practical to use that stuff for me. I do not see anything in -rt kernel that would improve this. This is btw not to say that -rt kernel is not useful for my app in general. We have a bunch of soft-RT threads that talk to the MAC thread. Those would definitely benefit. I think cpu isolation + -rt would work beautifully for wireless basestations. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]
Paul Jackson wrote: > Max wrote: >> Here is the list of things of issues with sched_load_balance flag from CPU >> isolation >> perspective: > > A separate thread happened to start up on lkml.org, shortly after > yours, that went into this in considerable detail. > > For example, the interaction of cpusets, sched_load_balance, > sched_domains and real time scheduling is examined in some detail on > this thread. Everyone participating on that thread learned something > (we all came into it with less than a full picture of what's there.) > > I would encourage you to read it closely. For example, the scheduler > code should not be trying to access per-cpuset attributes such as > the sched_load_balance flag (you are correct that this would be > difficult to do because of the locking; however by design, that is > not to be done.) > > This thread begins at: > > scheduler scalability - cgroups, cpusets and load-balancing > http://lkml.org/lkml/2008/1/29/60 > > Too bad we didn't think to include you in the CC list of that > thread from the beginning. Paul, I actually mentioned at the beginning of my email that I did read that thread started by Peter. I did learn quite a bit from it :) You guys did not discuss isolation stuff though. The thread was only about scheduling and my cpu isolation extension patches deal with other aspects. Sounds like at this point we're in agreement that sched_load_balance is not suitable for what I'd like to achieve. But how about making cpusets aware of the cpu_isolated_map ? Even without my patches it's somewhat of an issue right now. I mean of you use isolcpus= boot option to put cpus into null domain, cpusets will not be aware of it. The result maybe a bit confusing if an isolated cpu is added to some cpuset. Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]
Paul Jackson wrote: Max wrote: Here is the list of things of issues with sched_load_balance flag from CPU isolation perspective: A separate thread happened to start up on lkml.org, shortly after yours, that went into this in considerable detail. For example, the interaction of cpusets, sched_load_balance, sched_domains and real time scheduling is examined in some detail on this thread. Everyone participating on that thread learned something (we all came into it with less than a full picture of what's there.) I would encourage you to read it closely. For example, the scheduler code should not be trying to access per-cpuset attributes such as the sched_load_balance flag (you are correct that this would be difficult to do because of the locking; however by design, that is not to be done.) This thread begins at: scheduler scalability - cgroups, cpusets and load-balancing http://lkml.org/lkml/2008/1/29/60 Too bad we didn't think to include you in the CC list of that thread from the beginning. Paul, I actually mentioned at the beginning of my email that I did read that thread started by Peter. I did learn quite a bit from it :) You guys did not discuss isolation stuff though. The thread was only about scheduling and my cpu isolation extension patches deal with other aspects. Sounds like at this point we're in agreement that sched_load_balance is not suitable for what I'd like to achieve. But how about making cpusets aware of the cpu_isolated_map ? Even without my patches it's somewhat of an issue right now. I mean of you use isolcpus= boot option to put cpus into null domain, cpusets will not be aware of it. The result maybe a bit confusing if an isolated cpu is added to some cpuset. Max -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange freezes (seems like SATA related)
Robert Hancock wrote: > Can you post the full dmesg output? What kind of drive is this? Sorry for the delay. I'm on vacation and have sporadic email access. Full dmesg is pretty long. Here SATA related section. sata_nv :00:07.0: version 3.4 ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 23 ACPI: PCI Interrupt :00:07.0[A] -> Link [LSA0] -> GSI 23 (level, high) -> IRQ 23 sata_nv :00:07.0: Using ADMA mode PCI: Setting latency timer of device :00:07.0 to 64 scsi0 : sata_nv scsi1 : sata_nv ata1: SATA max UDMA/133 cmd 0xc2a16480 ctl 0xc2a164a0 bmdma 0x000158b0 irq 23 ata2: SATA max UDMA/133 cmd 0xc2a16580 ctl 0xc2a165a0 bmdma 0x000158b8 irq 23 ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: ATA-7: SAMSUNG HD080HJ, WT100-33, max UDMA/100 ata1.00: 156301488 sectors, multi 16: LBA48 ata1.00: configured for UDMA/100 ata2: SATA link down (SStatus 0 SControl 300) scsi 0:0:0:0: Direct-Access ATA SAMSUNG HD080HJ WT10 PQ: 0 ANSI: 5 ata1: bounce limit 0x, segment boundary 0x, hw segs 61 sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sda3 sd 0:0:0:0: [sda] Attached SCSI disk ACPI: PCI Interrupt Link [LSA1] enabled at IRQ 22 ACPI: PCI Interrupt :00:08.0[A] -> Link [LSA1] -> GSI 22 (level, high) -> IRQ 22 sata_nv :00:08.0: Using ADMA mode Max - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange freezes (seems like SATA related)
Andrew Morton wrote: > On Mon, 29 Oct 2007 09:54:27 -0700 > Max Krasnyansky <[EMAIL PROTECTED]> wrote: > >> A couple of HP xw9300 machines (dual Opterons) started freezing up. >> We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is >> alive >> (I can switch vts, etc) but everything else is dead (network, etc). >> Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff. >> >> Hooked up serial console and the only error that shows up is this. >> >> ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 >> status 0x1540 next cpb count 0x0 next cpb idx 0x0 >> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1 >> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen >> ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out >> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) >> Descriptor sense data with sense descriptors (in hex): >> end_request: I/O error, dev sda, sector 8388695 >> Buffer I/O error on device sda1, logical block 1048579 >> lost page write due to I/O error on sda1 >> sd 0:0:0:0: [sda] Write Protect is off >> >> I see a bunch of those and then the box just sits there spewing this >> periodically >> >> ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 >> status 0x1540 next cpb count 0x0 next cpb idx 0x0 >> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1 >> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen >> ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out >> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) >> >> SMART selftest on the drive passed without errors. >> >> Here is how this machine looks like >> >> ... > > So this happens on more than one machine? Yep. > The kernel shouldn't freeze, so even if both machines have magically > identical hardware faults, there's a kernel bug there somewhere. > > I guess it would be useful to test a 2.6.23 kernel if poss. We've seen a > very large number of reports like this one in recent months (many of which > have not been responded to, btw) and perhaps someone has done something > about them. I may not be able to run identical workload on 2.6.23. Will try to give it a shot sometime next week. Also I've upgraded to 2.6.22.10 last week. There are a few fixes in there that may potentially affect those boxes. Max - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange freezes (seems like SATA related)
Andrew Morton wrote: On Mon, 29 Oct 2007 09:54:27 -0700 Max Krasnyansky [EMAIL PROTECTED] wrote: A couple of HP xw9300 machines (dual Opterons) started freezing up. We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is alive (I can switch vts, etc) but everything else is dead (network, etc). Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff. Hooked up serial console and the only error that shows up is this. ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 0x1540 next cpb count 0x0 next cpb idx 0x0 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Descriptor sense data with sense descriptors (in hex): end_request: I/O error, dev sda, sector 8388695 Buffer I/O error on device sda1, logical block 1048579 lost page write due to I/O error on sda1 sd 0:0:0:0: [sda] Write Protect is off I see a bunch of those and then the box just sits there spewing this periodically ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 0x1540 next cpb count 0x0 next cpb idx 0x0 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) SMART selftest on the drive passed without errors. Here is how this machine looks like ... So this happens on more than one machine? Yep. The kernel shouldn't freeze, so even if both machines have magically identical hardware faults, there's a kernel bug there somewhere. I guess it would be useful to test a 2.6.23 kernel if poss. We've seen a very large number of reports like this one in recent months (many of which have not been responded to, btw) and perhaps someone has done something about them. I may not be able to run identical workload on 2.6.23. Will try to give it a shot sometime next week. Also I've upgraded to 2.6.22.10 last week. There are a few fixes in there that may potentially affect those boxes. Max - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange freezes (seems like SATA related)
Robert Hancock wrote: Can you post the full dmesg output? What kind of drive is this? Sorry for the delay. I'm on vacation and have sporadic email access. Full dmesg is pretty long. Here SATA related section. sata_nv :00:07.0: version 3.4 ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 23 ACPI: PCI Interrupt :00:07.0[A] - Link [LSA0] - GSI 23 (level, high) - IRQ 23 sata_nv :00:07.0: Using ADMA mode PCI: Setting latency timer of device :00:07.0 to 64 scsi0 : sata_nv scsi1 : sata_nv ata1: SATA max UDMA/133 cmd 0xc2a16480 ctl 0xc2a164a0 bmdma 0x000158b0 irq 23 ata2: SATA max UDMA/133 cmd 0xc2a16580 ctl 0xc2a165a0 bmdma 0x000158b8 irq 23 ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: ATA-7: SAMSUNG HD080HJ, WT100-33, max UDMA/100 ata1.00: 156301488 sectors, multi 16: LBA48 ata1.00: configured for UDMA/100 ata2: SATA link down (SStatus 0 SControl 300) scsi 0:0:0:0: Direct-Access ATA SAMSUNG HD080HJ WT10 PQ: 0 ANSI: 5 ata1: bounce limit 0x, segment boundary 0x, hw segs 61 sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sda3 sd 0:0:0:0: [sda] Attached SCSI disk ACPI: PCI Interrupt Link [LSA1] enabled at IRQ 22 ACPI: PCI Interrupt :00:08.0[A] - Link [LSA1] - GSI 22 (level, high) - IRQ 22 sata_nv :00:08.0: Using ADMA mode Max - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Strange freezes (seems like SATA related)
A couple of HP xw9300 machines (dual Opterons) started freezing up. We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is alive (I can switch vts, etc) but everything else is dead (network, etc). Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff. Hooked up serial console and the only error that shows up is this. ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 0x1540 next cpb count 0x0 next cpb idx 0x0 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Descriptor sense data with sense descriptors (in hex): end_request: I/O error, dev sda, sector 8388695 Buffer I/O error on device sda1, logical block 1048579 lost page write due to I/O error on sda1 sd 0:0:0:0: [sda] Write Protect is off I see a bunch of those and then the box just sits there spewing this periodically ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 0x1540 next cpb count 0x0 next cpb idx 0x0 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) SMART selftest on the drive passed without errors. Here is how this machine looks like 00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) 00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3) 00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2) 00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2) 00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3) 00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio Controller (rev a2) 00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2) 00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) 00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) 00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2) 00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3) 00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 05:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE] 05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link) 0a:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06) 40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 61:04.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07) 61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) 61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) 61:09.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07) 62:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01) 63:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01) 80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) 80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) 80:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 81:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06) As I mentioned dual Opteron, NUMA. Nothing fancy in the kernel config. Any ideas what might the problem be ? Max - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Strange freezes (seems like SATA related)
A couple of HP xw9300 machines (dual Opterons) started freezing up. We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is alive (I can switch vts, etc) but everything else is dead (network, etc). Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff. Hooked up serial console and the only error that shows up is this. ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 0x1540 next cpb count 0x0 next cpb idx 0x0 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Descriptor sense data with sense descriptors (in hex): end_request: I/O error, dev sda, sector 8388695 Buffer I/O error on device sda1, logical block 1048579 lost page write due to I/O error on sda1 sd 0:0:0:0: [sda] Write Protect is off I see a bunch of those and then the box just sits there spewing this periodically ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 0x1540 next cpb count 0x0 next cpb idx 0x0 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) SMART selftest on the drive passed without errors. Here is how this machine looks like 00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) 00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3) 00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2) 00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2) 00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3) 00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio Controller (rev a2) 00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2) 00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) 00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) 00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2) 00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3) 00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 05:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE] 05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link) 0a:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06) 40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 61:04.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07) 61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) 61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) 61:09.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07) 62:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01) 63:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01) 80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) 80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) 80:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 81:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06) As I mentioned dual Opteron, NUMA. Nothing fancy in the kernel config. Any ideas what might the problem be ? Max - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: TUN/TAP driver - MAINTAINERS - bad mailing list entry?
Joe Perches wrote: MAINTAINERS curently has: TUN/TAP driver P: Maxim Krasnyansky M: [EMAIL PROTECTED] L: [EMAIL PROTECTED] [EMAIL PROTECTED] doesn't seem to be a valid email address. Should it be removed or modified? Sorry for late response. Just noticed this. Yes it's an ancient mailing list and should be removed. I totally forgot about it. Max - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: TUN/TAP driver - MAINTAINERS - bad mailing list entry?
Joe Perches wrote: MAINTAINERS curently has: TUN/TAP driver P: Maxim Krasnyansky M: [EMAIL PROTECTED] L: [EMAIL PROTECTED] [EMAIL PROTECTED] doesn't seem to be a valid email address. Should it be removed or modified? Sorry for late response. Just noticed this. Yes it's an ancient mailing list and should be removed. I totally forgot about it. Max - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Allow group ownership of TUN/TAP devices
Jeff Dike wrote: I recieved from Guido Guenther the patch below to the TUN/TAP driver which allows group ownerships to be effective. It seems reasonable to me. Looks good to me too. We'll add to my tree. In the mean time I don't mind if one of net drv maintainers pushes it upstream. Thanx Max - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Allow group ownership of TUN/TAP devices
Jeff Dike wrote: I recieved from Guido Guenther the patch below to the TUN/TAP driver which allows group ownerships to be effective. It seems reasonable to me. Looks good to me too. We'll add to my tree. In the mean time I don't mind if one of net drv maintainers pushes it upstream. Thanx Max - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLAB cache reaper on isolated cpus
Christoph Lameter wrote: > On Tue, 20 Feb 2007, Max Krasnyansky wrote: > >> Suppose I need to isolate a CPU. We already support at the scheduler and >> irq levels (irq affinity). But I want to go a bit further and avoid >> doing kernel work on isolated cpus as much as possible. For example I >> would not want to schedule work queues and stuff on them. Currently >> there are just a few users of the schedule_delayed_work_on(). cpufreq >> (don't care for isolation purposes), oprofile (same here) and slab. For >> the slab it'd be nice to run the reaper on some other CPU. But you're >> saying that locking depends on CPU pinning. Is there any other option >> besides disabling cache reap ? Is there a way for example to constraint >> the slabs on CPU X to not exceed N megs ? > > There is no way to constrain the amount of slab work. In order to make the > above work we would have to disable the per cpu caches for a certain cpu. > Then there would be no need to run the cache reaper at all. > > To some extend such functionality already exists. F.e. kmalloc_node() > already bypasses the per cpu caches (most of the time). kmalloc_node will > have to take a spinlock on a shared cacheline on each invocation. kmalloc > does only touch per cpu data during regular operations. Thus kmalloc() is > much > faster than kmalloc_node() and the cachelines for kmalloc() can be kept in > the per cpu cache. > > If we could disable all per cpu caches for certain cpus then you could > make this work. All slab OS interference would be off the processor. Hmm. That's an idea. I'll play with it later today or tomorrow. Thanks Max - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SLAB cache reaper on isolated cpus
Christoph Lameter wrote: On Tue, 20 Feb 2007, Max Krasnyansky wrote: Ok. Sounds like disabling cache_reaper is a better option for now. Like you said it's unlikely that slabs will grow much if that cpu is not heavily used by the kernel. Running for prolonged times without cache_reaper is no good. What we are talking about here is to disable the cache_reaper during cpu shutdown. The slab cpu shutdown will clean the per cpu caches anyways so we really do not need the slab_reaper running during cpu shutdown. Ok. Let me restart the thread so that we're not confusing two issues :). I'm not talking about CPU shutdown or CPU hotplug in general. My proposal seemed related to the CPU shutdown issue that you guys were discussing, but it turns out it's not. Suppose I need to isolate a CPU. We already support at the scheduler and irq levels (irq affinity). But I want to go a bit further and avoid doing kernel work on isolated cpus as much as possible. For example I would not want to schedule work queues and stuff on them. Currently there are just a few users of the schedule_delayed_work_on(). cpufreq (don't care for isolation purposes), oprofile (same here) and slab. For the slab it'd be nice to run the reaper on some other CPU. But you're saying that locking depends on CPU pinning. Is there any other option besides disabling cache reap ? Is there a way for example to constraint the slabs on CPU X to not exceed N megs ? Max - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems
Christoph Lameter wrote: On Tue, 20 Feb 2007, Max Krasnyansky wrote: I guess I kind of hijacked the thread. The second part of my first email was dropped. Basically I was saying that I'm working on CPU isolation extensions. Where an isolated CPU is not supposed to do much kernel work. In which case you'd want to run slab cache reaper on some other CPU on behalf of the isolated one. Hence the proposal to explicitly pass cpu_id to the reaper. I guess now that you guys fixed the hotplug case it does not help in that scenario. A cpu must have a per cpu cache in order to do slab allocations. The locking in the slab allocator depends on it. If the isolated cpus have no need for slab allocations then you will also not need to run the slab_reaper(). Ok. Sounds like disabling cache_reaper is a better option for now. Like you said it's unlikely that slabs will grow much if that cpu is not heavily used by the kernel. Thanks Max - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems
Oleg Nesterov wrote: On 02/20, Christoph Lameter wrote: On Tue, 20 Feb 2007, Max Krasnyansky wrote: Well seems that we have a set of unresolved issues with workqueues and cpu hotplug. How about storing 'cpu' explicitly in the work queue instead of relying on the smp_processor_id() and friends ? That way there is no ambiguity when threads/timers get moved around. The slab functionality is designed to work on the processor with the queue. These tricks will only cause more trouble in the future. The cache_reaper needs to be either disabled or run on the right processor. It should never run on the wrong processor. I personally agree. Besides, cache_reaper is not alone. Note the comment in debug_smp_processor_id() about cpu-bound processes. The slab does correct thing right now, stops the timer on CPU_DEAD. Other problems imho should be solved by fixing cpu-hotplug. Gautham and Srivatsa are working on that. I guess I kind of hijacked the thread. The second part of my first email was dropped. Basically I was saying that I'm working on CPU isolation extensions. Where an isolated CPU is not supposed to do much kernel work. In which case you'd want to run slab cache reaper on some other CPU on behalf of the isolated one. Hence the proposal to explicitly pass cpu_id to the reaper. I guess now that you guys fixed the hotplug case it does not help in that scenario. Max - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems
Christoph Lameter wrote: On Tue, 20 Feb 2007, Max Krasnyansky wrote: Well seems that we have a set of unresolved issues with workqueues and cpu hotplug. How about storing 'cpu' explicitly in the work queue instead of relying on the smp_processor_id() and friends ? That way there is no ambiguity when threads/timers get moved around. The slab functionality is designed to work on the processor with the queue. These tricks will only cause more trouble in the future. The cache_reaper needs to be either disabled or run on the right processor. It should never run on the wrong processor. The cache_reaper() is of no importance to hotplug. You just need to make sure that it is not in the way (disable it and if its running wait until the cache_reaper has finished). I agree that running the reaper on the wrong CPU is not the best way to go about it. But it seems like disabling it is even worse, unless I missing something. ie wasting memory. btw What kind of troubles were you talking about ? Performance or robustness ? As I said performance wise it does not make sense to run reaper on the wrong CPU but it does seems to work just fine from the correctness (locking, etc) perspective. Again it's totally possible that I'm missing something here :). Thanks Max - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems
Folks, Oleg Nesterov wrote: Even if smp_processor_id() was stable during the execution of cache_reap(), this work_struct can be moved to another CPU if CPU_DEAD happens. We can't avoid this, and this is correct. Uhh This may not be correct in terms of how the slab operates. But this is practically impossible to avoid. We can't delay CPU_DOWN until all workqueues flush their cwq->worklist. This is livelockable, the work can re-queue itself, and new works can be added since the dying CPU is still on cpu_online_map. This means that some pending works will be processed on another CPU. delayed_work is even worse, the timer can migrate as well. The first problem (smp_processor_id() is not stable) could be solved if we use freezer or with the help of not-yet-implemented scalable lock_cpu_hotplug. This means that __get_cpu_var(reap_work) returns a "wrong" struct delayed_work. This is absolutely harmless right now, but may be it's better to use container_of(unused, struct delayed_work, work). Well seems that we have a set of unresolved issues with workqueues and cpu hotplug. How about storing 'cpu' explicitly in the work queue instead of relying on the smp_processor_id() and friends ? That way there is no ambiguity when threads/timers get moved around. I'm cooking a set of patches to extend cpu isolation concept a bit. In which case I'd like one CPU to run cache_reap timer on behalf of another cpu. See the patch below. diff --git a/mm/slab.c b/mm/slab.c index c610062..0f46d11 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -766,7 +766,17 @@ int slab_is_available(void) return g_cpucache_up == FULL; } -static DEFINE_PER_CPU(struct delayed_work, reap_work); +struct slab_work { + struct delayed_work dw; + unsigned int cpu; +}; + +static DEFINE_PER_CPU(struct slab_work, reap_work); + +static inline struct array_cache *cpu_cache_get_on(struct kmem_cache *cachep, unsigned int cpu) +{ + return cachep->array[cpu]; +} static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep) { @@ -915,9 +925,9 @@ static void init_reap_node(int cpu) per_cpu(reap_node, cpu) = node; } -static void next_reap_node(void) +static void next_reap_node(unsigned int cpu) { - int node = __get_cpu_var(reap_node); + int node = per_cpu(reap_node, cpu); /* * Also drain per cpu pages on remote zones @@ -928,12 +938,12 @@ static void next_reap_node(void) node = next_node(node, node_online_map); if (unlikely(node >= MAX_NUMNODES)) node = first_node(node_online_map); - __get_cpu_var(reap_node) = node; + per_cpu(reap_node, cpu) = node; } #else #define init_reap_node(cpu) do { } while (0) -#define next_reap_node(void) do { } while (0) +#define next_reap_node(cpu) do { } while (0) #endif /* @@ -945,17 +955,18 @@ static void next_reap_node(void) */ static void __devinit start_cpu_timer(int cpu) { - struct delayed_work *reap_work = _cpu(reap_work, cpu); + struct slab_work *reap_work = _cpu(reap_work, cpu); /* * When this gets called from do_initcalls via cpucache_init(), * init_workqueues() has already run, so keventd will be setup * at that time. */ - if (keventd_up() && reap_work->work.func == NULL) { + if (keventd_up() && reap_work->dw.work.func == NULL) { init_reap_node(cpu); - INIT_DELAYED_WORK(reap_work, cache_reap); - schedule_delayed_work_on(cpu, reap_work, + INIT_DELAYED_WORK(_work->dw, cache_reap); + reap_work->cpu = cpu; + schedule_delayed_work_on(cpu, _work->dw, __round_jiffies_relative(HZ, cpu)); } } @@ -1004,7 +1015,7 @@ static int transfer_objects(struct array_cache *to, #ifndef CONFIG_NUMA #define drain_alien_cache(cachep, alien) do { } while (0) -#define reap_alien(cachep, l3) do { } while (0) +#define reap_alien(cachep, l3, cpu) do { } while (0) static inline struct array_cache **alloc_alien_cache(int node, int limit) { @@ -1099,9 +1110,9 @@ static void __drain_alien_cache(struct kmem_cache *cachep, /* * Called from cache_reap() to regularly drain alien caches round robin. */ -static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3) +static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3, unsigned int cpu) { - int node = __get_cpu_var(reap_node); + int node = per_cpu(reap_node, cpu); if (l3->alien) { struct array_cache *ac = l3->alien[node]; @@ -4017,16 +4028,17 @@ void drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3, * If we cannot acquire the cache chain mutex then just give up - we'll try * again on the next iteration. */ -static void cache_reap(struct work_struct *unused) +static void cache_reap(struct work_struct *_work) { struct kmem_cache *searchp; struct
Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems
Folks, Oleg Nesterov wrote: Even if smp_processor_id() was stable during the execution of cache_reap(), this work_struct can be moved to another CPU if CPU_DEAD happens. We can't avoid this, and this is correct. Uhh This may not be correct in terms of how the slab operates. But this is practically impossible to avoid. We can't delay CPU_DOWN until all workqueues flush their cwq-worklist. This is livelockable, the work can re-queue itself, and new works can be added since the dying CPU is still on cpu_online_map. This means that some pending works will be processed on another CPU. delayed_work is even worse, the timer can migrate as well. The first problem (smp_processor_id() is not stable) could be solved if we use freezer or with the help of not-yet-implemented scalable lock_cpu_hotplug. This means that __get_cpu_var(reap_work) returns a wrong struct delayed_work. This is absolutely harmless right now, but may be it's better to use container_of(unused, struct delayed_work, work). Well seems that we have a set of unresolved issues with workqueues and cpu hotplug. How about storing 'cpu' explicitly in the work queue instead of relying on the smp_processor_id() and friends ? That way there is no ambiguity when threads/timers get moved around. I'm cooking a set of patches to extend cpu isolation concept a bit. In which case I'd like one CPU to run cache_reap timer on behalf of another cpu. See the patch below. diff --git a/mm/slab.c b/mm/slab.c index c610062..0f46d11 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -766,7 +766,17 @@ int slab_is_available(void) return g_cpucache_up == FULL; } -static DEFINE_PER_CPU(struct delayed_work, reap_work); +struct slab_work { + struct delayed_work dw; + unsigned int cpu; +}; + +static DEFINE_PER_CPU(struct slab_work, reap_work); + +static inline struct array_cache *cpu_cache_get_on(struct kmem_cache *cachep, unsigned int cpu) +{ + return cachep-array[cpu]; +} static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep) { @@ -915,9 +925,9 @@ static void init_reap_node(int cpu) per_cpu(reap_node, cpu) = node; } -static void next_reap_node(void) +static void next_reap_node(unsigned int cpu) { - int node = __get_cpu_var(reap_node); + int node = per_cpu(reap_node, cpu); /* * Also drain per cpu pages on remote zones @@ -928,12 +938,12 @@ static void next_reap_node(void) node = next_node(node, node_online_map); if (unlikely(node = MAX_NUMNODES)) node = first_node(node_online_map); - __get_cpu_var(reap_node) = node; + per_cpu(reap_node, cpu) = node; } #else #define init_reap_node(cpu) do { } while (0) -#define next_reap_node(void) do { } while (0) +#define next_reap_node(cpu) do { } while (0) #endif /* @@ -945,17 +955,18 @@ static void next_reap_node(void) */ static void __devinit start_cpu_timer(int cpu) { - struct delayed_work *reap_work = per_cpu(reap_work, cpu); + struct slab_work *reap_work = per_cpu(reap_work, cpu); /* * When this gets called from do_initcalls via cpucache_init(), * init_workqueues() has already run, so keventd will be setup * at that time. */ - if (keventd_up() reap_work-work.func == NULL) { + if (keventd_up() reap_work-dw.work.func == NULL) { init_reap_node(cpu); - INIT_DELAYED_WORK(reap_work, cache_reap); - schedule_delayed_work_on(cpu, reap_work, + INIT_DELAYED_WORK(reap_work-dw, cache_reap); + reap_work-cpu = cpu; + schedule_delayed_work_on(cpu, reap_work-dw, __round_jiffies_relative(HZ, cpu)); } } @@ -1004,7 +1015,7 @@ static int transfer_objects(struct array_cache *to, #ifndef CONFIG_NUMA #define drain_alien_cache(cachep, alien) do { } while (0) -#define reap_alien(cachep, l3) do { } while (0) +#define reap_alien(cachep, l3, cpu) do { } while (0) static inline struct array_cache **alloc_alien_cache(int node, int limit) { @@ -1099,9 +1110,9 @@ static void __drain_alien_cache(struct kmem_cache *cachep, /* * Called from cache_reap() to regularly drain alien caches round robin. */ -static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3) +static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3, unsigned int cpu) { - int node = __get_cpu_var(reap_node); + int node = per_cpu(reap_node, cpu); if (l3-alien) { struct array_cache *ac = l3-alien[node]; @@ -4017,16 +4028,17 @@ void drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3, * If we cannot acquire the cache chain mutex then just give up - we'll try * again on the next iteration. */ -static void cache_reap(struct work_struct *unused) +static void cache_reap(struct work_struct *_work) { struct kmem_cache *searchp; struct
Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems
Christoph Lameter wrote: On Tue, 20 Feb 2007, Max Krasnyansky wrote: Well seems that we have a set of unresolved issues with workqueues and cpu hotplug. How about storing 'cpu' explicitly in the work queue instead of relying on the smp_processor_id() and friends ? That way there is no ambiguity when threads/timers get moved around. The slab functionality is designed to work on the processor with the queue. These tricks will only cause more trouble in the future. The cache_reaper needs to be either disabled or run on the right processor. It should never run on the wrong processor. The cache_reaper() is of no importance to hotplug. You just need to make sure that it is not in the way (disable it and if its running wait until the cache_reaper has finished). I agree that running the reaper on the wrong CPU is not the best way to go about it. But it seems like disabling it is even worse, unless I missing something. ie wasting memory. btw What kind of troubles were you talking about ? Performance or robustness ? As I said performance wise it does not make sense to run reaper on the wrong CPU but it does seems to work just fine from the correctness (locking, etc) perspective. Again it's totally possible that I'm missing something here :). Thanks Max - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems
Oleg Nesterov wrote: On 02/20, Christoph Lameter wrote: On Tue, 20 Feb 2007, Max Krasnyansky wrote: Well seems that we have a set of unresolved issues with workqueues and cpu hotplug. How about storing 'cpu' explicitly in the work queue instead of relying on the smp_processor_id() and friends ? That way there is no ambiguity when threads/timers get moved around. The slab functionality is designed to work on the processor with the queue. These tricks will only cause more trouble in the future. The cache_reaper needs to be either disabled or run on the right processor. It should never run on the wrong processor. I personally agree. Besides, cache_reaper is not alone. Note the comment in debug_smp_processor_id() about cpu-bound processes. The slab does correct thing right now, stops the timer on CPU_DEAD. Other problems imho should be solved by fixing cpu-hotplug. Gautham and Srivatsa are working on that. I guess I kind of hijacked the thread. The second part of my first email was dropped. Basically I was saying that I'm working on CPU isolation extensions. Where an isolated CPU is not supposed to do much kernel work. In which case you'd want to run slab cache reaper on some other CPU on behalf of the isolated one. Hence the proposal to explicitly pass cpu_id to the reaper. I guess now that you guys fixed the hotplug case it does not help in that scenario. Max - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems
Christoph Lameter wrote: On Tue, 20 Feb 2007, Max Krasnyansky wrote: I guess I kind of hijacked the thread. The second part of my first email was dropped. Basically I was saying that I'm working on CPU isolation extensions. Where an isolated CPU is not supposed to do much kernel work. In which case you'd want to run slab cache reaper on some other CPU on behalf of the isolated one. Hence the proposal to explicitly pass cpu_id to the reaper. I guess now that you guys fixed the hotplug case it does not help in that scenario. A cpu must have a per cpu cache in order to do slab allocations. The locking in the slab allocator depends on it. If the isolated cpus have no need for slab allocations then you will also not need to run the slab_reaper(). Ok. Sounds like disabling cache_reaper is a better option for now. Like you said it's unlikely that slabs will grow much if that cpu is not heavily used by the kernel. Thanks Max - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SLAB cache reaper on isolated cpus
Christoph Lameter wrote: On Tue, 20 Feb 2007, Max Krasnyansky wrote: Ok. Sounds like disabling cache_reaper is a better option for now. Like you said it's unlikely that slabs will grow much if that cpu is not heavily used by the kernel. Running for prolonged times without cache_reaper is no good. What we are talking about here is to disable the cache_reaper during cpu shutdown. The slab cpu shutdown will clean the per cpu caches anyways so we really do not need the slab_reaper running during cpu shutdown. Ok. Let me restart the thread so that we're not confusing two issues :). I'm not talking about CPU shutdown or CPU hotplug in general. My proposal seemed related to the CPU shutdown issue that you guys were discussing, but it turns out it's not. Suppose I need to isolate a CPU. We already support at the scheduler and irq levels (irq affinity). But I want to go a bit further and avoid doing kernel work on isolated cpus as much as possible. For example I would not want to schedule work queues and stuff on them. Currently there are just a few users of the schedule_delayed_work_on(). cpufreq (don't care for isolation purposes), oprofile (same here) and slab. For the slab it'd be nice to run the reaper on some other CPU. But you're saying that locking depends on CPU pinning. Is there any other option besides disabling cache reap ? Is there a way for example to constraint the slabs on CPU X to not exceed N megs ? Max - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLAB cache reaper on isolated cpus
Christoph Lameter wrote: On Tue, 20 Feb 2007, Max Krasnyansky wrote: Suppose I need to isolate a CPU. We already support at the scheduler and irq levels (irq affinity). But I want to go a bit further and avoid doing kernel work on isolated cpus as much as possible. For example I would not want to schedule work queues and stuff on them. Currently there are just a few users of the schedule_delayed_work_on(). cpufreq (don't care for isolation purposes), oprofile (same here) and slab. For the slab it'd be nice to run the reaper on some other CPU. But you're saying that locking depends on CPU pinning. Is there any other option besides disabling cache reap ? Is there a way for example to constraint the slabs on CPU X to not exceed N megs ? There is no way to constrain the amount of slab work. In order to make the above work we would have to disable the per cpu caches for a certain cpu. Then there would be no need to run the cache reaper at all. To some extend such functionality already exists. F.e. kmalloc_node() already bypasses the per cpu caches (most of the time). kmalloc_node will have to take a spinlock on a shared cacheline on each invocation. kmalloc does only touch per cpu data during regular operations. Thus kmalloc() is much faster than kmalloc_node() and the cachelines for kmalloc() can be kept in the per cpu cache. If we could disable all per cpu caches for certain cpus then you could make this work. All slab OS interference would be off the processor. Hmm. That's an idea. I'll play with it later today or tomorrow. Thanks Max - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86: unify/rewrite SMP TSC sync code
Using gtod() can amount to a substantial disturbance of the thing to be measured. Using rdtsc, things seem reliable so far, and we have an FPGA (accessed through the PCI bus) that has been programmed to give access to an 8MHz clock and we do some checks against that. Same here. gettimeofday() is way too slow (dual Opteron box) for the frequency I need to call it at. HPET is not available. But TSC is doing just fine. Plus in my case I don't care about sync between CPUs (thread that uses TSC is running on the isolated CPU) and I have external time source that takes care of the drift. So please no trapping of the RDTSC. Making it clear (bold kernel message during boot :-) that TSC(s) are not in sync or unstable (from GTOD point of view) is of course perfectly fine. Thanx Max - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86: unify/rewrite SMP TSC sync code
Using gtod() can amount to a substantial disturbance of the thing to be measured. Using rdtsc, things seem reliable so far, and we have an FPGA (accessed through the PCI bus) that has been programmed to give access to an 8MHz clock and we do some checks against that. Same here. gettimeofday() is way too slow (dual Opteron box) for the frequency I need to call it at. HPET is not available. But TSC is doing just fine. Plus in my case I don't care about sync between CPUs (thread that uses TSC is running on the isolated CPU) and I have external time source that takes care of the drift. So please no trapping of the RDTSC. Making it clear (bold kernel message during boot :-) that TSC(s) are not in sync or unstable (from GTOD point of view) is of course perfectly fine. Thanx Max - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
CD/DVD drive access hangs when media is not inserted
Hi Folks, I've got ASUS DVD-E616P2 drive and it seems that media detection is broken with it. Processes that try to access the drive when cd or dvd is not inserted simply hang until the machine is rebooted. So for example if I do 'cat /dev/cdrom'. First few attempts fail with 'No medium found' error and dmesg shows 'cdrom: open failed'. But then it hangs in ide_do_drive_cmd 4435 D+ cat /dev/cdrom ide_do_drive_cmd From then on drive is dead. Inserting cd does not help. Reboot is the only way to bring it back to life. Everything else works just fine. Actually almost everything. Another annoying problem is if I pause DVD playback for too long (let's say 10-15 minutes) and then hit play again dvd access hangs just like in 'no medium' case. Any ideas ? I tried a bunch of kernels 2.6.8.1 2.6.9 2.6.10 Here is how the drive is recognized at boot. ICH5: IDE controller at PCI slot :00:1f.1 ACPI: PCI interrupt :00:1f.1[A] -> GSI 18 (level, low) -> IRQ 177 ICH5: chipset revision 2 ICH5: not 100% native mode: will probe irqs later ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:pio ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:DMA, hdd:pio Probing IDE interface ide0... hda: SAMSUNG SP1614N, ATA DISK drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 Probing IDE interface ide1... hdc: ASUS DVD-E616P2, ATAPI CD/DVD-ROM drive ide1 at 0x170-0x177,0x376 on irq 15 And this is what lspci has to say about my system 00:00.0 Host bridge: Intel Corp. 82865G/PE/P DRAM Controller/Host-Hub Interface (rev 02) 00:01.0 PCI bridge: Intel Corp. 82865G/PE/P PCI to AGP Controller (rev 02) 00:1d.0 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) 00:1d.1 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02) 00:1d.2 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI #3 (rev 02) 00:1d.3 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #4 (rev 02) 00:1d.7 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02) 00:1e.0 PCI bridge: Intel Corp. 82801 PCI Bridge (rev c2) 00:1f.0 ISA bridge: Intel Corp. 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02) 00:1f.1 IDE interface: Intel Corp. 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02) 00:1f.3 SMBus: Intel Corp. 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02) 00:1f.5 Multimedia audio controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller (rev 02) 01:00.0 VGA compatible controller: nVidia Corporation NV34 [GeForce FX 5200] (rev a1) 02:04.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture (rev 11) 02:04.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 11) 02:07.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5788 Gigabit Ethernet (rev 03) 02:0d.0 FireWire (IEEE 1394): Agere Systems (former Lucent Microelectronics) FW323 (rev 61) Max - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/