from:"Max Krasnyansky"

Re: vmstat: On demand vmstat workers V3

2014-04-23 Thread Max Krasnyansky

Hi Viresh,

On 04/22/2014 03:32 AM, Viresh Kumar wrote:
> On Thu, Oct 3, 2013 at 11:10 PM, Christoph Lameter  wrote:
>> V2->V3:
>> - Introduce a new tick_get_housekeeping_cpu() function. Not sure
>>   if that is exactly what we want but it is a start. Thomas?
>> - Migrate the shepherd task if the output of
>>   tick_get_housekeeping_cpu() changes.
>> - Fixes recommended by Andrew.
> 
> Hi Christoph,
> 
> This vmstat interrupt is disturbing my core isolation :), have you got
> any far with this patchset?

You don't mean an interrupt, right?
The updates are done via the regular priority workqueue.

I'm playing with isolation as well (has been more or less a background thing
for the last 6+ years). Our threads that run on the isolated cores are 
SCHED_FIFO
and therefor low prio workqueue stuff, like vmstat, doesn't get in the way.
I do have a few patches for the workqueues to make things better for isolation.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: vmstat: On demand vmstat workers V3

2014-04-23 Thread Max Krasnyansky

Hi Viresh,

On 04/22/2014 03:32 AM, Viresh Kumar wrote:
 On Thu, Oct 3, 2013 at 11:10 PM, Christoph Lameter c...@linux.com wrote:
 V2-V3:
 - Introduce a new tick_get_housekeeping_cpu() function. Not sure
   if that is exactly what we want but it is a start. Thomas?
 - Migrate the shepherd task if the output of
   tick_get_housekeeping_cpu() changes.
 - Fixes recommended by Andrew.
 
 Hi Christoph,
 
 This vmstat interrupt is disturbing my core isolation :), have you got
 any far with this patchset?

You don't mean an interrupt, right?
The updates are done via the regular priority workqueue.

I'm playing with isolation as well (has been more or less a background thing
for the last 6+ years). Our threads that run on the isolated cores are 
SCHED_FIFO
and therefor low prio workqueue stuff, like vmstat, doesn't get in the way.
I do have a few patches for the workqueues to make things better for isolation.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: why does kernel 3.8-rc1 put all TAP devices into state RUNNING during boot

2013-01-07 Thread Max Krasnyansky

On 01/05/2013 02:16 AM, Toralf Förster wrote:
> At my stable Gentoo Linux I'm observed a change behaviour for the
> configured TAP devices after the boot process.
> 
> $ diff 3.7.1 3.8.0-rc1+ | grep UP
>- br0: flags=4355  mtu 1500
>+ br0: flags=4419  mtu 1500
>- tap0: flags=4099  mtu 1500
>+ tap0: flags=4163  mtu 1500
> 
> May I ask you if this changed behaviour is intended ?

I'm not aware of any changes in this behavior.
btw Looks like it changed for your bridge interfaces as well. So it's not really
specific to the TAP devices.

Someone from netdev would know :)

Max



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: why does kernel 3.8-rc1 put all TAP devices into state RUNNING during boot

2013-01-07 Thread Max Krasnyansky

On 01/05/2013 02:16 AM, Toralf Förster wrote:
 At my stable Gentoo Linux I'm observed a change behaviour for the
 configured TAP devices after the boot process.
 
 $ diff 3.7.1 3.8.0-rc1+ | grep UP
- br0: flags=4355UP,BROADCAST,PROMISC,MULTICAST  mtu 1500
+ br0: flags=4419UP,BROADCAST,RUNNING,PROMISC,MULTICAST  mtu 1500
- tap0: flags=4099UP,BROADCAST,MULTICAST  mtu 1500
+ tap0: flags=4163UP,BROADCAST,RUNNING,MULTICAST  mtu 1500
 
 May I ask you if this changed behaviour is intended ?

I'm not aware of any changes in this behavior.
btw Looks like it changed for your bridge interfaces as well. So it's not really
specific to the TAP devices.

Someone from netdev would know :)

Max



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] MAINTAINERS: fix bouncing tun/tap entries

2012-11-30 Thread Max Krasnyansky

On 11/30/2012 09:28 AM, David Miller wrote:
> From: Jiri Slaby 
> Date: Fri, 30 Nov 2012 18:05:40 +0100
> 
>> Delivery to the following recipient failed permanently:
>>
>>  v...@office.satix.net
>>
>> Technical details of permanent failure:
>> DNS Error: Domain name not found
>>
>> Of course:
>> $ host office.satix.net
>> Host office.satix.net not found: 3(NXDOMAIN)
>>
>> ===
>>
>> And "Change of Email Address Notification":
>> Old AddressNew Address   Email Subject
>> --
>> m...@qualcomm.com  m...@qti.qualcomm.com "tuntap: multiqueue...
>>
>> Signed-off-by: Jiri Slaby 
> 
> Applied.
> 

Thanks for fixing that guys.

Max

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] MAINTAINERS: fix bouncing tun/tap entries

2012-11-30 Thread Max Krasnyansky

On 11/30/2012 09:28 AM, David Miller wrote:
 From: Jiri Slaby jsl...@suse.cz
 Date: Fri, 30 Nov 2012 18:05:40 +0100
 
 Delivery to the following recipient failed permanently:

  v...@office.satix.net

 Technical details of permanent failure:
 DNS Error: Domain name not found

 Of course:
 $ host office.satix.net
 Host office.satix.net not found: 3(NXDOMAIN)

 ===

 And Change of Email Address Notification:
 Old AddressNew Address   Email Subject
 --
 m...@qualcomm.com  m...@qti.qualcomm.com tuntap: multiqueue...

 Signed-off-by: Jiri Slaby jsl...@suse.cz
 
 Applied.
 

Thanks for fixing that guys.

Max

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [net-next v5 0/7] Multiqueue support in tuntap

2012-11-01 Thread Max Krasnyansky

On 10/31/2012 10:45 PM, Jason Wang wrote:
> Hello All:
> 
> This is an update of multiqueue support in tuntap from V3. Please consider to
> merge.
> 
> The main idea for this series is to let tun/tap device to be benefited from
> multiqueue network cards and multi-core host. We used to have a single queue 
> for
> tuntap which could be a bottleneck in a multiqueue/core environment. So this
> series let the device could be attched with multiple sockets and expose them
> through fd to the userspace as multiqueues. The sereis were orignally designed
> to serve as backend for multiqueue virtio-net in KVM, but the design is 
> generic
> for other application to be used.
> 
> Some quick overview of the design:
> 
> - Moving socket from tun_device to tun_file.
> - Allowing multiple sockets to be attached to a tun/tap devices.
> - Using RCU to synchronize the data path and system call.
> - Two new ioctls were added for the usespace to attach and detach socket to 
> the
>   device.
> - API compatibility were maintained without userspace notable changes, so 
> legacy
>   userspace that only use one queue won't need any changes.
> - A flow(rxhash) to queue table were maintained by tuntap which choose the txq
>   based on the last rxq where it comes.

I'm still trying to wrap my head around the new locking/RCU stuff but it looks 
like
Paul and others already looked at it.

Otherwise looks good to me.

btw In the description above you really meant allowing for attaching multiple 
file
descriptors not sockets.

Thanks
Max


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [net-next v5 0/7] Multiqueue support in tuntap

2012-11-01 Thread Max Krasnyansky

On 10/31/2012 10:45 PM, Jason Wang wrote:
 Hello All:
 
 This is an update of multiqueue support in tuntap from V3. Please consider to
 merge.
 
 The main idea for this series is to let tun/tap device to be benefited from
 multiqueue network cards and multi-core host. We used to have a single queue 
 for
 tuntap which could be a bottleneck in a multiqueue/core environment. So this
 series let the device could be attched with multiple sockets and expose them
 through fd to the userspace as multiqueues. The sereis were orignally designed
 to serve as backend for multiqueue virtio-net in KVM, but the design is 
 generic
 for other application to be used.
 
 Some quick overview of the design:
 
 - Moving socket from tun_device to tun_file.
 - Allowing multiple sockets to be attached to a tun/tap devices.
 - Using RCU to synchronize the data path and system call.
 - Two new ioctls were added for the usespace to attach and detach socket to 
 the
   device.
 - API compatibility were maintained without userspace notable changes, so 
 legacy
   userspace that only use one queue won't need any changes.
 - A flow(rxhash) to queue table were maintained by tuntap which choose the txq
   based on the last rxq where it comes.

I'm still trying to wrap my head around the new locking/RCU stuff but it looks 
like
Paul and others already looked at it.

Otherwise looks good to me.

btw In the description above you really meant allowing for attaching multiple 
file
descriptors not sockets.

Thanks
Max


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Tiny cpusets -- cpusets for small systems?

2008-02-23 Thread Max Krasnyansky

Hi Paul,

> A couple of proposals have been made recently by people working Linux
> on smaller systems, for improving realtime isolation and memory
> pressure handling:
> 
> (1) cpu isolation for hard(er) realtime
>   http://lkml.org/lkml/2008/2/21/517
>   Max Krasnyanskiy <[EMAIL PROTECTED]>
>   [PATCH sched-devel 0/7] CPU isolation extensions
> 
> (2) notify user space of tight memory
>   http://lkml.org/lkml/2008/2/9/144
>   KOSAKI Motohiro <[EMAIL PROTECTED]>
>   [PATCH 0/8][for -mm] mem_notify v6
> 
> In both cases, some of us have responded "why not use cpusets", and the
> original submitters have replied "cpusets are too fat"  (well, they
> were more diplomatic than that, but I guess I can say that ;)

My primary issue with cpusets (from CPU isolation perspective that is) was 
not the fatness. I did make a couple of comments like "On dual-cpu box
I do not need cpusets to manage the CPUs" but that's not directly related to
the CPU isolation.
For the CPU isolation in particular I need code like this

int select_irq_affinity(unsigned int irq)
{
cpumask_t usable_cpus;
cpus_andnot(usable_cpus, cpu_online_map, cpu_isolated_map);
irq_desc[irq].affinity = usable_cpus;
irq_desc[irq].chip->set_affinity(irq, usable_cpus);
return 0;
}

How would you implement that with cpusets ?
I haven't seen you patches but I'd imagine that they will still need locks and 
iterators for "Is CPU N isolated" functionality.

So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower
level mechanism that actually makes kernel aware of what's isolated what's not.
Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains
but scheduler does not use cpusets directly.

> I wonder if there might be room for a "tiny cpusets" configuration option:
>   * provide the same hooks to the rest of the kernel, and
>   * provide the same syntactic interface to user space, but
>   * with more limited semantics.
> 
> The primary semantic limit I'd suggest would be supporting exactly
> one layer depth of cpusets, not a full hierarchy.  So one could still
> successfully issue from user space 'mkdir /dev/cpuset/foo', but trying
> to do 'mkdir /dev/cpuset/foo/bar' would fail.  This reminds me of
> very early FAT file systems, which had just a single, fixed size
> root directory ;).  There might even be a configurable fixed upper
> limit on how many /dev/cpuset/* directories were allowed, further
> simplifying the locking and dynamic memory behavior of this apparatus.
In a foreseeable future 2-8 cores will be most common configuration.
Do you think that cpusets are needed/useful for those machines ?
The reason I'm asking is because given the restrictions you mentioned
above it seems that you might as well just do
taskset -c 1,2,3 app1
taskset -c 3,4,5 app2 
Yes it's not quite the same of course but imo covers most cases. That's what we
do on 2-4 cores these days, and are quite happy with that. ie We either let the 
specialized apps manage their thread affinities themselves or use "taskset" to 
manage the apps.

> User space would see the same API, except that some valid operations
> on full cpusets, such as a nested mkdir, would fail on tiny cpusets.
Speaking of user-space API. I guess it's not directly related to the 
tiny-cpusets 
proposal but rather to the cpusets in general.
Stuff that I'm working on this days (wireless basestations) is designed with 
the 
following model:
cpuN - runs soft-RT networking and management code
cpuN+1 to cpuN+x - are used as dedicated engines
ie Simplest example would be
cpu0 - runs IP, L2 and control plane
cpu1 - runs hard-RT MAC 

So if CPU isolation is implemented on top of the cpusets what kind of API do 
you envision for such an app ? I mean currently cpusets seems to be mostly 
dealing
with entire processes, whereas in this case we're really dealing with the 
threads. 
ie Different threads of the same process require different policies, some must 
run
on isolated cpus some must not. I guess one could write a thread's pid into 
cpusets
fs but that's not very convenient. pthread_set_affinity() is exactly what's 
needed.
Personally I do not see much use for cpusets for those kinds of designs. But 
maybe
I missing something. I got really excited when cpusets where first merged into 
mainline but after looking closer I could not really find a use for them, at 
least 
for not for our apps.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Tiny cpusets -- cpusets for small systems?

2008-02-23 Thread Max Krasnyansky

Hi Paul,

 A couple of proposals have been made recently by people working Linux
 on smaller systems, for improving realtime isolation and memory
 pressure handling:
 
 (1) cpu isolation for hard(er) realtime
   http://lkml.org/lkml/2008/2/21/517
   Max Krasnyanskiy [EMAIL PROTECTED]
   [PATCH sched-devel 0/7] CPU isolation extensions
 
 (2) notify user space of tight memory
   http://lkml.org/lkml/2008/2/9/144
   KOSAKI Motohiro [EMAIL PROTECTED]
   [PATCH 0/8][for -mm] mem_notify v6
 
 In both cases, some of us have responded why not use cpusets, and the
 original submitters have replied cpusets are too fat  (well, they
 were more diplomatic than that, but I guess I can say that ;)

My primary issue with cpusets (from CPU isolation perspective that is) was 
not the fatness. I did make a couple of comments like On dual-cpu box
I do not need cpusets to manage the CPUs but that's not directly related to
the CPU isolation.
For the CPU isolation in particular I need code like this

int select_irq_affinity(unsigned int irq)
{
cpumask_t usable_cpus;
cpus_andnot(usable_cpus, cpu_online_map, cpu_isolated_map);
irq_desc[irq].affinity = usable_cpus;
irq_desc[irq].chip-set_affinity(irq, usable_cpus);
return 0;
}

How would you implement that with cpusets ?
I haven't seen you patches but I'd imagine that they will still need locks and 
iterators for Is CPU N isolated functionality.

So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower
level mechanism that actually makes kernel aware of what's isolated what's not.
Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains
but scheduler does not use cpusets directly.

 I wonder if there might be room for a tiny cpusets configuration option:
   * provide the same hooks to the rest of the kernel, and
   * provide the same syntactic interface to user space, but
   * with more limited semantics.
 
 The primary semantic limit I'd suggest would be supporting exactly
 one layer depth of cpusets, not a full hierarchy.  So one could still
 successfully issue from user space 'mkdir /dev/cpuset/foo', but trying
 to do 'mkdir /dev/cpuset/foo/bar' would fail.  This reminds me of
 very early FAT file systems, which had just a single, fixed size
 root directory ;).  There might even be a configurable fixed upper
 limit on how many /dev/cpuset/* directories were allowed, further
 simplifying the locking and dynamic memory behavior of this apparatus.
In a foreseeable future 2-8 cores will be most common configuration.
Do you think that cpusets are needed/useful for those machines ?
The reason I'm asking is because given the restrictions you mentioned
above it seems that you might as well just do
taskset -c 1,2,3 app1
taskset -c 3,4,5 app2 
Yes it's not quite the same of course but imo covers most cases. That's what we
do on 2-4 cores these days, and are quite happy with that. ie We either let the 
specialized apps manage their thread affinities themselves or use taskset to 
manage the apps.

 User space would see the same API, except that some valid operations
 on full cpusets, such as a nested mkdir, would fail on tiny cpusets.
Speaking of user-space API. I guess it's not directly related to the 
tiny-cpusets 
proposal but rather to the cpusets in general.
Stuff that I'm working on this days (wireless basestations) is designed with 
the 
following model:
cpuN - runs soft-RT networking and management code
cpuN+1 to cpuN+x - are used as dedicated engines
ie Simplest example would be
cpu0 - runs IP, L2 and control plane
cpu1 - runs hard-RT MAC 

So if CPU isolation is implemented on top of the cpusets what kind of API do 
you envision for such an app ? I mean currently cpusets seems to be mostly 
dealing
with entire processes, whereas in this case we're really dealing with the 
threads. 
ie Different threads of the same process require different policies, some must 
run
on isolated cpus some must not. I guess one could write a thread's pid into 
cpusets
fs but that's not very convenient. pthread_set_affinity() is exactly what's 
needed.
Personally I do not see much use for cpusets for those kinds of designs. But 
maybe
I missing something. I got really excited when cpusets where first merged into 
mainline but after looking closer I could not really find a use for them, at 
least 
for not for our apps.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] Genirq and CPU isolation

2008-02-22 Thread Max Krasnyansky

Hi Thomas,

While reviewing CPU isolation patches Peter pointed out that instead of
changing arch specific irq handling I should be extending genirq code.
Which makes perfect sense. Why didn't I think of that before :)
Basically the idea is that by default isolated CPUs must not get HW
irqs routed to them (besides IPIs and stuff of course). 
Does the patch included below look like the right approach ?

btw select_smp_affinity() which is currently used only by alpha seemed
out of place. It's called multiple times for shared irqs. ie Every time
new handler is registered irq is moved to a different CPU.
So I moved it under "if (!shared)" check inside setup_irq().

The patch introduces generic version of the select_smp_affinity() that
sets the affinity mask to "online_cpus - isolated_cpus", and updates 
x86_32 and alpha load balancers to ignore isolated cpus.
Booted on Core2 laptop and dual Opteron boxes with and w/o isolcpus=
options and everything seems to work as expected.

I wanted to run this by you before I include it in my patch series.
Thanx
Max


diff --git a/arch/alpha/kernel/irq.c b/arch/alpha/kernel/irq.c
index facf82a..6b01702 100644
--- a/arch/alpha/kernel/irq.c
+++ b/arch/alpha/kernel/irq.c
@@ -51,7 +51,7 @@ select_smp_affinity(unsigned int irq)
if (!irq_desc[irq].chip->set_affinity || irq_user_affinity[irq])
return 1;
 
-   while (!cpu_possible(cpu))
+   while (!cpu_possible(cpu) || cpu_isolated(cpu))
cpu = (cpu < (NR_CPUS-1) ? cpu + 1 : 0);
last_cpu = cpu;
 
diff --git a/arch/x86/kernel/genapic_flat_64.c 
b/arch/x86/kernel/genapic_flat_64.c
index e02e58c..07352b7 100644
--- a/arch/x86/kernel/genapic_flat_64.c
+++ b/arch/x86/kernel/genapic_flat_64.c
@@ -21,9 +21,7 @@
 
 static cpumask_t flat_target_cpus(void)
 {
-   cpumask_t target;
-   cpus_andnot(target, cpu_online_map, cpu_isolated_map);
-   return target;
+   return cpu_online_map;
 }
 
 static cpumask_t flat_vector_allocation_domain(int cpu)
diff --git a/arch/x86/kernel/io_apic_32.c b/arch/x86/kernel/io_apic_32.c
index 4ca5486..9c8816f 100644
--- a/arch/x86/kernel/io_apic_32.c
+++ b/arch/x86/kernel/io_apic_32.c
@@ -468,7 +468,7 @@ static void do_irq_balance(void)
for_each_possible_cpu(i) {
int package_index;
CPU_IRQ(i) = 0;
-   if (!cpu_online(i))
+   if (!cpu_online(i) || cpu_isolated(i))
continue;
package_index = CPU_TO_PACKAGEINDEX(i);
for (j = 0; j < NR_IRQS; j++) {
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 176e5e7..287bc64 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -253,14 +253,7 @@ static inline void set_balance_irq_affinity(unsigned int 
irq, cpumask_t mask)
 }
 #endif
 
-#ifdef CONFIG_AUTO_IRQ_AFFINITY
 extern int select_smp_affinity(unsigned int irq);
-#else
-static inline int select_smp_affinity(unsigned int irq)
-{
-   return 1;
-}
-#endif
 
 extern int no_irq_affinity;
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 438a014..e74db94 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -376,6 +376,9 @@ int setup_irq(unsigned int irq, struct irqaction *new)
} else
/* Undo nested disables: */
desc->depth = 1;
+
+   /* Set default affinity mask once everything is setup */
+   select_smp_affinity(irq);
}
/* Reset broken irq detection when installing new handler */
desc->irq_count = 0;
@@ -488,6 +491,26 @@ void free_irq(unsigned int irq, void *dev_id)
 }
 EXPORT_SYMBOL(free_irq);
 
+#ifndef CONFIG_AUTO_IRQ_AFFINITY
+/**
+ * Generic version of the affinity autoselector.
+ * Called under desc->lock from setup_irq().
+ * btw Should we rename this to select_irq_affinity() ?
+ */
+int select_smp_affinity(unsigned int irq)
+{
+   cpumask_t usable_cpus;
+
+   if (!irq_can_set_affinity(irq))
+   return 0;
+
+   cpus_andnot(usable_cpus, cpu_online_map, cpu_isolated_map);
+   irq_desc[irq].affinity = usable_cpus;
+   irq_desc[irq].chip->set_affinity(irq, usable_cpus);
+   return 0;
+}
+#endif
+
 /**
  * request_irq - allocate an interrupt line
  * @irq: Interrupt line to allocate
@@ -555,8 +578,6 @@ int request_irq(unsigned int irq, irq_handler_t handler,
action->next = NULL;
action->dev_id = dev_id;
 
-   select_smp_affinity(irq);
-
 #ifdef CONFIG_DEBUG_SHIRQ
if (irqflags & IRQF_SHARED) {
/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH sched-devel 1/7] cpuisol: Make cpu isolation configrable and export isolated map

2008-02-22 Thread Max Krasnyansky

This simple patch introduces new config option for CPU isolation.
The reason I created the separate Kconfig file here is because more
options will be added by the following patches.

The patch also exports cpu_isolated_map, provides cpu_isolated()
accessor macro and provides access to the isolation bit via sysfs.
In other words cpu_isolated_map is exposed to the rest of the kernel
and the user-space in much the same way cpu_online_map is exposed today.

While at it I also moved cpu_*_map from kernel/sched.c into kernel/cpu.c
Those maps have very little to do with the scheduler these days and
therefor seem out of place in the scheduler code.

This patch does not change/affect any existing scheduler functionality.

Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]>
---
 arch/x86/Kconfig|1 +
 drivers/base/cpu.c  |   48 ++
 include/linux/cpumask.h |3 ++
 kernel/Kconfig.cpuisol  |   15 ++
 kernel/Makefile |4 +-
 kernel/cpu.c|   49 +++
 kernel/sched.c  |   36 --
 7 files changed, 118 insertions(+), 38 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3be2305..d228488 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -526,6 +526,7 @@ config SCHED_MC
  increased overhead in some places. If unsure say N here.
 
 source "kernel/Kconfig.preempt"
+source "kernel/Kconfig.cpuisol"
 
 config X86_UP_APIC
bool "Local APIC support on uniprocessors"
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 499b003..b6c5e0f 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -55,10 +55,58 @@ static ssize_t store_online(struct sys_device *dev, const 
char *buf,
 }
 static SYSDEV_ATTR(online, 0644, show_online, store_online);
 
+#ifdef CONFIG_CPUISOL
+/*
+ * This is under config hotplug because in order to 
+ * dynamically isolate a CPU it needs to be brought off-line first.
+ * In other words the sequence is
+ *   echo 0 > /sys/device/system/cpuN/online
+ *   echo 1 > /sys/device/system/cpuN/isolated
+ *   echo 1 > /sys/device/system/cpuN/online
+ */
+static ssize_t show_isol(struct sys_device *dev, char *buf)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+
+   return sprintf(buf, "%u\n", !!cpu_isolated(cpu->sysdev.id));
+}
+
+static ssize_t store_isol(struct sys_device *dev, const char *buf,
+   size_t count)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+   ssize_t ret = 0;
+
+   if (cpu_online(cpu->sysdev.id))
+   return -EBUSY;
+
+   switch (buf[0]) {
+   case '0':
+   cpu_clear(cpu->sysdev.id, cpu_isolated_map);
+   break;
+   case '1':
+   cpu_set(cpu->sysdev.id, cpu_isolated_map);
+   break;
+   default:
+   ret = -EINVAL;
+   }
+
+   if (ret >= 0)
+   ret = count;
+   return ret;
+}
+static SYSDEV_ATTR(isolated, 0600, show_isol, store_isol);
+#endif /* CONFIG_CPUISOL */
+
 static void __devinit register_cpu_control(struct cpu *cpu)
 {
sysdev_create_file(>sysdev, _online);
+
+#ifdef CONFIG_CPUISOL
+   sysdev_create_file(>sysdev, _isolated);
+#endif
 }
+
 void unregister_cpu(struct cpu *cpu)
 {
int logical_cpu = cpu->sysdev.id;
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 7047f58..cde2964 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -380,6 +380,7 @@ static inline void __cpus_remap(cpumask_t *dstp, const 
cpumask_t *srcp,
 extern cpumask_t cpu_possible_map;
 extern cpumask_t cpu_online_map;
 extern cpumask_t cpu_present_map;
+extern cpumask_t cpu_isolated_map;
 
 #if NR_CPUS > 1
 #define num_online_cpus()  cpus_weight(cpu_online_map)
@@ -388,6 +389,7 @@ extern cpumask_t cpu_present_map;
 #define cpu_online(cpu)cpu_isset((cpu), cpu_online_map)
 #define cpu_possible(cpu)  cpu_isset((cpu), cpu_possible_map)
 #define cpu_present(cpu)   cpu_isset((cpu), cpu_present_map)
+#define cpu_isolated(cpu)  cpu_isset((cpu), cpu_isolated_map)
 #else
 #define num_online_cpus()  1
 #define num_possible_cpus()1
@@ -395,6 +397,7 @@ extern cpumask_t cpu_present_map;
 #define cpu_online(cpu)((cpu) == 0)
 #define cpu_possible(cpu)  ((cpu) == 0)
 #define cpu_present(cpu)   ((cpu) == 0)
+#define cpu_isolated(cpu)  (0)
 #endif
 
 #define cpu_is_offline(cpu)unlikely(!cpu_online(cpu))
diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol
new file mode 100644
index 000..e606477
--- /dev/null
+++ b/kernel/Kconfig.cpuisol
@@ -0,0 +1,15 @@
+config CPUISOL
+   depends on SMP
+   bool "CPU isolation"
+   help
+ This option enables support for CPU isolation.
+ If enab

[PATCH sched-devel 5/7] cpuisol: Documentation updates

2008-02-22 Thread Max Krasnyansky

Documented sysfs interface as suggested by Andrew Morton.
Added general documentation that describes how to configure
and use CPU isolation features.

Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |   41 +++
 Documentation/cpu-isolation.txt|  113 
 2 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
b/Documentation/ABI/testing/sysfs-devices-system-cpu
new file mode 100644
index 000..32dde5b
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -0,0 +1,41 @@
+What:   /sys/devices/system/cpu/...
+Date:   Feb. 2008
+KernelVersion:  2.6.24 
+Contact:LKML 
+Description:
+
+The /sys/devices/system/cpu tree provides information about all cpu's
+known to the running kernel.
+
+Following files are created for each cpu. 'N' is the cpu number.
+
+/sys/devices/system/cpu/cpuN/
+  online (0644) On-line attribute. Indicates whether the cpu is on-line.
+The cpu can be brought off-line by writing '0' into
+this file.  Similarly it can be brought back on-line
+by writing '1' into this file.  This attribute is
+not available for the cpu's that cannot be brought
+off-line. Typically cpu0.  For more information see
+Documentation/cpu-hotplug.txt
+
+  isolated   (0644) Isolation attribute. Indicates whether the cpu
+is isolated.
+The cpu can be isolated by writing '1' into this
+file.  Similarly it can be un-isolated by writing
+'0' into this file.  In order to isolate the cpu it
+must first be brought off-line.  This attribute is
+not available for the cpu's that cannot be brought
+off-line. Typically cpu0.  
+Note this attribute is present only if "CPU isolation"
+is enabled. For more information see
+Documentation/cpu-isolation.txt
+
+  cpufreq(0755) Frequency scaling state.
+For more info see
+Documentation/cpu-freq/...
+
+  cache  (0755) Cache information. FIXME
+
+  cpuidle(0755) Idle state information. FIXME
+
+  topology   (0755) Topology information. FIXME
diff --git a/Documentation/cpu-isolation.txt b/Documentation/cpu-isolation.txt
new file mode 100644
index 000..b9ca425
--- /dev/null
+++ b/Documentation/cpu-isolation.txt
@@ -0,0 +1,113 @@
+CPU isolation support in Linux(tm) Kernel
+
+Maintainers:
+
+Scheduler and scheduler domain bits:
+   Ingo Molnar <[EMAIL PROTECTED]>
+
+General framework, irq and workqueue isolation:
+   Max Krasnyanskiy <[EMAIL PROTECTED]>
+
+ChangeLog:
+- Initial version. Feb 2008, MaxK
+
+Introduction
+
+
+The primary idea behind CPU isolation is the ability to use some CPU cores
+as a dedicated engines for running user-space code with minimal kernel
+overhead/intervention, think of it as an SPE in the Cell processor. For
+example CPU isolation allows for running CPU intensive(100%) RT task
+on one of the processors without adversely affecting or being affected
+by the other system activities.  With the current (as of early 2008)
+multi-core CPU trend we may see more and more applications that explore
+this capability: real-time gaming engines, simulators, hard real-time
+apps, etc.
+
+Current CPU isolation support consists of the following features:
+
+1. Isolated CPU(s) are excluded from the scheduler load balancing logic.
+   Applications must explicitly bind threads in order to run on those
+   CPU(s).
+
+2. By default interrupts are not routed to the isolated CPU(s).
+   Users must route interrupts (if any) to those CPU(s) explicitly.
+
+3. Kernel avoids any activity on the isolated CPU(s) as much as possible.
+   This includes workqueues, per CPU threads, etc.  Please note that
+   this feature is optional and is disabled by default.
+
+Kernel configuration options
+
+
+Following options need to be enabled in order to use CPU isolation
+   CONFIG_CPUISOL  Top-level config option. Enables general
+CPU isolation framework and enables features 
+#1 and #2 described above.
+
+   CONFIG_CPUISOL_WORKQUEUEThese options provide deeper isolation
+   CONFIG_CPUISOL_STOPMACHINE   from various kernel subsystems. They implement 
+   CONFIG_CPUISOL_...   feature #3 described above.  
+See Kconfig help for more information on each 
+individual option.
+
+How to isolate a CPU
+
+
+There are two ways for isolating a CPU
+
+Kernel boot comm

[PATCH sched-devel 6/7] cpuisol: Minor updates to the Kconfig options

2008-02-22 Thread Max Krasnyansky

Fixed a couple of typos, long lines and referred to the documentation file.

Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]>
---
 kernel/Kconfig.cpuisol |   31 +--
 1 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol
index 81f1972..e681b02 100644
--- a/kernel/Kconfig.cpuisol
+++ b/kernel/Kconfig.cpuisol
@@ -2,23 +2,26 @@ config CPUISOL
depends on SMP
bool "CPU isolation"
help
- This option enables support for CPU isolation.
- If enabled the kernel will try to avoid kernel activity on the 
isolated CPUs.
- By default user-space threads are not scheduled on the isolated CPUs 
unless 
- they explicitly request it (via sched_ and pthread_ affinity calls). 
Isolated
- CPUs are not subject to the scheduler load-balancing algorithms.
- 
- CPUs can be marked as isolated using 'isolcpus=' command line option 
or by 
- writing '1' into /sys/devices/system/cpu/cpuN/isolated.
- 
- This feature is useful for hard realtime and high performance 
applications.
+ This option enables support for CPU isolation. If enabled the
+ kernel will try to avoid kernel activity on the isolated CPUs.
+ By default user-space threads are not scheduled on the isolated
+ CPUs unless they explicitly request it via sched_setaffinity()
+ and pthread_setaffinity_np() calls. Isolated CPUs are not
+ subject to the scheduler load-balancing algorithms.
+
+ This feature is useful for hard realtime and high performance
+ applications.
+ See Documentation/cpu-isolation.txt for more details.
+
  If unsure say 'N'.
 
 config CPUISOL_WORKQUEUE
bool "Do not schedule workqueues on the isolated CPUs (EXPERIMENTAL)"
depends on CPUISOL && EXPERIMENTAL
help
- In this option is enabled kernel will not schedule workqueues on the 
- isolated CPUs.
- Please note that at this point this feature is experimental. It 
brakes 
- certain things like OProfile that heavily rely on per cpu workqueues.
+ If this option is enabled kernel will not schedule workqueues on
+ the isolated CPUs.  Please note that at this point this feature
+ is experimental. It breaks certain things like OProfile that
+ heavily rely on per cpu workqueues.
+
+ Say 'Y' to enable workqueue isolation.  If unsure say 'N'.
-- 
1.5.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH sched-devel 3/7] cpuisol: Do not schedule workqueues on the isolated CPUs

2008-02-22 Thread Max Krasnyansky

This patch is addressing the use case when a high priority realtime (FIFO, RR) 
user-space
thread is using 100% CPU for extended periods of time. In which case kernel 
workqueue
threads do not get a chance to run and entire machine essentially hangs because 
other CPUs
are waiting for scheduled workqueues to flush.

This use case is perfectly valid if one is using a CPU as a dedicated engine
(crunching numbers, hard realtime, etc). Think of it as an SPE in the Cell 
processor.
Which is what CPU isolation enables in the first place.

Most kernel subsystems do not rely on the per CPU workqueues. In fact we already
have support for single threaded workqueues, this patch just makes it automatic.
As mentioned in the introductory email this functionality has been tested on a 
wide
range of full fledged systems (with IDE, SATA, USB, automount, NFS, NUMA, etc) 
in the
production environment.

The only feature (that I know of) that does not work when workqueue isolation 
is enabled is
OProfile. It does not result in crashes or instability, OProfile is just unable 
to collect
stats from the isolated CPUs. Hence this feature is marked as experimental.

There is zero overhead if workqueue isolation is disabled.

Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]>
---
 kernel/Kconfig.cpuisol |9 +
 kernel/workqueue.c |   30 +++---
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol
index e606477..81f1972 100644
--- a/kernel/Kconfig.cpuisol
+++ b/kernel/Kconfig.cpuisol
@@ -13,3 +13,12 @@ config CPUISOL
  
  This feature is useful for hard realtime and high performance 
applications.
  If unsure say 'N'.
+
+config CPUISOL_WORKQUEUE
+   bool "Do not schedule workqueues on the isolated CPUs (EXPERIMENTAL)"
+   depends on CPUISOL && EXPERIMENTAL
+   help
+ In this option is enabled kernel will not schedule workqueues on the 
+ isolated CPUs.
+ Please note that at this point this feature is experimental. It 
brakes 
+ certain things like OProfile that heavily rely on per cpu workqueues.
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ff06611..f48e13c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -35,6 +35,16 @@
 #include 
 
 /*
+ * Stub out cpu_isolated() if isolated CPUs are allowed to 
+ * run workqueues.
+ */
+#ifdef CONFIG_CPUISOL_WORKQUEUE
+#define cpu_unusable(cpu) cpu_isolated(cpu)
+#else
+#define cpu_unusable(cpu) (0)
+#endif
+
+/*
  * The per-CPU workqueue (if single thread, we always use the first
  * possible cpu).
  */
@@ -97,7 +107,7 @@ static const cpumask_t *wq_cpu_map(struct workqueue_struct 
*wq)
 static
 struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
 {
-   if (unlikely(is_single_threaded(wq)))
+   if (unlikely(is_single_threaded(wq)) || cpu_unusable(cpu))
cpu = singlethread_cpu;
return per_cpu_ptr(wq->cpu_wq, cpu);
 }
@@ -229,9 +239,11 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct 
*wq,
timer->data = (unsigned long)dwork;
timer->function = delayed_work_timer_fn;
 
-   if (unlikely(cpu >= 0))
+   if (unlikely(cpu >= 0)) {
+   if (cpu_unusable(cpu))
+   cpu = singlethread_cpu;
add_timer_on(timer, cpu);
-   else
+   } else
add_timer(timer);
ret = 1;
}
@@ -605,7 +617,8 @@ int schedule_on_each_cpu(work_func_t func)
get_online_cpus();
for_each_online_cpu(cpu) {
struct work_struct *work = per_cpu_ptr(works, cpu);
-
+   if (cpu_unusable(cpu))
+   continue;
INIT_WORK(work, func);
set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
__queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), work);
@@ -754,7 +767,7 @@ struct workqueue_struct *__create_workqueue_key(const char 
*name,
 
for_each_possible_cpu(cpu) {
cwq = init_cpu_workqueue(wq, cpu);
-   if (err || !cpu_online(cpu))
+   if (err || !cpu_online(cpu) || cpu_unusable(cpu))
continue;
err = create_workqueue_thread(cwq, cpu);
start_workqueue_thread(cwq, cpu);
@@ -833,8 +846,11 @@ static int __devinit workqueue_cpu_callback(struct 
notifier_block *nfb,
struct cpu_workqueue_struct *cwq;
struct workqueue_struct *wq;
 
-   action &= ~CPU_TASKS_FROZEN;
+   if (cpu_unusable(cpu))
+   return NOTIFY_OK;
 
+   action &= ~CPU_TASKS_FROZEN;
+   
switch (action) {
 
case CPU_UP_PREPARE:
@@ -869,7 +885,7 @@ static int __devinit workqueue

[PATCH sched-devel 7/7] cpuisol: Do not halt isolated CPUs with Stop Machine

2008-02-22 Thread Max Krasnyansky

This patch makes "stop machine" ignore isolated CPUs (if the config option is 
enabled).

It addresses exact same usecase explained in the previous workqueue isolation 
patch.
Where a user-space RT thread can prevent stop machine threads from running, 
which causes
the entire system to hang.

Stop machine is particularly bad when it comes to latencies because it halts 
every single
CPU and may take several milliseconds to complete. It's currently used for 
module insertion
and removal only.
As some folks pointed out in the previous discussions this patch is potentially 
unsafe
if applications running on the isolated CPUs use kernel services affected by 
the module
insertion and removal.
I've been running kernels with this patch on a wide range of the machines in 
production
environment were we routinely insert/remove modules with applications running 
on isolated
CPUs. Also I've recently done quite a bit of testing on life multi-core systems 
with
"stop machine" _completely_ disabled, and was not able to trigger any problems.
For more details please see this thread
http://marc.info/?l=linux-kernel=120243837206248=2
That of course does not mean that the patch is totally safe but it does not 
seem to
cause any instability in real life.

This feature does not add any overhead when disabled. It's marked as 
experimental
due to potential issues mentioned above.

Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]>
---
 kernel/Kconfig.cpuisol |   15 +++
 kernel/stop_machine.c  |8 +++-
 2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol
index e681b02..24c1ef0 100644
--- a/kernel/Kconfig.cpuisol
+++ b/kernel/Kconfig.cpuisol
@@ -25,3 +25,18 @@ config CPUISOL_WORKQUEUE
  heavily rely on per cpu workqueues.
 
  Say 'Y' to enable workqueue isolation.  If unsure say 'N'.
+
+config CPUISOL_STOPMACHINE
+   bool "Do not halt isolated CPUs with Stop Machine (EXPERIMENTAL)"
+   depends on CPUISOL && STOP_MACHINE && EXPERIMENTAL
+   help
+ If this option is enabled kernel will not halt isolated CPUs
+ when Stop Machine is triggered. Stop Machine is currently only
+ used by the module insertion and removal.
+ Please note that at this point this feature is experimental. It is 
+ not known to really break anything but can potentially introduce
+ an instability due to race conditions in module removal logic.
+
+ Say 'Y' if support for dynamic module insertion and removal is
+ required for the system that uses isolated CPUs. 
+ If unsure say 'N'.
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 6f4e0e1..aa3af15 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -89,6 +89,12 @@ static void stopmachine_set_state(enum stopmachine_state 
state)
cpu_relax();
 }
 
+#ifdef CONFIG_CPUISOL_STOPMACHINE
+#define cpu_unusable(cpu) cpu_isolated(cpu)
+#else
+#define cpu_unusable(cpu) (0)
+#endif
+
 static int stop_machine(void)
 {
int i, ret = 0;
@@ -98,7 +104,7 @@ static int stop_machine(void)
stopmachine_state = STOPMACHINE_WAIT;
 
for_each_online_cpu(i) {
-   if (i == raw_smp_processor_id())
+   if (i == raw_smp_processor_id() || cpu_unusable(i))
continue;
ret = kernel_thread(stopmachine, (void *)(long)i,CLONE_KERNEL);
if (ret < 0)
-- 
1.5.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH sched-devel 4/7] cpuisol: Move on-stack array used for boot cmd parsing into __initdata

2008-02-22 Thread Max Krasnyansky

Suggested by Andrew Morton:

  isolated_cpu_setup() has an on-stack array of NR_CPUS integers.  This
  will consume 4k of stack on ia64 (at least).  We'll just squeak through
  for a ittle while, but this needs to be fixed.  Just move it into
  __initdata.

Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]>
---
 kernel/cpu.c |   15 ++-
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index a0ac386..b3af739 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -446,15 +446,20 @@ out:
 
 #ifdef CONFIG_CPUISOL
 /* Setup the mask of isolated cpus */
+
+static int __initdata isolcpu[NR_CPUS];
+
 static int __init isolated_cpu_setup(char *str)
 {
-   int ints[NR_CPUS], i;
+   int i, n;
+
+   str = get_options(str, ARRAY_SIZE(isolcpu), isolcpu);
+   n   = isolcpu[0];
 
-   str = get_options(str, ARRAY_SIZE(ints), ints);
cpus_clear(cpu_isolated_map);
-   for (i = 1; i <= ints[0]; i++)
-   if (ints[i] < NR_CPUS)
-   cpu_set(ints[i], cpu_isolated_map);
+   for (i = 1; i <= n; i++)
+   if (isolcpu[i] < NR_CPUS)
+   cpu_set(isolcpu[i], cpu_isolated_map);
return 1;
 }
 
-- 
1.5.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH sched-devel 2/7] cpuisol: Do not route IRQs to the CPUs isolated at boot

2008-02-22 Thread Max Krasnyansky

Most people would expect isolated CPUs to not get any
IRQs by default. This happens naturally if a CPU is brought
off-line, marked isolated and then brought back online.

There was some confusion about this patch originaly. So I wanted
to clarify that it does not completely disable IRQ handling on
the isolated CPUs. Users still have the option or routing IRQs
to them by modifying IRQ affinity mask.

I cannot test other archs hence the patch is for x86_64 only.

Signed-off-by: Max Krasnyansky <[EMAIL PROTECTED]>
---
 arch/x86/kernel/genapic_flat_64.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/genapic_flat_64.c 
b/arch/x86/kernel/genapic_flat_64.c
index 07352b7..e02e58c 100644
--- a/arch/x86/kernel/genapic_flat_64.c
+++ b/arch/x86/kernel/genapic_flat_64.c
@@ -21,7 +21,9 @@
 
 static cpumask_t flat_target_cpus(void)
 {
-   return cpu_online_map;
+   cpumask_t target;
+   cpus_andnot(target, cpu_online_map, cpu_isolated_map);
+   return target;
 }
 
 static cpumask_t flat_vector_allocation_domain(int cpu)
-- 
1.5.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH sched-devel 2/7] cpuisol: Do not route IRQs to the CPUs isolated at boot

2008-02-22 Thread Max Krasnyansky

Most people would expect isolated CPUs to not get any
IRQs by default. This happens naturally if a CPU is brought
off-line, marked isolated and then brought back online.

There was some confusion about this patch originaly. So I wanted
to clarify that it does not completely disable IRQ handling on
the isolated CPUs. Users still have the option or routing IRQs
to them by modifying IRQ affinity mask.

I cannot test other archs hence the patch is for x86_64 only.

Signed-off-by: Max Krasnyansky [EMAIL PROTECTED]
---
 arch/x86/kernel/genapic_flat_64.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/genapic_flat_64.c 
b/arch/x86/kernel/genapic_flat_64.c
index 07352b7..e02e58c 100644
--- a/arch/x86/kernel/genapic_flat_64.c
+++ b/arch/x86/kernel/genapic_flat_64.c
@@ -21,7 +21,9 @@
 
 static cpumask_t flat_target_cpus(void)
 {
-   return cpu_online_map;
+   cpumask_t target;
+   cpus_andnot(target, cpu_online_map, cpu_isolated_map);
+   return target;
 }
 
 static cpumask_t flat_vector_allocation_domain(int cpu)
-- 
1.5.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH sched-devel 4/7] cpuisol: Move on-stack array used for boot cmd parsing into __initdata

2008-02-22 Thread Max Krasnyansky

Suggested by Andrew Morton:

  isolated_cpu_setup() has an on-stack array of NR_CPUS integers.  This
  will consume 4k of stack on ia64 (at least).  We'll just squeak through
  for a ittle while, but this needs to be fixed.  Just move it into
  __initdata.

Signed-off-by: Max Krasnyansky [EMAIL PROTECTED]
---
 kernel/cpu.c |   15 ++-
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index a0ac386..b3af739 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -446,15 +446,20 @@ out:
 
 #ifdef CONFIG_CPUISOL
 /* Setup the mask of isolated cpus */
+
+static int __initdata isolcpu[NR_CPUS];
+
 static int __init isolated_cpu_setup(char *str)
 {
-   int ints[NR_CPUS], i;
+   int i, n;
+
+   str = get_options(str, ARRAY_SIZE(isolcpu), isolcpu);
+   n   = isolcpu[0];
 
-   str = get_options(str, ARRAY_SIZE(ints), ints);
cpus_clear(cpu_isolated_map);
-   for (i = 1; i = ints[0]; i++)
-   if (ints[i]  NR_CPUS)
-   cpu_set(ints[i], cpu_isolated_map);
+   for (i = 1; i = n; i++)
+   if (isolcpu[i]  NR_CPUS)
+   cpu_set(isolcpu[i], cpu_isolated_map);
return 1;
 }
 
-- 
1.5.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH sched-devel 3/7] cpuisol: Do not schedule workqueues on the isolated CPUs

2008-02-22 Thread Max Krasnyansky

This patch is addressing the use case when a high priority realtime (FIFO, RR) 
user-space
thread is using 100% CPU for extended periods of time. In which case kernel 
workqueue
threads do not get a chance to run and entire machine essentially hangs because 
other CPUs
are waiting for scheduled workqueues to flush.

This use case is perfectly valid if one is using a CPU as a dedicated engine
(crunching numbers, hard realtime, etc). Think of it as an SPE in the Cell 
processor.
Which is what CPU isolation enables in the first place.

Most kernel subsystems do not rely on the per CPU workqueues. In fact we already
have support for single threaded workqueues, this patch just makes it automatic.
As mentioned in the introductory email this functionality has been tested on a 
wide
range of full fledged systems (with IDE, SATA, USB, automount, NFS, NUMA, etc) 
in the
production environment.

The only feature (that I know of) that does not work when workqueue isolation 
is enabled is
OProfile. It does not result in crashes or instability, OProfile is just unable 
to collect
stats from the isolated CPUs. Hence this feature is marked as experimental.

There is zero overhead if workqueue isolation is disabled.

Signed-off-by: Max Krasnyansky [EMAIL PROTECTED]
---
 kernel/Kconfig.cpuisol |9 +
 kernel/workqueue.c |   30 +++---
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol
index e606477..81f1972 100644
--- a/kernel/Kconfig.cpuisol
+++ b/kernel/Kconfig.cpuisol
@@ -13,3 +13,12 @@ config CPUISOL
  
  This feature is useful for hard realtime and high performance 
applications.
  If unsure say 'N'.
+
+config CPUISOL_WORKQUEUE
+   bool Do not schedule workqueues on the isolated CPUs (EXPERIMENTAL)
+   depends on CPUISOL  EXPERIMENTAL
+   help
+ In this option is enabled kernel will not schedule workqueues on the 
+ isolated CPUs.
+ Please note that at this point this feature is experimental. It 
brakes 
+ certain things like OProfile that heavily rely on per cpu workqueues.
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ff06611..f48e13c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -35,6 +35,16 @@
 #include linux/lockdep.h
 
 /*
+ * Stub out cpu_isolated() if isolated CPUs are allowed to 
+ * run workqueues.
+ */
+#ifdef CONFIG_CPUISOL_WORKQUEUE
+#define cpu_unusable(cpu) cpu_isolated(cpu)
+#else
+#define cpu_unusable(cpu) (0)
+#endif
+
+/*
  * The per-CPU workqueue (if single thread, we always use the first
  * possible cpu).
  */
@@ -97,7 +107,7 @@ static const cpumask_t *wq_cpu_map(struct workqueue_struct 
*wq)
 static
 struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
 {
-   if (unlikely(is_single_threaded(wq)))
+   if (unlikely(is_single_threaded(wq)) || cpu_unusable(cpu))
cpu = singlethread_cpu;
return per_cpu_ptr(wq-cpu_wq, cpu);
 }
@@ -229,9 +239,11 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct 
*wq,
timer-data = (unsigned long)dwork;
timer-function = delayed_work_timer_fn;
 
-   if (unlikely(cpu = 0))
+   if (unlikely(cpu = 0)) {
+   if (cpu_unusable(cpu))
+   cpu = singlethread_cpu;
add_timer_on(timer, cpu);
-   else
+   } else
add_timer(timer);
ret = 1;
}
@@ -605,7 +617,8 @@ int schedule_on_each_cpu(work_func_t func)
get_online_cpus();
for_each_online_cpu(cpu) {
struct work_struct *work = per_cpu_ptr(works, cpu);
-
+   if (cpu_unusable(cpu))
+   continue;
INIT_WORK(work, func);
set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
__queue_work(per_cpu_ptr(keventd_wq-cpu_wq, cpu), work);
@@ -754,7 +767,7 @@ struct workqueue_struct *__create_workqueue_key(const char 
*name,
 
for_each_possible_cpu(cpu) {
cwq = init_cpu_workqueue(wq, cpu);
-   if (err || !cpu_online(cpu))
+   if (err || !cpu_online(cpu) || cpu_unusable(cpu))
continue;
err = create_workqueue_thread(cwq, cpu);
start_workqueue_thread(cwq, cpu);
@@ -833,8 +846,11 @@ static int __devinit workqueue_cpu_callback(struct 
notifier_block *nfb,
struct cpu_workqueue_struct *cwq;
struct workqueue_struct *wq;
 
-   action = ~CPU_TASKS_FROZEN;
+   if (cpu_unusable(cpu))
+   return NOTIFY_OK;
 
+   action = ~CPU_TASKS_FROZEN;
+   
switch (action) {
 
case CPU_UP_PREPARE:
@@ -869,7 +885,7 @@ static int __devinit workqueue_cpu_callback(struct 
notifier_block *nfb,
 
 void

[PATCH sched-devel 7/7] cpuisol: Do not halt isolated CPUs with Stop Machine

2008-02-22 Thread Max Krasnyansky

This patch makes stop machine ignore isolated CPUs (if the config option is 
enabled).

It addresses exact same usecase explained in the previous workqueue isolation 
patch.
Where a user-space RT thread can prevent stop machine threads from running, 
which causes
the entire system to hang.

Stop machine is particularly bad when it comes to latencies because it halts 
every single
CPU and may take several milliseconds to complete. It's currently used for 
module insertion
and removal only.
As some folks pointed out in the previous discussions this patch is potentially 
unsafe
if applications running on the isolated CPUs use kernel services affected by 
the module
insertion and removal.
I've been running kernels with this patch on a wide range of the machines in 
production
environment were we routinely insert/remove modules with applications running 
on isolated
CPUs. Also I've recently done quite a bit of testing on life multi-core systems 
with
stop machine _completely_ disabled, and was not able to trigger any problems.
For more details please see this thread
http://marc.info/?l=linux-kernelm=120243837206248w=2
That of course does not mean that the patch is totally safe but it does not 
seem to
cause any instability in real life.

This feature does not add any overhead when disabled. It's marked as 
experimental
due to potential issues mentioned above.

Signed-off-by: Max Krasnyansky [EMAIL PROTECTED]
---
 kernel/Kconfig.cpuisol |   15 +++
 kernel/stop_machine.c  |8 +++-
 2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol
index e681b02..24c1ef0 100644
--- a/kernel/Kconfig.cpuisol
+++ b/kernel/Kconfig.cpuisol
@@ -25,3 +25,18 @@ config CPUISOL_WORKQUEUE
  heavily rely on per cpu workqueues.
 
  Say 'Y' to enable workqueue isolation.  If unsure say 'N'.
+
+config CPUISOL_STOPMACHINE
+   bool Do not halt isolated CPUs with Stop Machine (EXPERIMENTAL)
+   depends on CPUISOL  STOP_MACHINE  EXPERIMENTAL
+   help
+ If this option is enabled kernel will not halt isolated CPUs
+ when Stop Machine is triggered. Stop Machine is currently only
+ used by the module insertion and removal.
+ Please note that at this point this feature is experimental. It is 
+ not known to really break anything but can potentially introduce
+ an instability due to race conditions in module removal logic.
+
+ Say 'Y' if support for dynamic module insertion and removal is
+ required for the system that uses isolated CPUs. 
+ If unsure say 'N'.
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 6f4e0e1..aa3af15 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -89,6 +89,12 @@ static void stopmachine_set_state(enum stopmachine_state 
state)
cpu_relax();
 }
 
+#ifdef CONFIG_CPUISOL_STOPMACHINE
+#define cpu_unusable(cpu) cpu_isolated(cpu)
+#else
+#define cpu_unusable(cpu) (0)
+#endif
+
 static int stop_machine(void)
 {
int i, ret = 0;
@@ -98,7 +104,7 @@ static int stop_machine(void)
stopmachine_state = STOPMACHINE_WAIT;
 
for_each_online_cpu(i) {
-   if (i == raw_smp_processor_id())
+   if (i == raw_smp_processor_id() || cpu_unusable(i))
continue;
ret = kernel_thread(stopmachine, (void *)(long)i,CLONE_KERNEL);
if (ret  0)
-- 
1.5.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH sched-devel 6/7] cpuisol: Minor updates to the Kconfig options

2008-02-22 Thread Max Krasnyansky

Fixed a couple of typos, long lines and referred to the documentation file.

Signed-off-by: Max Krasnyansky [EMAIL PROTECTED]
---
 kernel/Kconfig.cpuisol |   31 +--
 1 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol
index 81f1972..e681b02 100644
--- a/kernel/Kconfig.cpuisol
+++ b/kernel/Kconfig.cpuisol
@@ -2,23 +2,26 @@ config CPUISOL
depends on SMP
bool CPU isolation
help
- This option enables support for CPU isolation.
- If enabled the kernel will try to avoid kernel activity on the 
isolated CPUs.
- By default user-space threads are not scheduled on the isolated CPUs 
unless 
- they explicitly request it (via sched_ and pthread_ affinity calls). 
Isolated
- CPUs are not subject to the scheduler load-balancing algorithms.
- 
- CPUs can be marked as isolated using 'isolcpus=' command line option 
or by 
- writing '1' into /sys/devices/system/cpu/cpuN/isolated.
- 
- This feature is useful for hard realtime and high performance 
applications.
+ This option enables support for CPU isolation. If enabled the
+ kernel will try to avoid kernel activity on the isolated CPUs.
+ By default user-space threads are not scheduled on the isolated
+ CPUs unless they explicitly request it via sched_setaffinity()
+ and pthread_setaffinity_np() calls. Isolated CPUs are not
+ subject to the scheduler load-balancing algorithms.
+
+ This feature is useful for hard realtime and high performance
+ applications.
+ See Documentation/cpu-isolation.txt for more details.
+
  If unsure say 'N'.
 
 config CPUISOL_WORKQUEUE
bool Do not schedule workqueues on the isolated CPUs (EXPERIMENTAL)
depends on CPUISOL  EXPERIMENTAL
help
- In this option is enabled kernel will not schedule workqueues on the 
- isolated CPUs.
- Please note that at this point this feature is experimental. It 
brakes 
- certain things like OProfile that heavily rely on per cpu workqueues.
+ If this option is enabled kernel will not schedule workqueues on
+ the isolated CPUs.  Please note that at this point this feature
+ is experimental. It breaks certain things like OProfile that
+ heavily rely on per cpu workqueues.
+
+ Say 'Y' to enable workqueue isolation.  If unsure say 'N'.
-- 
1.5.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH sched-devel 5/7] cpuisol: Documentation updates

2008-02-22 Thread Max Krasnyansky

Documented sysfs interface as suggested by Andrew Morton.
Added general documentation that describes how to configure
and use CPU isolation features.

Signed-off-by: Max Krasnyansky [EMAIL PROTECTED]
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |   41 +++
 Documentation/cpu-isolation.txt|  113 
 2 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
b/Documentation/ABI/testing/sysfs-devices-system-cpu
new file mode 100644
index 000..32dde5b
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -0,0 +1,41 @@
+What:   /sys/devices/system/cpu/...
+Date:   Feb. 2008
+KernelVersion:  2.6.24 
+Contact:LKML linux-kernel@vger.kernel.org
+Description:
+
+The /sys/devices/system/cpu tree provides information about all cpu's
+known to the running kernel.
+
+Following files are created for each cpu. 'N' is the cpu number.
+
+/sys/devices/system/cpu/cpuN/
+  online (0644) On-line attribute. Indicates whether the cpu is on-line.
+The cpu can be brought off-line by writing '0' into
+this file.  Similarly it can be brought back on-line
+by writing '1' into this file.  This attribute is
+not available for the cpu's that cannot be brought
+off-line. Typically cpu0.  For more information see
+Documentation/cpu-hotplug.txt
+
+  isolated   (0644) Isolation attribute. Indicates whether the cpu
+is isolated.
+The cpu can be isolated by writing '1' into this
+file.  Similarly it can be un-isolated by writing
+'0' into this file.  In order to isolate the cpu it
+must first be brought off-line.  This attribute is
+not available for the cpu's that cannot be brought
+off-line. Typically cpu0.  
+Note this attribute is present only if CPU isolation
+is enabled. For more information see
+Documentation/cpu-isolation.txt
+
+  cpufreq(0755) Frequency scaling state.
+For more info see
+Documentation/cpu-freq/...
+
+  cache  (0755) Cache information. FIXME
+
+  cpuidle(0755) Idle state information. FIXME
+
+  topology   (0755) Topology information. FIXME
diff --git a/Documentation/cpu-isolation.txt b/Documentation/cpu-isolation.txt
new file mode 100644
index 000..b9ca425
--- /dev/null
+++ b/Documentation/cpu-isolation.txt
@@ -0,0 +1,113 @@
+CPU isolation support in Linux(tm) Kernel
+
+Maintainers:
+
+Scheduler and scheduler domain bits:
+   Ingo Molnar [EMAIL PROTECTED]
+
+General framework, irq and workqueue isolation:
+   Max Krasnyanskiy [EMAIL PROTECTED]
+
+ChangeLog:
+- Initial version. Feb 2008, MaxK
+
+Introduction
+
+
+The primary idea behind CPU isolation is the ability to use some CPU cores
+as a dedicated engines for running user-space code with minimal kernel
+overhead/intervention, think of it as an SPE in the Cell processor. For
+example CPU isolation allows for running CPU intensive(100%) RT task
+on one of the processors without adversely affecting or being affected
+by the other system activities.  With the current (as of early 2008)
+multi-core CPU trend we may see more and more applications that explore
+this capability: real-time gaming engines, simulators, hard real-time
+apps, etc.
+
+Current CPU isolation support consists of the following features:
+
+1. Isolated CPU(s) are excluded from the scheduler load balancing logic.
+   Applications must explicitly bind threads in order to run on those
+   CPU(s).
+
+2. By default interrupts are not routed to the isolated CPU(s).
+   Users must route interrupts (if any) to those CPU(s) explicitly.
+
+3. Kernel avoids any activity on the isolated CPU(s) as much as possible.
+   This includes workqueues, per CPU threads, etc.  Please note that
+   this feature is optional and is disabled by default.
+
+Kernel configuration options
+
+
+Following options need to be enabled in order to use CPU isolation
+   CONFIG_CPUISOL  Top-level config option. Enables general
+CPU isolation framework and enables features 
+#1 and #2 described above.
+
+   CONFIG_CPUISOL_WORKQUEUEThese options provide deeper isolation
+   CONFIG_CPUISOL_STOPMACHINE   from various kernel subsystems. They implement 
+   CONFIG_CPUISOL_...   feature #3 described above.  
+See Kconfig help for more information on each 
+individual option.
+
+How to isolate a CPU
+
+
+There are two ways for isolating a CPU
+
+Kernel boot command line

[PATCH sched-devel 1/7] cpuisol: Make cpu isolation configrable and export isolated map

2008-02-22 Thread Max Krasnyansky

This simple patch introduces new config option for CPU isolation.
The reason I created the separate Kconfig file here is because more
options will be added by the following patches.

The patch also exports cpu_isolated_map, provides cpu_isolated()
accessor macro and provides access to the isolation bit via sysfs.
In other words cpu_isolated_map is exposed to the rest of the kernel
and the user-space in much the same way cpu_online_map is exposed today.

While at it I also moved cpu_*_map from kernel/sched.c into kernel/cpu.c
Those maps have very little to do with the scheduler these days and
therefor seem out of place in the scheduler code.

This patch does not change/affect any existing scheduler functionality.

Signed-off-by: Max Krasnyansky [EMAIL PROTECTED]
---
 arch/x86/Kconfig|1 +
 drivers/base/cpu.c  |   48 ++
 include/linux/cpumask.h |3 ++
 kernel/Kconfig.cpuisol  |   15 ++
 kernel/Makefile |4 +-
 kernel/cpu.c|   49 +++
 kernel/sched.c  |   36 --
 7 files changed, 118 insertions(+), 38 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3be2305..d228488 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -526,6 +526,7 @@ config SCHED_MC
  increased overhead in some places. If unsure say N here.
 
 source kernel/Kconfig.preempt
+source kernel/Kconfig.cpuisol
 
 config X86_UP_APIC
bool Local APIC support on uniprocessors
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 499b003..b6c5e0f 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -55,10 +55,58 @@ static ssize_t store_online(struct sys_device *dev, const 
char *buf,
 }
 static SYSDEV_ATTR(online, 0644, show_online, store_online);
 
+#ifdef CONFIG_CPUISOL
+/*
+ * This is under config hotplug because in order to 
+ * dynamically isolate a CPU it needs to be brought off-line first.
+ * In other words the sequence is
+ *   echo 0  /sys/device/system/cpuN/online
+ *   echo 1  /sys/device/system/cpuN/isolated
+ *   echo 1  /sys/device/system/cpuN/online
+ */
+static ssize_t show_isol(struct sys_device *dev, char *buf)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+
+   return sprintf(buf, %u\n, !!cpu_isolated(cpu-sysdev.id));
+}
+
+static ssize_t store_isol(struct sys_device *dev, const char *buf,
+   size_t count)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+   ssize_t ret = 0;
+
+   if (cpu_online(cpu-sysdev.id))
+   return -EBUSY;
+
+   switch (buf[0]) {
+   case '0':
+   cpu_clear(cpu-sysdev.id, cpu_isolated_map);
+   break;
+   case '1':
+   cpu_set(cpu-sysdev.id, cpu_isolated_map);
+   break;
+   default:
+   ret = -EINVAL;
+   }
+
+   if (ret = 0)
+   ret = count;
+   return ret;
+}
+static SYSDEV_ATTR(isolated, 0600, show_isol, store_isol);
+#endif /* CONFIG_CPUISOL */
+
 static void __devinit register_cpu_control(struct cpu *cpu)
 {
sysdev_create_file(cpu-sysdev, attr_online);
+
+#ifdef CONFIG_CPUISOL
+   sysdev_create_file(cpu-sysdev, attr_isolated);
+#endif
 }
+
 void unregister_cpu(struct cpu *cpu)
 {
int logical_cpu = cpu-sysdev.id;
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 7047f58..cde2964 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -380,6 +380,7 @@ static inline void __cpus_remap(cpumask_t *dstp, const 
cpumask_t *srcp,
 extern cpumask_t cpu_possible_map;
 extern cpumask_t cpu_online_map;
 extern cpumask_t cpu_present_map;
+extern cpumask_t cpu_isolated_map;
 
 #if NR_CPUS  1
 #define num_online_cpus()  cpus_weight(cpu_online_map)
@@ -388,6 +389,7 @@ extern cpumask_t cpu_present_map;
 #define cpu_online(cpu)cpu_isset((cpu), cpu_online_map)
 #define cpu_possible(cpu)  cpu_isset((cpu), cpu_possible_map)
 #define cpu_present(cpu)   cpu_isset((cpu), cpu_present_map)
+#define cpu_isolated(cpu)  cpu_isset((cpu), cpu_isolated_map)
 #else
 #define num_online_cpus()  1
 #define num_possible_cpus()1
@@ -395,6 +397,7 @@ extern cpumask_t cpu_present_map;
 #define cpu_online(cpu)((cpu) == 0)
 #define cpu_possible(cpu)  ((cpu) == 0)
 #define cpu_present(cpu)   ((cpu) == 0)
+#define cpu_isolated(cpu)  (0)
 #endif
 
 #define cpu_is_offline(cpu)unlikely(!cpu_online(cpu))
diff --git a/kernel/Kconfig.cpuisol b/kernel/Kconfig.cpuisol
new file mode 100644
index 000..e606477
--- /dev/null
+++ b/kernel/Kconfig.cpuisol
@@ -0,0 +1,15 @@
+config CPUISOL
+   depends on SMP
+   bool CPU isolation
+   help
+ This option enables support for CPU isolation.
+ If enabled the kernel will try to avoid kernel activity on the 
isolated CPUs.
+ By default user-space

[RFC] Genirq and CPU isolation

2008-02-22 Thread Max Krasnyansky

Hi Thomas,

While reviewing CPU isolation patches Peter pointed out that instead of
changing arch specific irq handling I should be extending genirq code.
Which makes perfect sense. Why didn't I think of that before :)
Basically the idea is that by default isolated CPUs must not get HW
irqs routed to them (besides IPIs and stuff of course). 
Does the patch included below look like the right approach ?

btw select_smp_affinity() which is currently used only by alpha seemed
out of place. It's called multiple times for shared irqs. ie Every time
new handler is registered irq is moved to a different CPU.
So I moved it under if (!shared) check inside setup_irq().

The patch introduces generic version of the select_smp_affinity() that
sets the affinity mask to online_cpus - isolated_cpus, and updates 
x86_32 and alpha load balancers to ignore isolated cpus.
Booted on Core2 laptop and dual Opteron boxes with and w/o isolcpus=
options and everything seems to work as expected.

I wanted to run this by you before I include it in my patch series.
Thanx
Max


diff --git a/arch/alpha/kernel/irq.c b/arch/alpha/kernel/irq.c
index facf82a..6b01702 100644
--- a/arch/alpha/kernel/irq.c
+++ b/arch/alpha/kernel/irq.c
@@ -51,7 +51,7 @@ select_smp_affinity(unsigned int irq)
if (!irq_desc[irq].chip-set_affinity || irq_user_affinity[irq])
return 1;
 
-   while (!cpu_possible(cpu))
+   while (!cpu_possible(cpu) || cpu_isolated(cpu))
cpu = (cpu  (NR_CPUS-1) ? cpu + 1 : 0);
last_cpu = cpu;
 
diff --git a/arch/x86/kernel/genapic_flat_64.c 
b/arch/x86/kernel/genapic_flat_64.c
index e02e58c..07352b7 100644
--- a/arch/x86/kernel/genapic_flat_64.c
+++ b/arch/x86/kernel/genapic_flat_64.c
@@ -21,9 +21,7 @@
 
 static cpumask_t flat_target_cpus(void)
 {
-   cpumask_t target;
-   cpus_andnot(target, cpu_online_map, cpu_isolated_map);
-   return target;
+   return cpu_online_map;
 }
 
 static cpumask_t flat_vector_allocation_domain(int cpu)
diff --git a/arch/x86/kernel/io_apic_32.c b/arch/x86/kernel/io_apic_32.c
index 4ca5486..9c8816f 100644
--- a/arch/x86/kernel/io_apic_32.c
+++ b/arch/x86/kernel/io_apic_32.c
@@ -468,7 +468,7 @@ static void do_irq_balance(void)
for_each_possible_cpu(i) {
int package_index;
CPU_IRQ(i) = 0;
-   if (!cpu_online(i))
+   if (!cpu_online(i) || cpu_isolated(i))
continue;
package_index = CPU_TO_PACKAGEINDEX(i);
for (j = 0; j  NR_IRQS; j++) {
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 176e5e7..287bc64 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -253,14 +253,7 @@ static inline void set_balance_irq_affinity(unsigned int 
irq, cpumask_t mask)
 }
 #endif
 
-#ifdef CONFIG_AUTO_IRQ_AFFINITY
 extern int select_smp_affinity(unsigned int irq);
-#else
-static inline int select_smp_affinity(unsigned int irq)
-{
-   return 1;
-}
-#endif
 
 extern int no_irq_affinity;
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 438a014..e74db94 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -376,6 +376,9 @@ int setup_irq(unsigned int irq, struct irqaction *new)
} else
/* Undo nested disables: */
desc-depth = 1;
+
+   /* Set default affinity mask once everything is setup */
+   select_smp_affinity(irq);
}
/* Reset broken irq detection when installing new handler */
desc-irq_count = 0;
@@ -488,6 +491,26 @@ void free_irq(unsigned int irq, void *dev_id)
 }
 EXPORT_SYMBOL(free_irq);
 
+#ifndef CONFIG_AUTO_IRQ_AFFINITY
+/**
+ * Generic version of the affinity autoselector.
+ * Called under desc-lock from setup_irq().
+ * btw Should we rename this to select_irq_affinity() ?
+ */
+int select_smp_affinity(unsigned int irq)
+{
+   cpumask_t usable_cpus;
+
+   if (!irq_can_set_affinity(irq))
+   return 0;
+
+   cpus_andnot(usable_cpus, cpu_online_map, cpu_isolated_map);
+   irq_desc[irq].affinity = usable_cpus;
+   irq_desc[irq].chip-set_affinity(irq, usable_cpus);
+   return 0;
+}
+#endif
+
 /**
  * request_irq - allocate an interrupt line
  * @irq: Interrupt line to allocate
@@ -555,8 +578,6 @@ int request_irq(unsigned int irq, irq_handler_t handler,
action-next = NULL;
action-dev_id = dev_id;
 
-   select_smp_affinity(irq);
-
 #ifdef CONFIG_DEBUG_SHIRQ
if (irqflags  IRQF_SHARED) {
/*
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions (updated)

2008-02-13 Thread Max Krasnyansky

Ingo Molnar wrote:
> * Max Krasnyansky <[EMAIL PROTECTED]> wrote:
> 
>> Ingo said a few different things (a bit too large to quote). 
> 
> [...]
>> And at the end he said:
>>> Also, i'd not mind some test-coverage in sched.git as well.
> 
>> I far as I know "do not mind" does not mean "must go to" ;-). [...]
> 
> the CPU isolation related patches have typically flown through 
> sched.git/sched-devel.git, so yes, you can take my "i'd not mind" 
> comment as "i'd not mind it at all". That's the tree that all the folks 
> who deal with this (such as Paul) are following. So lets go via the 
> normal contribution cycle and let this trickle through with all the 
> scheduler folks? I'd say 2.6.26 would be a tentative target, if it holds 
> up to scrutiny in sched-devel.git (both testing and review wise). And 
> because Andrew tracks sched-devel.git it will thus show up in -mm too.

Sounds good. Can you pull my tree then ? Or do you want me to resend the 
patches.
The tree is here:
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git
Take the for-linus branch.
Or as I said please let me know and I'll resend the patches.

Thanx
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-13 Thread Max Krasnyansky



Nick Piggin wrote:
> On Wednesday 13 February 2008 17:06, Max Krasnyansky wrote:
>> Nick Piggin wrote:
> 
>>> But don't let me dissuade you from making these good improvements
>>> to Linux as well :) Just that it isn't really going to be hard-rt
>>> in general.
>> Actually that's the cool thing about CPU isolation. Get rid of all latency
>> sources from the CPU(s) and you get youself as hard-RT as it gets.
> 
> Hmm, maybe. Removing all sources of latency from the CPU kind of
> implies that you have to audit the whole kernel for source of
> latency.
That's exactly where cpu isolation comes in. It makes sure that an isolated 
CPU is excluded from: 
1. HW interrupts. This means no softirq, etc.
2. Things like workqueues, stop machine, etc. This typically means no timers, 
etc.
3. Scheduler load balancing (we had support for that for awhile now).

All that's left on that CPU is the scheduler tick and IPIs. And those are just 
fine.
At that point it's up to the app to use or not to use kernel services.
In other words no auditing is required. It's the RT preempt that needs to audit 
in
order to be general purpose RT.
 
>> I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec
>> worst case and ~200nsec average latency. I do not even need Adeos/Xenomai
>> or Preemp-RT just a few very small patches. And it can be used for non RT
>> stuff too.
> 
> OK, but you then are very restricted in what you can do, and easily
> can break it especially if you run any userspace on that CPU. If
> you just run a kernel module that, after setup, doesn't use any
> other kernel resources except interrupt handling, then you might be
> OK (depending on whether even interrupt handling can run into
> contended locks)...
> 
> If you started doing very much more, then you can easily run into
> trouble.
Yes I'm definitely not selling it as general purpose. And no, it's not just 
kernel code 
it's a pure user-space code. Carefully designed user-space code that is.
The model is pretty simple. Lets say you have a dual cpu/core box. The app can 
be 
partitioned like this:
- CPU0 handles HW irqs, runs general services, etc and soft-RT threads
- CPU1 runs hard-RT threads or a special engine. For the description of the 
engine
see http://marc.info/?l=linux-kernel=120232425515556=2 

hard-RT threads do not need any system call besides pthread_ mutex and signals 
(those
are perfectly fine).
They can use direct HW access (if needed). ie Memory mapping something and 
acessing
it without the syscalls (see libe1000.sf.net for example).
Communication between hard-RT and soft-RT thread is lock-less (single 
reader/single writer
queues, etc).
It may sound fairly limited but you'd be surprised how much you can do. It's 
relatively
easy to design the app that way once you get a hang of it :). I'm working with 
our legal 
folks on releasing user-space framework and afore mentioned engine with a bunch 
of examples.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-13 Thread Max Krasnyansky



Nick Piggin wrote:
 On Wednesday 13 February 2008 17:06, Max Krasnyansky wrote:
 Nick Piggin wrote:
 
 But don't let me dissuade you from making these good improvements
 to Linux as well :) Just that it isn't really going to be hard-rt
 in general.
 Actually that's the cool thing about CPU isolation. Get rid of all latency
 sources from the CPU(s) and you get youself as hard-RT as it gets.
 
 Hmm, maybe. Removing all sources of latency from the CPU kind of
 implies that you have to audit the whole kernel for source of
 latency.
That's exactly where cpu isolation comes in. It makes sure that an isolated 
CPU is excluded from: 
1. HW interrupts. This means no softirq, etc.
2. Things like workqueues, stop machine, etc. This typically means no timers, 
etc.
3. Scheduler load balancing (we had support for that for awhile now).

All that's left on that CPU is the scheduler tick and IPIs. And those are just 
fine.
At that point it's up to the app to use or not to use kernel services.
In other words no auditing is required. It's the RT preempt that needs to audit 
in
order to be general purpose RT.
 
 I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec
 worst case and ~200nsec average latency. I do not even need Adeos/Xenomai
 or Preemp-RT just a few very small patches. And it can be used for non RT
 stuff too.
 
 OK, but you then are very restricted in what you can do, and easily
 can break it especially if you run any userspace on that CPU. If
 you just run a kernel module that, after setup, doesn't use any
 other kernel resources except interrupt handling, then you might be
 OK (depending on whether even interrupt handling can run into
 contended locks)...
 
 If you started doing very much more, then you can easily run into
 trouble.
Yes I'm definitely not selling it as general purpose. And no, it's not just 
kernel code 
it's a pure user-space code. Carefully designed user-space code that is.
The model is pretty simple. Lets say you have a dual cpu/core box. The app can 
be 
partitioned like this:
- CPU0 handles HW irqs, runs general services, etc and soft-RT threads
- CPU1 runs hard-RT threads or a special engine. For the description of the 
engine
see http://marc.info/?l=linux-kernelm=120232425515556w=2 

hard-RT threads do not need any system call besides pthread_ mutex and signals 
(those
are perfectly fine).
They can use direct HW access (if needed). ie Memory mapping something and 
acessing
it without the syscalls (see libe1000.sf.net for example).
Communication between hard-RT and soft-RT thread is lock-less (single 
reader/single writer
queues, etc).
It may sound fairly limited but you'd be surprised how much you can do. It's 
relatively
easy to design the app that way once you get a hang of it :). I'm working with 
our legal 
folks on releasing user-space framework and afore mentioned engine with a bunch 
of examples.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions (updated)

2008-02-13 Thread Max Krasnyansky

Ingo Molnar wrote:
 * Max Krasnyansky [EMAIL PROTECTED] wrote:
 
 Ingo said a few different things (a bit too large to quote). 
 
 [...]
 And at the end he said:
 Also, i'd not mind some test-coverage in sched.git as well.
 
 I far as I know do not mind does not mean must go to ;-). [...]
 
 the CPU isolation related patches have typically flown through 
 sched.git/sched-devel.git, so yes, you can take my i'd not mind 
 comment as i'd not mind it at all. That's the tree that all the folks 
 who deal with this (such as Paul) are following. So lets go via the 
 normal contribution cycle and let this trickle through with all the 
 scheduler folks? I'd say 2.6.26 would be a tentative target, if it holds 
 up to scrutiny in sched-devel.git (both testing and review wise). And 
 because Andrew tracks sched-devel.git it will thus show up in -mm too.

Sounds good. Can you pull my tree then ? Or do you want me to resend the 
patches.
The tree is here:
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git
Take the for-linus branch.
Or as I said please let me know and I'll resend the patches.

Thanx
Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Max Krasnyansky

Nick Piggin wrote:
> On Wednesday 13 February 2008 14:32, Max Krasnyansky wrote:
>> David Miller wrote:
>>> From: Nick Piggin <[EMAIL PROTECTED]>
>>> Date: Tue, 12 Feb 2008 17:41:21 +1100
>>>
>>>> stop machine is used for more than just module loading and unloading.
>>>> I don't think you can just disable it.
>>> Right, in particular it is used for CPU hotplug.
>> Ooops. Totally missed that. And a bunch of other places.
>>
>> [EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run
>> Documentation/cpu-hotplug.txt
>> arch/s390/kernel/kprobes.c
>> drivers/char/hw_random/intel-rng.c
>> include/linux/stop_machine.h
>> kernel/cpu.c
>> kernel/module.c
>> kernel/stop_machine.c
>> mm/page_alloc.c
>>
>> I wonder why I did not see any issues when I disabled stop machine
>> completely. I mentioned in the other thread that I commented out the part
>> that actually halts the machine and ran it for several hours on my dual
>> core laptop and on the quad core server. Tried all kinds of workloads,
>> which include constant module removal and insertion, and cpu hotplug as
>> well. It cannot be just luck :).
> 
> It really is. With subtle races, it can take a lot more than a few
> hours. Consider that we have subtle races still in the kernel now,
> which are almost never or rarely hit in maybe 10,000 hours * every
> single person who has been using the current kernel for the past
> year.
> 
> For a less theoretical example -- when I was writing the RCU radix
> tree code, I tried to run directed stress tests on a 64 CPU Altix
> machine (which found no bugs). Then I ran it on a dedicated test
> harness that could actually do a lot more than the existing kernel
> users are able to, and promptly found a couple more bugs (on a 2
> CPU system).
> 
> But your primary defence against concurrency bugs _has_ to be
> knowing the code and all its interactions.
100% agree. 
btw For modules though it does not seem like luck (ie that it worked fine for 
me). 
I mean subsystems are supposed to cleanly register/unregister anyway. But I can 
of
course be wrong. We'll see what Rusty says.

>> Clearly though, you guys are right. It cannot be simply disabled. Based on
>> the above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390
>> and intel rng driver. Hopefully we can avoid it at least in module
>> insertion/removal.
> 
> Yes, reducing the number of users by going through their code and
> showing that it is safe, is the right way to do this. Also, you
> could avoid module insertion/removal?
I could. But it'd be nice if I did not have to :)

> FWIW, I think the idea of trying to turn Linux into giving hard
> realtime guarantees is just insane. If that is what you want, you
> would IMO be much better off to spend effort with something like
> improving adeos and communicatoin/administration between Linux and
> the hard-rt kernel.
> 
> But don't let me dissuade you from making these good improvements
> to Linux as well :) Just that it isn't really going to be hard-rt
> in general.
Actually that's the cool thing about CPU isolation. Get rid of all latency 
sources
from the CPU(s) and you get youself as hard-RT as it gets. 
I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec worst 
case and 
~200nsec average latency. I do not even need Adeos/Xenomai or Preemp-RT just a 
few
very small patches. And it can be used for non RT stuff too.
 
Max



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Max Krasnyansky



Steven Rostedt wrote:
> On Tue, 12 Feb 2008, Peter Zijlstra wrote:
>>> Rusty - Stop machine.
>>>After doing a bunch of testing last three days I actually downgraded 
>>> stop machine
>>>changes from [highly experimental] to simply [experimental]. Pleas see 
>>> this thread
>>>for more info: http://marc.info/?l=linux-kernel=120243837206248=2
>>>Short story is that I ran several insmod/rmmod workloads on live 
>>> multi-core boxes
>>>with stop machine _completely_ disabled and did no see any issues. Rusty 
>>> did not get
>>>a chance to reply yet, I hopping that we'll be able to make "stop 
>>> machine" completely
>>>optional for some configurations.
> 
> This part really scares me. The comment that you say you have run several
> insmod/rmmod workloads without kstop_machine doesn't mean that it is still
> safe. A lot of races that things like this protect may only happen under
> load once a month. But the fact that it happens at all is reason to have
> the protection.
> 
> Before taking out any protection, please analyze it in detail and report
> your findings why something is not needed. Not just some general hand
> waving and "it doesn't crash on my box".
Sure. I did not say lets disable it. I was hopping we could and I wanted to see 
what Rusty 
Russell has to say about this.

> Besides that, kstop_machine may be used by other features that can have an
> impact.
Yes it is. I missed a few. Nick and Dave already pointed out CPU hotplug. 
I looked around and found more users. So disabling stop machine completely is 
definitely out.

> Again, if you have a system that cant handle things like kstop_machine,
> than don't do things that require a kstop_machine run. All modules should
> be loaded, and no new modules should be added when the system is
> performing critical work. I see no reason for disabling kstop_machine.
I'm considering that option. So far it does not seem practical. At least the 
way we use those
machines at this point. If we can prove that at least not halting isolation 
CPUs is safe that'd
be better.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Max Krasnyansky

Peter Zijlstra wrote:
> On Mon, 2008-02-11 at 20:10 -0800, Max Krasnyansky wrote:
>> Andrew, looks like Linus decided not to pull this stuff.
>> Can we please put it into -mm then.
>>
>> My tree is here
>>  git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git
>> Please use 'master' branch (or 'for-linus' they are identical).
> 
> I'm wondering why you insist on offering a git tree that bypasses the
> regular maintainers. Why not post the patches and walk the normal route?
> 
> To me this feels rather aggressive, which makes me feel less inclined to
> look at it.
Peter, it may sound stupid but I'm honestly not sure what you mean. Please bear
with me I do not mean to sounds arrogant. I'm looking for advice here.
So here are some questions:

- First, who would the regular maintainer be in this case ?
I felt that cpu isolation can just sit in its own tree since it does not seem 
to belong to any existing stuff.
So far people suggested -mm and -shed.
I do not think it has much to do much with the -sched.
-mm seems more general purpose, since Linus did not pull it directly I asked 
Andrew to take this stuff into -mm. He was already ok with the patches when I 
sent original pull request to Linus.

- Is it not easier for a regular maintainer (whoever it turns out to be in this 
case)
to pull from GIT rather than use patches ?
In any case I did post patches along with pull request. So for example if 
Andrew 
prefers patches he could take those instead of the git. In fact if you look at 
my 
email I mentioned that if needed I can repost the patches.

- And last but not least I want to be able to just tell people who want to use 
CPU 
isolation "Go get get this tree and use it". Git it the best for that.

I can see how pull request to Linus may have been a bit aggressive. But then 
again
I posted patches (_without_ pull request). Got feedback from You, Paul and 
couple of 
other guys. Addressed/explained issues/questions. Posted patches again 
(_without_ 
pull request). Got _zero_ replies even though folks who replied to the first 
patchset
were replying to other things in the same timeframe. So I figured since I 
addressed 
everything you guys are happy, why not push it to Linus.

So what did I do wrong ?

Max






>> 
>>  
>> Diffstat:
>>  Documentation/ABI/testing/sysfs-devices-system-cpu |   41 +++
>>  Documentation/cpu-isolation.txt|  113 
>> +
>>  arch/x86/Kconfig   |1 
>>  arch/x86/kernel/genapic_flat_64.c  |4 
>>  drivers/base/cpu.c |   48 
>>  include/linux/cpumask.h|3 
>>  kernel/Kconfig.cpuisol |   42 +++
>>  kernel/Makefile|4 
>>  kernel/cpu.c   |   54 ++
>>  kernel/sched.c |   36 --
>>  kernel/stop_machine.c  |8 +
>>  kernel/workqueue.c |   30 -
>>  12 files changed, 337 insertions(+), 47 deletions(-)
>>
>> This addresses all Andrew's comments for the last submission. Details here:
>>http://marc.info/?l=linux-kernel=120236394012766=2
>>
>> There are no code changes since last time, besides minor fix for moving 
>> on-stack array 
>> to __initdata as suggested by Andrew. Other stuff is just documentation 
>> updates. 
>>
>> List of commits
>>cpuisol: Make cpu isolation configrable and export isolated map
>>cpuisol: Do not route IRQs to the CPUs isolated at boot
>>cpuisol: Do not schedule workqueues on the isolated CPUs
>>cpuisol: Move on-stack array used for boot cmd parsing into __initdata
>>cpuisol: Documentation updates
>>cpuisol: Minor updates to the Kconfig options
>>cpuisol: Do not halt isolated CPUs with Stop Machine
>>
>> I suggested by Ingo I'm CC'ing everyone who is even remotely 
>> connected/affected ;-)
> 
> You forgot Oleg, he does a lot of the workqueue work.
> 
> I'm worried by your approach to never start any workqueue on these cpus.
> Like you said, it breaks Oprofile and others who depend on cpu local
> workqueues being present.
> 
> Under normal circumstances these workqueues will not do any work,
> someone needs to provide work for them. That is, workqueues are passive.
> 
> So I think your approach is the wrong way about. Instead of taking the
> workqueue away, take away those that generate the work.
> 
>> Ingo, Peter - Scheduler.
>>There are _no_ changes in this area besides moving cpu_*_map m

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Max Krasnyansky

David Miller wrote:
> From: Nick Piggin <[EMAIL PROTECTED]>
> Date: Tue, 12 Feb 2008 17:41:21 +1100
> 
>> stop machine is used for more than just module loading and unloading.
>> I don't think you can just disable it.
> 
> Right, in particular it is used for CPU hotplug.
Ooops. Totally missed that. And a bunch of other places.

[EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run
Documentation/cpu-hotplug.txt
arch/s390/kernel/kprobes.c
drivers/char/hw_random/intel-rng.c
include/linux/stop_machine.h
kernel/cpu.c
kernel/module.c
kernel/stop_machine.c
mm/page_alloc.c

I wonder why I did not see any issues when I disabled stop machine completely.
I mentioned in the other thread that I commented out the part that actually 
halts
the machine and ran it for several hours on my dual core laptop and on the quad
core server. Tried all kinds of workloads, which include constant module removal
and insertion, and cpu hotplug as well. It cannot be just luck :).

Clearly though, you guys are right. It cannot be simply disabled. Based on the 
above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390 and intel
rng driver. Hopefully we can avoid it at least in module insertion/removal.

Max

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Max Krasnyansky

Peter Zijlstra wrote:
 On Mon, 2008-02-11 at 20:10 -0800, Max Krasnyansky wrote:
 Andrew, looks like Linus decided not to pull this stuff.
 Can we please put it into -mm then.

 My tree is here
  git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git
 Please use 'master' branch (or 'for-linus' they are identical).
 
 I'm wondering why you insist on offering a git tree that bypasses the
 regular maintainers. Why not post the patches and walk the normal route?
 
 To me this feels rather aggressive, which makes me feel less inclined to
 look at it.
Peter, it may sound stupid but I'm honestly not sure what you mean. Please bear
with me I do not mean to sounds arrogant. I'm looking for advice here.
So here are some questions:

- First, who would the regular maintainer be in this case ?
I felt that cpu isolation can just sit in its own tree since it does not seem 
to belong to any existing stuff.
So far people suggested -mm and -shed.
I do not think it has much to do much with the -sched.
-mm seems more general purpose, since Linus did not pull it directly I asked 
Andrew to take this stuff into -mm. He was already ok with the patches when I 
sent original pull request to Linus.

- Is it not easier for a regular maintainer (whoever it turns out to be in this 
case)
to pull from GIT rather than use patches ?
In any case I did post patches along with pull request. So for example if 
Andrew 
prefers patches he could take those instead of the git. In fact if you look at 
my 
email I mentioned that if needed I can repost the patches.

- And last but not least I want to be able to just tell people who want to use 
CPU 
isolation Go get get this tree and use it. Git it the best for that.

I can see how pull request to Linus may have been a bit aggressive. But then 
again
I posted patches (_without_ pull request). Got feedback from You, Paul and 
couple of 
other guys. Addressed/explained issues/questions. Posted patches again 
(_without_ 
pull request). Got _zero_ replies even though folks who replied to the first 
patchset
were replying to other things in the same timeframe. So I figured since I 
addressed 
everything you guys are happy, why not push it to Linus.

So what did I do wrong ?

Max






 
  
 Diffstat:
  Documentation/ABI/testing/sysfs-devices-system-cpu |   41 +++
  Documentation/cpu-isolation.txt|  113 
 +
  arch/x86/Kconfig   |1 
  arch/x86/kernel/genapic_flat_64.c  |4 
  drivers/base/cpu.c |   48 
  include/linux/cpumask.h|3 
  kernel/Kconfig.cpuisol |   42 +++
  kernel/Makefile|4 
  kernel/cpu.c   |   54 ++
  kernel/sched.c |   36 --
  kernel/stop_machine.c  |8 +
  kernel/workqueue.c |   30 -
  12 files changed, 337 insertions(+), 47 deletions(-)

 This addresses all Andrew's comments for the last submission. Details here:
http://marc.info/?l=linux-kernelm=120236394012766w=2

 There are no code changes since last time, besides minor fix for moving 
 on-stack array 
 to __initdata as suggested by Andrew. Other stuff is just documentation 
 updates. 

 List of commits
cpuisol: Make cpu isolation configrable and export isolated map
cpuisol: Do not route IRQs to the CPUs isolated at boot
cpuisol: Do not schedule workqueues on the isolated CPUs
cpuisol: Move on-stack array used for boot cmd parsing into __initdata
cpuisol: Documentation updates
cpuisol: Minor updates to the Kconfig options
cpuisol: Do not halt isolated CPUs with Stop Machine

 I suggested by Ingo I'm CC'ing everyone who is even remotely 
 connected/affected ;-)
 
 You forgot Oleg, he does a lot of the workqueue work.
 
 I'm worried by your approach to never start any workqueue on these cpus.
 Like you said, it breaks Oprofile and others who depend on cpu local
 workqueues being present.
 
 Under normal circumstances these workqueues will not do any work,
 someone needs to provide work for them. That is, workqueues are passive.
 
 So I think your approach is the wrong way about. Instead of taking the
 workqueue away, take away those that generate the work.
 
 Ingo, Peter - Scheduler.
There are _no_ changes in this area besides moving cpu_*_map maps from 
 kerne/sched.c 
to kernel/cpu.c.
 
 Ingo (and Thomas) do the genirq bits
 
 The IRQ isolation in concept isn't wrong. But it seems to me that
 arch/x86/kernel/genapic_flat_64.c isn't the best place to do this.
 It just considers one architecture, if you do this, please make it work
 across all.
 
 Paul - Cpuset
Again there are _no_ changes in this area.
For reasons why cpuset is not the right mechanism for cpu isolation see

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Max Krasnyansky

David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Tue, 12 Feb 2008 17:41:21 +1100

 stop machine is used for more than just module loading and unloading.
 I don't think you can just disable it.

 Right, in particular it is used for CPU hotplug.
Ooops. Totally missed that. And a bunch of other places.

[EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run
Documentation/cpu-hotplug.txt
arch/s390/kernel/kprobes.c
drivers/char/hw_random/intel-rng.c
include/linux/stop_machine.h
kernel/cpu.c
kernel/module.c
kernel/stop_machine.c
mm/page_alloc.c

I wonder why I did not see any issues when I disabled stop machine completely.
I mentioned in the other thread that I commented out the part that actually 
halts
the machine and ran it for several hours on my dual core laptop and on the quad
core server. Tried all kinds of workloads, which include constant module removal
and insertion, and cpu hotplug as well. It cannot be just luck :).

Clearly though, you guys are right. It cannot be simply disabled. Based on the 
above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390 and intel
rng driver. Hopefully we can avoid it at least in module insertion/removal.

Max

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Max Krasnyansky



Steven Rostedt wrote:
 On Tue, 12 Feb 2008, Peter Zijlstra wrote:
 Rusty - Stop machine.
After doing a bunch of testing last three days I actually downgraded 
 stop machine
changes from [highly experimental] to simply [experimental]. Pleas see 
 this thread
for more info: http://marc.info/?l=linux-kernelm=120243837206248w=2
Short story is that I ran several insmod/rmmod workloads on live 
 multi-core boxes
with stop machine _completely_ disabled and did no see any issues. Rusty 
 did not get
a chance to reply yet, I hopping that we'll be able to make stop 
 machine completely
optional for some configurations.
 
 This part really scares me. The comment that you say you have run several
 insmod/rmmod workloads without kstop_machine doesn't mean that it is still
 safe. A lot of races that things like this protect may only happen under
 load once a month. But the fact that it happens at all is reason to have
 the protection.
 
 Before taking out any protection, please analyze it in detail and report
 your findings why something is not needed. Not just some general hand
 waving and it doesn't crash on my box.
Sure. I did not say lets disable it. I was hopping we could and I wanted to see 
what Rusty 
Russell has to say about this.

 Besides that, kstop_machine may be used by other features that can have an
 impact.
Yes it is. I missed a few. Nick and Dave already pointed out CPU hotplug. 
I looked around and found more users. So disabling stop machine completely is 
definitely out.

 Again, if you have a system that cant handle things like kstop_machine,
 than don't do things that require a kstop_machine run. All modules should
 be loaded, and no new modules should be added when the system is
 performing critical work. I see no reason for disabling kstop_machine.
I'm considering that option. So far it does not seem practical. At least the 
way we use those
machines at this point. If we can prove that at least not halting isolation 
CPUs is safe that'd
be better.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Max Krasnyansky

Nick Piggin wrote:
 On Wednesday 13 February 2008 14:32, Max Krasnyansky wrote:
 David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Tue, 12 Feb 2008 17:41:21 +1100

 stop machine is used for more than just module loading and unloading.
 I don't think you can just disable it.
 Right, in particular it is used for CPU hotplug.
 Ooops. Totally missed that. And a bunch of other places.

 [EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run
 Documentation/cpu-hotplug.txt
 arch/s390/kernel/kprobes.c
 drivers/char/hw_random/intel-rng.c
 include/linux/stop_machine.h
 kernel/cpu.c
 kernel/module.c
 kernel/stop_machine.c
 mm/page_alloc.c

 I wonder why I did not see any issues when I disabled stop machine
 completely. I mentioned in the other thread that I commented out the part
 that actually halts the machine and ran it for several hours on my dual
 core laptop and on the quad core server. Tried all kinds of workloads,
 which include constant module removal and insertion, and cpu hotplug as
 well. It cannot be just luck :).

 It really is. With subtle races, it can take a lot more than a few
 hours. Consider that we have subtle races still in the kernel now,
 which are almost never or rarely hit in maybe 10,000 hours * every
 single person who has been using the current kernel for the past
 year.

 For a less theoretical example -- when I was writing the RCU radix
 tree code, I tried to run directed stress tests on a 64 CPU Altix
 machine (which found no bugs). Then I ran it on a dedicated test
 harness that could actually do a lot more than the existing kernel
 users are able to, and promptly found a couple more bugs (on a 2
 CPU system).

 But your primary defence against concurrency bugs _has_ to be
 knowing the code and all its interactions.
100% agree. 
btw For modules though it does not seem like luck (ie that it worked fine for 
me). 
I mean subsystems are supposed to cleanly register/unregister anyway. But I can 
of
course be wrong. We'll see what Rusty says.

 Clearly though, you guys are right. It cannot be simply disabled. Based on
 the above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390
 and intel rng driver. Hopefully we can avoid it at least in module
 insertion/removal.

 Yes, reducing the number of users by going through their code and
 showing that it is safe, is the right way to do this. Also, you
 could avoid module insertion/removal?
I could. But it'd be nice if I did not have to :)

 FWIW, I think the idea of trying to turn Linux into giving hard
 realtime guarantees is just insane. If that is what you want, you
 would IMO be much better off to spend effort with something like
 improving adeos and communicatoin/administration between Linux and
 the hard-rt kernel.

 But don't let me dissuade you from making these good improvements
 to Linux as well :) Just that it isn't really going to be hard-rt
 in general.
Actually that's the cool thing about CPU isolation. Get rid of all latency 
sources
from the CPU(s) and you get youself as hard-RT as it gets. 
I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec worst 
case and 
~200nsec average latency. I do not even need Adeos/Xenomai or Preemp-RT just a 
few
very small patches. And it can be used for non RT stuff too.

Max

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[git pull for -mm] CPU isolation extensions (updated2)

2008-02-11 Thread Max Krasnyansky

Andrew, looks like Linus decided not to pull this stuff.
Can we please put it into -mm then.

My tree is here
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git
Please use 'master' branch (or 'for-linus' they are identical).

There are no changes since last time I sent it. Details below.
Patches were sent out two days ago. I can resend them if needed.

Thanx
Max


 
Diffstat:
 Documentation/ABI/testing/sysfs-devices-system-cpu |   41 +++
 Documentation/cpu-isolation.txt|  113 +
 arch/x86/Kconfig   |1 
 arch/x86/kernel/genapic_flat_64.c  |4 
 drivers/base/cpu.c |   48 
 include/linux/cpumask.h|3 
 kernel/Kconfig.cpuisol |   42 +++
 kernel/Makefile|4 
 kernel/cpu.c   |   54 ++
 kernel/sched.c |   36 --
 kernel/stop_machine.c  |8 +
 kernel/workqueue.c |   30 -
 12 files changed, 337 insertions(+), 47 deletions(-)

This addresses all Andrew's comments for the last submission. Details here:
   http://marc.info/?l=linux-kernel=120236394012766=2

There are no code changes since last time, besides minor fix for moving 
on-stack array 
to __initdata as suggested by Andrew. Other stuff is just documentation 
updates. 

List of commits
   cpuisol: Make cpu isolation configrable and export isolated map
   cpuisol: Do not route IRQs to the CPUs isolated at boot
   cpuisol: Do not schedule workqueues on the isolated CPUs
   cpuisol: Move on-stack array used for boot cmd parsing into __initdata
   cpuisol: Documentation updates
   cpuisol: Minor updates to the Kconfig options
   cpuisol: Do not halt isolated CPUs with Stop Machine

I suggested by Ingo I'm CC'ing everyone who is even remotely connected/affected 
;-)
 
Ingo, Peter - Scheduler.
   There are _no_ changes in this area besides moving cpu_*_map maps from 
kerne/sched.c 
   to kernel/cpu.c.

Paul - Cpuset
   Again there are _no_ changes in this area.
   For reasons why cpuset is not the right mechanism for cpu isolation see this 
thread
  http://marc.info/?l=linux-kernel=120180692331461=2

Rusty - Stop machine.
   After doing a bunch of testing last three days I actually downgraded stop 
machine 
   changes from [highly experimental] to simply [experimental]. Pleas see this 
thread 
   for more info: http://marc.info/?l=linux-kernel=120243837206248=2
   Short story is that I ran several insmod/rmmod workloads on live multi-core 
boxes 
   with stop machine _completely_ disabled and did no see any issues. Rusty did 
not get
   a chance to reply yet, I hopping that we'll be able to make "stop machine" 
completely
   optional for some configurations.

Gerg - ABI documentation.
   Nothing interesting here. I simply added 
Documentation/ABI/testing/sysfs-devices-system-cpu
   and documented some of the attributes exposed in there.
   Suggested by Andrew.

I believe this is ready for the inclusion and my impression is that Andrew is 
ok with that. 
Most changes are very simple and do not affect existing behavior. As I 
mentioned before I've 
been using Workqueue and StopMachine changes in production for a couple of 
years now and have 
high confidence in them. Yet they are marked as experimental for now, just to 
be safe.

My original explanation is included below.

btw I'll be out skiing/snow boarding for the next 4 days and will have sporadic 
email access.
Will do my best to address question/concerns (if any) during that time.

Thanx
Max

--
This patch series extends CPU isolation support. Yes, most people want to 
virtuallize 
CPUs these days and I want to isolate them  :) .

The primary idea here is to be able to use some CPU cores as the dedicated 
engines for running
user-space code with minimal kernel overhead/intervention, think of it as an 
SPE in the 
Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on 
one of the 
processors without adversely affecting or being affected by the other system 
activities. 
System activities here include _kernel_ activities as well. 

I'm personally using this for hard realtime purposes. With CPU isolation it's 
very easy to 
achieve single digit usec worst case and around 200 nsec average response times 
on off-the-shelf
multi- processor/core systems (vanilla kernel plus these patches) even under 
extreme system load. 
I'm working with legal folks on releasing hard RT user-space framework for that.
I believe with the current multi-core CPU trend we will see more and more 
applications that 
explore this capability: RT gaming engines, simulators, hard RT apps, etc.

Hence the

[git pull for -mm] CPU isolation extensions (updated2)

2008-02-11 Thread Max Krasnyansky

Andrew, looks like Linus decided not to pull this stuff.
Can we please put it into -mm then.

My tree is here
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git
Please use 'master' branch (or 'for-linus' they are identical).

There are no changes since last time I sent it. Details below.
Patches were sent out two days ago. I can resend them if needed.

Thanx
Max


 
Diffstat:
 Documentation/ABI/testing/sysfs-devices-system-cpu |   41 +++
 Documentation/cpu-isolation.txt|  113 +
 arch/x86/Kconfig   |1 
 arch/x86/kernel/genapic_flat_64.c  |4 
 drivers/base/cpu.c |   48 
 include/linux/cpumask.h|3 
 kernel/Kconfig.cpuisol |   42 +++
 kernel/Makefile|4 
 kernel/cpu.c   |   54 ++
 kernel/sched.c |   36 --
 kernel/stop_machine.c  |8 +
 kernel/workqueue.c |   30 -
 12 files changed, 337 insertions(+), 47 deletions(-)

This addresses all Andrew's comments for the last submission. Details here:
   http://marc.info/?l=linux-kernelm=120236394012766w=2

There are no code changes since last time, besides minor fix for moving 
on-stack array 
to __initdata as suggested by Andrew. Other stuff is just documentation 
updates. 

List of commits
   cpuisol: Make cpu isolation configrable and export isolated map
   cpuisol: Do not route IRQs to the CPUs isolated at boot
   cpuisol: Do not schedule workqueues on the isolated CPUs
   cpuisol: Move on-stack array used for boot cmd parsing into __initdata
   cpuisol: Documentation updates
   cpuisol: Minor updates to the Kconfig options
   cpuisol: Do not halt isolated CPUs with Stop Machine

I suggested by Ingo I'm CC'ing everyone who is even remotely connected/affected 
;-)
 
Ingo, Peter - Scheduler.
   There are _no_ changes in this area besides moving cpu_*_map maps from 
kerne/sched.c 
   to kernel/cpu.c.

Paul - Cpuset
   Again there are _no_ changes in this area.
   For reasons why cpuset is not the right mechanism for cpu isolation see this 
thread
  http://marc.info/?l=linux-kernelm=120180692331461w=2

Rusty - Stop machine.
   After doing a bunch of testing last three days I actually downgraded stop 
machine 
   changes from [highly experimental] to simply [experimental]. Pleas see this 
thread 
   for more info: http://marc.info/?l=linux-kernelm=120243837206248w=2
   Short story is that I ran several insmod/rmmod workloads on live multi-core 
boxes 
   with stop machine _completely_ disabled and did no see any issues. Rusty did 
not get
   a chance to reply yet, I hopping that we'll be able to make stop machine 
completely
   optional for some configurations.

Gerg - ABI documentation.
   Nothing interesting here. I simply added 
Documentation/ABI/testing/sysfs-devices-system-cpu
   and documented some of the attributes exposed in there.
   Suggested by Andrew.

I believe this is ready for the inclusion and my impression is that Andrew is 
ok with that. 
Most changes are very simple and do not affect existing behavior. As I 
mentioned before I've 
been using Workqueue and StopMachine changes in production for a couple of 
years now and have 
high confidence in them. Yet they are marked as experimental for now, just to 
be safe.

My original explanation is included below.

btw I'll be out skiing/snow boarding for the next 4 days and will have sporadic 
email access.
Will do my best to address question/concerns (if any) during that time.

Thanx
Max

--
This patch series extends CPU isolation support. Yes, most people want to 
virtuallize 
CPUs these days and I want to isolate them  :) .

The primary idea here is to be able to use some CPU cores as the dedicated 
engines for running
user-space code with minimal kernel overhead/intervention, think of it as an 
SPE in the 
Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on 
one of the 
processors without adversely affecting or being affected by the other system 
activities. 
System activities here include _kernel_ activities as well. 

I'm personally using this for hard realtime purposes. With CPU isolation it's 
very easy to 
achieve single digit usec worst case and around 200 nsec average response times 
on off-the-shelf
multi- processor/core systems (vanilla kernel plus these patches) even under 
extreme system load. 
I'm working with legal folks on releasing hard RT user-space framework for that.
I believe with the current multi-core CPU trend we will see more and more 
applications that 
explore this capability: RT gaming engines, simulators, hard RT apps, etc.

Hence the

Re: [git pull] CPU isolation extensions (updated)

2008-02-09 Thread Max Krasnyansky

Paul Jackson wrote:
> Max wrote:
>> Linus, please pull CPU isolation extensions from
> 
> Did I miss something in this discussion?  I thought
> Ingo was quite clear, and Linus pretty clear too,
> that this patch should bake in *-mm or some such
> place for a bit first.
> 

Andrew said:
> The feature as a whole seems useful, and I don't actually oppose the merge
> based on what I see here.  As long as you're really sure that cpusets are
> inappropriate (and bear in mind that Paul has a track record of being wrong
> on this :)).  But I see a few glitches 

As far as I can understand Andrew is ok with the merge. And I addressed all 
his comments.

Linus said:
> Have these been in -mm and widely discussed etc? I'd like to start more 
> carefully, and (a) have that controversial last patch not merged initially 
> and (b) make sure everybody is on the same page wrt this all..

As far as I can understand Linus _asked_ whether it was in -mm or not and 
whether
everybody's on the same page. He did not say "this must be in -mm first".
I explained that it has not been in -mm, and who it was discussed with, and did 
a 
bunch more testing/investigation on the controversial patch and explained why I 
think 
it's not that controversial any more.

Ingo said a few different things (a bit too large to quote). 
- That it was not discussed. I explained that it was in fact discussed and 
provided
a bunch of pointers to the mail threads.
- That he thinks that cpuset is the way to do it. Again I explained why it's 
not.
And at the end he said:
> Also, i'd not mind some test-coverage in sched.git as well.

I far as I know "do not mind" does not mean "must go to" ;-). Also I replied 
that 
I did not mind either but I do not think that it has much (if anything) to do 
with
the scheduler.

Anyway. I think I mentioned that I did not mind -mm either. I think it's ready 
for
the mainline. But if people still strongly feel that it has to be in -mm that's 
fine.
Lets just do s/Linus/Andrew/ on the first line and move on. But if Linus pulls 
it now
even better ;-)

Andrew, Linus, I'll let you guys decide which tree it needs to go.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[git pull] CPU isolation extensions (updated)

2008-02-09 Thread Max Krasnyansky

Linus, please pull CPU isolation extensions from

git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus
 
Diffstat:
 Documentation/ABI/testing/sysfs-devices-system-cpu |   41 +++
 Documentation/cpu-isolation.txt|  113 +
 arch/x86/Kconfig   |1 
 arch/x86/kernel/genapic_flat_64.c  |4 
 drivers/base/cpu.c |   48 
 include/linux/cpumask.h|3 
 kernel/Kconfig.cpuisol |   42 +++
 kernel/Makefile|4 
 kernel/cpu.c   |   54 ++
 kernel/sched.c |   36 --
 kernel/stop_machine.c  |8 +
 kernel/workqueue.c |   30 -
 12 files changed, 337 insertions(+), 47 deletions(-)

This addresses all Andrew's comments for the last submission. Details here:
   http://marc.info/?l=linux-kernel=120236394012766=2

There are no code changes since last time, besides minor fix for moving 
on-stack array 
to __initdata as suggested by Andrew. Other stuff is just documentation 
updates. 

List of commits
   cpuisol: Make cpu isolation configrable and export isolated map
   cpuisol: Do not route IRQs to the CPUs isolated at boot
   cpuisol: Do not schedule workqueues on the isolated CPUs
   cpuisol: Move on-stack array used for boot cmd parsing into __initdata
   cpuisol: Documentation updates
   cpuisol: Minor updates to the Kconfig options
   cpuisol: Do not halt isolated CPUs with Stop Machine

I suggested by Ingo I'm CC'ing everyone who is even remotely connected/affected 
;-)
 
Ingo, Peter - Scheduler.
   There are _no_ changes in this area besides moving cpu_*_map maps from 
kerne/sched.c 
   to kernel/cpu.c.

Paul - Cpuset
   Again there are _no_ changes in this area.
   For reasons why cpuset is not the right mechanism for cpu isolation see this 
thread
  http://marc.info/?l=linux-kernel=120180692331461=2

Rusty - Stop machine.
   After doing a bunch of testing last three days I actually downgraded stop 
machine 
   changes from [highly experimental] to simply [experimental]. Pleas see this 
thread 
   for more info: http://marc.info/?l=linux-kernel=120243837206248=2
   Short story is that I ran several insmod/rmmod workloads on live multi-core 
boxes 
   with stop machine _completely_ disabled and did no see any issues. Rusty did 
not get
   a chance to reply yet, I hopping that we'll be able to make "stop machine" 
completely
   optional for some configurations.

Gerg - ABI documentation.
   Nothing interesting here. I simply added 
Documentation/ABI/testing/sysfs-devices-system-cpu
   and documented some of the attributes exposed in there.
   Suggested by Andrew.

I believe this is ready for the inclusion and my impression is that Andrew is 
ok with that. 
Most changes are very simple and do not affect existing behavior. As I 
mentioned before I've 
been using Workqueue and StopMachine changes in production for a couple of 
years now and have 
high confidence in them. Yet they are marked as experimental for now, just to 
be safe.

My original explanation is included below.

btw I'll be out skiing/snow boarding for the next 4 days and will have sporadic 
email access.
Will do my best to address question/concerns (if any) during that time.

Thanx
Max

--
This patch series extends CPU isolation support. Yes, most people want to 
virtuallize 
CPUs these days and I want to isolate them  :) .

The primary idea here is to be able to use some CPU cores as the dedicated 
engines for running
user-space code with minimal kernel overhead/intervention, think of it as an 
SPE in the 
Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on 
one of the 
processors without adversely affecting or being affected by the other system 
activities. 
System activities here include _kernel_ activities as well. 

I'm personally using this for hard realtime purposes. With CPU isolation it's 
very easy to 
achieve single digit usec worst case and around 200 nsec average response times 
on off-the-shelf
multi- processor/core systems (vanilla kernel plus these patches) even under 
extreme system load. 
I'm working with legal folks on releasing hard RT user-space framework for that.
I believe with the current multi-core CPU trend we will see more and more 
applications that 
explore this capability: RT gaming engines, simulators, hard RT apps, etc.

Hence the proposal is to extend current CPU isolation feature.
The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
  Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts

[git pull] CPU isolation extensions (updated)

2008-02-09 Thread Max Krasnyansky

Linus, please pull CPU isolation extensions from

git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus
 
Diffstat:
 Documentation/ABI/testing/sysfs-devices-system-cpu |   41 +++
 Documentation/cpu-isolation.txt|  113 +
 arch/x86/Kconfig   |1 
 arch/x86/kernel/genapic_flat_64.c  |4 
 drivers/base/cpu.c |   48 
 include/linux/cpumask.h|3 
 kernel/Kconfig.cpuisol |   42 +++
 kernel/Makefile|4 
 kernel/cpu.c   |   54 ++
 kernel/sched.c |   36 --
 kernel/stop_machine.c  |8 +
 kernel/workqueue.c |   30 -
 12 files changed, 337 insertions(+), 47 deletions(-)

This addresses all Andrew's comments for the last submission. Details here:
   http://marc.info/?l=linux-kernelm=120236394012766w=2

There are no code changes since last time, besides minor fix for moving 
on-stack array 
to __initdata as suggested by Andrew. Other stuff is just documentation 
updates. 

List of commits
   cpuisol: Make cpu isolation configrable and export isolated map
   cpuisol: Do not route IRQs to the CPUs isolated at boot
   cpuisol: Do not schedule workqueues on the isolated CPUs
   cpuisol: Move on-stack array used for boot cmd parsing into __initdata
   cpuisol: Documentation updates
   cpuisol: Minor updates to the Kconfig options
   cpuisol: Do not halt isolated CPUs with Stop Machine

I suggested by Ingo I'm CC'ing everyone who is even remotely connected/affected 
;-)
 
Ingo, Peter - Scheduler.
   There are _no_ changes in this area besides moving cpu_*_map maps from 
kerne/sched.c 
   to kernel/cpu.c.

Paul - Cpuset
   Again there are _no_ changes in this area.
   For reasons why cpuset is not the right mechanism for cpu isolation see this 
thread
  http://marc.info/?l=linux-kernelm=120180692331461w=2

Rusty - Stop machine.
   After doing a bunch of testing last three days I actually downgraded stop 
machine 
   changes from [highly experimental] to simply [experimental]. Pleas see this 
thread 
   for more info: http://marc.info/?l=linux-kernelm=120243837206248w=2
   Short story is that I ran several insmod/rmmod workloads on live multi-core 
boxes 
   with stop machine _completely_ disabled and did no see any issues. Rusty did 
not get
   a chance to reply yet, I hopping that we'll be able to make stop machine 
completely
   optional for some configurations.

Gerg - ABI documentation.
   Nothing interesting here. I simply added 
Documentation/ABI/testing/sysfs-devices-system-cpu
   and documented some of the attributes exposed in there.
   Suggested by Andrew.

I believe this is ready for the inclusion and my impression is that Andrew is 
ok with that. 
Most changes are very simple and do not affect existing behavior. As I 
mentioned before I've 
been using Workqueue and StopMachine changes in production for a couple of 
years now and have 
high confidence in them. Yet they are marked as experimental for now, just to 
be safe.

My original explanation is included below.

btw I'll be out skiing/snow boarding for the next 4 days and will have sporadic 
email access.
Will do my best to address question/concerns (if any) during that time.

Thanx
Max

--
This patch series extends CPU isolation support. Yes, most people want to 
virtuallize 
CPUs these days and I want to isolate them  :) .

The primary idea here is to be able to use some CPU cores as the dedicated 
engines for running
user-space code with minimal kernel overhead/intervention, think of it as an 
SPE in the 
Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on 
one of the 
processors without adversely affecting or being affected by the other system 
activities. 
System activities here include _kernel_ activities as well. 

I'm personally using this for hard realtime purposes. With CPU isolation it's 
very easy to 
achieve single digit usec worst case and around 200 nsec average response times 
on off-the-shelf
multi- processor/core systems (vanilla kernel plus these patches) even under 
extreme system load. 
I'm working with legal folks on releasing hard RT user-space framework for that.
I believe with the current multi-core CPU trend we will see more and more 
applications that 
explore this capability: RT gaming engines, simulators, hard RT apps, etc.

Hence the proposal is to extend current CPU isolation feature.
The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
  Users must explicitly bind threads in order to run on those CPU(s).

2. By default

Re: [git pull] CPU isolation extensions (updated)

2008-02-09 Thread Max Krasnyansky

Paul Jackson wrote:
 Max wrote:
 Linus, please pull CPU isolation extensions from
 
 Did I miss something in this discussion?  I thought
 Ingo was quite clear, and Linus pretty clear too,
 that this patch should bake in *-mm or some such
 place for a bit first.
 

Andrew said:
 The feature as a whole seems useful, and I don't actually oppose the merge
 based on what I see here.  As long as you're really sure that cpusets are
 inappropriate (and bear in mind that Paul has a track record of being wrong
 on this :)).  But I see a few glitches 

As far as I can understand Andrew is ok with the merge. And I addressed all 
his comments.

Linus said:
 Have these been in -mm and widely discussed etc? I'd like to start more 
 carefully, and (a) have that controversial last patch not merged initially 
 and (b) make sure everybody is on the same page wrt this all..

As far as I can understand Linus _asked_ whether it was in -mm or not and 
whether
everybody's on the same page. He did not say this must be in -mm first.
I explained that it has not been in -mm, and who it was discussed with, and did 
a 
bunch more testing/investigation on the controversial patch and explained why I 
think 
it's not that controversial any more.

Ingo said a few different things (a bit too large to quote). 
- That it was not discussed. I explained that it was in fact discussed and 
provided
a bunch of pointers to the mail threads.
- That he thinks that cpuset is the way to do it. Again I explained why it's 
not.
And at the end he said:
 Also, i'd not mind some test-coverage in sched.git as well.

I far as I know do not mind does not mean must go to ;-). Also I replied 
that 
I did not mind either but I do not think that it has much (if anything) to do 
with
the scheduler.

Anyway. I think I mentioned that I did not mind -mm either. I think it's ready 
for
the mainline. But if people still strongly feel that it has to be in -mm that's 
fine.
Lets just do s/Linus/Andrew/ on the first line and move on. But if Linus pulls 
it now
even better ;-)
  
Andrew, Linus, I'll let you guys decide which tree it needs to go.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Module loading/unloading and "The Stop Machine"

2008-02-07 Thread Max Krasnyansky

Hi Rusty,

I was hopping you could answer a couple of questions about module 
loading/unloading
and the stop machine.
There was a recent discussion on LKML about CPU isolation patches I'm working 
on.
One of the patches makes stop machine ignore the isolated CPUs. People of 
course had
questions about that. So I started looking into more details and got this 
silly, crazy 
idea that maybe we do not need the stop machine any more :)

As far as I can tell the stop machine is basically a safety net in case some 
locking
and recounting mechanisms aren't bullet proof. In other words if a subsystem 
can actually
handle registration/unregistration in a robust way, module loader/unloader does 
not 
necessarily have to halt entire machine in order to load/unload a module that 
belongs
to that subsystem. I may of course be completely wrong on that.
 
The problem with the stop machine is that it's a very very big gun :). In a 
sense that 
it totally kills all the latencies and stuff since the entire machine gets 
halted while
module is being (un)loaded. Which is a major issue for any realtime apps. 
Specifically 
for CPU isolation the issue is that high-priority rt user-space thread prevents 
stop 
machine threads from running and entire box just hangs waiting for it. 
I'm kind of surprised that folks who use monster boxes with over 100 CPUs have 
not 
complained. It's must be a huge hit for those machines to halt the entire 
thing. 

It seems that over the last few years most subsystems got much better at 
locking and 
refcounting. And I'm hopping that we can avoid halting the entire machine these 
days.
For CPU isolation in particular the solution is simple. We can just ignore 
isolated CPUs. 
What I'm trying to figure out is how safe it is and whether we can avoid full 
halt 
altogether.

So. Here is what I tried today on my Core2 Duo laptop
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -204,11 +204,14 @@ int stop_machine_run(int (*fn)(void *), void *data, 
> unsigned int cpu)
>  
> /* No CPUs can come up or down during this. */
> lock_cpu_hotplug();
> +/*
> p = __stop_machine_run(fn, data, cpu);
> if (!IS_ERR(p))
> ret = kthread_stop(p);
> else
> ret = PTR_ERR(p);
> +*/
> +   ret = fn(data);
> unlock_cpu_hotplug();
>  
> return ret;

ie Completely disabled stop machine. It just loads/unloads modules without full 
halt.
I then ran three scripts:

while true; do
/sbin/modprobe -r uhci_hcd
/sbin/modprobe uhci_hcd
sleep 10
done

while true; do
/sbin/modprobe -r tg3
/sbin/modprobe tg3
sleep 2
done

while true; do
/usr/sbin/tcpdump -i eth0
done

The machine has a bunch of USB devices connected to it. The two most 
interesting 
are a Bluetooth dongle and a USB mouse. By loading/unloading UHCI driver we're 
touching
Sysfs, USB stack, Bluetooth stack, HID layer, Input layer. The X is running and 
is using 
that USB mouse. The Bluetooth services are running too.
By loading/unloading TG3 driver we're touching sysfs, network stack (a bunch of 
layers).
The machine is running NetworkManager and tcpdumping on the eth0 which is 
registered 
by TG3.
This is a pretty good stress test in general let alone the disabled stop 
machine. 

I left all that running for the whole day while doing normal day to day things. 
Compiling a bunch of things, emails, office apps, etc. That's where I'm writing 
this
email from :). It's still running all that :) 

So the question is do we still need stop machine ? I must be missing something 
obvious.
But things seem to be working pretty well without it. I certainly feel much 
better about 
at least ignoring isolated CPUs during stop machine execution. Which btw I've 
doing
for a couple of years now on a wide range of the machines where people are 
inserting 
modules left and right. 

What do you think ?

Thanx
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions

2008-02-07 Thread Max Krasnyansky

Hi Ingo,

Thanks for your reply.

> * Linus Torvalds <[EMAIL PROTECTED]> wrote:
> 
>> On Wed, 6 Feb 2008, Max Krasnyansky wrote:
>>> Linus, please pull CPU isolation extensions from
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git 
>>> for-linus
>> Have these been in -mm and widely discussed etc? I'd like to start 
>> more carefully, and (a) have that controversial last patch not merged 
>> initially and (b) make sure everybody is on the same page wrt this 
>> all..
> 
> no, they have not been under nearly enough testing and review - these 
> patches surfaced on lkml for the first time one week ago (!). 
Almost two weeks actually. Ok 1.8 :)

> I find the pull request totally premature, this stuff has not been discussed 
> and 
> agreed on _at all_.
Ingo, I may have the wrong impression but my impression is that you ignored all 
the 
other emails and just read Linus' reply. I do not believe this accusation is 
valid.
I apologize if my impression is incorrect.
Since the patches _do not_ change/affect existing scheduler/cpuset 
functionality I did 
not know who to CC in the first email that I sent. Luckily Peter picked it up 
and CC'ed 
a bunch of folks, including Paul, Steven and You.
All of them replied and had questions/concerns. As I mentioned before I believe 
I addressed
all of them.
 
> None of the people who maintain and have interest in 
> this code and participated in the (short) one-week discussion were 
> Cc:-ed to the pull request.
Ok. I did not realize I'm supposed to do that. 
Since I got no replies to the second round of patches (take 2), which again was 
CC'ed to
the same people that Peter CC'ed. I assumed that people are ok with it. That's 
what discussion 
on the first take ended with.

> I think these patches also need a buy-in from Peter Zijlstra and Paul 
> Jackson (or really good reasoning while any objections from them should 
> be overriden) - all of whom deal with the code affected by these changes 
> on a daily basis and have an interest in CPU isolation features.
See above. 
Following issues were raised:
1. Peter and Steven initially thought that workqueue isolation is not needed.
2. Paul thought that it should be implemented on top of cpusets.
3. Peter thought that stopmachine change is not safe.
There were a couple of other minor misunderstandings (for example Peter thought 
that I'm completely disallowing IRQs on isolated CPUs, which is obviously not
the case). I clarified all of them.

#1 I explained in the original thread and then followed up with concrete code 
example
of why it is needed.
http://marc.info/?l=linux-kernel=120217173001671=2
Got no replies so far. So I'm assuming folks are happy.

#2 I started a separate thread on that
http://marc.info/?l=linux-kernel=120180692331461=2
The conclusion was, well let me just quote exactly what Paul had said:

> Paul Jackson wrote:
>> Max wrote:
>>> Looks like I failed to explain what I'm trying to achieve. So let me try 
>>> again.
>> 
>> Well done.  I read through that, expecting to disagree or at least
>> to not understand at some point, and got all the way through nodding
>> my head in agreement.  Good.
>> 
>> Whether the earlier confusions were lack of clarity in the presentation,
>> or lack of competence in my brain ... well guess I don't want to ask that
>> question ;).


And #3 Peter did not agree with me but said that it's up to Linus or Andrew to 
decide
whether it's appropriate in mainline or not. I _clearly_ indicated that this 
part is
somewhat controversial and maybe dangerous, I'm _not_ trying to sneak something 
in. 
Andrew picked it up and I'm going to do some more investigation on whether it's 
really
not safe or is actually fine (about to send an email to Rusty).

> Generally i think that cpusets is actually the feature and API that 
> should be used (and extended) for CPU isolation - and we already 
> extended it recently in the direction of CPU isolation. Most enterprise 
> distros have cpusets enabled so it's in use. Also, cpusets has the 
> appeal of being commonly used in the "big honking boxes" arena, so 
> reusing the same concept for RT and virtualization stuff would be the 
> natural approach. It already ties in to the scheduler domains code 
> dynamically and is flexible and scalable. I resisted ad-hoc CPU 
> isolation patches in -rt for that reason. 
That's exactly what Paul proposed initially. I completely disagree with that 
but I did look 
at it in _detail_. 
Please take a look here for detailed explanation
http://marc.info/?l=linux-kernel=120180692331461=2
This email getting to long and I did not want to inline everything.

> Also, i'd not mind some test-coverage in sched.git as well.
I believe it has _nothin

Re: [E1000-devel] e1000 1sec latency problem

2008-02-07 Thread Max Krasnyansky

Kok, Auke wrote:
> Max Krasnyansky wrote:
>> Kok, Auke wrote:
>>> Max Krasnyansky wrote:
>>>> Kok, Auke wrote:
>>>>> Max Krasnyansky wrote:
>>>>>> So you don't think it's related to the interrupt coalescing by any 
>>>>>> chance ?
>>>>>> I'd suggest to try and disable the coalescing and see if it makes any 
>>>>>> difference.
>>>>>> We've had lots of issues with coalescing misbehavior. Not this bad (ie 1 
>>>>>> second) though.
>>>>>>
>>>>>> Add this to modprobe.conf and reload e1000 module
>>>>>>
>>>>>> options e1000 RxIntDelay=0,0 RxAbsIntDelay=0,0 InterruptThrottleRate=0,0 
>>>>>> TxIntDelay=0,0 TxAbsIntDelay=0,0
>>>>> that can't be the problem. irq moderation would only account for 2-3ms 
>>>>> variance
>>>>> maximum.
>>>> Oh, I've definitely seen worse than that. Not as bad as a 1second though. 
>>>> Plus you're talking
>>>> about the case when coalescing logic is working as designed ;-). What if 
>>>> there is some kind of 
>>>> bug where timer did not expire or something.
>>> we don't use a software timer in e1000 irq coalescing/moderation, it's all 
>>> in
>>> hardware, so we don't have that problem at all. And I certainly have never 
>>> seen
>>> anything you are referring to with e1000 hardware, and I do not know of any 
>>> bug
>>> related to this.
>>>
>>> are you maybe confused with other hardware ?
>>>
>>> feel free to demonstrate an example...
>> Just to give you a background. I wrote and maintain http://libe1000.sf.net
>> So I know E1000 HW and SW in and out.
> 
> wow, even I do not dare to say that!
Ok maybe that was a bit of an overstatement :). 

>> And no I'm not confused with other HW and I know that we're
>> not using SW timers for the coalescing. HW can be buggy as well. Note that 
>> I'm not saying that I
>> know for sure that the problem is coalescing, I'm just suggesting to take it 
>> out of the equation
>> while Pavel is investigating.
>>
>> Unfortunately I cannot demonstrate an example but I've seen unexplained 
>> packet delays in the range 
>> of 1-20 milliseconds on E1000 HW (and boy ... I do have a lot of it in my 
>> labs). Once coalescing 
>> was disabled those problems have gone away.
> 
> this sounds like you have some sort of PCI POST-ing problem and those can 
> indeed
> be worse if you use any form of interrupt coalescing. In any case that is 
> largely
> irrelevant to the in-kernel drivers, and as I said we definately have no open
> issues on that right now, and I really do not recollect any as well either 
> (other
> than the issue of interference when both ends are irq coalescing)
I was actually talking about in kernel drivers. ie We were seeing delays with 
TIPC running over in
kernel E1000 driver. And no it was not a TIPC issue, everything worked fine 
with over TG3 and issues
went away when coalescing was disabled. 
Anyway, I think we can drop this subject.

Max


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [E1000-devel] e1000 1sec latency problem

2008-02-07 Thread Max Krasnyansky



Kok, Auke wrote:
> Max Krasnyansky wrote:
>> Kok, Auke wrote:
>>> Max Krasnyansky wrote:
>>>> So you don't think it's related to the interrupt coalescing by any chance ?
>>>> I'd suggest to try and disable the coalescing and see if it makes any 
>>>> difference.
>>>> We've had lots of issues with coalescing misbehavior. Not this bad (ie 1 
>>>> second) though.
>>>>
>>>> Add this to modprobe.conf and reload e1000 module
>>>>
>>>> options e1000 RxIntDelay=0,0 RxAbsIntDelay=0,0 InterruptThrottleRate=0,0 
>>>> TxIntDelay=0,0 TxAbsIntDelay=0,0
>>> that can't be the problem. irq moderation would only account for 2-3ms 
>>> variance
>>> maximum.
>> Oh, I've definitely seen worse than that. Not as bad as a 1second though. 
>> Plus you're talking
>> about the case when coalescing logic is working as designed ;-). What if 
>> there is some kind of 
>> bug where timer did not expire or something.
> 
> we don't use a software timer in e1000 irq coalescing/moderation, it's all in
> hardware, so we don't have that problem at all. And I certainly have never 
> seen
> anything you are referring to with e1000 hardware, and I do not know of any 
> bug
> related to this.
> 
> are you maybe confused with other hardware ?
> 
> feel free to demonstrate an example...

Just to give you a background. I wrote and maintain http://libe1000.sf.net
So I know E1000 HW and SW in and out. And no I'm not confused with other HW and 
I know that we're 
not using SW timers for the coalescing. HW can be buggy as well. Note that I'm 
not saying that I
know for sure that the problem is coalescing, I'm just suggesting to take it 
out of the equation
while Pavel is investigating.

Unfortunately I cannot demonstrate an example but I've seen unexplained packet 
delays in the range 
of 1-20 milliseconds on E1000 HW (and boy ... I do have a lot of it in my 
labs). Once coalescing 
was disabled those problems have gone away.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions

2008-02-07 Thread Max Krasnyansky



Paul Jackson wrote:
> Max - Andrew wondered if the rt tree had seen the
> code or commented it on it.  What became of that?
I just replied to Andrew. It's not an RT feature per se.
And yes Peter CC'ed RT folks. You probably did not get a chance to read all 
replies.
They had some questions/concerns and stuff. I believe I answered/clarified all 
of them.

> My two cents isn't worth a plug nickel here, but
> I'm inclined to nod in agreement when Linus wants
> to see these patches get some more exposure before
> going into Linus's tree.  ... what's the hurry?
No hurry I guess. I did mentioned in the introductory email that I've been 
maintaining 
this stuff for awhile now. SLAB patches used to be messy, with new SLUB the 
mess goes away.
CFS handles CPU hotplug much better than O(1), cpu hotplug is needed to be able 
to change
isolated bit from sysfs. That's why I think it's a good time to merge.
I don't mind of course if we put this stuff in -mm first. Although first part 
of the patchset 
(ie exporting isolated map, sysfs interface, etc) seem very simple and totally 
not controversial.
Stop machine patch is really the only thing that may look suspicious. 
 
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions

2008-02-07 Thread Max Krasnyansky

Andrew Morton wrote:
> On Thu, 7 Feb 2008 01:59:54 -0600 Paul Jackson <[EMAIL PROTECTED]> wrote:
> 
>> but hard real time is not my expertise
> 
> Speaking of which..  there is the -rt tree.  Have those people had a look
> at the feature, perhaps played with the code?

Peter Z. and Steven R. sent me some comments, I believe I explained and 
addressed them.
Ingo's been quite. Probably too busy.

btw It's not an RT feature per se. It certainly helps RT but removing all the 
latency
sources from isolated CPUs. But in general it's just "reducing kernel overhead 
on some CPUs"
kind of feature.

Max 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [E1000-devel] e1000 1sec latency problem

2008-02-07 Thread Max Krasnyansky

Kok, Auke wrote:
> Max Krasnyansky wrote:
>> So you don't think it's related to the interrupt coalescing by any chance ?
>> I'd suggest to try and disable the coalescing and see if it makes any 
>> difference.
>> We've had lots of issues with coalescing misbehavior. Not this bad (ie 1 
>> second) though.
>>
>> Add this to modprobe.conf and reload e1000 module
>>
>> options e1000 RxIntDelay=0,0 RxAbsIntDelay=0,0 InterruptThrottleRate=0,0 
>> TxIntDelay=0,0 TxAbsIntDelay=0,0
> 
> that can't be the problem. irq moderation would only account for 2-3ms 
> variance
> maximum.
Oh, I've definitely seen worse than that. Not as bad as a 1second though. Plus 
you're talking
about the case when coalescing logic is working as designed ;-). What if there 
is some kind of 
bug where timer did not expire or something.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions

2008-02-07 Thread Max Krasnyansky

Paul Jackson wrote:
> Andrew wrote:
>>  (and bear in mind that Paul has a track record of being wrong
>>  on this :))
> 
> heh - I saw that .
> 
> Max - Andrew's about right, as usual.  You answered my initial
> questions on this patch set adequately, but hard real time is
> not my expertise, so in the final analysis, other than my saying
> I don't have any more objections, my input doesn't mean much
> either way.

I honestly think this one is no brainer and I do not think this one will hurt 
Paul's track record :).
Paul initially disagreed with me and that's when he was wrong ;-))
 
Andrew, I looked at this in detail and here is an explanation that 
I sent to Paul a few days ago (a bit shortened/updated version).


I thought some more about your proposal to use sched_load_balance flag in 
cpusets instead of extending 
cpu_isolated_map. I looked at the cpusets, cgroups and here are my thoughts on 
this.
Here is the list of issues with sched_load_balance flag from CPU isolation 
perspective:
-- 
(1) Boot time isolation is not possible. There is currently no way to setup a 
cpuset at
boot time. For example we won't be able to isolate cpus from irqs and 
workqueues at boot.
Not a major issue but still an inconvenience.

-- 
(2) There is currently no easy way to figure out what cpuset a cpu belongs to 
in order to query 
it's sched_load_balance flag. In order to do that we need a method that 
iterates all active cpusets 
and checks their cpus_allowed masks. This implies holding cgroup and cpuset 
mutexes. It's not clear 
whether it's ok to do that from the the contexts CPU isolation happens in 
(apic, sched, workqueue). 
It seems that cgroup/cpuset api is designed from top down access. ie adding a 
cpu to a set and then 
recomputing domains. Which makes perfect sense for the common cpuset usecase 
but is not what cpu 
isolation needs.
In other words I think it's much simpler and cleaner to use the 
cpu_isolated_map for isolation
purposes. No locks, no races, etc.

-- 
(3) cpusets are a bit too dynamic  :) . What I mean by this is that 
sched_load_balance flag
can be changed at any time without bringing a CPU offline. What that means is 
that we'll
need some notifier mechanisms for killing and restarting workqueue threads when 
that flag changes. 
Also we'd need some logic that makes sure that a user does not disable load 
balancing on all cpus 
because that effectively will kill workqueues on all the cpus.
This particular case is already handled very nicely in my patches. Isolated bit 
can be set
only when cpu is offline and it cannot be set on the first online cpu. 
Workqueus and other
subsystems already handle cpu hotplug events nicely and can easily ignore 
isolated cpus when
they come online.

--
#1 is probably unfixable. #2 and #3 can be fixed but at the expense of extra 
complexity across
the board. I seriously doubt that I'll be able to push that through the reviews 
;-).

Also personally I still think cpusets and cpu isolation attack two different 
problems. cpusets is about 
partitioning cpus and memory nodes, and managing tasks. Most of the 
cgroups/cpuset APIs are designed to 
deal with tasks. 
CPU isolation is much simpler and is at the lower layer. It deals with IRQs, 
kernel per cpu threads, etc. 
The only intersection I see is that both features affect scheduling domains. 
CPU isolation is again 
simple here it uses existing logic in sched.c it does not change anything in 
this area. 

-

Andrew, hopefully that clarifies it. Let me know if you're not convinced.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions

2008-02-07 Thread Max Krasnyansky

Hi Linus,

Linus Torvalds wrote:
> 
> On Wed, 6 Feb 2008, Max Krasnyansky wrote:
>> Linus, please pull CPU isolation extensions from
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus
> 
> Have these been in -mm and widely discussed etc? I'd like to start more 
> carefully, and (a) have that controversial last patch not merged initially 
> and (b) make sure everybody is on the same page wrt this all..

They've been discussed with RT/scheduler/cpuset folks.
Andrew is definitely in the loop. He just replied and asked for some fixes and
clarifications. He seems to be ok with merging this in general.

The last patch may not be as bad as I originally thought. We'll discuss it some
more with Andrew. I'll also check with Rusty who wrote the stopmachine in the 
first place. It actually seems like an overkill at this point. My impression is 
that it was supposed to be a safety net if some refcounting/locking is not 
fully 
safe and may not be needed or as critical anymore. 
I'm maybe wrong of course. So I'll find that out :)

Thanx
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions

2008-02-07 Thread Max Krasnyansky

Andrew Morton wrote:
> On Wed, 06 Feb 2008 21:32:55 -0800 Max Krasnyansky <[EMAIL PROTECTED]> wrote:
> 
>> Linus, please pull CPU isolation extensions from
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus
> 
> The feature as a whole seems useful, and I don't actually oppose the merge
> based on what I see here.  
Awesome :) I think it's get more and more useful as people will start trying
to figure out what the heck there is supposed to do with the spare CPU cores.
I mean pretty soon most machines will have 4 cores and some will have 8.
One way to use those cores is the "dedicated engine" model.  

> As long as you're really sure that cpusets are
> inappropriate (and bear in mind that Paul has a track record of being wrong
> on this :)).
I'll cover this in a separate email with more details.
  
> But I see a few glitches
Good catches. Thanks for reviewing.

> - There are two separate and identical implementations of
>   cpu_unusable(cpu).  Please do it once, in a header, preferably with C
>   function, not macros.

Those are local versions that depend whether a feature is enabled or not.
If CONFIG_CPUISOL_WORKQUEUE is disabled we want to cpu_unusable()
in the workqueue.c to be a noop, and if it's enabled that macro resolve to 
cpu_isolated(). 
Same thing for the stopmachine.c. If CONFIG_CPUISOL_STOPMACHIN is disabled
cpu_unusable() is a noop. 
In other words cpu_isolated() is the one common macro that subsystem may
want to stub out. 
Do you see another way of doing this ?

> - The Kconfig help is a bit scraggly:
> 
> +config CPUISOL_STOPMACHINE
> + bool "Do not halt isolated CPUs with Stop Machine (HIGHLY EXPERIMENTAL)"
> + depends on CPUISOL && STOP_MACHINE && EXPERIMENTAL
> + help
> +  If this option is enabled kernel will not halt isolated CPUs when 
> Stop Machine
> 
> "the kernel"
> 
> text is too wide
Got it. Will fix asap.
 
> +  is triggered.
> +  Stop Machine is currently only used by the module insertion and 
> removal logic.
> +  Please note that at this point this feature is highly experimental 
> and maybe
> +  dangerous. It is not known to really brake anything but can 
> potentially 
> +  introduce an instability.
> 
> s/maybe/may be/
> s/brake/break/

Man, the typos are killing me :). Will fix.

> Neither this text, nor the changelog nor the code comments tell us what the
> potential instability with stopmachine *is*?  Or maybe I missed it.
That's the thing, we don't really know :). In real life does not seem to be a 
problem at all.
As I mentioned in prev emails. We've been running all kinds of machines with 
this enabled,
and inserting all kinds of modules left and right. Never seen any crashes or 
anything.
But the fact that stopmachine is supposed to halt all cpus during module 
insertion/removal
seems to imply that something bad may happen if some cpus are not halted. It 
may very well
turnout that it's no longer needed because our locking and refcounting handles 
this just fine.
I mean ideally we should not have to halt the entire box, it causes terrible 
latencies.
 
> - Adding new sysfs files without updating Documentation/ABI/ makes Greg cry.
Oh, did not know that. Will fix.

> 
> - Why is cpu_isolated_map exported to modules?  Just for api consistency, it 
> appears?
Yes. For consistency. We'd want cpu_isolated() to work everywhere.
 
> pre-existing problems:
> 
> - isolated_cpu_setup() has an on-stack array of NR_CPUS integers.  This
>   will consume 4k of stack on ia64 (at least).  We'll just squeak through
>   for a ittle while, but this needs to be fixed.  Just move it into
>   __initdata.
Will do.
 
> - isolated_cpu_setup() expects that the user can provide an up-to-1024
>   character kernel boot parameter.  Is this reasonable given cpu command
>   line limits, and given that NR_CPUS will surely grow beyond 1024 in the
>   future?
I'm thinking that is reasonable for now.

I'll fix and resend the patches asap.

Thanx
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: e1000 1sec latency problem

2008-02-07 Thread Max Krasnyansky

Pavel Machek wrote:
> Hi!
> 
> I have the famous e1000 latency problems:
> 
> 64 bytes from 195.113.31.123: icmp_seq=68 ttl=56 time=351.9 ms
> 64 bytes from 195.113.31.123: icmp_seq=69 ttl=56 time=209.2 ms
> 64 bytes from 195.113.31.123: icmp_seq=70 ttl=56 time=1004.1 ms
> 64 bytes from 195.113.31.123: icmp_seq=71 ttl=56 time=308.9 ms
> 64 bytes from 195.113.31.123: icmp_seq=72 ttl=56 time=305.4 ms
> 64 bytes from 195.113.31.123: icmp_seq=73 ttl=56 time=9.8 ms
> 64 bytes from 195.113.31.123: icmp_seq=74 ttl=56 time=3.7 ms
> 
> ...and they are still there in 2.6.25-git0. I had ethernet EEPROM
> checksum problems, which I fixed by the update, but problems are not
> gone.
> 
> irqpoll helps.
> 
> nosmp (which implies XT-PIC is being used) does not help.
> 
>  16:   1925  0   IO-APIC-fasteoi   ahci, yenta, uhci_hcd:usb2, 
> eth0
> 
> Booting kernel with nosmp/ no yenta, no usb does not help.
> 
> Hmm, as expected, interrupt load on ahci (find /) makes latencies go
> away.
> 
> It should be easily reproducible on x60 with latest bios, it is 100%
> reproducible for me...

So you don't think it's related to the interrupt coalescing by any chance ?
I'd suggest to try and disable the coalescing and see if it makes any 
difference.
We've had lots of issues with coalescing misbehavior. Not this bad (ie 1 
second) though.

Add this to modprobe.conf and reload e1000 module

options e1000 RxIntDelay=0,0 RxAbsIntDelay=0,0 InterruptThrottleRate=0,0 
TxIntDelay=0,0 TxAbsIntDelay=0,0

Max

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions

2008-02-07 Thread Max Krasnyansky

Andrew Morton wrote:
 On Thu, 7 Feb 2008 01:59:54 -0600 Paul Jackson [EMAIL PROTECTED] wrote:
 
 but hard real time is not my expertise
 
 Speaking of which..  there is the -rt tree.  Have those people had a look
 at the feature, perhaps played with the code?

Peter Z. and Steven R. sent me some comments, I believe I explained and 
addressed them.
Ingo's been quite. Probably too busy.

btw It's not an RT feature per se. It certainly helps RT but removing all the 
latency
sources from isolated CPUs. But in general it's just reducing kernel overhead 
on some CPUs
kind of feature.

Max 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [E1000-devel] e1000 1sec latency problem

2008-02-07 Thread Max Krasnyansky

Kok, Auke wrote:
 Max Krasnyansky wrote:
 Kok, Auke wrote:
 Max Krasnyansky wrote:
 Kok, Auke wrote:
 Max Krasnyansky wrote:
 So you don't think it's related to the interrupt coalescing by any 
 chance ?
 I'd suggest to try and disable the coalescing and see if it makes any 
 difference.
 We've had lots of issues with coalescing misbehavior. Not this bad (ie 1 
 second) though.

 Add this to modprobe.conf and reload e1000 module

 options e1000 RxIntDelay=0,0 RxAbsIntDelay=0,0 InterruptThrottleRate=0,0 
 TxIntDelay=0,0 TxAbsIntDelay=0,0
 that can't be the problem. irq moderation would only account for 2-3ms 
 variance
 maximum.
 Oh, I've definitely seen worse than that. Not as bad as a 1second though. 
 Plus you're talking
 about the case when coalescing logic is working as designed ;-). What if 
 there is some kind of 
 bug where timer did not expire or something.
 we don't use a software timer in e1000 irq coalescing/moderation, it's all 
 in
 hardware, so we don't have that problem at all. And I certainly have never 
 seen
 anything you are referring to with e1000 hardware, and I do not know of any 
 bug
 related to this.

 are you maybe confused with other hardware ?

 feel free to demonstrate an example...
 Just to give you a background. I wrote and maintain http://libe1000.sf.net
 So I know E1000 HW and SW in and out.
 
 wow, even I do not dare to say that!
Ok maybe that was a bit of an overstatement :). 

 And no I'm not confused with other HW and I know that we're
 not using SW timers for the coalescing. HW can be buggy as well. Note that 
 I'm not saying that I
 know for sure that the problem is coalescing, I'm just suggesting to take it 
 out of the equation
 while Pavel is investigating.

 Unfortunately I cannot demonstrate an example but I've seen unexplained 
 packet delays in the range 
 of 1-20 milliseconds on E1000 HW (and boy ... I do have a lot of it in my 
 labs). Once coalescing 
 was disabled those problems have gone away.
 
 this sounds like you have some sort of PCI POST-ing problem and those can 
 indeed
 be worse if you use any form of interrupt coalescing. In any case that is 
 largely
 irrelevant to the in-kernel drivers, and as I said we definately have no open
 issues on that right now, and I really do not recollect any as well either 
 (other
 than the issue of interference when both ends are irq coalescing)
I was actually talking about in kernel drivers. ie We were seeing delays with 
TIPC running over in
kernel E1000 driver. And no it was not a TIPC issue, everything worked fine 
with over TG3 and issues
went away when coalescing was disabled. 
Anyway, I think we can drop this subject.

Max


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [E1000-devel] e1000 1sec latency problem

2008-02-07 Thread Max Krasnyansky



Kok, Auke wrote:
 Max Krasnyansky wrote:
 Kok, Auke wrote:
 Max Krasnyansky wrote:
 So you don't think it's related to the interrupt coalescing by any chance ?
 I'd suggest to try and disable the coalescing and see if it makes any 
 difference.
 We've had lots of issues with coalescing misbehavior. Not this bad (ie 1 
 second) though.

 Add this to modprobe.conf and reload e1000 module

 options e1000 RxIntDelay=0,0 RxAbsIntDelay=0,0 InterruptThrottleRate=0,0 
 TxIntDelay=0,0 TxAbsIntDelay=0,0
 that can't be the problem. irq moderation would only account for 2-3ms 
 variance
 maximum.
 Oh, I've definitely seen worse than that. Not as bad as a 1second though. 
 Plus you're talking
 about the case when coalescing logic is working as designed ;-). What if 
 there is some kind of 
 bug where timer did not expire or something.
 
 we don't use a software timer in e1000 irq coalescing/moderation, it's all in
 hardware, so we don't have that problem at all. And I certainly have never 
 seen
 anything you are referring to with e1000 hardware, and I do not know of any 
 bug
 related to this.
 
 are you maybe confused with other hardware ?
 
 feel free to demonstrate an example...

Just to give you a background. I wrote and maintain http://libe1000.sf.net
So I know E1000 HW and SW in and out. And no I'm not confused with other HW and 
I know that we're 
not using SW timers for the coalescing. HW can be buggy as well. Note that I'm 
not saying that I
know for sure that the problem is coalescing, I'm just suggesting to take it 
out of the equation
while Pavel is investigating.

Unfortunately I cannot demonstrate an example but I've seen unexplained packet 
delays in the range 
of 1-20 milliseconds on E1000 HW (and boy ... I do have a lot of it in my 
labs). Once coalescing 
was disabled those problems have gone away.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions

2008-02-07 Thread Max Krasnyansky

Hi Linus,

Linus Torvalds wrote:
 
 On Wed, 6 Feb 2008, Max Krasnyansky wrote:
 Linus, please pull CPU isolation extensions from

 git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus
 
 Have these been in -mm and widely discussed etc? I'd like to start more 
 carefully, and (a) have that controversial last patch not merged initially 
 and (b) make sure everybody is on the same page wrt this all..

They've been discussed with RT/scheduler/cpuset folks.
Andrew is definitely in the loop. He just replied and asked for some fixes and
clarifications. He seems to be ok with merging this in general.

The last patch may not be as bad as I originally thought. We'll discuss it some
more with Andrew. I'll also check with Rusty who wrote the stopmachine in the 
first place. It actually seems like an overkill at this point. My impression is 
that it was supposed to be a safety net if some refcounting/locking is not 
fully 
safe and may not be needed or as critical anymore. 
I'm maybe wrong of course. So I'll find that out :)

Thanx
Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions

2008-02-07 Thread Max Krasnyansky



Paul Jackson wrote:
 Max - Andrew wondered if the rt tree had seen the
 code or commented it on it.  What became of that?
I just replied to Andrew. It's not an RT feature per se.
And yes Peter CC'ed RT folks. You probably did not get a chance to read all 
replies.
They had some questions/concerns and stuff. I believe I answered/clarified all 
of them.

 My two cents isn't worth a plug nickel here, but
 I'm inclined to nod in agreement when Linus wants
 to see these patches get some more exposure before
 going into Linus's tree.  ... what's the hurry?
No hurry I guess. I did mentioned in the introductory email that I've been 
maintaining 
this stuff for awhile now. SLAB patches used to be messy, with new SLUB the 
mess goes away.
CFS handles CPU hotplug much better than O(1), cpu hotplug is needed to be able 
to change
isolated bit from sysfs. That's why I think it's a good time to merge.
I don't mind of course if we put this stuff in -mm first. Although first part 
of the patchset 
(ie exporting isolated map, sysfs interface, etc) seem very simple and totally 
not controversial.
Stop machine patch is really the only thing that may look suspicious. 
 
Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions

2008-02-07 Thread Max Krasnyansky

Paul Jackson wrote:
 Andrew wrote:
  (and bear in mind that Paul has a track record of being wrong
  on this :))
 
 heh - I saw that grin.
 
 Max - Andrew's about right, as usual.  You answered my initial
 questions on this patch set adequately, but hard real time is
 not my expertise, so in the final analysis, other than my saying
 I don't have any more objections, my input doesn't mean much
 either way.

I honestly think this one is no brainer and I do not think this one will hurt 
Paul's track record :).
Paul initially disagreed with me and that's when he was wrong ;-))
 
Andrew, I looked at this in detail and here is an explanation that 
I sent to Paul a few days ago (a bit shortened/updated version).


I thought some more about your proposal to use sched_load_balance flag in 
cpusets instead of extending 
cpu_isolated_map. I looked at the cpusets, cgroups and here are my thoughts on 
this.
Here is the list of issues with sched_load_balance flag from CPU isolation 
perspective:
-- 
(1) Boot time isolation is not possible. There is currently no way to setup a 
cpuset at
boot time. For example we won't be able to isolate cpus from irqs and 
workqueues at boot.
Not a major issue but still an inconvenience.

-- 
(2) There is currently no easy way to figure out what cpuset a cpu belongs to 
in order to query 
it's sched_load_balance flag. In order to do that we need a method that 
iterates all active cpusets 
and checks their cpus_allowed masks. This implies holding cgroup and cpuset 
mutexes. It's not clear 
whether it's ok to do that from the the contexts CPU isolation happens in 
(apic, sched, workqueue). 
It seems that cgroup/cpuset api is designed from top down access. ie adding a 
cpu to a set and then 
recomputing domains. Which makes perfect sense for the common cpuset usecase 
but is not what cpu 
isolation needs.
In other words I think it's much simpler and cleaner to use the 
cpu_isolated_map for isolation
purposes. No locks, no races, etc.

-- 
(3) cpusets are a bit too dynamic  :) . What I mean by this is that 
sched_load_balance flag
can be changed at any time without bringing a CPU offline. What that means is 
that we'll
need some notifier mechanisms for killing and restarting workqueue threads when 
that flag changes. 
Also we'd need some logic that makes sure that a user does not disable load 
balancing on all cpus 
because that effectively will kill workqueues on all the cpus.
This particular case is already handled very nicely in my patches. Isolated bit 
can be set
only when cpu is offline and it cannot be set on the first online cpu. 
Workqueus and other
subsystems already handle cpu hotplug events nicely and can easily ignore 
isolated cpus when
they come online.

--
#1 is probably unfixable. #2 and #3 can be fixed but at the expense of extra 
complexity across
the board. I seriously doubt that I'll be able to push that through the reviews 
;-).

Also personally I still think cpusets and cpu isolation attack two different 
problems. cpusets is about 
partitioning cpus and memory nodes, and managing tasks. Most of the 
cgroups/cpuset APIs are designed to 
deal with tasks. 
CPU isolation is much simpler and is at the lower layer. It deals with IRQs, 
kernel per cpu threads, etc. 
The only intersection I see is that both features affect scheduling domains. 
CPU isolation is again 
simple here it uses existing logic in sched.c it does not change anything in 
this area. 

-

Andrew, hopefully that clarifies it. Let me know if you're not convinced.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] CPU isolation extensions

2008-02-07 Thread Max Krasnyansky

Hi Ingo,

Thanks for your reply.

 * Linus Torvalds [EMAIL PROTECTED] wrote:
 
 On Wed, 6 Feb 2008, Max Krasnyansky wrote:
 Linus, please pull CPU isolation extensions from

 git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git 
 for-linus
 Have these been in -mm and widely discussed etc? I'd like to start 
 more carefully, and (a) have that controversial last patch not merged 
 initially and (b) make sure everybody is on the same page wrt this 
 all..
 
 no, they have not been under nearly enough testing and review - these 
 patches surfaced on lkml for the first time one week ago (!). 
Almost two weeks actually. Ok 1.8 :)

 I find the pull request totally premature, this stuff has not been discussed 
 and 
 agreed on _at all_.
Ingo, I may have the wrong impression but my impression is that you ignored all 
the 
other emails and just read Linus' reply. I do not believe this accusation is 
valid.
I apologize if my impression is incorrect.
Since the patches _do not_ change/affect existing scheduler/cpuset 
functionality I did 
not know who to CC in the first email that I sent. Luckily Peter picked it up 
and CC'ed 
a bunch of folks, including Paul, Steven and You.
All of them replied and had questions/concerns. As I mentioned before I believe 
I addressed
all of them.
 
 None of the people who maintain and have interest in 
 this code and participated in the (short) one-week discussion were 
 Cc:-ed to the pull request.
Ok. I did not realize I'm supposed to do that. 
Since I got no replies to the second round of patches (take 2), which again was 
CC'ed to
the same people that Peter CC'ed. I assumed that people are ok with it. That's 
what discussion 
on the first take ended with.

 I think these patches also need a buy-in from Peter Zijlstra and Paul 
 Jackson (or really good reasoning while any objections from them should 
 be overriden) - all of whom deal with the code affected by these changes 
 on a daily basis and have an interest in CPU isolation features.
See above. 
Following issues were raised:
1. Peter and Steven initially thought that workqueue isolation is not needed.
2. Paul thought that it should be implemented on top of cpusets.
3. Peter thought that stopmachine change is not safe.
There were a couple of other minor misunderstandings (for example Peter thought 
that I'm completely disallowing IRQs on isolated CPUs, which is obviously not
the case). I clarified all of them.

#1 I explained in the original thread and then followed up with concrete code 
example
of why it is needed.
http://marc.info/?l=linux-kernelm=120217173001671w=2
Got no replies so far. So I'm assuming folks are happy.

#2 I started a separate thread on that
http://marc.info/?l=linux-kernelm=120180692331461w=2
The conclusion was, well let me just quote exactly what Paul had said:

 Paul Jackson wrote:
 Max wrote:
 Looks like I failed to explain what I'm trying to achieve. So let me try 
 again.
 
 Well done.  I read through that, expecting to disagree or at least
 to not understand at some point, and got all the way through nodding
 my head in agreement.  Good.
 
 Whether the earlier confusions were lack of clarity in the presentation,
 or lack of competence in my brain ... well guess I don't want to ask that
 question ;).


And #3 Peter did not agree with me but said that it's up to Linus or Andrew to 
decide
whether it's appropriate in mainline or not. I _clearly_ indicated that this 
part is
somewhat controversial and maybe dangerous, I'm _not_ trying to sneak something 
in. 
Andrew picked it up and I'm going to do some more investigation on whether it's 
really
not safe or is actually fine (about to send an email to Rusty).

 Generally i think that cpusets is actually the feature and API that 
 should be used (and extended) for CPU isolation - and we already 
 extended it recently in the direction of CPU isolation. Most enterprise 
 distros have cpusets enabled so it's in use. Also, cpusets has the 
 appeal of being commonly used in the big honking boxes arena, so 
 reusing the same concept for RT and virtualization stuff would be the 
 natural approach. It already ties in to the scheduler domains code 
 dynamically and is flexible and scalable. I resisted ad-hoc CPU 
 isolation patches in -rt for that reason. 
That's exactly what Paul proposed initially. I completely disagree with that 
but I did look 
at it in _detail_. 
Please take a look here for detailed explanation
http://marc.info/?l=linux-kernelm=120180692331461w=2
This email getting to long and I did not want to inline everything.

 Also, i'd not mind some test-coverage in sched.git as well.
I believe it has _nothing_ to do with the scheduler but I do not mind it 
being in that tree.
Please read this email on why it has nothing to do with the scheduler
http://marc.info/?l=linux-kernelm=120210515323578w=2
That's the email that convinced Paul.

To sum it up. It has been discussed with the right people. I do

Module loading/unloading and The Stop Machine

2008-02-07 Thread Max Krasnyansky

Hi Rusty,

I was hopping you could answer a couple of questions about module 
loading/unloading
and the stop machine.
There was a recent discussion on LKML about CPU isolation patches I'm working 
on.
One of the patches makes stop machine ignore the isolated CPUs. People of 
course had
questions about that. So I started looking into more details and got this 
silly, crazy 
idea that maybe we do not need the stop machine any more :)

As far as I can tell the stop machine is basically a safety net in case some 
locking
and recounting mechanisms aren't bullet proof. In other words if a subsystem 
can actually
handle registration/unregistration in a robust way, module loader/unloader does 
not 
necessarily have to halt entire machine in order to load/unload a module that 
belongs
to that subsystem. I may of course be completely wrong on that.
 
The problem with the stop machine is that it's a very very big gun :). In a 
sense that 
it totally kills all the latencies and stuff since the entire machine gets 
halted while
module is being (un)loaded. Which is a major issue for any realtime apps. 
Specifically 
for CPU isolation the issue is that high-priority rt user-space thread prevents 
stop 
machine threads from running and entire box just hangs waiting for it. 
I'm kind of surprised that folks who use monster boxes with over 100 CPUs have 
not 
complained. It's must be a huge hit for those machines to halt the entire 
thing. 

It seems that over the last few years most subsystems got much better at 
locking and 
refcounting. And I'm hopping that we can avoid halting the entire machine these 
days.
For CPU isolation in particular the solution is simple. We can just ignore 
isolated CPUs. 
What I'm trying to figure out is how safe it is and whether we can avoid full 
halt 
altogether.

So. Here is what I tried today on my Core2 Duo laptop
 --- a/kernel/stop_machine.c
 +++ b/kernel/stop_machine.c
 @@ -204,11 +204,14 @@ int stop_machine_run(int (*fn)(void *), void *data, 
 unsigned int cpu)
  
 /* No CPUs can come up or down during this. */
 lock_cpu_hotplug();
 +/*
 p = __stop_machine_run(fn, data, cpu);
 if (!IS_ERR(p))
 ret = kthread_stop(p);
 else
 ret = PTR_ERR(p);
 +*/
 +   ret = fn(data);
 unlock_cpu_hotplug();
  
 return ret;

ie Completely disabled stop machine. It just loads/unloads modules without full 
halt.
I then ran three scripts:

while true; do
/sbin/modprobe -r uhci_hcd
/sbin/modprobe uhci_hcd
sleep 10
done

while true; do
/sbin/modprobe -r tg3
/sbin/modprobe tg3
sleep 2
done

while true; do
/usr/sbin/tcpdump -i eth0
done

The machine has a bunch of USB devices connected to it. The two most 
interesting 
are a Bluetooth dongle and a USB mouse. By loading/unloading UHCI driver we're 
touching
Sysfs, USB stack, Bluetooth stack, HID layer, Input layer. The X is running and 
is using 
that USB mouse. The Bluetooth services are running too.
By loading/unloading TG3 driver we're touching sysfs, network stack (a bunch of 
layers).
The machine is running NetworkManager and tcpdumping on the eth0 which is 
registered 
by TG3.
This is a pretty good stress test in general let alone the disabled stop 
machine. 

I left all that running for the whole day while doing normal day to day things. 
Compiling a bunch of things, emails, office apps, etc. That's where I'm writing 
this
email from :). It's still running all that :) 

So the question is do we still need stop machine ? I must be missing something 
obvious.
But things seem to be working pretty well without it. I certainly feel much 
better about 
at least ignoring isolated CPUs during stop machine execution. Which btw I've 
doing
for a couple of years now on a wide range of the machines where people are 
inserting 
modules left and right. 

What do you think ?

Thanx
Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[git pull] CPU isolation extensions

2008-02-06 Thread Max Krasnyansky

Linus, please pull CPU isolation extensions from

git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus
 
Diffstat:

b/arch/x86/Kconfig  |1 
b/arch/x86/kernel/genapic_flat_64.c |5 ++-
b/drivers/base/cpu.c|   48 +++
b/include/linux/cpumask.h   |3 ++
b/kernel/Kconfig.cpuisol|   15 +++
b/kernel/Makefile   |4 +-
b/kernel/cpu.c  |   49 
b/kernel/sched.c|   37 ---
b/kernel/stop_machine.c |9 +-
b/kernel/workqueue.c|   31 --
kernel/Kconfig.cpuisol  |   26 ++-
11 files changed, 176 insertions(+), 52 deletions(-)

The patchset consist of 4 patches. 
cpuisol: Make cpu isolation configrable and export isolated map
cpuisol: Do not route IRQs to the CPUs isolated at boot
cpuisol: Do not schedule workqueues on the isolated CPUs
cpuisol: Do not halt isolated CPUs with Stop Machine

First two are very simple. They simply make "CPU isolation" a 
configurable feature, export cpu_isolated_map and provide some helper functions 
to access it (just 
like cpu_online() and friends).
Last two patches add support for isolating CPUs from running workqueus and stop 
machine. Last patch
is kind of controversial let me know if you think it's too ugly and I'll resend 
without it.
For more details see below.


This patch series extends CPU isolation support. Yes, most people want to 
virtuallize 
CPUs these days and I want to isolate them  :) .

The primary idea here is to be able to use some CPU cores as the dedicated 
engines for running
user-space code with minimal kernel overhead/intervention, think of it as an 
SPE in the 
Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on 
one of the 
processors without adversely affecting or being affected by the other system 
activities. 
System activities here include _kernel_ activities as well. 

I'm personally using this for hard realtime purposes. With CPU isolation it's 
very easy to 
achieve single digit usec worst case and around 200 nsec average response times 
on off-the-shelf
multi- processor/core systems (vanilla kernel plus these patches) even under 
exteme system load. 
I'm working with legal folks on releasing hard RT user-space framework for that.
I believe with the current multi-core CPU trend we will see more and more 
applications that 
explore this capability: RT gaming engines, simulators, hard RT apps, etc.

Hence the proposal is to extend current CPU isolation feature.
The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
  Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
  User must route interrupts (if any) to those CPUs explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as 
much as possible
  Includes workqueues, per CPU threads, etc.
  This feature is configurable and is disabled by default.  
---

I've been maintaining this stuff since around 2.6.18 and it's been running in 
production
environment for a couple of years now. It's been tested on all kinds of 
machines, from NUMA
boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
The messiest part used to be SLAB garbage collector changes. With the new SLUB 
all that mess 
goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much 
better than O(1) 
did (ie domains are recomputed dynamically) so that isolation can be done at 
any time (via sysfs). 
So this seems like a good time to merge. 

We've had scheduler support for CPU isolation ever since O(1) scheduler went 
it. In other words
#1 is already supported. These patches do not change/affect that functionality 
in any way. 
#2 is trivial one liner change to the IRQ init code. 
#3 is addressed by a couple of separate patches. The main problem here is that 
RT thread can prevent
kernel threads from running and machine gets stuck because other CPUs are 
waiting for those threads
to run and report back.

Folks involved in the scheduler/cpuset development provided a lot of feedback 
on the first series
of patches. I believe I managed to explain and clarify every aspect. 
Paul Jackson initially suggested to implement #2 and #3 using cpusets 
subsystem. Paul and I looked 
at it more closely and determined that exporting cpu_isolated_map instead is a 
better option. 

Last patch to the stop machine is potentially unsafe and is marked as highly 
experimental. Unfortunately 
it's currently the only option that allows dynamic module insertion/removal for 
above scenarios. 
If people still feel that it's t ugly I can revert that change and keep it 
in the separate tree

[git pull] CPU isolation extensions

2008-02-06 Thread Max Krasnyansky

Linus, please pull CPU isolation extensions from

git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git for-linus
 
Diffstat:

b/arch/x86/Kconfig  |1 
b/arch/x86/kernel/genapic_flat_64.c |5 ++-
b/drivers/base/cpu.c|   48 +++
b/include/linux/cpumask.h   |3 ++
b/kernel/Kconfig.cpuisol|   15 +++
b/kernel/Makefile   |4 +-
b/kernel/cpu.c  |   49 
b/kernel/sched.c|   37 ---
b/kernel/stop_machine.c |9 +-
b/kernel/workqueue.c|   31 --
kernel/Kconfig.cpuisol  |   26 ++-
11 files changed, 176 insertions(+), 52 deletions(-)

The patchset consist of 4 patches. 
cpuisol: Make cpu isolation configrable and export isolated map
cpuisol: Do not route IRQs to the CPUs isolated at boot
cpuisol: Do not schedule workqueues on the isolated CPUs
cpuisol: Do not halt isolated CPUs with Stop Machine

First two are very simple. They simply make CPU isolation a 
configurable feature, export cpu_isolated_map and provide some helper functions 
to access it (just 
like cpu_online() and friends).
Last two patches add support for isolating CPUs from running workqueus and stop 
machine. Last patch
is kind of controversial let me know if you think it's too ugly and I'll resend 
without it.
For more details see below.


This patch series extends CPU isolation support. Yes, most people want to 
virtuallize 
CPUs these days and I want to isolate them  :) .

The primary idea here is to be able to use some CPU cores as the dedicated 
engines for running
user-space code with minimal kernel overhead/intervention, think of it as an 
SPE in the 
Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on 
one of the 
processors without adversely affecting or being affected by the other system 
activities. 
System activities here include _kernel_ activities as well. 

I'm personally using this for hard realtime purposes. With CPU isolation it's 
very easy to 
achieve single digit usec worst case and around 200 nsec average response times 
on off-the-shelf
multi- processor/core systems (vanilla kernel plus these patches) even under 
exteme system load. 
I'm working with legal folks on releasing hard RT user-space framework for that.
I believe with the current multi-core CPU trend we will see more and more 
applications that 
explore this capability: RT gaming engines, simulators, hard RT apps, etc.

Hence the proposal is to extend current CPU isolation feature.
The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
  Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
  User must route interrupts (if any) to those CPUs explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as 
much as possible
  Includes workqueues, per CPU threads, etc.
  This feature is configurable and is disabled by default.  
---

I've been maintaining this stuff since around 2.6.18 and it's been running in 
production
environment for a couple of years now. It's been tested on all kinds of 
machines, from NUMA
boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
The messiest part used to be SLAB garbage collector changes. With the new SLUB 
all that mess 
goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much 
better than O(1) 
did (ie domains are recomputed dynamically) so that isolation can be done at 
any time (via sysfs). 
So this seems like a good time to merge. 

We've had scheduler support for CPU isolation ever since O(1) scheduler went 
it. In other words
#1 is already supported. These patches do not change/affect that functionality 
in any way. 
#2 is trivial one liner change to the IRQ init code. 
#3 is addressed by a couple of separate patches. The main problem here is that 
RT thread can prevent
kernel threads from running and machine gets stuck because other CPUs are 
waiting for those threads
to run and report back.

Folks involved in the scheduler/cpuset development provided a lot of feedback 
on the first series
of patches. I believe I managed to explain and clarify every aspect. 
Paul Jackson initially suggested to implement #2 and #3 using cpusets 
subsystem. Paul and I looked 
at it more closely and determined that exporting cpu_isolated_map instead is a 
better option. 

Last patch to the stop machine is potentially unsafe and is marked as highly 
experimental. Unfortunately 
it's currently the only option that allows dynamic module insertion/removal for 
above scenarios. 
If people still feel that it's t ugly I can revert that change and keep it 
in the separate tree

Re: CPU hotplug and IRQ affinity with 2.6.24-rt1

2008-02-04 Thread Max Krasnyansky



Daniel Walker wrote:
> On Mon, Feb 04, 2008 at 03:35:13PM -0800, Max Krasnyanskiy wrote:
>> This is just an FYI. As part of the "Isolated CPU extensions" thread Daniel 
>> suggest for me
>> to check out latest RT kernels. So I did or at least tried to and 
>> immediately spotted a couple
>> of issues.
>>
>> The machine I'm running it on is:
>>  HP xw9300, Dual Opteron, NUMA
>>
>> It looks like with -rt kernel IRQ affinity masks are ignored on that 
>> system. ie I write 1 to lets say /proc/irq/23/smp_affinity but the 
>> interrupts keep coming to CPU1. Vanilla 2.6.24 does not have that issue.
> 
> I tried this, and it works according to /proc/interrupts .. Are you
> looking at the interrupt threads affinity ?
Nope. I'm looking at the /proc/interrupts. ie The interrupt count keeps 
incrementing for cpu1 even
though affinity mask is set to 1.

IRQ thread affinity was btw set to 3 which is probably wrong.
To clarify, by default after reboot:
- IRQ affinity set 3, IRQ thread affinity set to 3
- User writes 1 into /proc/irq/N/smp_affinity
- IRQ affinity is now set to 1, IRQ thread affinity is still set to 3

It'd still work I guess but does not seem right. Ideally IRQ thread affinity 
should have change as well.
We could of course just have some user-space tool that adjusts both.

Looks like Greg already replied to the cpu hotplug issue. For me it did not 
oops. Just got stuck probably
because it could not move an IRQ due to broken IRQ affinity logic.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]

2008-02-04 Thread Max Krasnyansky



Paul Jackson wrote:
> Max K wrote:
>>> And for another thing, we already declare externs in cpumask.h for
>>> the other, more widely used, cpu_*_map variables cpu_possible_map,
>>> cpu_online_map, and cpu_present_map.
>> Well, to address #2 and #3 isolated map will need to be exported as well.
>> Those other maps do not really have much to do with the scheduler code.
>> That's why I think either kernel/cpumask.c or kernel/cpu.c is a better place 
>> for them.
> 
> Well, if you have need it to be exported for #2 or #3, then that's ok
> by me - export it.
> 
> I'm unaware of any kernel/cpumask.c.  If you meant lib/cpumask.c, then
> I'd prefer you not put it there, as lib/cpumask.c just contains the
> implementation details of the abstract data type cpumask_t, not any of
> its uses.  If you mean kernel/cpuset.c, then that's not a good choice
> either, as that just contains the implementation details of the cpuset
> subsystem.  You should usually define such things in one of the files
> using it, and unless there is clearly a -better- place to move the
> definition, it's usually better to just leave it where it is.

I was thinking of creating the new file kernel/cpumask.c. But it probably does 
not make sense 
just for the masks. I'm now thinking kernel/cpu.c is the best place for it. It 
contains all 
the cpu hotplug logic that deals with those maps at the very top it has stuff 
like

/* Serializes the updates to cpu_online_map, cpu_present_map */
static DEFINE_MUTEX(cpu_add_remove_lock);

So it seems to make sense to keep the maps in there.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]

2008-02-04 Thread Max Krasnyansky



Paul Jackson wrote:
 Max K wrote:
 And for another thing, we already declare externs in cpumask.h for
 the other, more widely used, cpu_*_map variables cpu_possible_map,
 cpu_online_map, and cpu_present_map.
 Well, to address #2 and #3 isolated map will need to be exported as well.
 Those other maps do not really have much to do with the scheduler code.
 That's why I think either kernel/cpumask.c or kernel/cpu.c is a better place 
 for them.
 
 Well, if you have need it to be exported for #2 or #3, then that's ok
 by me - export it.
 
 I'm unaware of any kernel/cpumask.c.  If you meant lib/cpumask.c, then
 I'd prefer you not put it there, as lib/cpumask.c just contains the
 implementation details of the abstract data type cpumask_t, not any of
 its uses.  If you mean kernel/cpuset.c, then that's not a good choice
 either, as that just contains the implementation details of the cpuset
 subsystem.  You should usually define such things in one of the files
 using it, and unless there is clearly a -better- place to move the
 definition, it's usually better to just leave it where it is.

I was thinking of creating the new file kernel/cpumask.c. But it probably does 
not make sense 
just for the masks. I'm now thinking kernel/cpu.c is the best place for it. It 
contains all 
the cpu hotplug logic that deals with those maps at the very top it has stuff 
like

/* Serializes the updates to cpu_online_map, cpu_present_map */
static DEFINE_MUTEX(cpu_add_remove_lock);

So it seems to make sense to keep the maps in there.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CPU hotplug and IRQ affinity with 2.6.24-rt1

2008-02-04 Thread Max Krasnyansky



Daniel Walker wrote:
 On Mon, Feb 04, 2008 at 03:35:13PM -0800, Max Krasnyanskiy wrote:
 This is just an FYI. As part of the Isolated CPU extensions thread Daniel 
 suggest for me
 to check out latest RT kernels. So I did or at least tried to and 
 immediately spotted a couple
 of issues.

 The machine I'm running it on is:
  HP xw9300, Dual Opteron, NUMA

 It looks like with -rt kernel IRQ affinity masks are ignored on that 
 system. ie I write 1 to lets say /proc/irq/23/smp_affinity but the 
 interrupts keep coming to CPU1. Vanilla 2.6.24 does not have that issue.
 
 I tried this, and it works according to /proc/interrupts .. Are you
 looking at the interrupt threads affinity ?
Nope. I'm looking at the /proc/interrupts. ie The interrupt count keeps 
incrementing for cpu1 even
though affinity mask is set to 1.

IRQ thread affinity was btw set to 3 which is probably wrong.
To clarify, by default after reboot:
- IRQ affinity set 3, IRQ thread affinity set to 3
- User writes 1 into /proc/irq/N/smp_affinity
- IRQ affinity is now set to 1, IRQ thread affinity is still set to 3

It'd still work I guess but does not seem right. Ideally IRQ thread affinity 
should have change as well.
We could of course just have some user-space tool that adjusts both.

Looks like Greg already replied to the cpu hotplug issue. For me it did not 
oops. Just got stuck probably
because it could not move an IRQ due to broken IRQ affinity logic.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CPUISOL] CPU isolation extensions

2008-02-03 Thread Max Krasnyansky

Hi Daniel,

Sorry for not replying right away.

Daniel Walker wrote:
> On Mon, 2008-01-28 at 16:12 -0800, Max Krasnyanskiy wrote:
> 
>> Not accurate enough and way too much overhead for what I need. I know at 
>> this point it probably 
>> sounds like I'm talking BS :). I wish I've released the engine and examples 
>> by now. Anyway let 
>> me just say that SW MAC has crazy tight deadlines with lots of small tasks. 
>> Using nanosleep() & 
>> gettimeofday() is simply not practical. So it's all TSC based with clever 
>> time sync logic between
>> HW and SW.
> 
> I don't know if it's BS or not, you clearly fixed your own problem which
> is good .. Although when you say "RT patches cannot achieve what I
> needed. Even RTAI/Xenomai can't do that." , and HRT is "Not accurate
> enough and way too much overhead" .. Given the hardware your using,
> that's all difficult to believe.. You also said this code has been
> running on production systems for two year, which means it's at least
> two years old .. There's been some good sized leaps in real time linux
> in the past two years ..

I've been actually tracking RT patches fairly closely. I can't say I tried all 
of them but I do try
them from time to time. I just got latest 2.6.24-rt1 running on HP xw9300. 
Looks like it does not handle
CPU hotplug very well, I manged to kill it by bringing cpu 1 off-line. So I 
cannot run any tests right 
now will run some tomorrow.

For now let me mention that I have a simple tests that sleeps for a 
millisecond, then does some bitbanging 
for 200 usec. It measures jitter caused by the periodic scheduler tick, IPIs 
and other kernel activities.
With high-res timers disabled on most of the machines I mentioned before it 
shows around 1-1.2usec worst case. 
With high-res timers enabled it shows 5-6usec. This is with 2.6.24 running on 
an isolated CPU. Forget about
using a user-space timer (nanosleep(), etc). Even scheduler tick itself is 
fairly heavy.
gettimeofday() call on that machine takes on average 2-3usec (not a vsyscall) 
and SW MAC is all about precise 
timing. That's why I said that it's not practical to use that stuff for me. I 
do not see anything in -rt kernel 
that would improve this.

This is btw not to say that -rt kernel is not useful for my app in general. We 
have a bunch of soft-RT threads
that talk to the MAC thread. Those would definitely benefit. I think cpu 
isolation + -rt would work beautifully
for wireless basestations.

Max 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]

2008-02-03 Thread Max Krasnyansky

Paul Jackson wrote:
> Max wrote:
>> Paul, I actually mentioned at the beginning of my email that I did read that 
>> thread
>> started by Peter. I did learn quite a bit from it :)
> 
> Ah - sorry - I missed that part.  However, I'm still getting the feeling
> that there were some key points in that thread that we have not managed
> to communicate successfully.
I think you are assuming that I only need to deal with RT scheduler and 
scheduler
domains which is not correct. See below.

>> Sounds like at this point we're in agreement that sched_load_balance is not 
>> suitable
>> for what I'd like to achieve.
> 
> I don't think we're in agreement; I think we're in confusion ;)
Yeah. I don't believe I'm the confused side though ;-)

> Yes, sched_load_balance does not *directly* have anything to do with this.
> 
> But indirectly it is a critical element in what I think you'd like to
> achieve.  It affects how the cpuset code sets up sched_domains, and
> if I understand correctly, you require either (1) some sched_domains to
> only contain RT tasks, or (2) some CPUs to be in no sched_domain at all.
> 
> Proper configuration of the cpuset hierarchy, including the setting of
> the per-cpuset sched_load_balance flag, can provide either of these
> sched_domain partitions, as desired.
Again you're assuming that scheduling domain partitioning satisfies my 
requirements
or addresses my use case. It does not. See below for more details.

>> But how about making cpusets aware of the cpu_isolated_map ?
> 
> No.  That's confusing cpusets and the scheduler again.
> 
> The cpu_isolated_map is a file static variable known only within
> the kernel/sched.c file; this should not change.
I completely disagree. In fact I think all the cpu_xxx_map (online, present, 
isolated)
variables do not belong in the scheduler code. I'm thinking of submitting a 
patch that
factors them out into kernel/cpumask.c We already have cpumask.h.

> Presently, the boot parameter isolcpus= is just used to initialize
> what CPUs are isolated at boot, and then the sched_domain partitioning,
> as done in kernel/sched.c:partition_sched_domains() (the hook into
> the sched code that cpusets uses) determines which CPUs are isolated
> from that point forward.  I doubt that this should change either.
Sure, I did not even touch that part. I just proposed to extend the meaning of 
the 
'isolated' bit.

> In that thread referenced above, did you see the part where RT is
> achieved not by isolating CPUs from any scheduler, but rather by
> polymorphically having several schedulers available to operate on each
> sched_domain, and having RT threads self-select the RT scheduler?
Absolutely. Yes that is. I saw that part. But it has nothing to do with my use 
case.

Looks like I failed to explain what I'm trying to achieve. So let me try again.
I'd like to be able to run a CPU intensive (%100) RT task on one of the 
processors without 
adversely affecting or being affected by the other system activities. System 
activities 
here include _kernel_ activities as well. Hence the proposal is to extend 
current CPU 
isolation feature.

The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
   Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
   User must route interrupts (if any) explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as 
much as possible
   Includes workqueues, per CPU threads, etc.
   This feature is configurable and is disabled by default.  
---

#1 affects scheduler and scheduler domains. It's already supported either by 
using isolcpus= boot
option or by setting "sched_load_balance" in cpusets. I'm totally happy with 
the current behavior
and my original patch did not mess with this functionality in any way.

#2 and #3 have _nothing_ to do with the scheduler or scheduler domains. I've 
been trying to explain 
that for a few days now ;-). When you saw my patches for #2 and #3 you told me 
that you'd be interested 
to see them implemented on top of the "sched_load_balance" flag. Here is your 
original reply
http://marc.info/?l=linux-kernel=120153260217699=2

So I looked into that and provided an explanation why it would not work or 
would work but would add 
lots of complexity (access to internal cpuset structures, locking, etc).
My email on that is here:
http://marc.info/?l=linux-kernel=120180692331461=2

Now, I felt from the beginning that cpusets is not the right mechanism to 
address number #2 and #3.
The best mechanism IMO is to simply provide an access to the cpu_isolated_map 
to the rest of the kernel.
Again the fact that cpu_isolated_map currently lives in the scheduler code does 
not change anything 
here because as I explained I'm proposing to extend the meaning of the "CPU 
isolation". I provided 
dynamic access to the "isolated" bit only for

Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]

2008-02-03 Thread Max Krasnyansky

Paul Jackson wrote:
 Max wrote:
 Paul, I actually mentioned at the beginning of my email that I did read that 
 thread
 started by Peter. I did learn quite a bit from it :)
 
 Ah - sorry - I missed that part.  However, I'm still getting the feeling
 that there were some key points in that thread that we have not managed
 to communicate successfully.
I think you are assuming that I only need to deal with RT scheduler and 
scheduler
domains which is not correct. See below.

 Sounds like at this point we're in agreement that sched_load_balance is not 
 suitable
 for what I'd like to achieve.
 
 I don't think we're in agreement; I think we're in confusion ;)
Yeah. I don't believe I'm the confused side though ;-)

 Yes, sched_load_balance does not *directly* have anything to do with this.
 
 But indirectly it is a critical element in what I think you'd like to
 achieve.  It affects how the cpuset code sets up sched_domains, and
 if I understand correctly, you require either (1) some sched_domains to
 only contain RT tasks, or (2) some CPUs to be in no sched_domain at all.
 
 Proper configuration of the cpuset hierarchy, including the setting of
 the per-cpuset sched_load_balance flag, can provide either of these
 sched_domain partitions, as desired.
Again you're assuming that scheduling domain partitioning satisfies my 
requirements
or addresses my use case. It does not. See below for more details.
 
 But how about making cpusets aware of the cpu_isolated_map ?
 
 No.  That's confusing cpusets and the scheduler again.
 
 The cpu_isolated_map is a file static variable known only within
 the kernel/sched.c file; this should not change.
I completely disagree. In fact I think all the cpu_xxx_map (online, present, 
isolated)
variables do not belong in the scheduler code. I'm thinking of submitting a 
patch that
factors them out into kernel/cpumask.c We already have cpumask.h.

 Presently, the boot parameter isolcpus= is just used to initialize
 what CPUs are isolated at boot, and then the sched_domain partitioning,
 as done in kernel/sched.c:partition_sched_domains() (the hook into
 the sched code that cpusets uses) determines which CPUs are isolated
 from that point forward.  I doubt that this should change either.
Sure, I did not even touch that part. I just proposed to extend the meaning of 
the 
'isolated' bit.

 In that thread referenced above, did you see the part where RT is
 achieved not by isolating CPUs from any scheduler, but rather by
 polymorphically having several schedulers available to operate on each
 sched_domain, and having RT threads self-select the RT scheduler?
Absolutely. Yes that is. I saw that part. But it has nothing to do with my use 
case.

Looks like I failed to explain what I'm trying to achieve. So let me try again.
I'd like to be able to run a CPU intensive (%100) RT task on one of the 
processors without 
adversely affecting or being affected by the other system activities. System 
activities 
here include _kernel_ activities as well. Hence the proposal is to extend 
current CPU 
isolation feature.

The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
   Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
   User must route interrupts (if any) explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as 
much as possible
   Includes workqueues, per CPU threads, etc.
   This feature is configurable and is disabled by default.  
---

#1 affects scheduler and scheduler domains. It's already supported either by 
using isolcpus= boot
option or by setting sched_load_balance in cpusets. I'm totally happy with 
the current behavior
and my original patch did not mess with this functionality in any way.

#2 and #3 have _nothing_ to do with the scheduler or scheduler domains. I've 
been trying to explain 
that for a few days now ;-). When you saw my patches for #2 and #3 you told me 
that you'd be interested 
to see them implemented on top of the sched_load_balance flag. Here is your 
original reply
http://marc.info/?l=linux-kernelm=120153260217699w=2

So I looked into that and provided an explanation why it would not work or 
would work but would add 
lots of complexity (access to internal cpuset structures, locking, etc).
My email on that is here:
http://marc.info/?l=linux-kernelm=120180692331461w=2

Now, I felt from the beginning that cpusets is not the right mechanism to 
address number #2 and #3.
The best mechanism IMO is to simply provide an access to the cpu_isolated_map 
to the rest of the kernel.
Again the fact that cpu_isolated_map currently lives in the scheduler code does 
not change anything 
here because as I explained I'm proposing to extend the meaning of the CPU 
isolation. I provided 
dynamic access to the isolated bit only for convince, it does _not_ change 
existing

Re: [CPUISOL] CPU isolation extensions

2008-02-03 Thread Max Krasnyansky

Hi Daniel,

Sorry for not replying right away.

Daniel Walker wrote:
 On Mon, 2008-01-28 at 16:12 -0800, Max Krasnyanskiy wrote:
 
 Not accurate enough and way too much overhead for what I need. I know at 
 this point it probably 
 sounds like I'm talking BS :). I wish I've released the engine and examples 
 by now. Anyway let 
 me just say that SW MAC has crazy tight deadlines with lots of small tasks. 
 Using nanosleep()  
 gettimeofday() is simply not practical. So it's all TSC based with clever 
 time sync logic between
 HW and SW.
 
 I don't know if it's BS or not, you clearly fixed your own problem which
 is good .. Although when you say RT patches cannot achieve what I
 needed. Even RTAI/Xenomai can't do that. , and HRT is Not accurate
 enough and way too much overhead .. Given the hardware your using,
 that's all difficult to believe.. You also said this code has been
 running on production systems for two year, which means it's at least
 two years old .. There's been some good sized leaps in real time linux
 in the past two years ..

I've been actually tracking RT patches fairly closely. I can't say I tried all 
of them but I do try
them from time to time. I just got latest 2.6.24-rt1 running on HP xw9300. 
Looks like it does not handle
CPU hotplug very well, I manged to kill it by bringing cpu 1 off-line. So I 
cannot run any tests right 
now will run some tomorrow.
 
For now let me mention that I have a simple tests that sleeps for a 
millisecond, then does some bitbanging 
for 200 usec. It measures jitter caused by the periodic scheduler tick, IPIs 
and other kernel activities.
With high-res timers disabled on most of the machines I mentioned before it 
shows around 1-1.2usec worst case. 
With high-res timers enabled it shows 5-6usec. This is with 2.6.24 running on 
an isolated CPU. Forget about
using a user-space timer (nanosleep(), etc). Even scheduler tick itself is 
fairly heavy.
gettimeofday() call on that machine takes on average 2-3usec (not a vsyscall) 
and SW MAC is all about precise 
timing. That's why I said that it's not practical to use that stuff for me. I 
do not see anything in -rt kernel 
that would improve this.

This is btw not to say that -rt kernel is not useful for my app in general. We 
have a bunch of soft-RT threads
that talk to the MAC thread. Those would definitely benefit. I think cpu 
isolation + -rt would work beautifully
for wireless basestations.

Max 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]

2008-02-02 Thread Max Krasnyansky

Paul Jackson wrote:
> Max wrote:
>> Here is the list of things of issues with sched_load_balance flag from CPU 
>> isolation 
>> perspective:
> 
> A separate thread happened to start up on lkml.org, shortly after
> yours, that went into this in considerable detail.
> 
> For example, the interaction of cpusets, sched_load_balance,
> sched_domains and real time scheduling is examined in some detail on
> this thread.  Everyone participating on that thread learned something
> (we all came into it with less than a full picture of what's there.)
> 
> I would encourage you to read it closely.  For example, the scheduler
> code should not be trying to access per-cpuset attributes such as
> the sched_load_balance flag (you are correct that this would be
> difficult to do because of the locking; however by design, that is
> not to be done.)
> 
> This thread begins at:
> 
> scheduler scalability - cgroups, cpusets and load-balancing
> http://lkml.org/lkml/2008/1/29/60
> 
> Too bad we didn't think to include you in the CC list of that
> thread from the beginning.

Paul, I actually mentioned at the beginning of my email that I did read that 
thread
started by Peter. I did learn quite a bit from it :)
You guys did not discuss isolation stuff though. The thread was only about 
scheduling
and my cpu isolation extension patches deal with other aspects. 

Sounds like at this point we're in agreement that sched_load_balance is not 
suitable
for what I'd like to achieve. But how about making cpusets aware of the 
cpu_isolated_map ?
Even without my patches it's somewhat of an issue right now. I mean of you use 
isolcpus= 
boot option to put cpus into null domain, cpusets will not be aware of it. The 
result maybe 
a bit confusing if an isolated cpu is added to some cpuset.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]

2008-02-02 Thread Max Krasnyansky

Paul Jackson wrote:
 Max wrote:
 Here is the list of things of issues with sched_load_balance flag from CPU 
 isolation 
 perspective:
 
 A separate thread happened to start up on lkml.org, shortly after
 yours, that went into this in considerable detail.
 
 For example, the interaction of cpusets, sched_load_balance,
 sched_domains and real time scheduling is examined in some detail on
 this thread.  Everyone participating on that thread learned something
 (we all came into it with less than a full picture of what's there.)
 
 I would encourage you to read it closely.  For example, the scheduler
 code should not be trying to access per-cpuset attributes such as
 the sched_load_balance flag (you are correct that this would be
 difficult to do because of the locking; however by design, that is
 not to be done.)
 
 This thread begins at:
 
 scheduler scalability - cgroups, cpusets and load-balancing
 http://lkml.org/lkml/2008/1/29/60
 
 Too bad we didn't think to include you in the CC list of that
 thread from the beginning.

Paul, I actually mentioned at the beginning of my email that I did read that 
thread
started by Peter. I did learn quite a bit from it :)
You guys did not discuss isolation stuff though. The thread was only about 
scheduling
and my cpu isolation extension patches deal with other aspects. 

Sounds like at this point we're in agreement that sched_load_balance is not 
suitable
for what I'd like to achieve. But how about making cpusets aware of the 
cpu_isolated_map ?
Even without my patches it's somewhat of an issue right now. I mean of you use 
isolcpus= 
boot option to put cpus into null domain, cpusets will not be aware of it. The 
result maybe 
a bit confusing if an isolated cpu is added to some cpuset.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Strange freezes (seems like SATA related)

2007-11-06 Thread Max Krasnyansky

Robert Hancock wrote:
> Can you post the full dmesg output? What kind of drive is this?

Sorry for the delay. I'm on vacation and have sporadic email access.
Full dmesg is pretty long. Here SATA related section.

sata_nv :00:07.0: version 3.4
ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 23
ACPI: PCI Interrupt :00:07.0[A] -> Link [LSA0] -> GSI 23 (level, high) -> 
IRQ 23
sata_nv :00:07.0: Using ADMA mode
PCI: Setting latency timer of device :00:07.0 to 64
scsi0 : sata_nv
scsi1 : sata_nv
ata1: SATA max UDMA/133 cmd 0xc2a16480 ctl 0xc2a164a0 bmdma 
0x000158b0 irq 23
ata2: SATA max UDMA/133 cmd 0xc2a16580 ctl 0xc2a165a0 bmdma 
0x000158b8 irq 23
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7: SAMSUNG HD080HJ, WT100-33, max UDMA/100
ata1.00: 156301488 sectors, multi 16: LBA48 
ata1.00: configured for UDMA/100
ata2: SATA link down (SStatus 0 SControl 300)
scsi 0:0:0:0: Direct-Access ATA  SAMSUNG HD080HJ  WT10 PQ: 0 ANSI: 5
ata1: bounce limit 0x, segment boundary 0x, hw segs 61
sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
 sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
ACPI: PCI Interrupt Link [LSA1] enabled at IRQ 22
ACPI: PCI Interrupt :00:08.0[A] -> Link [LSA1] -> GSI 22 (level, high) -> 
IRQ 22
sata_nv :00:08.0: Using ADMA mode

Max



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Strange freezes (seems like SATA related)

2007-11-06 Thread Max Krasnyansky



Andrew Morton wrote:
> On Mon, 29 Oct 2007 09:54:27 -0700
> Max Krasnyansky <[EMAIL PROTECTED]> wrote:
> 
>> A couple of HP xw9300 machines (dual Opterons) started freezing up.
>> We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
>> alive
>> (I can switch vts, etc) but everything else is dead (network, etc).
>> Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.
>>
>> Hooked up serial console and the only error that shows up is this.
>>
>> ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
>> status 0x1540 next cpb count 0x0 next cpb idx 0x0
>> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
>>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> Descriptor sense data with sense descriptors (in hex):
>> end_request: I/O error, dev sda, sector 8388695
>> Buffer I/O error on device sda1, logical block 1048579
>> lost page write due to I/O error on sda1
>> sd 0:0:0:0: [sda] Write Protect is off
>>
>> I see a bunch of those and then the box just sits there spewing this 
>> periodically
>>
>> ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
>> status 0x1540 next cpb count 0x0 next cpb idx 0x0
>> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
>>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>
>> SMART selftest on the drive passed without errors.
>>
>> Here is how this machine looks like
>>
>> ...
> 
> So this happens on more than one machine?
Yep.

> The kernel shouldn't freeze, so even if both machines have magically
> identical hardware faults, there's a kernel bug there somewhere.
> 
> I guess it would be useful to test a 2.6.23 kernel if poss.  We've seen a
> very large number of reports like this one in recent months (many of which
> have not been responded to, btw) and perhaps someone has done something
> about them.
I may not be able to run identical workload on 2.6.23. Will try to give it a 
shot
sometime next week. Also I've upgraded to 2.6.22.10 last week. There are a few 
fixes 
in there that may potentially affect those boxes.

Max



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Strange freezes (seems like SATA related)

2007-11-06 Thread Max Krasnyansky



Andrew Morton wrote:
 On Mon, 29 Oct 2007 09:54:27 -0700
 Max Krasnyansky [EMAIL PROTECTED] wrote:
 
 A couple of HP xw9300 machines (dual Opterons) started freezing up.
 We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
 alive
 (I can switch vts, etc) but everything else is dead (network, etc).
 Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.

 Hooked up serial console and the only error that shows up is this.

 ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
 status 0x1540 next cpb count 0x0 next cpb idx 0x0
 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
 ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 Descriptor sense data with sense descriptors (in hex):
 end_request: I/O error, dev sda, sector 8388695
 Buffer I/O error on device sda1, logical block 1048579
 lost page write due to I/O error on sda1
 sd 0:0:0:0: [sda] Write Protect is off

 I see a bunch of those and then the box just sits there spewing this 
 periodically

 ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
 status 0x1540 next cpb count 0x0 next cpb idx 0x0
 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
 ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

 SMART selftest on the drive passed without errors.

 Here is how this machine looks like

 ...
 
 So this happens on more than one machine?
Yep.

 The kernel shouldn't freeze, so even if both machines have magically
 identical hardware faults, there's a kernel bug there somewhere.
 
 I guess it would be useful to test a 2.6.23 kernel if poss.  We've seen a
 very large number of reports like this one in recent months (many of which
 have not been responded to, btw) and perhaps someone has done something
 about them.
I may not be able to run identical workload on 2.6.23. Will try to give it a 
shot
sometime next week. Also I've upgraded to 2.6.22.10 last week. There are a few 
fixes 
in there that may potentially affect those boxes.

Max



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Strange freezes (seems like SATA related)

2007-11-06 Thread Max Krasnyansky

Robert Hancock wrote:
 Can you post the full dmesg output? What kind of drive is this?

Sorry for the delay. I'm on vacation and have sporadic email access.
Full dmesg is pretty long. Here SATA related section.

sata_nv :00:07.0: version 3.4
ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 23
ACPI: PCI Interrupt :00:07.0[A] - Link [LSA0] - GSI 23 (level, high) - 
IRQ 23
sata_nv :00:07.0: Using ADMA mode
PCI: Setting latency timer of device :00:07.0 to 64
scsi0 : sata_nv
scsi1 : sata_nv
ata1: SATA max UDMA/133 cmd 0xc2a16480 ctl 0xc2a164a0 bmdma 
0x000158b0 irq 23
ata2: SATA max UDMA/133 cmd 0xc2a16580 ctl 0xc2a165a0 bmdma 
0x000158b8 irq 23
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7: SAMSUNG HD080HJ, WT100-33, max UDMA/100
ata1.00: 156301488 sectors, multi 16: LBA48 
ata1.00: configured for UDMA/100
ata2: SATA link down (SStatus 0 SControl 300)
scsi 0:0:0:0: Direct-Access ATA  SAMSUNG HD080HJ  WT10 PQ: 0 ANSI: 5
ata1: bounce limit 0x, segment boundary 0x, hw segs 61
sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
 sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
ACPI: PCI Interrupt Link [LSA1] enabled at IRQ 22
ACPI: PCI Interrupt :00:08.0[A] - Link [LSA1] - GSI 22 (level, high) - 
IRQ 22
sata_nv :00:08.0: Using ADMA mode

Max



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Strange freezes (seems like SATA related)

2007-10-29 Thread Max Krasnyansky

A couple of HP xw9300 machines (dual Opterons) started freezing up.
We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
alive
(I can switch vts, etc) but everything else is dead (network, etc).
Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.

Hooked up serial console and the only error that shows up is this.

ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Descriptor sense data with sense descriptors (in hex):
end_request: I/O error, dev sda, sector 8388695
Buffer I/O error on device sda1, logical block 1048579
lost page write due to I/O error on sda1
sd 0:0:0:0: [sda] Write Protect is off

I see a bunch of those and then the box just sits there spewing this 
periodically

ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

SMART selftest on the drive passed without errors.

Here is how this machine looks like

00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio 
Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
05:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 
7000/VE]
05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
Controller (PHY/Link)
0a:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
Controller (Copper) (rev 06)
40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
61:04.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:09.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
62:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
63:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
81:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
Controller (Copper) (rev 06)

As I mentioned dual Opteron, NUMA. Nothing fancy in the kernel config. 

Any ideas what might the problem be ?

Max

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Strange freezes (seems like SATA related)

2007-10-29 Thread Max Krasnyansky

A couple of HP xw9300 machines (dual Opterons) started freezing up.
We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
alive
(I can switch vts, etc) but everything else is dead (network, etc).
Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.

Hooked up serial console and the only error that shows up is this.

ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Descriptor sense data with sense descriptors (in hex):
end_request: I/O error, dev sda, sector 8388695
Buffer I/O error on device sda1, logical block 1048579
lost page write due to I/O error on sda1
sd 0:0:0:0: [sda] Write Protect is off

I see a bunch of those and then the box just sits there spewing this 
periodically

ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

SMART selftest on the drive passed without errors.

Here is how this machine looks like

00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio 
Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
05:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 
7000/VE]
05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
Controller (PHY/Link)
0a:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
Controller (Copper) (rev 06)
40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
61:04.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:09.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
62:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
63:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
81:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
Controller (Copper) (rev 06)

As I mentioned dual Opteron, NUMA. Nothing fancy in the kernel config. 

Any ideas what might the problem be ?

Max

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: TUN/TAP driver - MAINTAINERS - bad mailing list entry?

2007-09-04 Thread Max Krasnyansky


Joe Perches wrote:

MAINTAINERS curently has:

TUN/TAP driver
P:  Maxim Krasnyansky
M:  [EMAIL PROTECTED]
L:  [EMAIL PROTECTED]

[EMAIL PROTECTED] doesn't seem to be a valid email address.

Should it be removed or modified?



Sorry for late response. Just noticed this.
Yes it's an ancient mailing list and should be removed.
I totally forgot about it.

Max

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: TUN/TAP driver - MAINTAINERS - bad mailing list entry?

2007-09-04 Thread Max Krasnyansky


Joe Perches wrote:

MAINTAINERS curently has:

TUN/TAP driver
P:  Maxim Krasnyansky
M:  [EMAIL PROTECTED]
L:  [EMAIL PROTECTED]

[EMAIL PROTECTED] doesn't seem to be a valid email address.

Should it be removed or modified?



Sorry for late response. Just noticed this.
Yes it's an ancient mailing list and should be removed.
I totally forgot about it.

Max

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Allow group ownership of TUN/TAP devices

2007-06-21 Thread Max Krasnyansky


Jeff Dike wrote:

I recieved from Guido Guenther the patch below to the TUN/TAP driver
which allows group ownerships to be effective.

It seems reasonable to me.


Looks good to me too. We'll add to my tree. In the mean time I don't
mind if one of net drv maintainers pushes it upstream.

Thanx
Max
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Allow group ownership of TUN/TAP devices

2007-06-21 Thread Max Krasnyansky


Jeff Dike wrote:

I recieved from Guido Guenther the patch below to the TUN/TAP driver
which allows group ownerships to be effective.

It seems reasonable to me.


Looks good to me too. We'll add to my tree. In the mean time I don't
mind if one of net drv maintainers pushes it upstream.

Thanx
Max
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLAB cache reaper on isolated cpus

2007-02-20 Thread Max Krasnyansky

Christoph Lameter wrote:
> On Tue, 20 Feb 2007, Max Krasnyansky wrote:
> 
>> Suppose I need to isolate a CPU. We already support at the scheduler and 
>> irq levels (irq affinity). But I want to go a bit further and avoid 
>> doing kernel work on isolated cpus as much as possible. For example I 
>> would not want to schedule work queues and stuff on them. Currently 
>> there are just a few users of the schedule_delayed_work_on(). cpufreq 
>> (don't care for isolation purposes), oprofile (same here) and slab. For 
>> the slab it'd be nice to run the reaper on some other CPU. But you're 
>> saying that locking depends on CPU pinning. Is there any other option 
>> besides disabling cache reap ? Is there a way for example to constraint 
>> the slabs on CPU X to not exceed N megs ?
> 
> There is no way to constrain the amount of slab work. In order to make the 
> above work we would have to disable the per cpu caches for a certain cpu. 
> Then there would be no need to run the cache reaper at all.
> 
> To some extend such functionality already exists. F.e. kmalloc_node() 
> already bypasses the per cpu caches (most of the time).  kmalloc_node will 
> have to take a spinlock on a shared cacheline on each invocation. kmalloc 
> does only touch per cpu data during regular operations. Thus kmalloc() is 
> much 
> faster than kmalloc_node() and the cachelines for kmalloc() can be kept in 
> the per cpu cache.
> 
> If we could disable all per cpu caches for certain cpus then you could 
> make this work. All slab OS interference would be off the processor.

Hmm. That's an idea. I'll play with it later today or tomorrow.

Thanks
Max

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

SLAB cache reaper on isolated cpus

2007-02-20 Thread Max Krasnyansky


Christoph Lameter wrote:

On Tue, 20 Feb 2007, Max Krasnyansky wrote:


Ok. Sounds like disabling cache_reaper is a better option for now. Like you said
it's unlikely that slabs will grow much if that cpu is not heavily used by the 
kernel.


Running for prolonged times without cache_reaper is no good.

What we are talking about here is to disable the cache_reaper during cpu 
shutdown. The slab cpu shutdown will clean the per cpu caches anyways so 
we really do not need the slab_reaper running during cpu shutdown.


Ok. Let me restart the thread so that we're not confusing two issues :).

I'm not talking about CPU shutdown or CPU hotplug in general. My proposal 
seemed related
to the CPU shutdown issue that you guys were discussing, but it turns out it's 
not.

Suppose I need to isolate a CPU. We already support at the scheduler and irq 
levels (irq affinity).
But I want to go a bit further and avoid doing kernel work on isolated cpus as 
much as possible.
For example I would not want to schedule work queues and stuff on them.
Currently there are just a few users of the schedule_delayed_work_on(). cpufreq 
(don't care for
isolation purposes), oprofile (same here) and slab.
For the slab it'd be nice to run the reaper on some other CPU. But you're 
saying that locking
depends on CPU pinning. Is there any other option besides disabling cache reap 
? Is there a way
for example to constraint the slabs on CPU X to not exceed N megs ?

Max




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems

2007-02-20 Thread Max Krasnyansky


Christoph Lameter wrote:

On Tue, 20 Feb 2007, Max Krasnyansky wrote:


I guess I kind of hijacked the thread. The second part of my first email was
dropped. Basically I was saying that I'm working on CPU isolation extensions.
Where an isolated CPU is not supposed to do much kernel work. In which case
you'd want to run slab cache reaper on some other CPU on behalf of the
isolated
one. Hence the proposal to explicitly pass cpu_id to the reaper. I guess now
that you guys fixed the hotplug case it does not help in that scenario.


A cpu must have a per cpu cache in order to do slab allocations. The 
locking in the slab allocator depends on it.


If the isolated cpus have no need for slab allocations then you will also 
not need to run the slab_reaper().


Ok. Sounds like disabling cache_reaper is a better option for now. Like you said
it's unlikely that slabs will grow much if that cpu is not heavily used by the
kernel.

Thanks
Max
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems

2007-02-20 Thread Max Krasnyansky


Oleg Nesterov wrote:

On 02/20, Christoph Lameter wrote:

On Tue, 20 Feb 2007, Max Krasnyansky wrote:


Well seems that we have a set of unresolved issues with workqueues and cpu
hotplug.

How about storing 'cpu' explicitly in the work queue instead of relying on the
smp_processor_id() and friends ? That way there is no ambiguity when
threads/timers get
moved around.
The slab functionality is designed to work on the processor with the 
queue. These tricks will only cause more trouble in the future. The 
cache_reaper needs to be either disabled or run on the right processor. It 
should never run on the wrong processor.


I personally agree. Besides, cache_reaper is not alone. Note the comment
in debug_smp_processor_id() about cpu-bound processes. The slab does correct
thing right now, stops the timer on CPU_DEAD. Other problems imho should be
solved by fixing cpu-hotplug. Gautham and Srivatsa are working on that.


I guess I kind of hijacked the thread. The second part of my first email was
dropped. Basically I was saying that I'm working on CPU isolation extensions.
Where an isolated CPU is not supposed to do much kernel work. In which case
you'd want to run slab cache reaper on some other CPU on behalf of the isolated
one. Hence the proposal to explicitly pass cpu_id to the reaper. I guess now
that you guys fixed the hotplug case it does not help in that scenario.

Max
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems

2007-02-20 Thread Max Krasnyansky


Christoph Lameter wrote:

On Tue, 20 Feb 2007, Max Krasnyansky wrote:


Well seems that we have a set of unresolved issues with workqueues and cpu
hotplug.

How about storing 'cpu' explicitly in the work queue instead of relying on the
smp_processor_id() and friends ? That way there is no ambiguity when
threads/timers get
moved around.


The slab functionality is designed to work on the processor with the 
queue. These tricks will only cause more trouble in the future. The 
cache_reaper needs to be either disabled or run on the right processor. It 
should never run on the wrong processor.
The cache_reaper() is of no importance to hotplug. You just need to make 
sure that it is not in the way (disable it and if its running wait 
until the cache_reaper has finished).


I agree that running the reaper on the wrong CPU is not the best way to go 
about it.
But it seems like disabling it is even worse, unless I missing something. ie 
wasting
memory.

btw What kind of troubles were you talking about ? Performance or robustness ?
As I said performance wise it does not make sense to run reaper on the wrong 
CPU but
it does seems to work just fine from the correctness (locking, etc) 
perspective. Again
it's totally possible that I'm missing something here :).

Thanks
Max
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems

2007-02-20 Thread Max Krasnyansky


Folks,

Oleg Nesterov wrote:

Even if smp_processor_id() was stable during the execution of cache_reap(),
this work_struct can be moved to another CPU if CPU_DEAD happens. We can't
avoid this, and this is correct.

Uhh This may not be correct in terms of how the slab operates.


But this is practically impossible to avoid. We can't delay CPU_DOWN until all
workqueues flush their cwq->worklist. This is livelockable, the work can 
re-queue
itself, and new works can be added since the dying CPU is still on 
cpu_online_map.
This means that some pending works will be processed on another CPU.

delayed_work is even worse, the timer can migrate as well.

The first problem (smp_processor_id() is not stable) could be solved if we
use freezer or with the help of not-yet-implemented scalable lock_cpu_hotplug.


This means that __get_cpu_var(reap_work) returns a "wrong" struct delayed_work.
This is absolutely harmless right now, but may be it's better to use
container_of(unused, struct delayed_work, work).
Well seems that we have a set of unresolved issues with workqueues and cpu 
hotplug.


How about storing 'cpu' explicitly in the work queue instead of relying on the
smp_processor_id() and friends ? That way there is no ambiguity when 
threads/timers get
moved around.
I'm cooking a set of patches to extend cpu isolation concept a bit. In which 
case I'd
like one CPU to run cache_reap timer on behalf of another cpu. See the patch 
below.

diff --git a/mm/slab.c b/mm/slab.c
index c610062..0f46d11 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -766,7 +766,17 @@ int slab_is_available(void)
return g_cpucache_up == FULL;
 }

-static DEFINE_PER_CPU(struct delayed_work, reap_work);
+struct slab_work {
+   struct delayed_work dw;
+   unsigned int   cpu;
+};
+
+static DEFINE_PER_CPU(struct slab_work, reap_work);
+
+static inline struct array_cache *cpu_cache_get_on(struct kmem_cache *cachep, 
unsigned int cpu)
+{
+   return cachep->array[cpu];
+}

 static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep)
 {
@@ -915,9 +925,9 @@ static void init_reap_node(int cpu)
per_cpu(reap_node, cpu) = node;
 }

-static void next_reap_node(void)
+static void next_reap_node(unsigned int cpu)
 {
-   int node = __get_cpu_var(reap_node);
+   int node = per_cpu(reap_node, cpu);

/*
 * Also drain per cpu pages on remote zones
@@ -928,12 +938,12 @@ static void next_reap_node(void)
node = next_node(node, node_online_map);
if (unlikely(node >= MAX_NUMNODES))
node = first_node(node_online_map);
-   __get_cpu_var(reap_node) = node;
+   per_cpu(reap_node, cpu) = node;
 }

 #else
 #define init_reap_node(cpu) do { } while (0)
-#define next_reap_node(void) do { } while (0)
+#define next_reap_node(cpu) do { } while (0)
 #endif

 /*
@@ -945,17 +955,18 @@ static void next_reap_node(void)
  */
 static void __devinit start_cpu_timer(int cpu)
 {
-   struct delayed_work *reap_work = _cpu(reap_work, cpu);
+   struct slab_work *reap_work = _cpu(reap_work, cpu);

/*
 * When this gets called from do_initcalls via cpucache_init(),
 * init_workqueues() has already run, so keventd will be setup
 * at that time.
 */
-   if (keventd_up() && reap_work->work.func == NULL) {
+   if (keventd_up() && reap_work->dw.work.func == NULL) {
init_reap_node(cpu);
-   INIT_DELAYED_WORK(reap_work, cache_reap);
-   schedule_delayed_work_on(cpu, reap_work,
+   INIT_DELAYED_WORK(_work->dw, cache_reap);
+   reap_work->cpu = cpu;
+   schedule_delayed_work_on(cpu, _work->dw,
__round_jiffies_relative(HZ, cpu));
}
 }
@@ -1004,7 +1015,7 @@ static int transfer_objects(struct array_cache *to,
 #ifndef CONFIG_NUMA

 #define drain_alien_cache(cachep, alien) do { } while (0)
-#define reap_alien(cachep, l3) do { } while (0)
+#define reap_alien(cachep, l3, cpu) do { } while (0)

 static inline struct array_cache **alloc_alien_cache(int node, int limit)
 {
@@ -1099,9 +1110,9 @@ static void __drain_alien_cache(struct kmem_cache *cachep,
 /*
  * Called from cache_reap() to regularly drain alien caches round robin.
  */
-static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3)
+static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3, 
unsigned int cpu)
 {
-   int node = __get_cpu_var(reap_node);
+   int node = per_cpu(reap_node, cpu);

if (l3->alien) {
struct array_cache *ac = l3->alien[node];
@@ -4017,16 +4028,17 @@ void drain_array(struct kmem_cache *cachep, struct 
kmem_list3 *l3,
  * If we cannot acquire the cache chain mutex then just give up - we'll try
  * again on the next iteration.
  */
-static void cache_reap(struct work_struct *unused)
+static void cache_reap(struct work_struct *_work)
 {
struct kmem_cache *searchp;
struct

Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems

2007-02-20 Thread Max Krasnyansky


Folks,

Oleg Nesterov wrote:

Even if smp_processor_id() was stable during the execution of cache_reap(),
this work_struct can be moved to another CPU if CPU_DEAD happens. We can't
avoid this, and this is correct.

Uhh This may not be correct in terms of how the slab operates.


But this is practically impossible to avoid. We can't delay CPU_DOWN until all
workqueues flush their cwq-worklist. This is livelockable, the work can 
re-queue
itself, and new works can be added since the dying CPU is still on 
cpu_online_map.
This means that some pending works will be processed on another CPU.

delayed_work is even worse, the timer can migrate as well.

The first problem (smp_processor_id() is not stable) could be solved if we
use freezer or with the help of not-yet-implemented scalable lock_cpu_hotplug.


This means that __get_cpu_var(reap_work) returns a wrong struct delayed_work.
This is absolutely harmless right now, but may be it's better to use
container_of(unused, struct delayed_work, work).
Well seems that we have a set of unresolved issues with workqueues and cpu 
hotplug.


How about storing 'cpu' explicitly in the work queue instead of relying on the
smp_processor_id() and friends ? That way there is no ambiguity when 
threads/timers get
moved around.
I'm cooking a set of patches to extend cpu isolation concept a bit. In which 
case I'd
like one CPU to run cache_reap timer on behalf of another cpu. See the patch 
below.

diff --git a/mm/slab.c b/mm/slab.c
index c610062..0f46d11 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -766,7 +766,17 @@ int slab_is_available(void)
return g_cpucache_up == FULL;
 }

-static DEFINE_PER_CPU(struct delayed_work, reap_work);
+struct slab_work {
+   struct delayed_work dw;
+   unsigned int   cpu;
+};
+
+static DEFINE_PER_CPU(struct slab_work, reap_work);
+
+static inline struct array_cache *cpu_cache_get_on(struct kmem_cache *cachep, 
unsigned int cpu)
+{
+   return cachep-array[cpu];
+}

 static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep)
 {
@@ -915,9 +925,9 @@ static void init_reap_node(int cpu)
per_cpu(reap_node, cpu) = node;
 }

-static void next_reap_node(void)
+static void next_reap_node(unsigned int cpu)
 {
-   int node = __get_cpu_var(reap_node);
+   int node = per_cpu(reap_node, cpu);

/*
 * Also drain per cpu pages on remote zones
@@ -928,12 +938,12 @@ static void next_reap_node(void)
node = next_node(node, node_online_map);
if (unlikely(node = MAX_NUMNODES))
node = first_node(node_online_map);
-   __get_cpu_var(reap_node) = node;
+   per_cpu(reap_node, cpu) = node;
 }

 #else
 #define init_reap_node(cpu) do { } while (0)
-#define next_reap_node(void) do { } while (0)
+#define next_reap_node(cpu) do { } while (0)
 #endif

 /*
@@ -945,17 +955,18 @@ static void next_reap_node(void)
  */
 static void __devinit start_cpu_timer(int cpu)
 {
-   struct delayed_work *reap_work = per_cpu(reap_work, cpu);
+   struct slab_work *reap_work = per_cpu(reap_work, cpu);

/*
 * When this gets called from do_initcalls via cpucache_init(),
 * init_workqueues() has already run, so keventd will be setup
 * at that time.
 */
-   if (keventd_up()  reap_work-work.func == NULL) {
+   if (keventd_up()  reap_work-dw.work.func == NULL) {
init_reap_node(cpu);
-   INIT_DELAYED_WORK(reap_work, cache_reap);
-   schedule_delayed_work_on(cpu, reap_work,
+   INIT_DELAYED_WORK(reap_work-dw, cache_reap);
+   reap_work-cpu = cpu;
+   schedule_delayed_work_on(cpu, reap_work-dw,
__round_jiffies_relative(HZ, cpu));
}
 }
@@ -1004,7 +1015,7 @@ static int transfer_objects(struct array_cache *to,
 #ifndef CONFIG_NUMA

 #define drain_alien_cache(cachep, alien) do { } while (0)
-#define reap_alien(cachep, l3) do { } while (0)
+#define reap_alien(cachep, l3, cpu) do { } while (0)

 static inline struct array_cache **alloc_alien_cache(int node, int limit)
 {
@@ -1099,9 +1110,9 @@ static void __drain_alien_cache(struct kmem_cache *cachep,
 /*
  * Called from cache_reap() to regularly drain alien caches round robin.
  */
-static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3)
+static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3, 
unsigned int cpu)
 {
-   int node = __get_cpu_var(reap_node);
+   int node = per_cpu(reap_node, cpu);

if (l3-alien) {
struct array_cache *ac = l3-alien[node];
@@ -4017,16 +4028,17 @@ void drain_array(struct kmem_cache *cachep, struct 
kmem_list3 *l3,
  * If we cannot acquire the cache chain mutex then just give up - we'll try
  * again on the next iteration.
  */
-static void cache_reap(struct work_struct *unused)
+static void cache_reap(struct work_struct *_work)
 {
struct kmem_cache *searchp;
struct

Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems

2007-02-20 Thread Max Krasnyansky


Christoph Lameter wrote:

On Tue, 20 Feb 2007, Max Krasnyansky wrote:


Well seems that we have a set of unresolved issues with workqueues and cpu
hotplug.

How about storing 'cpu' explicitly in the work queue instead of relying on the
smp_processor_id() and friends ? That way there is no ambiguity when
threads/timers get
moved around.


The slab functionality is designed to work on the processor with the 
queue. These tricks will only cause more trouble in the future. The 
cache_reaper needs to be either disabled or run on the right processor. It 
should never run on the wrong processor.
The cache_reaper() is of no importance to hotplug. You just need to make 
sure that it is not in the way (disable it and if its running wait 
until the cache_reaper has finished).


I agree that running the reaper on the wrong CPU is not the best way to go 
about it.
But it seems like disabling it is even worse, unless I missing something. ie 
wasting
memory.

btw What kind of troubles were you talking about ? Performance or robustness ?
As I said performance wise it does not make sense to run reaper on the wrong 
CPU but
it does seems to work just fine from the correctness (locking, etc) 
perspective. Again
it's totally possible that I'm missing something here :).

Thanks
Max
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems

2007-02-20 Thread Max Krasnyansky


Oleg Nesterov wrote:

On 02/20, Christoph Lameter wrote:

On Tue, 20 Feb 2007, Max Krasnyansky wrote:


Well seems that we have a set of unresolved issues with workqueues and cpu
hotplug.

How about storing 'cpu' explicitly in the work queue instead of relying on the
smp_processor_id() and friends ? That way there is no ambiguity when
threads/timers get
moved around.
The slab functionality is designed to work on the processor with the 
queue. These tricks will only cause more trouble in the future. The 
cache_reaper needs to be either disabled or run on the right processor. It 
should never run on the wrong processor.


I personally agree. Besides, cache_reaper is not alone. Note the comment
in debug_smp_processor_id() about cpu-bound processes. The slab does correct
thing right now, stops the timer on CPU_DEAD. Other problems imho should be
solved by fixing cpu-hotplug. Gautham and Srivatsa are working on that.


I guess I kind of hijacked the thread. The second part of my first email was
dropped. Basically I was saying that I'm working on CPU isolation extensions.
Where an isolated CPU is not supposed to do much kernel work. In which case
you'd want to run slab cache reaper on some other CPU on behalf of the isolated
one. Hence the proposal to explicitly pass cpu_id to the reaper. I guess now
that you guys fixed the hotplug case it does not help in that scenario.

Max
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: slab: start_cpu_timer/cache_reap CONFIG_HOTPLUG_CPU problems

2007-02-20 Thread Max Krasnyansky


Christoph Lameter wrote:

On Tue, 20 Feb 2007, Max Krasnyansky wrote:


I guess I kind of hijacked the thread. The second part of my first email was
dropped. Basically I was saying that I'm working on CPU isolation extensions.
Where an isolated CPU is not supposed to do much kernel work. In which case
you'd want to run slab cache reaper on some other CPU on behalf of the
isolated
one. Hence the proposal to explicitly pass cpu_id to the reaper. I guess now
that you guys fixed the hotplug case it does not help in that scenario.


A cpu must have a per cpu cache in order to do slab allocations. The 
locking in the slab allocator depends on it.


If the isolated cpus have no need for slab allocations then you will also 
not need to run the slab_reaper().


Ok. Sounds like disabling cache_reaper is a better option for now. Like you said
it's unlikely that slabs will grow much if that cpu is not heavily used by the
kernel.

Thanks
Max
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

SLAB cache reaper on isolated cpus

2007-02-20 Thread Max Krasnyansky


Christoph Lameter wrote:

On Tue, 20 Feb 2007, Max Krasnyansky wrote:


Ok. Sounds like disabling cache_reaper is a better option for now. Like you said
it's unlikely that slabs will grow much if that cpu is not heavily used by the 
kernel.


Running for prolonged times without cache_reaper is no good.

What we are talking about here is to disable the cache_reaper during cpu 
shutdown. The slab cpu shutdown will clean the per cpu caches anyways so 
we really do not need the slab_reaper running during cpu shutdown.


Ok. Let me restart the thread so that we're not confusing two issues :).

I'm not talking about CPU shutdown or CPU hotplug in general. My proposal 
seemed related
to the CPU shutdown issue that you guys were discussing, but it turns out it's 
not.

Suppose I need to isolate a CPU. We already support at the scheduler and irq 
levels (irq affinity).
But I want to go a bit further and avoid doing kernel work on isolated cpus as 
much as possible.
For example I would not want to schedule work queues and stuff on them.
Currently there are just a few users of the schedule_delayed_work_on(). cpufreq 
(don't care for
isolation purposes), oprofile (same here) and slab.
For the slab it'd be nice to run the reaper on some other CPU. But you're 
saying that locking
depends on CPU pinning. Is there any other option besides disabling cache reap 
? Is there a way
for example to constraint the slabs on CPU X to not exceed N megs ?

Max




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLAB cache reaper on isolated cpus

2007-02-20 Thread Max Krasnyansky

Christoph Lameter wrote:
 On Tue, 20 Feb 2007, Max Krasnyansky wrote:
 
 Suppose I need to isolate a CPU. We already support at the scheduler and 
 irq levels (irq affinity). But I want to go a bit further and avoid 
 doing kernel work on isolated cpus as much as possible. For example I 
 would not want to schedule work queues and stuff on them. Currently 
 there are just a few users of the schedule_delayed_work_on(). cpufreq 
 (don't care for isolation purposes), oprofile (same here) and slab. For 
 the slab it'd be nice to run the reaper on some other CPU. But you're 
 saying that locking depends on CPU pinning. Is there any other option 
 besides disabling cache reap ? Is there a way for example to constraint 
 the slabs on CPU X to not exceed N megs ?
 
 There is no way to constrain the amount of slab work. In order to make the 
 above work we would have to disable the per cpu caches for a certain cpu. 
 Then there would be no need to run the cache reaper at all.
 
 To some extend such functionality already exists. F.e. kmalloc_node() 
 already bypasses the per cpu caches (most of the time).  kmalloc_node will 
 have to take a spinlock on a shared cacheline on each invocation. kmalloc 
 does only touch per cpu data during regular operations. Thus kmalloc() is 
 much 
 faster than kmalloc_node() and the cachelines for kmalloc() can be kept in 
 the per cpu cache.
 
 If we could disable all per cpu caches for certain cpus then you could 
 make this work. All slab OS interference would be off the processor.

Hmm. That's an idea. I'll play with it later today or tomorrow.

Thanks
Max

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86: unify/rewrite SMP TSC sync code

2006-11-27 Thread Max Krasnyansky




Using gtod() can amount to a substantial disturbance of the thing to
be measured.  Using rdtsc, things seem reliable so far, and we have an
FPGA (accessed through the PCI bus) that has been programmed to give
access to an 8MHz clock and we do some checks against that.


Same here. gettimeofday() is way too slow (dual Opteron box) for the
frequency I need to call it at. HPET is not available. But TSC is doing just
fine. Plus in my case I don't care about sync between CPUs (thread that uses
TSC is running on the isolated CPU) and I have external time source that takes
care of the drift.

So please no trapping of the RDTSC. Making it clear (bold kernel message during
boot :-) that TSC(s) are not in sync or unstable (from GTOD point of view) is of
course perfectly fine.

Thanx
Max
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86: unify/rewrite SMP TSC sync code

2006-11-27 Thread Max Krasnyansky




Using gtod() can amount to a substantial disturbance of the thing to
be measured.  Using rdtsc, things seem reliable so far, and we have an
FPGA (accessed through the PCI bus) that has been programmed to give
access to an 8MHz clock and we do some checks against that.


Same here. gettimeofday() is way too slow (dual Opteron box) for the
frequency I need to call it at. HPET is not available. But TSC is doing just
fine. Plus in my case I don't care about sync between CPUs (thread that uses
TSC is running on the isolated CPU) and I have external time source that takes
care of the drift.

So please no trapping of the RDTSC. Making it clear (bold kernel message during
boot :-) that TSC(s) are not in sync or unstable (from GTOD point of view) is of
course perfectly fine.

Thanx
Max
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

CD/DVD drive access hangs when media is not inserted

2005-01-20 Thread Max Krasnyansky

Hi Folks,
I've got ASUS DVD-E616P2 drive and it seems that media detection is broken with 
it.
Processes that try to access the drive when cd or dvd is not inserted simply
hang until the machine is rebooted.
So for example if I do 'cat /dev/cdrom'. First few attempts fail with 'No 
medium found'
error and dmesg shows 'cdrom: open failed'. But then it hangs in 
ide_do_drive_cmd
4435 D+   cat /dev/cdrom   ide_do_drive_cmd
From then on drive is dead. Inserting cd does not help. Reboot is the only way 
to bring
it back to life.
Everything else works just fine. Actually almost everything. Another annoying 
problem is
if I pause DVD playback for too long (let's say 10-15 minutes) and then hit 
play again dvd
access hangs just like in 'no medium' case.
Any ideas ?
I tried a bunch of kernels 2.6.8.1  2.6.9  2.6.10
Here is how the drive is recognized at boot.
ICH5: IDE controller at PCI slot :00:1f.1
ACPI: PCI interrupt :00:1f.1[A] -> GSI 18 (level, low) -> IRQ 177
ICH5: chipset revision 2
ICH5: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:DMA, hdd:pio
Probing IDE interface ide0...
hda: SAMSUNG SP1614N, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Probing IDE interface ide1...
hdc: ASUS DVD-E616P2, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
And this is what lspci has to say about my system
00:00.0 Host bridge: Intel Corp. 82865G/PE/P DRAM Controller/Host-Hub Interface 
(rev 02)
00:01.0 PCI bridge: Intel Corp. 82865G/PE/P PCI to AGP Controller (rev 02)
00:1d.0 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller 
#1 (rev 02)
00:1d.1 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller 
#2 (rev 02)
00:1d.2 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI #3 (rev 02)
00:1d.3 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller 
#4 (rev 02)
00:1d.7 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB2 EHCI 
Controller (rev 02)
00:1e.0 PCI bridge: Intel Corp. 82801 PCI Bridge (rev c2)
00:1f.0 ISA bridge: Intel Corp. 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge 
(rev 02)
00:1f.1 IDE interface: Intel Corp. 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 
02)
00:1f.3 SMBus: Intel Corp. 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02)
00:1f.5 Multimedia audio controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) AC'97 
Audio Controller
(rev 02)
01:00.0 VGA compatible controller: nVidia Corporation NV34 [GeForce FX 5200] 
(rev a1)
02:04.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture 
(rev 11)
02:04.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 
11)
02:07.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5788 Gigabit 
Ethernet (rev 03)
02:0d.0 FireWire (IEEE 1394): Agere Systems (former Lucent Microelectronics) 
FW323 (rev 61)
Max
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 101 matches

Mail list logo