Re: [PATCH] [RFC] Throttle swappiness for interactive tasks

2007-04-18 Thread Chris Snook

अभिजित भोपटकर (Abhijit Bhopatkar) wrote:

The mm structures of interactive tasks are marked and
the pages belonging to them are never shifted to inactive
list in lru algorithm. Thus keeping interactive tasks in
memory as long as possible.
The interactivity is already determined by schedular so
we reuse that knowledge to mark the mm structures.

Signed-off-by: Abhijit Bhopatkar [EMAIL PROTECTED]
---


Lying to the VM doesn't seem like the best way to handle this.  A lot of tasks, 
including interactive ones have some/many pages that they touch once during 
startup, and don't touch again for a very long time, if ever.  We want these 
pages swapped out long before the box swaps out the working set of our 
non-interactive processes.


I like the general idea of swap priority influenced by scheduler priority, but 
if we're going to do that, we should do it in a general way that's independent 
of scheduler implementation, so it'll be useful to soft real-time users and 
still relevant if (when?) we replace the current scheduler with something else 
lacking a special interactive flag.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP lockup in virtualized environment

2007-04-24 Thread Chris Snook

LAPLACE Cyprien wrote:

An example: in kernel/pid.c:alloc_pid(), if one of the guest CPUs is
descheduled when holding the pidmap_lock, what happens to the other
guest CPUs who want to alloc/free pids ? Are they blocked too ?


Yup.  This is where it's really nice to have directed yields, where you tell the 
hypervisor to give your physical CPU time to the vcpu that's holding the lock 
you're blocking on.  I know s390 can do this.  Perhaps it's something worth 
generalizing in paravirt_ops?


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/17] atl1 trivial endianness misannotations

2007-03-15 Thread Chris Snook

Al Viro wrote:

NB: driver is choke-full of code that will break on big-endian; as long
as the hardware is onboard-only we can live with that, but sooner or
later that'll need fixing.

Signed-off-by: Al Viro [EMAIL PROTECTED]
---
 drivers/net/atl1/atl1_main.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c
index 88d4f70..dee3638 100644
--- a/drivers/net/atl1/atl1_main.c
+++ b/drivers/net/atl1/atl1_main.c
@@ -1328,7 +1328,7 @@ static int atl1_tx_csum(struct atl1_adapter *adapter, 
struct sk_buff *skb,
 
 	if (likely(skb-ip_summed == CHECKSUM_PARTIAL)) {

cso = skb-h.raw - skb-data;
-   css = (skb-h.raw + skb-csum) - skb-data;
+   css = (skb-h.raw + skb-csum_offset) - skb-data;
if (unlikely(cso  0x1)) {
printk(KERN_DEBUG %s: payload offset != even number\n,
atl1_driver_name);


This could certainly explain some checksumming problems we've seen.


@@ -1562,7 +1562,7 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct 
net_device *netdev)
/* mss will be nonzero if we're doing segment offload (TSO/GSO) */
mss = skb_shinfo(skb)-gso_size;
if (mss) {
-   if (skb-protocol == ntohs(ETH_P_IP)) {
+   if (skb-protocol == htons(ETH_P_IP)) {
proto_hdr_len = ((skb-h.raw - skb-data) +
 (skb-h.th-doff  2));
if (unlikely(proto_hdr_len  len)) {


ACK.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fix atl1 braino

2007-02-13 Thread Chris Snook

Al Viro wrote:

Spot the bug...

Signed-off-by: Al Viro [EMAIL PROTECTED]
---

diff --git a/drivers/net/atl1/atl1_hw.c b/drivers/net/atl1/atl1_hw.c
index 08b2d78..e28707a 100644
--- a/drivers/net/atl1/atl1_hw.c
+++ b/drivers/net/atl1/atl1_hw.c
@@ -357,7 +357,7 @@ void atl1_hash_set(struct atl1_hw *hw, u32 hash_value)
 */
hash_reg = (hash_value  31)  0x1;
hash_bit = (hash_value  26)  0x1F;
-   mta = ioread32((hw + REG_RX_HASH_TABLE) + (hash_reg  2));
+   mta = ioread32((hw-hw_addr + REG_RX_HASH_TABLE) + (hash_reg  2));
mta |= (1  hash_bit);
iowrite32(mta, (hw-hw_addr + REG_RX_HASH_TABLE) + (hash_reg  2));
 }


ACK.

Thanks for catching this.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPL vs non-GPL device drivers

2007-02-15 Thread Chris Snook

v j wrote:

You don't get it do you. Our source code is meaningless to the Open
Source community at large. It is only useful to our tiny set of
competitors that have nothing to do with Linux. The Embedded space is
very specific. We are only _using_ Linux. Just as we could have used
VxWorks or OSE. Using our source code would not benefit anybody but
our competitors. Sure we could make our drivers open-source. This is a
decision that is made FIRST when evaluating an OS. If we we were
required to make our drivers/HW open, we would just not have chosen
Linux. It is as simple as that.


Collaborating with the competition (coopetition) on a common 
technology platform reduces costs for anyone who chooses to get 
involved, giving them a collective competitive edge against anyone who 
doesn't.  This is why there is so much industry interest in F/OSS, and 
mortal enemies in the business world happily work together on technical 
issues in Linux.


If you choose to actively participate in the community, you will benefit 
from this phenomenon, as well as the patches you will receive from very 
smart kernel hackers who don't even own your hardware, and the pool of 
mature GPL code you can use to improve your drivers.


If you do not choose to actively participate in the community, you can 
still keep using existing versions of the kernel that work fine for you, 
even if future versions do not.  There are plenty of embedded devices 
out there using 2.4 or even 2.2 kernels that do what they need.


Your competitors who do participate in the community (and there are a 
lot in the embedded space) enjoy reduced development costs, more stable 
and better-reviewed code, continuous compatibility with the latest 
versions, and influence in the community over the direction of future 
development.  If you want to cede this advantage to your competitors, 
that's between you and your investors.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-05 Thread Chris Snook

Linus Torvalds wrote:


On Thu, 5 Apr 2007, Robin Holt wrote:

For testing, Jack Steiner create the following patch.  All it does
is moves tasks which are transitioning to the zombie state from where
they are in the children list to the head of the list.  In this way,
they will be the first found and reaping does speed up.  We will still
do a full scan of the list once the rearranged tasks are all removed.
This does not seem to be a significant problem.


I'd almost prefer to just put the zombie children on a separate list. I 
wonder how painful that would be..


That would still make it expensive for people who use WUNTRACED to get 
stopped children (since they'd have to look at all lists), but maybe 
that's not a big deal.


Shouldn't be any worse than it already is.

Another thing we could do is to just make sure that kernel threads simply 
don't end up as children of init. That whole thing is silly, they're 
really not children of the user-space init anyway. Comments?


Linus


Does anyone remember why we started doing this in the first place?  I'm sure 
there are some tools that expect a process tree, rather than a forest, and 
making it a forest could make them unhappy.


The support angel on my shoulder says we should just put all the kernel threads 
under a kthread subtree to shorten init's child list and minimize impact.  The 
hacker devil on my other shoulder says that with usermode helpers, containers, 
etc. it's about time we treat it as a tree, and any tools that have a problem 
with that need to be fixed.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-05 Thread Chris Snook

Chris Snook wrote:

Linus Torvalds wrote:


On Thu, 5 Apr 2007, Robin Holt wrote:

For testing, Jack Steiner create the following patch.  All it does
is moves tasks which are transitioning to the zombie state from where
they are in the children list to the head of the list.  In this way,
they will be the first found and reaping does speed up.  We will still
do a full scan of the list once the rearranged tasks are all removed.
This does not seem to be a significant problem.


I'd almost prefer to just put the zombie children on a separate list. 
I wonder how painful that would be..


That would still make it expensive for people who use WUNTRACED to get 
stopped children (since they'd have to look at all lists), but maybe 
that's not a big deal.


Shouldn't be any worse than it already is.

Another thing we could do is to just make sure that kernel threads 
simply don't end up as children of init. That whole thing is silly, 
they're really not children of the user-space init anyway. Comments?


Linus


Does anyone remember why we started doing this in the first place?  I'm 
sure there are some tools that expect a process tree, rather than a 
forest, and making it a forest could make them unhappy.


The support angel on my shoulder says we should just put all the kernel 
threads under a kthread subtree to shorten init's child list and 
minimize impact.  The hacker devil on my other shoulder says that with 
usermode helpers, containers, etc. it's about time we treat it as a 
tree, and any tools that have a problem with that need to be fixed.


-- Chris


Err, that should have been about time we treat it as a forest.

-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Chris Snook

Eric W. Biederman wrote:

Linus Torvalds [EMAIL PROTECTED] writes:
I'm not sure anybody would really be unhappy with pptr pointing to some 
magic and special task that has pid 0 (which makes it clear to everybody 
that the parent is something special), and that has SIGCHLD set to SIG_IGN 
(which should make the exit case not even go through the zombie phase).


I can't even imagine *how* you'd make a tool unhappy with that, since even 
tools like ps (and even more pstree won't read all the process states 
atomically, so they invariably will see parent pointers that don't even 
exist any more, because by the time they get to the parent, it has exited 
already.


Right.  pid == 1 being missing might cause some confusing having 
but having ppid == 0 should be fine.  Heck pid == 1 already has 
ppid == 0, so it is a value user space has had to deal with for a

while.

In addition there was a period in 2.6 where most kernel threads
and init had a pgid == 0 and a session  == 0, and nothing seemed
to complain.

We should probably make all of the kernel threads children of
init_task.  The initial idle thread on the first cpu that is the
parent of pid == 1.   That will give the ppid == 0 naturally because
the idle thread has pid == 0.


Linus, Eric, thanks for the history lesson.  I think it's safe to say 
that anything that breaks because of this sort of change was already 
broken anyway.


If we're going to scale to an obscene number of CPUs (which I believe 
was the original motivation on this thread) then putting the dead 
children on their own list will probably scale better.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] use symbolic constants in generic lseek code

2007-02-20 Thread Chris Snook
The generic lseek code in fs/read_write.c uses hardcoded values for
SEEK_{SET,CUR,END}.

Patch 1 fixes the case statements to use the symbolic constants in
include/linux/fs.h, and should not be at all controversial.

Patch 2 adds a SEEK_MAX and uses it to validate user arguments.  This makes
the code a little cleaner and also enables future extensions (such as
SEEK_DATA and SEEK_HOLE).  If anyone has a problem with this, please speak up.

-- Chris

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] use symbolic constants in generic lseek code

2007-02-20 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Convert magic numbers to SEEK_* values from fs.h

Signed-off-by: Chris Snook [EMAIL PROTECTED]
--
--- a/fs/read_write.c   2007-02-20 14:49:45.0 -0500
+++ b/fs/read_write.c   2007-02-20 16:48:39.0 -0500
@@ -37,10 +37,10 @@ loff_t generic_file_llseek(struct file *

mutex_lock(inode-i_mutex);
switch (origin) {
-   case 2:
+   case SEEK_END:
offset += inode-i_size;
break;
-   case 1:
+   case SEEK_CUR:
offset += file-f_pos;
}
retval = -EINVAL;
@@ -63,10 +63,10 @@ loff_t remote_llseek(struct file *file,

lock_kernel();
switch (origin) {
-   case 2:
+   case SEEK_END:
offset += i_size_read(file-f_path.dentry-d_inode);
break;
-   case 1:
+   case SEEK_CUR:
offset += file-f_pos;
}
retval = -EINVAL;
@@ -94,10 +94,10 @@ loff_t default_llseek(struct file *file,

lock_kernel();
switch (origin) {
-   case 2:
+   case SEEK_END:
offset += i_size_read(file-f_path.dentry-d_inode);
break;
-   case 1:
+   case SEEK_CUR:
offset += file-f_pos;
}
retval = -EINVAL;

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] use use SEEK_MAX to validate user lseek arguments

2007-02-20 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Add SEEK_MAX and use it to validate lseek arguments from userspace.

Signed-off-by: Chris Snook [EMAIL PROTECTED]
--
diff -urp b/fs/read_write.c c/fs/read_write.c
--- b/fs/read_write.c   2007-02-20 16:48:39.0 -0500
+++ c/fs/read_write.c   2007-02-20 16:55:46.0 -0500
@@ -139,7 +139,7 @@ asmlinkage off_t sys_lseek(unsigned int
goto bad;

retval = -EINVAL;
-   if (origin = 2) {
+   if (origin = SEEK_MAX) {
loff_t res = vfs_llseek(file, offset, origin);
retval = res;
if (res != (loff_t)retval)
@@ -166,7 +166,7 @@ asmlinkage long sys_llseek(unsigned int
goto bad;

retval = -EINVAL;
-   if (origin  2)
+   if (origin  SEEK_MAX)
goto out_putf;

offset = vfs_llseek(file, ((loff_t) offset_high  32) | offset_low,
diff -urp b/include/linux/fs.h c/include/linux/fs.h
--- b/include/linux/fs.h2007-02-20 14:49:46.0 -0500
+++ c/include/linux/fs.h2007-02-20 16:54:30.0 -0500
@@ -30,6 +30,7 @@
 #define SEEK_SET   0   /* seek relative to beginning of file */
 #define SEEK_CUR   1   /* seek relative to current file position */
 #define SEEK_END   2   /* seek relative to end of file */
+#define SEEK_MAX   SEEK_END

 /* And dynamically-tunable limits and defaults: */
 struct files_stat_struct {

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Lower HD transfer rate with NCQ enabled?

2007-04-03 Thread Chris Snook

Paa Paa wrote:
I'm using Linux 2.6.20.4. I noticed that I get lower SATA hard drive 
throughput with 2.6.20.4 than with 2.6.19. The reason was that 2.6.20 
enables NCQ by defauly (queue_depth = 31/32 instead of 0/32). Transfer 
rate was measured using hdparm -t:


With NCQ (queue_depth == 31): 50MB/s.
Without NCQ (queue_depth == 0): 60MB/s.

20% difference is quite a lot. This is with Intel ICH8R controller and 
Western Digital WD1600YS hard disk in AHCI mode. I also used the next 
command to cat-copy a biggish (540MB) file and time it:


rm temp  sync  time sh -c 'cat quite_big_file  temp  sync'

Here I noticed no differences at all with and without NCQ. The times 
(real time) were basically the same in many successive runs. Around 19s.


Q: What conclusion can I make on hdparm -t results or can I make any 
conclusions? Do I really have lower performance with NCQ or not? If I 
do, is this because of my HD or because of kernel?


hdparm -t is a perfect example of a synthetic benchmark.  NCQ was 
designed to optimize real-world workloads.  The overhead gets hidden 
pretty well when there are multiple requests in flight simultaneously, 
as tends to be the case when you have a user thread reading data while a 
kernel thread is asynchronously flushing the user thread's buffered 
writes.  Given that you're breaking even with one user thread and one 
kernel thread doing I/O, you'll probably get performance improvements 
with higher thread counts.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Usage semantics of atomic_set ( )

2008-01-11 Thread Chris Snook

Vineet Gupta wrote:

I'm trying to implement atomic ops for a CPU which has no inherent
support for Read-Modify-Write Ops. Instead of using a global spin lock
which protects all the atomic APIs, I want to use a spin lock per
instance of atomic_t.


What operations are you using to implement spinlocks?

A few architectures use arrays of spinlocks to implement atomic_t.  I believe 
sparc and parisc are among them.  Assuming your spinlock implementation is sound 
and efficient, the same technique should work for you.


-- Chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: irq load balancing

2007-09-12 Thread Chris Snook

Venkat Subbiah wrote:

Most of the load in my system is triggered by a single ethernet IRQ.
Essentially the IRQ schedules a tasklet and most of the work is done in the
taskelet which is scheduled in the IRQ. From what I read looks like the
tasklet would be executed on the same CPU on which it was scheduled. So this
means even in an SMP system it will be one processor which is overloaded.

So will using the user space IRQ loadbalancer really help?


A little bit.  It'll keep other IRQs on different CPUs, which will prevent other 
interrupts from causing cache and TLB evictions that could slow down the 
interrupt handler for the NIC.



What I am doubtful
about is that the user space load balance comes along and changes the
affinity once in a while. But really what I need is every interrupt to go to
a different CPU in a round robin fashion.


Doing it in a round-robin fashion will be disastrous for performance.  Your 
cache miss rate will go through the roof and you'll hit the slow paths in the 
network stack most of the time.



Looks like the APIC  can distribute IRQ's dynamically? Is this supported in
the kernel and any config or proc interface to turn this on/off.


/proc/irq/$FOO/smp_affinity is a bitmask.  You can mask an irq to multiple 
processors.  Of course, this will absolutely kill your performance.  That's why 
irqbalance never does this.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Lossy interrupts on x86_64

2007-09-13 Thread Chris Snook

Jesse Barnes wrote:
I just narrowed down a weird problem where I was losing more than 50% of 
my vblank interrupts to what seems to be the hires timers patch.  Stock 
2.6.23-rc5 works fine, but the latest (171) kernel from rawhide drops 
most of my interrupts unless I also have another interrupt source 
running (e.g. if I hold down a key or move the mouse I get the expected 
number of vblank interrupts, otherwise I get between 3 and 30 instead 
of the expected 60 per second).


Any ideas?  It seems like it might be bad APIC programming, but I 
haven't gone through those mods to look for suspects...


What happens if you boot with 'noapic' or 'pci=nomsi'?  Please post dmesg as 
well so we can see how the kernel is initializing the relevant hardware.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86_64: make atomic64_t semantics consistent with atomic_t

2007-09-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

The volatile keyword has already been removed from the declaration of atomic_t
on x86_64.  For consistency, remove it from atomic64_t as well.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- a/include/asm-x86_64/atomic.h   2007-07-08 19:32:17.0 -0400
+++ b/include/asm-x86_64/atomic.h   2007-09-13 11:30:51.0 -0400
@@ -206,7 +206,7 @@ static __inline__ int atomic_sub_return(
 
 /* An 64bit atomic type */
 
-typedef struct { volatile long counter; } atomic64_t;
+typedef struct { long counter; } atomic64_t;
 
 #define ATOMIC64_INIT(i)   { (i) }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: irq load balancing

2007-09-13 Thread Chris Snook

Venkat Subbiah wrote:

Since most network devices have a single status register for both
receiver and transmit (and errors and the like), which needs a lock to
protect access, you will likely end up with serious thrashing of moving
the lock between cpus.

Any ways to measure the trashing of locks?


Since most network devices have a single status register for both
receiver and transmit (and errors and the like)

These register accesses will be mostly within the irq handler which I

plan on keeping on the same processor. The network driver is actually
tg3. Will looks closely into the driver.


Why are you trying to do this, anyway?  This is a classic example of fairness 
hurting both performance and efficiency.  Unbalanced distribution of a single 
IRQ gives superior performance.  There are cases when this is a worthwhile 
tradeoff, but the network stack is not one of them.  In the HPC world, people 
generally want to squeeze maximum performance out of CPU/cache/RAM so they just 
accept the imbalance because it performs better than balancing it, and 
irqbalance can keep things fair over longer intervals if that's important.  In 
the realtime world, people generally bind everything they can to one or two 
CPUs, and bind their realtime applications to the remaining ones to minimize 
contention.


Distributing your network interrupts in a round-robin fashion will make your 
computer do exactly one thing faster: heat up the room.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPU usage for 10Gbps UDP transfers

2007-09-17 Thread Chris Snook

Lukas Hejtmanek wrote:

Hello,

is it expected that application sending 8900bytes datagram through 10Gbps NIC
utilizes CPU to 100% and similarly the receiver also utilizes CPU to 100%.
Is it something wrong or this is quite OK?

(The box is dual single core Opteron 2.4GHz with Myricom 10GE NIC.)


Every time a new generation of ethernet comes out, its peak throughput exceeds 
the memory/CPU/IO capacity of commodity hardware available at the time.  This is 
normal.  Of course, you may not be saturating the link, and it may be possible 
to tune the driver to improve your throughput, but you'll still be saturating a 
CPU on that hardware.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch/option to wipe memory at boot?

2007-09-19 Thread Chris Snook

David Madore wrote:

On Mon, Sep 17, 2007 at 11:11:52AM -0700, Jeremy Fitzhardinge wrote:

Boot memtest86 for a little while before booting the kernel?  And if you
haven't already run it for a while, then that would be your first step
anyway.


Indeed, that does the trick, thanks for the suggestion.  So I can be
quite confident, now, that my RAM is sane and it's just that the BIOS
doesn't initialize it properly.

But I'd still like some way of filling the RAM when Linux starts (or
perhaps in the bootloader), because letting memtest86 run after every
cold reboot isn't a very satisfactory solution.


Bootloaders like to do things like run in 16-bit or 32-bit mode on boxes where 
higher bitness is necessary to access all the memory.  It may be possible to do 
this in the bootloader, but the BIOS is clearly the correct place to fix this 
problem.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PAGE_SIZE on 64bit and 32bit machines

2007-11-12 Thread Chris Snook

Yoav Artzi wrote:
According to my knowledge the PAGE_SIZE on 32bit architectures in 4KB. 
Logically, the PAGE_SIZE on 64bit architectures should be 8KB. That's at 
least the way I understand it. However, looking at the kernel code of 
x86_64, I see the PAGE_SIZE is 4KB.



Can anyone explain to me what am I missing here?


PAGE_SIZE is highly architecture-dependent.  While it is true that 4K pages are 
typical on 32-bit architectures, and 64-bit architectures have historically 
introduced 8K pages, this is by no means a requirement.  x86_64 uses the same 
page sizes that are available on i686+PAE, so you get 4K base pages.  alpha and 
sparc64 typically use 8K base pages, though they have other options as well. 
ia64 defaults to 16K, though it can do 4K, 8K, and a bunch of larger base sizes. 
 ppc64 does 4K and 64K.  s390 uses 4K base pages in both 31-bit and 64-bit 
kernels.  If x86_64 processors are released with TLBs that can handle 8K pages, 
it'll be straightforward to add that feature, but otherwise it would require 
faking it in software, which has lots of pitfalls and does nothing to improve 
TLB efficiency.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange delays / what usually happens every 10 min?

2007-11-13 Thread Chris Snook

Florian Boelstler wrote:

While running that test driver a delay of about 10ms _exactly_ occurs
every 10 minutes.


This is precisely the sort of thing that BIOS/firmware-level SMI handlers do, 
particularly those that have monitoring or management features.  Try to 
determine if the kernel is doing anything during this time.  If the entire 
kernel seems to be frozen, talk to the people who wrote the firmware.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: IM Kernel Failure 12/11/07

2007-11-14 Thread Chris Snook

[EMAIL PROTECTED] wrote:
	Linux version 2.4.9-e.38smp ([EMAIL PROTECTED]) (gcc 
version 2.96 2731 (Red Hat Linux 7.2 2.96-124.7.2)) #1 SMP Wed Feb 
11 00:09:01 EST 2004


Ancient vendor kernels are very out of scope for this mailing list.  The 
following links may be useful:


https://bugzilla.redhat.com/
https://www.redhat.com/apps/support/
http://www.redhat.com/mailman/listinfo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers/net/: Spelling fixes

2007-12-17 Thread Chris Snook

Joe Perches wrote:

 drivers/net/atl1/atl1_hw.c |2 +-
 drivers/net/atl1/atl1_main.c   |2 +-


The atl1 code will be heavily reworked in the 2.6.25 merge window, so this may 
cause headaches.  Please remove these chunks before merging.


The spelling corrections themselves are fine, and I will ensure that the revised 
driver includes them, if the comments in question are still present at all once 
we're done with all the changes and cleanups.


-- Chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoid overflows in kernel/time.c

2007-11-29 Thread Chris Snook

H. Peter Anvin wrote:

NOTE: This patch uses a bc(1) script to compute the appropriate
constants.


Perhaps dc would be more appropriate?  That's included in busybox.

-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel Development Objective-C

2007-11-30 Thread Chris Snook

Ben Crowhurst wrote:

Has Objective-C ever been considered for kernel development?


No.  Kernel programming requires what is essentially assembly language with a 
lot of syntactic sugar, which C provides.  Higher-level languages abstract away 
too much detail to be suitable for the sort of bit-perfect control you need when 
you're directly controlling bare metal.  You can still use object-oriented 
programming techniques in C, and we do this all the time in the kernel, but we 
do so with more fine-grained explicit control than a language like Objective-C 
would give us.  More to the point, if we tried to use Objective-C, we'd find 
ourselves needing to fall back to C-style explicitness so often that it wouldn't 
be worth the trouble.


In other news, I hear Hurd boots again!

-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Kernel - Future works

2007-12-04 Thread Chris Snook

Muhammad Nowbuth wrote:

Hi all,

Could anyone give some ideas of future pending works which are needed
on the linux kernel?


http://kernelnewbies.org/KernelHacking
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6.22.y][PATCH] atl1: disable broken 64-bit DMA

2007-11-26 Thread Chris Snook

Jay Cliburn wrote:

atl1: disable broken 64-bit DMA

[ Upstream commit: 5f08e46b621a769e52a9545a23ab1d5fb2aec1d4 ]

The L1 network chip can DMA to 64-bit addresses, but multiple descriptor
rings share a single register for the high 32 bits of their address, so
only a single, aligned, 4 GB physical address range can be used at a time.
As a result, we need to confine the driver to a 32-bit DMA mask, otherwise
we see occasional data corruption errors in systems containing 4 or more
gigabytes of RAM.

Signed-off-by: Jay Cliburn [EMAIL PROTECTED]
Cc: Luca Tettamanti [EMAIL PROTECTED]
Cc: Chris Snook [EMAIL PROTECTED]


Acked-By: Chris Snook [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related

2007-12-28 Thread Chris Snook

Martin Knoblauch wrote:

Hi,

currently I am tracking down an interesting effect when writing to a
Solars-10/Sparc based server. The server exports two filesystems. One UFS,
one VXFS. The filesystems are mounted NFS3/TCP, no special options. Linux
kernel in question is 2.6.24-rc6, but it happens with earlier kernels
(2.6.19.2, 2.6.22.6) as well. The client is x86_64 with 8 GB of ram.

The problem: when writing to the VXFS based filesystem, performance drops
dramatically when the the filesize reaches or exceeds dirty_ratio. For a
dirty_ratio of 10% (about 800MB) files below 750 MB are transfered with about
30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If I perform
the same tests on the UFS based FS, performance stays at about 30 MB/sec
until 3GB and likely larger (I just stopped at 3 GB).

Any ideas what could cause this difference? Any suggestions on debugging it?


1) Try normal NFS tuning, such as rsize/wsize tuning.

2) You're entering synchronous writeback mode, so you can delay the problem by 
raising dirty_ratio to 100, or reduce the size of the problem by lowering 
dirty_ratio to 1.  Either one could help.


3) It sounds like the bottleneck is the vxfs filesystem.  It only *appears* on 
the client side because writes up until dirty_ratio get buffered on the client. 
 If you can confirm that the server is actually writing stuff to disk slower 
when the client is in writeback mode, then it's possible the Linux NFS client is 
doing something inefficient in writeback mode.


-- Chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Quad core CPU detected but shows as single core in 2.6.23.1

2007-10-24 Thread Chris Snook

Zurk Tech wrote:

Hi guys,
I have a tyan s3992 h2000 with single barcelona amd quad core cpu (the
other cpu socket is empty). cat /proc/cpuinfo shows amd quad core
processor
but core : 1ive compiled the kernel from scratch with smp and
amd64 + the numa stuff. i also tried debian etchs amd64 smp kernel and
same result.
is amd barcelona quad core cpu not yet supported or is it something else ?
Thanks for any insight. im completely stumped. ive dealt with
mutliprocessing machines before and have a couple of dual cores which
are fine with the
exact same kernel configs. my amd tk-53 x2 turions show 2 cores in cpuinfo


The bootstrap protocol for Barcelona is a little different from older Opterons, 
so an older BIOS that doesn't know the new protocol won't be able to bring up 
any CPU other than the bootstrap processor.  My wild guess is that this is 
what's happening and a BIOS update will fix it, but as Arjan said, please post 
dmesg when reporting bugs like this.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: mostly merge types.h

2007-10-19 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Most of types_32.h and types_64.h are the same.  Merge the common definitions
into types.h, keeping the differences in their own files.  Also #error if
types_{32,64}.h is included directly.  Tested with allmodconfig on x86_64.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

 types.h|   45 +
 types_32.h |   48 ++--
 types_64.h |   47 +++
 3 files changed, 58 insertions(+), 82 deletions(-)

diff -urp a/include/asm-x86/types_32.h b/include/asm-x86/types_32.h
--- a/include/asm-x86/types_32.h2007-10-18 04:23:36.0 -0400
+++ b/include/asm-x86/types_32.h2007-10-18 07:03:05.0 -0400
@@ -1,64 +1,28 @@
 #ifndef _I386_TYPES_H
 #define _I386_TYPES_H
 
-#ifndef __ASSEMBLY__
-
-typedef unsigned short umode_t;
-
-/*
- * __xx is ok: it doesn't pollute the POSIX namespace. Use these in the
- * header files exported to user space
- */
-
-typedef __signed__ char __s8;
-typedef unsigned char __u8;
-
-typedef __signed__ short __s16;
-typedef unsigned short __u16;
-
-typedef __signed__ int __s32;
-typedef unsigned int __u32;
+#ifndef _X86_TYPES_H
+#error Do not include this file directly.  Use asm/types.h instead.
+#endif
 
-#if defined(__GNUC__)
+#if !defined(__ASSEMBLY__)  defined(__GNUC__)
 __extension__ typedef __signed__ long long __s64;
 __extension__ typedef unsigned long long __u64;
 #endif
 
-#endif /* __ASSEMBLY__ */
-
-/*
- * These aren't exported outside the kernel to avoid name space clashes
- */
 #ifdef __KERNEL__
 
 #define BITS_PER_LONG 32
 
 #ifndef __ASSEMBLY__
 
-
-typedef signed char s8;
-typedef unsigned char u8;
-
-typedef signed short s16;
-typedef unsigned short u16;
-
-typedef signed int s32;
-typedef unsigned int u32;
-
-typedef signed long long s64;
-typedef unsigned long long u64;
-
-/* DMA addresses come in generic and 64-bit flavours.  */
-
+/* DMA addresses come in generic and 64-bit flavours. */
 #ifdef CONFIG_HIGHMEM64G
 typedef u64 dma_addr_t;
 #else
 typedef u32 dma_addr_t;
 #endif
-typedef u64 dma64_addr_t;
 
 #endif /* __ASSEMBLY__ */
-
 #endif /* __KERNEL__ */
-
-#endif
+#endif /* _I386_TYPES_H */
diff -urp a/include/asm-x86/types_64.h b/include/asm-x86/types_64.h
--- a/include/asm-x86/types_64.h2007-10-18 04:23:36.0 -0400
+++ b/include/asm-x86/types_64.h2007-10-18 07:03:11.0 -0400
@@ -1,55 +1,22 @@
 #ifndef _X86_64_TYPES_H
 #define _X86_64_TYPES_H
 
-#ifndef __ASSEMBLY__
-
-typedef unsigned short umode_t;
-
-/*
- * __xx is ok: it doesn't pollute the POSIX namespace. Use these in the
- * header files exported to user space
- */
-
-typedef __signed__ char __s8;
-typedef unsigned char __u8;
-
-typedef __signed__ short __s16;
-typedef unsigned short __u16;
-
-typedef __signed__ int __s32;
-typedef unsigned int __u32;
+#ifndef _X86_TYPES_H
+#error Do not include this file directly.  Use asm/types.h instead.
+#endif
 
+#ifndef __ASSEMBLY__
 typedef __signed__ long long __s64;
 typedef unsigned long long  __u64;
+#endif
 
-#endif /* __ASSEMBLY__ */
-
-/*
- * These aren't exported outside the kernel to avoid name space clashes
- */
 #ifdef __KERNEL__
 
 #define BITS_PER_LONG 64
 
 #ifndef __ASSEMBLY__
-
-typedef signed char s8;
-typedef unsigned char u8;
-
-typedef signed short s16;
-typedef unsigned short u16;
-
-typedef signed int s32;
-typedef unsigned int u32;
-
-typedef signed long long s64;
-typedef unsigned long long u64;
-
-typedef u64 dma64_addr_t;
 typedef u64 dma_addr_t;
-
-#endif /* __ASSEMBLY__ */
+#endif
 
 #endif /* __KERNEL__ */
-
-#endif
+#endif /* _X86_64_TYPES_H */
diff -urp a/include/asm-x86/types.h b/include/asm-x86/types.h
--- a/include/asm-x86/types.h   2007-10-18 04:23:36.0 -0400
+++ b/include/asm-x86/types.h   2007-10-18 06:59:37.0 -0400
@@ -1,3 +1,46 @@
+#ifndef _X86_TYPES_H
+#define _X86_TYPES_H
+
+#ifndef __ASSEMBLY__
+
+typedef unsigned short umode_t;
+
+/*
+ * __xx is ok: it doesn't pollute the POSIX namespace. Use these in the
+ * header files exported to user space
+ */
+
+typedef __signed__ char __s8;
+typedef unsigned char __u8;
+
+typedef __signed__ short __s16;
+typedef unsigned short __u16;
+
+typedef __signed__ int __s32;
+typedef unsigned int __u32;
+
+/*
+ * These aren't exported outside the kernel to avoid name space clashes
+ */
+#ifdef __KERNEL__
+
+typedef signed char s8;
+typedef unsigned char u8;
+
+typedef signed short s16;
+typedef unsigned short u16;
+
+typedef signed int s32;
+typedef unsigned int u32;
+
+typedef signed long long s64;
+typedef unsigned long long u64;
+
+typedef u64 dma64_addr_t;
+
+#endif /* __KERNEL__ */
+#endif /* __ASSEMBLY__ */
+
 #ifdef __KERNEL__
 # ifdef CONFIG_X86_32
 #  include types_32.h
@@ -11,3 +54,5 @@
 #  include types_64.h
 # endif
 #endif
+
+#endif /* _X86_TYPES_H */

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message

[PATCH] x86: merge mmu{,_32,_64}.h

2007-10-20 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Merge mmu_32.h and mmu_64.h into mmu.h.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

diff -Nurp a/include/asm-x86/mmu_32.h b/include/asm-x86/mmu_32.h
--- a/include/asm-x86/mmu_32.h  2007-10-20 02:42:24.0 -0400
+++ b/include/asm-x86/mmu_32.h  1969-12-31 19:00:00.0 -0500
@@ -1,18 +0,0 @@
-#ifndef __i386_MMU_H
-#define __i386_MMU_H
-
-#include linux/mutex.h
-/*
- * The i386 doesn't have a mmu context, but
- * we put the segment information here.
- *
- * cpu_vm_mask is used to optimize ldt flushing.
- */
-typedef struct { 
-   int size;
-   struct mutex lock;
-   void *ldt;
-   void *vdso;
-} mm_context_t;
-
-#endif
diff -Nurp a/include/asm-x86/mmu_64.h b/include/asm-x86/mmu_64.h
--- a/include/asm-x86/mmu_64.h  2007-10-20 02:42:24.0 -0400
+++ b/include/asm-x86/mmu_64.h  1969-12-31 19:00:00.0 -0500
@@ -1,21 +0,0 @@
-#ifndef __x86_64_MMU_H
-#define __x86_64_MMU_H
-
-#include linux/spinlock.h
-#include linux/mutex.h
-
-/*
- * The x86_64 doesn't have a mmu context, but
- * we put the segment information here.
- *
- * cpu_vm_mask is used to optimize ldt flushing.
- */
-typedef struct { 
-   void *ldt;
-   rwlock_t ldtlock; 
-   int size;
-   struct mutex lock;
-   void *vdso;
-} mm_context_t;
-
-#endif
diff -Nurp a/include/asm-x86/mmu.h b/include/asm-x86/mmu.h
--- a/include/asm-x86/mmu.h 2007-10-20 02:42:24.0 -0400
+++ b/include/asm-x86/mmu.h 2007-10-20 02:38:36.0 -0400
@@ -1,5 +1,23 @@
-#ifdef CONFIG_X86_32
-# include mmu_32.h
-#else
-# include mmu_64.h
+#ifndef _ASM_X86_MMU_H
+#define _ASM_X86_MMU_H
+
+#include linux/spinlock.h
+#include linux/mutex.h
+
+/*
+ * The x86 doesn't have a mmu context, but
+ * we put the segment information here.
+ *
+ * cpu_vm_mask is used to optimize ldt flushing.
+ */
+typedef struct { 
+   void *ldt;
+#ifdef CONFIG_X86_64
+   rwlock_t ldtlock; 
 #endif
+   int size;
+   struct mutex lock;
+   void *vdso;
+} mm_context_t;
+
+#endif /* _ASM_X86_MMU_H */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: unify a.out{,_32,_64}.h

2007-10-20 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Unify x86 a.out_32.h and a.out_64.h

Signed-off-by: Chris Snook [EMAIL PROTECTED]

diff -Nurp a/include/asm-x86/a.out_32.h b/include/asm-x86/a.out_32.h
--- a/include/asm-x86/a.out_32.h2007-10-20 06:20:01.0 -0400
+++ b/include/asm-x86/a.out_32.h1969-12-31 19:00:00.0 -0500
@@ -1,27 +0,0 @@
-#ifndef __I386_A_OUT_H__
-#define __I386_A_OUT_H__
-
-struct exec
-{
-  unsigned long a_info;/* Use macros N_MAGIC, etc for access */
-  unsigned a_text; /* length of text, in bytes */
-  unsigned a_data; /* length of data, in bytes */
-  unsigned a_bss;  /* length of uninitialized data area for file, 
in bytes */
-  unsigned a_syms; /* length of symbol table data in file, in 
bytes */
-  unsigned a_entry;/* start address */
-  unsigned a_trsize;   /* length of relocation info for text, in bytes 
*/
-  unsigned a_drsize;   /* length of relocation info for data, in bytes 
*/
-};
-
-#define N_TRSIZE(a)((a).a_trsize)
-#define N_DRSIZE(a)((a).a_drsize)
-#define N_SYMSIZE(a)   ((a).a_syms)
-
-#ifdef __KERNEL__
-
-#define STACK_TOP  TASK_SIZE
-#define STACK_TOP_MAX  STACK_TOP
-
-#endif
-
-#endif /* __A_OUT_GNU_H__ */
diff -Nurp a/include/asm-x86/a.out_64.h b/include/asm-x86/a.out_64.h
--- a/include/asm-x86/a.out_64.h2007-10-20 06:20:01.0 -0400
+++ b/include/asm-x86/a.out_64.h1969-12-31 19:00:00.0 -0500
@@ -1,28 +0,0 @@
-#ifndef __X8664_A_OUT_H__
-#define __X8664_A_OUT_H__
-
-/* 32bit a.out */
-
-struct exec
-{
-  unsigned int a_info; /* Use macros N_MAGIC, etc for access */
-  unsigned a_text; /* length of text, in bytes */
-  unsigned a_data; /* length of data, in bytes */
-  unsigned a_bss;  /* length of uninitialized data area for file, 
in bytes */
-  unsigned a_syms; /* length of symbol table data in file, in 
bytes */
-  unsigned a_entry;/* start address */
-  unsigned a_trsize;   /* length of relocation info for text, in bytes 
*/
-  unsigned a_drsize;   /* length of relocation info for data, in bytes 
*/
-};
-
-#define N_TRSIZE(a)((a).a_trsize)
-#define N_DRSIZE(a)((a).a_drsize)
-#define N_SYMSIZE(a)   ((a).a_syms)
-
-#ifdef __KERNEL__
-#include linux/thread_info.h
-#define STACK_TOP  TASK_SIZE
-#define STACK_TOP_MAX  TASK_SIZE64
-#endif
-
-#endif /* __A_OUT_GNU_H__ */
diff -Nurp a/include/asm-x86/a.out.h b/include/asm-x86/a.out.h
--- a/include/asm-x86/a.out.h   2007-10-20 06:20:01.0 -0400
+++ b/include/asm-x86/a.out.h   2007-10-20 06:14:26.0 -0400
@@ -1,13 +1,32 @@
+#ifndef _ASM_X86_A_OUT_H
+#define _ASM_X86_A_OUT_H
+
+/* 32bit a.out */
+
+struct exec
+{
+  unsigned int a_info; /* Use macros N_MAGIC, etc for access */
+  unsigned a_text; /* length of text, in bytes */
+  unsigned a_data; /* length of data, in bytes */
+  unsigned a_bss;  /* length of uninitialized data area for file, 
in bytes */
+  unsigned a_syms; /* length of symbol table data in file, in 
bytes */
+  unsigned a_entry;/* start address */
+  unsigned a_trsize;   /* length of relocation info for text, in bytes 
*/
+  unsigned a_drsize;   /* length of relocation info for data, in bytes 
*/
+};
+
+#define N_TRSIZE(a)((a).a_trsize)
+#define N_DRSIZE(a)((a).a_drsize)
+#define N_SYMSIZE(a)   ((a).a_syms)
+
 #ifdef __KERNEL__
+# include linux/thread_info.h
+# define STACK_TOP TASK_SIZE
 # ifdef CONFIG_X86_32
-#  include a.out_32.h
+#  define STACK_TOP_MAXSTACK_TOP
 # else
-#  include a.out_64.h
-# endif
-#else
-# ifdef __i386__
-#  include a.out_32.h
-# else
-#  include a.out_64.h
+#  define STACK_TOP_MAXTASK_SIZE64
 # endif
 #endif
+
+#endif /* _ASM_X86_A_OUT_H */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: unify div64{,_32,_64}.h

2007-10-20 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Unify x86 div64.h headers.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

diff -Nurp a/include/asm-x86/div64_32.h b/include/asm-x86/div64_32.h
--- a/include/asm-x86/div64_32.h2007-10-20 07:33:53.0 -0400
+++ b/include/asm-x86/div64_32.h1969-12-31 19:00:00.0 -0500
@@ -1,52 +0,0 @@
-#ifndef __I386_DIV64
-#define __I386_DIV64
-
-#include linux/types.h
-
-/*
- * do_div() is NOT a C function. It wants to return
- * two values (the quotient and the remainder), but
- * since that doesn't work very well in C, what it
- * does is:
- *
- * - modifies the 64-bit dividend _in_place_
- * - returns the 32-bit remainder
- *
- * This ends up being the most efficient calling
- * convention on x86.
- */
-#define do_div(n,base) ({ \
-   unsigned long __upper, __low, __high, __mod, __base; \
-   __base = (base); \
-   asm(:=a (__low), =d (__high):A (n)); \
-   __upper = __high; \
-   if (__high) { \
-   __upper = __high % (__base); \
-   __high = __high / (__base); \
-   } \
-   asm(divl %2:=a (__low), =d (__mod):rm (__base), 0 (__low), 
1 (__upper)); \
-   asm(:=A (n):a (__low),d (__high)); \
-   __mod; \
-})
-
-/*
- * (long)X = ((long long)divs) / (long)div
- * (long)rem = ((long long)divs) % (long)div
- *
- * Warning, this will do an exception if X overflows.
- */
-#define div_long_long_rem(a,b,c) div_ll_X_l_rem(a,b,c)
-
-static inline long
-div_ll_X_l_rem(long long divs, long div, long *rem)
-{
-   long dum2;
-  __asm__(divl %2:=a(dum2), =d(*rem)
-  :rm(div), A(divs));
-
-   return dum2;
-
-}
-
-extern uint64_t div64_64(uint64_t dividend, uint64_t divisor);
-#endif
diff -Nurp a/include/asm-x86/div64_64.h b/include/asm-x86/div64_64.h
--- a/include/asm-x86/div64_64.h2007-10-20 07:33:53.0 -0400
+++ b/include/asm-x86/div64_64.h1969-12-31 19:00:00.0 -0500
@@ -1 +0,0 @@
-#include asm-generic/div64.h
diff -Nurp a/include/asm-x86/div64.h b/include/asm-x86/div64.h
--- a/include/asm-x86/div64.h   2007-10-20 07:33:53.0 -0400
+++ b/include/asm-x86/div64.h   2007-10-20 07:32:34.0 -0400
@@ -1,5 +1,58 @@
+#ifndef _ASM_X86_DIV64_H
+#define _ASM_X86_DIV64_H
+
 #ifdef CONFIG_X86_32
-# include div64_32.h
-#else
-# include div64_64.h
-#endif
+
+#include linux/types.h
+
+/*
+ * do_div() is NOT a C function. It wants to return
+ * two values (the quotient and the remainder), but
+ * since that doesn't work very well in C, what it
+ * does is:
+ *
+ * - modifies the 64-bit dividend _in_place_
+ * - returns the 32-bit remainder
+ *
+ * This ends up being the most efficient calling
+ * convention on x86.
+ */
+#define do_div(n,base) ({ \
+   unsigned long __upper, __low, __high, __mod, __base; \
+   __base = (base); \
+   asm(:=a (__low), =d (__high):A (n)); \
+   __upper = __high; \
+   if (__high) { \
+   __upper = __high % (__base); \
+   __high = __high / (__base); \
+   } \
+   asm(divl %2:=a (__low), =d (__mod):rm (__base), 0 (__low), 
1 (__upper)); \
+   asm(:=A (n):a (__low),d (__high)); \
+   __mod; \
+})
+
+/*
+ * (long)X = ((long long)divs) / (long)div
+ * (long)rem = ((long long)divs) % (long)div
+ *
+ * Warning, this will do an exception if X overflows.
+ */
+#define div_long_long_rem(a,b,c) div_ll_X_l_rem(a,b,c)
+
+static inline long
+div_ll_X_l_rem(long long divs, long div, long *rem)
+{
+   long dum2;
+  __asm__(divl %2:=a(dum2), =d(*rem)
+  :rm(div), A(divs));
+
+   return dum2;
+
+}
+
+extern uint64_t div64_64(uint64_t dividend, uint64_t divisor);
+
+# else
+#  include asm-generic/div64.h
+# endif /* CONFIG_X86_32 */
+#endif /* _ASM_X86_DIV64_H */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc1 panics on boot

2008-02-13 Thread Chris Snook

Dhaval Giani wrote:

I am getting the following oops on bootup on 2.6.25-rc1

...

I am booting using kexec with maxcpus=1. It does not have any problems
with maxcpus=2 or higher.


Sounds like another (the same?) kexec cpu numbering bug.  Can you 
post/link the entire dmesg from both a cold boot and a kexec boot so we 
can compare?


-- Chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next build status

2008-02-14 Thread Chris Snook

Stephen Rothwell wrote:

Hi all,

Initial status can be seen here
http://kisskb.ellerman.id.au/kisskb/branch/9/ (I hope to make a better
URL soon).  Suggestions for more compiler/config combinations are
welcome, but we can't necessarily commit to fulfilling all you
wishes.  :-)



i386 allmodconfig please.

Also, I highly recommend adding some randconfig builds, at least one 32-bit arch 
and one 64-bit arch.  Any given randconfig build is not particularly likely to 
catch bugs that would be missed elsewhere, but doing them daily for two months 
will catch a lot of things before they get released.  The catch, of course, is 
that you have to actually save the .config for this to be useful, which might 
require a slight modification to your scripts.


-- Chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next build status

2008-02-14 Thread Chris Snook

Tony Breeds wrote:

On Thu, Feb 14, 2008 at 08:24:27PM -0500, Chris Snook wrote:

Stephen Rothwell wrote:

Hi all,

Initial status can be seen here
http://kisskb.ellerman.id.au/kisskb/branch/9/ (I hope to make a better
URL soon).  Suggestions for more compiler/config combinations are
welcome, but we can't necessarily commit to fulfilling all you
wishes.  :-)


i386 allmodconfig please.


Wont i386 allmodconfig be equivalent to x86_64 allmodconfig?


Only if there are no bugs.

Driver code is most likely to trip over bitness/endianness bugs, and 
you've already got allmodconfig builds for be32, be64, and le64 
architectures.  Adding an le32 architecture (i386) completes the 
coverage of these basic categories.


-- Chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] make LKDTM depend on BLOCK

2008-02-15 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Make LKDTM depend on BLOCK to prevent build failures with certain configs.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a370fe8..24b327c 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -524,6 +524,7 @@ config LKDTM
tristate Linux Kernel Dump Test Tool Module
depends on DEBUG_KERNEL
depends on KPROBES
+   depends on BLOCK
default n
help
This module enables testing of the different dumping mechanisms by
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RESEND] x86_64: make atomic64_t work like atomic_t

2007-09-26 Thread Chris Snook
Regardless of the greater controversy about the semantics of atomic_t, I think
we can all agree that atomic_t and atomic64_t should have the same semantics.
This is presently not the case on x86_64, where the volatile keyword was
removed from the declaration of atomic_t, but it was not removed from the
declaration of atomic64_t.  The following patch fixes that inconsistency,
without delving into anything more controversial.

From: Chris Snook [EMAIL PROTECTED]

The volatile keyword has already been removed from the declaration of atomic_t
on x86_64.  For consistency, remove it from atomic64_t as well.

Signed-off-by: Chris Snook [EMAIL PROTECTED]
CC: Andi Kleen [EMAIL PROTECTED]

--- a/include/asm-x86_64/atomic.h   2007-07-08 19:32:17.0 -0400
+++ b/include/asm-x86_64/atomic.h   2007-09-13 11:30:51.0 -0400
@@ -206,7 +206,7 @@ static __inline__ int atomic_sub_return(
 
 /* An 64bit atomic type */
 
-typedef struct { volatile long counter; } atomic64_t;
+typedef struct { long counter; } atomic64_t;
 
 #define ATOMIC64_INIT(i)   { (i) }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bonnie++ with 1024k stripe SW/RAID5 causes kernel to goto D-state

2007-09-29 Thread Chris Snook

Justin Piszcz wrote:

Kernel: 2.6.23-rc8 (older kernels do this as well)

When running the following command:
/usr/bin/time /usr/sbin/bonnie++ -d /x/test -s 16384 -m p34 -n 
16:10:16:64


It hangs unless I increase various parameters md/raid such as the 
stripe_cache_size etc..


# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   276  0.0  0.0  0 0 ?D12:14   0:00 [pdflush]
root   277  0.0  0.0  0 0 ?D12:14   0:00 [pdflush]
root  1639  0.0  0.0  0 0 ?D   12:14   0:00 [xfsbufd]
root  1767  0.0  0.0   8100   420 ?Ds   12:14   0:00 
root  2895  0.0  0.0   5916   632 ?Ds   12:15   0:00 
/sbin/syslogd -r


See the bottom for more details.

Is this normal?  Does md only work without tuning up to a certain stripe 
size? I use a RAID 5 with 1024k stripe which works fine with many 
optimizations, but if I just boot the system and run bonnie++ on it 
without applying the optimizations, it will hang in d-state.  When I run 
the optimizations, then it exits out of D-state, pretty weird?


Not at all.  1024k stripes are way outside the norm.  If you do something way 
outside the norm, and don't tune for it in advance, don't be terribly surprised 
when something like bonnie++ brings your box to its knees.


That's not to say we couldn't make md auto-tune itself more intelligently, but 
this isn't really a bug.  With a sufficiently huge amount of RAM, you'd be able 
to dynamically allocate the buffers that you're not pre-allocating with 
stripe_cache_size, but bonnie++ is eating that up in this case.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: One process with multiple user ids.

2007-10-02 Thread Chris Snook

Giuliano Gagliardi wrote:

Hello,

I have a server that has to switch to different user ids, but because it does 
other complex things, I would rather not have it run as root.


Well, it's probably going to have to *start* as root, or use something like 
sudo.  It's probably easiest to have it start as root and drop privileges as 
soon as possible, certainly before handling any untrusted data.


 I only need the

server to be able to switch to certain pre-defined user ids.


This is a very easy special case.  Just start a process for each user ID and 
drop root privileges.  They can communicate via sockets or even shared memory. 
If you wanted to switch between arbitrary UIDs at runtime, it might be worth 
doing something exotic, but it's really not in this case.  Also, if you do it 
this way, it's rather easy to verify the correctness of your design, and you 
never have to touch kernel code.


I have seen that two possible solutions have already been suggested here on 
the LKML, but it was some years ago, and nothing like it has been 
implemented.


(1) Having supplementary user ids like there are supplementary group ids and 
system calls getuids() and setuids() that work like getgroups() and 
setgroups()


But you can already accomplish this with ACLs and SELinux.  You're trying to 
make this problem harder than it really is.



(2) Allowing processes to pass user and group ids via sockets.


And do what with them?  You can already pass arbitrary data via sockets.  It 
sounds like you need (1) to use (2).


Both (1) and (2) would solve my problem. Now my question is whether there are 
any fundamental flaws with (1) or (2), or whether the right way to solve my 
problem is another one.


(1) doesn't accomplish anything you can't already do, but it would make a huge 
mess of a lot of code.


(2) is silly.  Sockets are for communicating between userspace processes.  If 
you want to be granting/revoking credentials, you should be using system calls, 
and even then only if you absolutely must.  Having the kernel snoop traffic on 
sockets between processes would be disastrous for performance, and without that, 
any process could claim that it had been granted privileges over a socket and 
the kernel would just have to trust it.


Don't overthink this.  You don't need to touch the kernel at all to do this. 
Just use a multi-process model, like qmail does, for example.  You can start 
with root privileges and drop them, or use sudo to help you out.  It's fast, 
secure, takes advantage of modern multi-core CPUs, and is much simpler.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: gigabit ethernet power consumption

2007-10-08 Thread Chris Snook

Pavel Machek wrote:

Hi!

I've found that gbit vs. 100mbit power consumption difference is about
1W -- pretty significant. (Maybe powertop should include it in the
tips section? :).

Energy Star people insist that machines should switch down to 100mbit
when network is idle, and I guess that makes a lot of sense -- you
save 1W locally and 1W on the router.

Question is, how to implement it correctly? Daemon that would watch
data rates and switch speeds using mii-tool would be simple, but is
that enough?


I believe you misspelled ethtool.

While you're at it, why stop at 100Mb?  I believe you save even more power at 
10Mb, which is why WOL puts the card in 10Mb mode.  In my experience, you 
generally want either the maximum setting or the minimum setting when going for 
power savings, because of the race-to-idle effect.  Workloads that have a 
sustained fractional utilization are rare.  Right now I'm at home, hooked up to 
a cable modem, so anything over 4Mb is wasted, unless I'm talking to the box 
across the room, which is rare.


Talk to the NetworkManager folks.  This is right up their alley.

-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] DeskOpt - on fly task, i/o scheduler optimization

2007-08-31 Thread Chris Snook

Michal Piotrowski wrote:

Hi,

Here is something that might be useful for gamers and audio/video editors
http://www.stardust.webpages.pl/files/tools/deskopt/

You can easily tune CFS/CFQ scheduler params


I would think that gamers and AV editors would want to be using deadline 
(or maybe even as), not cfq.  How well does it work with other I/O 
schedulers?


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: HIMEM calculation

2007-09-03 Thread Chris Snook

James C. Georgas wrote:

I'm not sure I understand how the kernel calculates the amount of
physical RAM it can map during the boot process.

I've quoted two blocks of kernel messages below, one for a kernel with
NOHIGHMEM and another for a kernel with HIGHMEM4G.

If I do the math on the BIOS provided physical RAM map, there is less
than 5MiB of the address space reserved. Since I only have 1GiB of
physical RAM in the board, I figured that it would still be possible to
physically map 1019MiB, even with the 3GiB/1GiB split between user space
and kernel space that occurs with NOHIGHMEM.

However, What actually happens is that I'm 127MiB short of a full GiB.

What am I missing here? Why does that last 127MiB have to go in HIGHMEM?


That's the vmalloc address space.  You only get 896 MB in the NORMAL 
zone on i386, to leave room for vmalloc.  If you don't like it, go 64-bit.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: HIMEM calculation

2007-09-04 Thread Chris Snook

James Georgas wrote:

  That's the vmalloc address space.  You only get 896 MB in the NORMAL
  zone on i386, to leave room for vmalloc.  If you don't like it, go 
64-bit.

 
-- Chris

I like it fine. I just didn't understand it. Thanks for answering.

So, basically, the vmalloc address space is not backed by physical RAM,
right? Rather, the virtual address space associated with vmalloc is
mapped to physical pages by page tables?


Basically, yes, but that's an oversimplification.  We actually use page tables 
everywhere, but the conversion is simply +/- 0xC000 for the NORMAL zone, so 
we can skip most of the fancy VM work and just use a trivial macro.  vmalloc can 
allocate large chunks of virtually contiguous memory even when the physical 
memory is heavily fragmented, and since we've set aside address space for it, 
it's visible in all process contexts.


vmalloc is handy sometimes because it can complete even if there's no memory 
free when it's called, since the VM will swap out user pages and then return 
those remapped into the vmalloc address space.  Unfortunately, we can't use 
vmalloc anywhere we want to use DMA because it will be accessed without the MMU. 
 Worse, we also can't use it in any path that could be called while trying to 
free memory, due to recursion issues, which substantially limits its utility in 
the kernel.  Some people *cough*OpenAFS*cough* use it carelessly and get all 
kinds of exciting panics under rare and difficult-to-reproduce load conditions.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mutex vs cache coherency protocol(for multiprocessor )

2007-09-04 Thread Chris Snook

Xu Yang wrote:

Hello everyone,

Just got a rough question in my head.

don't know whether anyone interested .

mutex vs cache coherency protocol(for multiprocessor)

both of these two can be used to protect shared resource in the memory.

are both of them necessary?

for example:

in a multiprocessor system, if there is only mutex no cache coherency.
obviously this would cause problem.

what about there is no mutex mechanism, only cache coherency protocol
in multiprocessor system? after consideration, I found this also could
casue problem, when the processors are multithreading processors,
which means more than one threads can be running on one processor. in
this case if we only have cache coherency and no mutex, this would
cause problem. because all the threads running on one processor share
one cache, the cache coherency protocol can not be functioning
anymore. the shrared resource could be crashed by different threads.

then if all the processors in the multiprocessor system are sigle
thread processor, only one thread can be running one one processor. is
it ok, if we only have cache coherency protocol ,no mutex mechanism?

anyone has any idea? all the comments are welcome and appreciated,
including criticism.


Cache coherency is necessary for SMP locking primitives (and thus Linux SMP 
support), but it is hardly sufficient.  Take a look at all the exciting inline 
assembly in include/asm/ for spinlocks, atomic operations, etc.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: modinfo modulename question

2007-09-05 Thread Chris Snook

Justin Piszcz wrote:
Is there anyway to get/see what parameters were passed to a kernel 
module? Running modinfo -p module will show the defaults, but for 
example, st, the scsi tape driver, is there a way to see what it is 
currently using? I know in dmesg it shows this when you load it 
initially (but if say dmesg has been cleared or the buffer was filled up)?


/sys/module/$MODULENAME/parameters/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Health monitor of a multi-threaded process

2007-09-10 Thread Chris Snook

Yishai Hadas wrote:

Hi List,

I'm looking for any mechanism in a multi-threaded process to monitor the
health of its running threads - or by a specific monitor thread or by
any other mechanism.

It includes the following aspects:

1) Threads are running and not stuck on any lock. 


If you're using posix locking, you'll never find yourself busy-waiting for very 
long.  Use ps or top.


2) Threads are running and have not died accidentally. 


Use ps or top.

3) Threads are not consuming too much CPU/Memory. 


Use ps or top.  You'll have to decide how much is too much.

4) Threads are not in any infinite loop. 


This requires solving the Halting Problem.  If your management is demanding this 
feature, I suggest informing them that it is mathematically impossible.


Just use top or ps.  Don't reinvent the wheel.  We've got a really good wheel. 
If you don't like top or ps as is, read the ps man page to see all the fancy 
formatting it can do, and parse it with a simple script in your favorite 
scripting language.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Document non-semantics of atomic_read() and atomic_set()

2007-09-10 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Unambiguously document the fact that atomic_read() and atomic_set()
do not imply any ordering or memory access, and that callers are
obligated to explicitly invoke barriers as needed to ensure that
changes to atomic variables are visible in all contexts that need
to see them.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- a/Documentation/atomic_ops.txt  2007-07-08 19:32:17.0 -0400
+++ b/Documentation/atomic_ops.txt  2007-09-10 19:02:50.0 -0400
@@ -12,7 +12,11 @@
 C integer type will fail.  Something like the following should
 suffice:
 
-   typedef struct { volatile int counter; } atomic_t;
+   typedef struct { int counter; } atomic_t;
+
+   Historically, counter has been declared volatile.  This is now
+discouraged.  See Documentation/volatile-considered-harmful.txt for the
+complete rationale.
 
The first operations to implement for atomic_t's are the
 initializers and plain reads.
@@ -42,6 +46,22 @@
 
 which simply reads the current value of the counter.
 
+*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***
+
+Some architectures may choose to use the volatile keyword, barriers, or
+inline assembly to guarantee some degree of immediacy for atomic_read()
+and atomic_set().  This is not uniformly guaranteed, and may change in
+the future, so all users of atomic_t should treat atomic_read() and
+atomic_set() as simple C assignment statements that may be reordered or
+optimized away entirely by the compiler or processor, and explicitly
+invoke the appropriate compiler and/or memory barrier for each use case.
+Failure to do so will result in code that may suddenly break when used with
+different architectures or compiler optimizations, or even changes in
+unrelated code which changes how the compiler optimizes the section
+accessing atomic_t variables.
+
+*** YOU HAVE BEEN WARNED! ***
+
 Now, we move onto the actual atomic operation interfaces.
 
void atomic_add(int i, atomic_t *v);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/4] CONFIG_STABLE: Define it

2007-07-20 Thread Chris Snook

Satyam Sharma wrote:

[ Just cleaning up my inbox, and stumbled across this thread ... ]


On 5/31/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

Introduce CONFIG_STABLE to control checks only useful for development.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
[...]
 menu General setup

+config STABLE
+   bool Stable kernel
+   help
+ If the kernel is configured to be a stable kernel then various
+ checks that are only of interest to kernel development will be
+ omitted.
+



A programmer who uses assertions during testing and turns them off
during production is like a sailor who wears a life vest while drilling
on shore and takes it off at sea.
   - Tony Hoare


Probably you meant to turn off debug _output_ (and not _checks_)
with this config option? But we already have CONFIG_FOO_DEBUG_BAR
for those situations ...


There are plenty of validation and debugging features in the kernel that go WAY 
beyond mere assertions, often imposing significant overhead (particularly when 
you scale up) or creating interfaces you'd never use unless you were doing 
kernel development work.  You really do want these features completely removed 
from production kernels.


The point of this is not to remove one-line WARN_ON and BUG_ON checks (though we 
might remove a few from fast paths), but rather to disable big chunks of 
debugging code that don't implement anything visible to a production workload.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/4] CONFIG_STABLE: Define it

2007-07-20 Thread Chris Snook

Satyam Sharma wrote:

On 7/20/07, Chris Snook [EMAIL PROTECTED] wrote:

Satyam Sharma wrote:
 [ Just cleaning up my inbox, and stumbled across this thread ... ]


 On 5/31/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Introduce CONFIG_STABLE to control checks only useful for development.

 Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
 [...]
  menu General setup

 +config STABLE
 +   bool Stable kernel
 +   help
 + If the kernel is configured to be a stable kernel then 
various
 + checks that are only of interest to kernel development 
will be

 + omitted.
 +


 A programmer who uses assertions during testing and turns them off
 during production is like a sailor who wears a life vest while drilling
 on shore and takes it off at sea.
- Tony Hoare


 Probably you meant to turn off debug _output_ (and not _checks_)
 with this config option? But we already have CONFIG_FOO_DEBUG_BAR
 for those situations ...

There are plenty of validation and debugging features in the kernel 
that go WAY
beyond mere assertions, often imposing significant overhead 
(particularly when
you scale up) or creating interfaces you'd never use unless you were 
doing
kernel development work.  You really do want these features completely 
removed

from production kernels.


As for entire such development/debugging-related features, most (all, 
really)

should anyway have their own config options.


They do.  With kconfig dependencies, we can ensure that those config options are 
off when CONFIG_STABLE is set.  That way you only have to set one option to 
ensure that all these expensive checks are disabled.


The point of this is not to remove one-line WARN_ON and BUG_ON checks 
(though we

might remove a few from fast paths), but rather to disable big chunks of
debugging code that don't implement anything visible to a production 
workload.


Oh yes, but it's still not clear to me why or how a kernel-wide 
CONFIG_STABLE

or CONFIG_RELEASE would help ... what's wrong with finer granularity
CONFIG_xxx_DEBUG_xxx kind of knobs?


With kconfig dependencies, we can keep the fine granularity, but not have to 
spend a half hour digging through the configuration to make sure we have a 
production-suitable kernel.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/4] CONFIG_STABLE: Define it

2007-07-20 Thread Chris Snook

Satyam Sharma wrote:

On 7/20/07, Chris Snook [EMAIL PROTECTED] wrote:

Satyam Sharma wrote:
 On 7/20/07, Chris Snook [EMAIL PROTECTED] wrote:
 Satyam Sharma wrote:
  [ Just cleaning up my inbox, and stumbled across this thread ... ]
 
 
  On 5/31/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
  Introduce CONFIG_STABLE to control checks only useful for 
development.

 
  Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
  [...]
   menu General setup
 
  +config STABLE
  +   bool Stable kernel
  +   help
  + If the kernel is configured to be a stable kernel then
 various
  + checks that are only of interest to kernel development
 will be
  + omitted.
  +
 
 
  A programmer who uses assertions during testing and turns them off
  during production is like a sailor who wears a life vest while 
drilling

  on shore and takes it off at sea.
 - Tony Hoare
 
 
  Probably you meant to turn off debug _output_ (and not _checks_)
  with this config option? But we already have CONFIG_FOO_DEBUG_BAR
  for those situations ...

 There are plenty of validation and debugging features in the kernel
 that go WAY
 beyond mere assertions, often imposing significant overhead
 (particularly when
 you scale up) or creating interfaces you'd never use unless you were
 doing
 kernel development work.  You really do want these features completely
 removed
 from production kernels.

 As for entire such development/debugging-related features, most (all,
 really)
 should anyway have their own config options.

They do.  With kconfig dependencies, we can ensure that those config 
options are
off when CONFIG_STABLE is set.  That way you only have to set one 
option to

ensure that all these expensive checks are disabled.


Oh, so you mean use this (the negation of this, actually) as a universal
kconfig dependency of all other such development/debugging related stuff?
Hmm, the name is quite misleading in that case.


There are many different ways you can use it.  If I'm writing a configurable 
feature, I could make it depend on !CONFIG_STABLE, or I could ifdef my code out 
if CONFIG_STABLE is set, unless a more granular option is also set.  The 
maintainer of the code that uses the config option has a lot of flexibility, at 
least until we start enforcing standards.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-23 Thread Chris Snook

Tong Li wrote:
This patch extends CFS to achieve better fairness for SMPs. For example, 
with 10 tasks (same priority) on 8 CPUs, it enables each task to receive 
equal CPU time (80%). The code works on top of CFS and provides SMP 
fairness at a coarser time grainularity; local on each CPU, it relies on 
CFS to provide fine-grained fairness and good interactivity.


The code is based on the distributed weighted round-robin (DWRR) 
algorithm. It keeps two RB trees on each CPU: one is the original 
cfs_rq, referred to as active, and one is a new cfs_rq, called 
round-expired. Each CPU keeps a round number, initially zero. The 
scheduler works exactly the same way as in CFS, but only runs tasks from 
the active tree. Each task is assigned a round slice, equal to its 
weight times a system constant (e.g., 100ms), controlled by 
sysctl_base_round_slice. When a task uses up its round slice, it moves 
to the round-expired tree on the same CPU and stops running. Thus, at 
any time on each CPU, the active tree contains all tasks that are 
running in the current round, while tasks in round-expired have all 
finished the current round and await to start the next round. When an 
active tree becomes empty, it calls idle_balance() to grab tasks of the 
same round from other CPUs. If none can be moved over, it switches its 
active and round-expired trees, thus unleashing round-expired tasks and 
advancing the local round number by one. An invariant it maintains is 
that the round numbers of any two CPUs in the system differ by at most 
one. This property ensures fairness across CPUs. The variable 
sysctl_base_round_slice controls fairness-performance tradeoffs: a 
smaller value leads to better cross-CPU fairness at the potential cost 
of performance; on the other hand, the larger the value is, the closer 
the system behavior is to the default CFS without the patch.


Any comments and suggestions would be highly appreciated.


This patch is massive overkill.  Maybe you're not seeing the overhead on your 
8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RT 
workload.  Do you have any data justifying the need for this patch?


Doing anything globally is expensive, and should be avoided at all costs.  The 
scheduler already rebalances when a CPU is idle, so you're really just 
rebalancing the overload here.  On a server workload, we don't necessarily want 
to do that, since the overload may be multiple threads spawned to service a 
single request, and could be sharing a lot of data.


Instead of an explicit system-wide fairness invariant (which will get very hard 
to enforce when you throw SCHED_FIFO processes into the mix and the scheduler 
isn't running on some CPUs), try a simpler invariant.  If we guarantee that the 
load on CPU X does not differ from the load on CPU (X+1)%N by more than some 
small constant, then we know that the system is fairly balanced.  We can achieve 
global fairness with local balancing, and avoid all this overhead.  This has the 
added advantage of keeping most of the migrations core/socket/node-local on 
SMT/multicore/NUMA systems.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Snook

Chris Snook wrote:

Tong Li wrote:
This patch extends CFS to achieve better fairness for SMPs. For 
example, with 10 tasks (same priority) on 8 CPUs, it enables each task 
to receive equal CPU time (80%). The code works on top of CFS and 
provides SMP fairness at a coarser time grainularity; local on each 
CPU, it relies on CFS to provide fine-grained fairness and good 
interactivity.


The code is based on the distributed weighted round-robin (DWRR) 
algorithm. It keeps two RB trees on each CPU: one is the original 
cfs_rq, referred to as active, and one is a new cfs_rq, called 
round-expired. Each CPU keeps a round number, initially zero. The 
scheduler works exactly the same way as in CFS, but only runs tasks 
from the active tree. Each task is assigned a round slice, equal to 
its weight times a system constant (e.g., 100ms), controlled by 
sysctl_base_round_slice. When a task uses up its round slice, it moves 
to the round-expired tree on the same CPU and stops running. Thus, at 
any time on each CPU, the active tree contains all tasks that are 
running in the current round, while tasks in round-expired have all 
finished the current round and await to start the next round. When an 
active tree becomes empty, it calls idle_balance() to grab tasks of 
the same round from other CPUs. If none can be moved over, it switches 
its active and round-expired trees, thus unleashing round-expired 
tasks and advancing the local round number by one. An invariant it 
maintains is that the round numbers of any two CPUs in the system 
differ by at most one. This property ensures fairness across CPUs. The 
variable sysctl_base_round_slice controls fairness-performance 
tradeoffs: a smaller value leads to better cross-CPU fairness at the 
potential cost of performance; on the other hand, the larger the value 
is, the closer the system behavior is to the default CFS without the 
patch.


Any comments and suggestions would be highly appreciated.


This patch is massive overkill.  Maybe you're not seeing the overhead on 
your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a 
partially-RT workload.  Do you have any data justifying the need for 
this patch?


Doing anything globally is expensive, and should be avoided at all 
costs.  The scheduler already rebalances when a CPU is idle, so you're 
really just rebalancing the overload here.  On a server workload, we 
don't necessarily want to do that, since the overload may be multiple 
threads spawned to service a single request, and could be sharing a lot 
of data.


Instead of an explicit system-wide fairness invariant (which will get 
very hard to enforce when you throw SCHED_FIFO processes into the mix 
and the scheduler isn't running on some CPUs), try a simpler invariant.  
If we guarantee that the load on CPU X does not differ from the load on 
CPU (X+1)%N by more than some small constant, then we know that the 
system is fairly balanced.  We can achieve global fairness with local 
balancing, and avoid all this overhead.  This has the added advantage of 
keeping most of the migrations core/socket/node-local on 
SMT/multicore/NUMA systems.


-- Chris


To clarify, I'm not suggesting that the balance with cpu (x+1)%n only 
algorithm is the only way to do this.  Rather, I'm pointing out that 
even an extremely simple algorithm can give you fair loading when you 
already have CFS managing the runqueues.  There are countless more 
sophisticated ways we could do this without using global locking, or 
possibly without any locking at all, other than the locking we already 
use during migration.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Snook

Tong Li wrote:

On Mon, 23 Jul 2007, Chris Snook wrote:

This patch is massive overkill.  Maybe you're not seeing the overhead 
on your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a 
partially-RT workload.  Do you have any data justifying the need for 
this patch?


Doing anything globally is expensive, and should be avoided at all 
costs. The scheduler already rebalances when a CPU is idle, so you're 
really just rebalancing the overload here.  On a server workload, we 
don't necessarily want to do that, since the overload may be multiple 
threads spawned to service a single request, and could be sharing a 
lot of data.


Instead of an explicit system-wide fairness invariant (which will get 
very hard to enforce when you throw SCHED_FIFO processes into the mix 
and the scheduler isn't running on some CPUs), try a simpler 
invariant.  If we guarantee that the load on CPU X does not differ 
from the load on CPU (X+1)%N by more than some small constant, then we 
know that the system is fairly balanced.  We can achieve global 
fairness with local balancing, and avoid all this overhead.  This has 
the added advantage of keeping most of the migrations 
core/socket/node-local on SMT/multicore/NUMA systems.




Chris,

These are all good comments. Thanks. I see three concerns and I'll try 
to address each.


1. Unjustified effort/cost

My view is that fairness (or proportional fairness) is a first-order 
metric and necessary in many cases even at the cost of performance.


In the cases where it's critical, we have realtime.  In the cases where it's 
important, this implementation won't keep latency low enough to make people 
happier.  If you've got a test case to prove me wrong, I'd like to see it.


A 
server running multiple client apps certainly doesn't want the clients 
to see that they are getting different amounts of service, assuming the 
clients are of equal importance (priority).


A conventional server receives client requests, does a brief amount of work, and 
then gives a response.  This patch doesn't help that workload.  This patch helps 
the case where you've got batch jobs running on a slightly overloaded compute 
server, and unfairness means you end up waiting for a couple threads to finish 
at the end while CPUs sit idle.  I don't think it's that big of a problem, and 
if it is, I think we can solve it in a more elegant way than reintroducing 
expired queues.


When the clients have 
different priorities, the server also wants to give them service time 
proportional to their priority/weight. The same is true for desktops, 
where users want to nice tasks and see an effect that's consistent with 
what they expect, i.e., task CPU time should be proportional to their 
nice values. The point is that it's important to enforce fairness 
because it enables users to control the system in a deterministic way 
and it helps each task get good response time. CFS achieves this on 
local CPUs and this patch makes the support stronger for SMPs. It's 
overkill to enforce unnecessary degree of fairness, but it is necessary 
to enforce an error bound, even if large, such that the user can 
reliably know what kind of CPU time (even performance) he'd get after 
making a nice value change.


Doesn't CFS already do this?

This patch ensures an error bound of (max 
task weight currently in system) * sysctl_base_round_slice compared to 
an idealized fair system.


The thing that bugs me about this is the diminishing returns.  It looks like it 
will only give a substantial benefit when system load is somewhere between 1.0 
and 2.0.  On a heavily-loaded system, CFS will do the right thing within a good 
margin of error, and on an underloaded system, even a naive scheduler will do 
the right thing.  If you want to optimize smp fairness in this range, that's 
great, but there's probably a lighter-weight way to do it.



2. High performance overhead

Two sources of overhead: (1) the global rw_lock, and (2) task 
migrations. I agree they can be problems on NUMA, but I'd argue they are 
not on SMPs. Any global lock can cause two performance problems: (1) 
serialization, and (2) excessive remote cache accesses and traffic. IMO 
(1) is not a problem since this is a rw_lock and a write_lock occurs 
infrequently only when all tasks in the system finish the current round. 
(2) could be a problem as every read/write lock causes an invalidation. 
It could be improved by using Nick's ticket lock. On the other hand, 
this is a single cache line and it's invalidated only when a CPU 
finishes all tasks in its local active RB tree, where each nice 0 task 
takes sysctl_base_round_slice (e.g., 30ms) to finish, so it looks to me 
the invalidations would be infrequent enough and could be noise in the 
whole system.


Task migrations don't bother me all that much.  Since we're migrating the 
*overload*, I expect those processes to be fairly cache-cold whenever we get 
around to them anyway.  It'd be nice to be SMT/multicore

Re: miserable performance of 2.6.21 under network load

2007-07-24 Thread Chris Snook

Aaron Porter wrote:

I'm in the process up upgrading a pool of apache servers from
2.6.17.8 to 2.6.21.5, and we're seeing a pretty major change in behavior.
Under identical network load, 2.6.21 has a load average more than 3 times
higher, cpu 0 spends well over 90% of its time in interrupts (vs ~30%
under 2.6.17). When we hit 3k apache sessions, ksoftirqd eats 100% of cpu0
and our network traffic drops off rapidly. The end result is that 2.6.17
performs twice as well under this load.


Is it always CPU 0, or does it move?  Are you running irqbalance?  If you're 
running irqbalance, you can run a script that alternates between 'cat 
/proc/interrupts' and 'mpstat -P ALL 5 10' and watch the offending interrupt 
jump around between processors.  It's not as informative as oprofile, as Andi 
suggested, but it's really easy to set up.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: miserable performance of 2.6.21 under network load

2007-07-24 Thread Chris Snook

Aaron Porter wrote:

On Tue, Jul 24, 2007 at 08:48:00PM +0200, Andi Kleen wrote:

Aaron Porter [EMAIL PROTECTED] writes:


I'm in the process up upgrading a pool of apache servers from
2.6.17.8 to 2.6.21.5, and we're seeing a pretty major change in behavior.
Under identical network load, 2.6.21 has a load average more than 3 times
higher, cpu 0 spends well over 90% of its time in interrupts (vs ~30%
under 2.6.17). When we hit 3k apache sessions, ksoftirqd eats 100% of cpu0
and our network traffic drops off rapidly. The end result is that 2.6.17
performs twice as well under this load.

Can you oprofile it?


# opreport -l
CPU: AMD64 processors, speed 1994.52 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask 
of 0x00 (No unit mask) count 10
samples  %app name symbol name
914379   48.8404  vmlinux-2.6.21.5 check_poison_obj
341920   18.2632  vmlinux-2.6.21.5 poison_obj


I bet you have CONFIG_DEBUG_SLAB turned off in your 2.6.17 kernel, and turned on 
in your 2.6.21 kernel.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Snook

Chris Friesen wrote:

Chris Snook wrote:

Concerns aside, I agree that fairness is important, and I'd really 
like to see a test case that demonstrates the problem.


One place that might be useful is the case of fairness between resource 
groups, where the load balancer needs to consider each group separately.


You mean like the CFS group scheduler patches?  I don't see how this patch is 
related to that, besides working on top of it.


Now it may be the case that trying to keep the load of each class within 
X% of the other cpus is sufficient, but it's not trivial.


I agree.  My suggestion is that we try being fair from the bottom-up, rather 
than top-down.  If most of the rebalancing is local, we can minimize expensive 
locking and cross-node migrations, and scale very nicely on large NUMA boxes.


Consider the case where you have a resource group that is allocated 50% 
of each cpu in a dual cpu system, and only have a single task in that 
group.  This means that in order to make use of the full group 
allocation, that task needs to be load-balanced to the other cpu as soon 
as it gets scheduled out.  Most load-balancers can't handle that kind of 
granularity, but I have guys in our engineering team that would really 
like this level of performance.


Divining the intentions of the administrator is an AI-complete problem and we're 
not going to try to solve that in the kernel.  An intelligent administrator 
could also allocate 50% of each CPU to a resource group containing all the 
*other* processes.  Then, when the other processes are scheduled out, your 
single task will run on whichever CPU is idle.  This will very quickly 
equilibrate to the scheduling ping-pong you seem to want.  The scheduler 
deliberately avoids this kind of migration by default because it hurts cache and 
TLB performance, so if you want to override this very sane default behavior, 
you're going to have to explicitly configure it yourself.


We currently use CKRM on an SMP machine, but the only way we can get 
away with it is because our main app is affined to one cpu and just 
about everything else is affined to the other.


If you're not explicitly allocating resources, you're just low-latency, not 
truly realtime.  Realtime requires guaranteed resources, so messing with 
affinities is a necessary evil.


We have another SMP box that would benefit from group scheduling, but we 
can't use it because the load balancer is not nearly good enough.


Which scheduler?  Have you tried the CFS group scheduler patches?

-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Snook

Li, Tong N wrote:

On Tue, 2007-07-24 at 16:39 -0400, Chris Snook wrote:

Divining the intentions of the administrator is an AI-complete problem and we're 
not going to try to solve that in the kernel.  An intelligent administrator 
could also allocate 50% of each CPU to a resource group containing all the 
*other* processes.  Then, when the other processes are scheduled out, your 
single task will run on whichever CPU is idle.  This will very quickly 
equilibrate to the scheduling ping-pong you seem to want.  The scheduler 
deliberately avoids this kind of migration by default because it hurts cache and 
TLB performance, so if you want to override this very sane default behavior, 
you're going to have to explicitly configure it yourself.




Well, the admin wouldn't specifically ask for 50% of each CPU. He would
just allocate 50% of total CPU time---it's up to the scheduler to
fulfill that. If a task is entitled to one CPU, then it'll stay there
and have no migration. Migration occurs only if there's overload, in
which case I think you agree in your last email that the cache and TLB
impact is not an issue (at least in SMP).


I don't think Chris's scenario has much bearing on your patch.  What he wants is 
to have a task that will always be running, but can't monopolize either CPU. 
This is useful for certain realtime workloads, but as I've said before, realtime 
requires explicit resource allocation.  I don't think this is very relevant to 
SCHED_FAIR balancing.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Snook

Bill Huey (hui) wrote:

On Tue, Jul 24, 2007 at 04:39:47PM -0400, Chris Snook wrote:

Chris Friesen wrote:
We currently use CKRM on an SMP machine, but the only way we can get away 
with it is because our main app is affined to one cpu and just about 
everything else is affined to the other.
If you're not explicitly allocating resources, you're just low-latency, not 
truly realtime.  Realtime requires guaranteed resources, so messing with 
affinities is a necessary evil.


You've mentioned this twice in this thread. If you're going to talk about this
you should characterize this more specifically because resource allocation is
a rather incomplete area in the Linux.


Well, you need enough CPU time to meet your deadlines.  You need pre-allocated 
memory, or to be able to guarantee that you can allocate memory fast enough to 
meet your deadlines.  This principle extends to any other shared resource, such 
as disk or network.  I'm being vague because it's open-ended.  If a medical 
device fails to meet realtime guarantees because the battery fails, the 
patient's family isn't going to care how correct the software is.  Realtime 
engineering is hard.



Rebalancing is still an open research
problem the last time I looked.


Actually, it's worse than merely an open problem.  A clairvoyant fair scheduler 
with perfect future knowledge can underperform a heuristic fair scheduler, 
because the heuristic scheduler can guess the future incorrectly resulting in 
unfair but higher-throughput behavior.  This is a perfect example of why we only 
try to be as fair as is beneficial.



Tong's previous trio patch is an attempt at resolving this using a generic
grouping mechanism and some constructive discussion should come of it.


Sure, but it seems to me to be largely orthogonal to this patch.

-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Snook

Chris Friesen wrote:

Chris Snook wrote:

I don't think Chris's scenario has much bearing on your patch.  What 
he wants is to have a task that will always be running, but can't 
monopolize either CPU. This is useful for certain realtime workloads, 
but as I've said before, realtime requires explicit resource 
allocation.  I don't think this is very relevant to SCHED_FAIR balancing.


I'm not actually using the scenario I described, its just sort of a 
worst-case load-balancing thought experiment.


What we want to be able to do is to specify a fraction of each cpu for 
each task group.  We don't want to have to affine tasks to particular cpus.


A fraction of *each* CPU, or a fraction of *total* CPU?  Per-cpu granularity 
doesn't make anything more fair.  You've got a big bucket of MIPS you want to 
divide between certain groups, but it shouldn't make a difference which CPUs 
those MIPS come from, other than the fact that we try to minimize overhead 
induced by migration.


This means that the load balancer must be group-aware, and must trigger 
a re-balance (possibly just for a particular group) as soon as the cpu 
allocation for that group is used up on a particular cpu.


If I have two threads with the same priority, and two CPUs, the scheduler will 
put one on each CPU, and they'll run happily without any migration or balancing. 
 It sounds like you're saying that every X milliseconds, you want both to 
expire, be forbidden from running on the current CPU for the next X 
milliseconds, and then migrated to the other CPU.  There's no gain in fairness 
here, and there's a big drop in performance.


I suggested local fairness as a means to achieve global fairness because it 
could reduce overhead, and by adding the margin of error at each level in the 
locality hierarchy, you can get an algorithm which naturally tolerates the level 
of unfairness beyond which it is impossible to optimize.  Strict local fairness 
for its own sake doesn't accomplish anything that's better than global fairness.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-25 Thread Chris Snook

Chris Friesen wrote:

Ingo Molnar wrote:

the 3s is the problem: change that to 60s! We no way want to 
over-migrate for SMP fairness, the change i did gives us reasonable 
long-term SMP fairness without the need for high-rate rebalancing.


Actually, I do have requirements from our engineering guys for 
short-term fairness.  They'd actually like decent fairness over even 
shorter intervals...1 second would be nice, 2 is acceptable.


They are willing to trade off random peak performance for predictability.

Chris



The sysctls for CFS have nanosecond resolution.  They default to 
millisecond-order values, but you can set them much lower.  See sched_fair.c for 
the knobs and their explanations.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-25 Thread Chris Snook

Li, Tong N wrote:

On Wed, 2007-07-25 at 16:55 -0400, Chris Snook wrote:

Chris Friesen wrote:

Ingo Molnar wrote:

the 3s is the problem: change that to 60s! We no way want to 
over-migrate for SMP fairness, the change i did gives us reasonable 
long-term SMP fairness without the need for high-rate rebalancing.
Actually, I do have requirements from our engineering guys for 
short-term fairness.  They'd actually like decent fairness over even 
shorter intervals...1 second would be nice, 2 is acceptable.


They are willing to trade off random peak performance for predictability.

Chris

The sysctls for CFS have nanosecond resolution.  They default to 
millisecond-order values, but you can set them much lower.  See sched_fair.c for 
the knobs and their explanations.


-- Chris


This is incorrect. Those knobs control local-CPU fairness granularity
but have no control over fairness across CPUs.

I'll do some benchmarking as Ingo suggested.

  tong


CFS naturally enforces cross-CPU fairness anyway, as Ingo demonstrated. 
Lowering the local CPU parameters should cause system-wide fairness to converge 
faster.  It might be worthwhile to create a more explicit knob for this, but I'm 
inclined to believe we could do it in much less than 700 lines.  Ingo's 
one-liner to improve the 10/8 balancing case, and the resulting improvement, 
were exactly what I was saying should be possible and desirable.  TCP Nagle 
aside, it generally shouldn't take 700 lines of code to speed up the rate of 
convergence of something that already converges.


Until now I've been watching the scheduler rewrite from the sidelines, but I'm 
digging into it now.  I'll try to give some more constructive criticism soon.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-27 Thread Chris Snook

Tong Li wrote:
I'd like to clarify that I'm not trying to push this particular code to 
the kernel. I'm a researcher. My intent was to point out that we have a 
problem in the scheduler and my dwrr algorithm can potentially help fix 
it. The patch itself was merely a proof-of-concept. I'd be thrilled if 
the algorithm can be proven useful in the real world. I appreciate the 
people who have given me comments. Since then, I've revised my 
algorithm/code. Now it doesn't require global locking but retains strong 
fairness properties (which I was able to prove mathematically).


Thanks for doing this work.  Please don't take the implementation criticism as a 
lack of appreciation for the work.  I'd like to see dwrr in the scheduler, but 
I'm skeptical that re-introducing expired runqueues is the most efficient way to 
do it.


Given the inherently controversial nature of scheduler code, particularly that 
which attempts to enforce fairness, perhaps a concise design document would help 
us come to an agreement about what we think the scheduler should do and what 
tradeoffs we're willing to make to do those things.  Do you have a design 
document we could discuss?


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-27 Thread Chris Snook

Tong Li wrote:

On Fri, 27 Jul 2007, Chris Snook wrote:


Tong Li wrote:
I'd like to clarify that I'm not trying to push this particular code 
to the kernel. I'm a researcher. My intent was to point out that we 
have a problem in the scheduler and my dwrr algorithm can potentially 
help fix it. The patch itself was merely a proof-of-concept. I'd be 
thrilled if the algorithm can be proven useful in the real world. I 
appreciate the people who have given me comments. Since then, I've 
revised my algorithm/code. Now it doesn't require global locking but 
retains strong fairness properties (which I was able to prove 
mathematically).


Thanks for doing this work.  Please don't take the implementation 
criticism as a lack of appreciation for the work.  I'd like to see 
dwrr in the scheduler, but I'm skeptical that re-introducing expired 
runqueues is the most efficient way to do it.


Given the inherently controversial nature of scheduler code, 
particularly that which attempts to enforce fairness, perhaps a 
concise design document would help us come to an agreement about what 
we think the scheduler should do and what tradeoffs we're willing to 
make to do those things.  Do you have a design document we could discuss?


-- Chris



Thanks for the interest. Attached is a design doc I wrote several months 
ago (with small modifications). It talks about the two pieces of my 
design: group scheduling and dwrr. The description was based on the 
original O(1) scheduler, but as my CFS patch showed, the algorithm is 
applicable to other underlying schedulers as well. It's interesting that 
I started working on this in January for the purpose of eventually 
writing a paper about it. So I knew reasonably well the related research 
work but was totally unaware that people in the Linux community were 
also working on similar things. This is good. If you are interested, I'd 
like to help with the algorithms and theory side of the things.


  tong

---
Overview:

Trio extends the existing Linux scheduler with support for 
proportional-share scheduling. It uses a scheduling algorithm, called 
Distributed Weighted Round-Robin (DWRR), which retains the existing 
scheduler design as much as possible, and extends it to achieve 
proportional fairness with O(1) time complexity and a constant error 
bound, compared to the ideal fair scheduling algorithm. The goal of Trio 
is not to improve interactive performance; rather, it relies on the 
existing scheduler for interactivity and extends it to support MP 
proportional fairness.


Trio has two unique features: (1) it enables users to control shares of 
CPU time for any thread or group of threads (e.g., a process, an 
application, etc.), and (2) it enables fair sharing of CPU time across 
multiple CPUs. For example, with ten tasks running on eight CPUs, Trio 
allows each task to take an equal fraction of the total CPU time. These 
features enable Trio to complement the existing Linux scheduler to 
enable greater user flexibility and stronger fairness.


Background:

Over the years, there has been a lot of criticism that conventional Unix 
priorities and the nice interface provide insufficient support for users 
to accurately control CPU shares of different threads or applications. 
Many have studied scheduling algorithms that achieve proportional 
fairness. Assuming that each thread has a weight that expresses its 
desired CPU share, informally, a scheduler is proportionally fair if (1) 
it is work-conserving, and (2) it allocates CPU time to threads in exact 
proportion to their weights in any time interval. Ideal proportional 
fairness is impractical since it requires that all runnable threads be 
running simultaneously and scheduled with infinitesimally small quanta. 
In practice, every proportional-share scheduling algorithm approximates 
the ideal algorithm with the goal of achieving a constant error bound. 
For more theoretical background, please refer to the following papers:


I don't think that achieving a constant error bound is always a good thing.  We 
all know that fairness has overhead.  If I have 3 threads and 2 processors, and 
I have a choice between fairly giving each thread 1.0 billion cycles during the 
next second, or unfairly giving two of them 1.1 billion cycles and giving the 
other 0.9 billion cycles, then we can have a useful discussion about where we 
want to draw the line on the fairness/performance tradeoff.  On the other hand, 
if we can give two of them 1.1 billion cycles and still give the other one 1.0 
billion cycles, it's madness to waste those 0.2 billion cycles just to avoid 
user jealousy.  The more complex the memory topology of a system, the more 
free cycles you'll get by tolerating short-term unfairness.  As a crude 
heuristic, scaling some fairly low tolerance by log2(NCPUS) seems appropriate, 
but eventually we should take the boot-time computed migration costs into 
consideration.



[1] A. K

Re: Volanomark slows by 80% under CFS

2007-07-27 Thread Chris Snook

Tim Chen wrote:

Ingo,

Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.  
Benchmark was run on a 2 socket Core2 machine.


The change in scheduler treatment of sched_yield 
could play a part in changing Volanomark behavior.

In CFS, sched_yield is implemented
by dequeueing and requeueing a process .  The time a process 
has spent running probably reduced the the cpu time due it 
by only a bit. The process could get re-queued pretty close

to head of the queue, and may get scheduled again pretty
quickly if there is still a lot of cpu time due.  


It may make sense to queue the
yielding process a bit further behind in the queue. 
I made a slight change by zeroing out wait_runtime 
(i.e. have the process gives
up cpu time due for it to run) for experimentation. 
Let's put aside gripes that Volanomark should have used a 
better mechanism to coordinate threads instead sched_yield for 
a second.   Volanomark runs better
and is only 40% (instead of 80%) down from old scheduler 
without CFS.  


Of course we should not tune for Volanomark and this is
reference data. 
What are your view on how CFS's sched_yield should behave?


Regards,
Tim


The primary purpose of sched_yield is for SCHED_FIFO realtime processes.  Where 
nothing else will run, ever, unless the running thread blocks or yields the CPU. 
 Under CFS, the yielding process will still be leftmost in the rbtree, 
otherwise it would have already been scheduled out.


Zeroing out wait_runtime on sched_yield strikes me as completely appropriate. 
If the process wanted to sleep a finite duration, it should actually call a 
sleep function, but sched_yield is essentially saying I don't have anything 
else to do right now, so it's hardly fair to claim you've been waiting for your 
chance when you just gave it up.


As for the remaining 40% degradation, if Volanomark is using it for 
synchronization, the scheduler is probably cycling through threads until it gets 
to the one that actually wants to do work.  The O(1) scheduler will do this very 
 quickly, whereas CFS has a bit more overhead.  Interactivity boosting may have 
also helped the old scheduler find the right thread faster.


I think Volanomark is being pretty stupid, and deserves to run slowly, but there 
are legitimate reasons to want to call sched_yield in a non-SCHED_FIFO process. 
 If I'm performing multiple different calculations on the same set of data in 
multiple threads, and accessing the shared data in a linear fashion, I'd like to 
be able to have one thread give the other some CPU time so they can stay at the 
same point in the stream and improve cache hit rates, but this is only an 
optimization if I can do it without wasting CPU or gradually nicing myself into 
oblivion.  Having sched_yield zero out wait_runtime seems like an appropriate 
way to make this use case work to the extent possible.  Any user attempting such 
an optimization should have the good sense to do real work between sched_yield 
calls, to avoid calling the scheduler in a tight loop.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: swap-prefetch: A smart way to make good use of idle resources (was: updatedb)

2007-07-27 Thread Chris Snook

Al Boldi wrote:

People wrote:

I believe the users who say their apps really do get paged back in
though, so suspect that's not the case.

Stopping the bush-circumference beating, I do not. -ck (and gentoo) have
this massive Calimero thing going among their users where people are
much less interested in technology than in how the nasty big kernel
meanies are keeping them down (*).

I think the problem is elsewhere. Users don't say: My apps get paged
back in. They say: My system is more responsive. They really don't
care *why* the reaction to a mouse click that takes three seconds with
a mainline kernel is instantaneous with -ck. Nasty big kernel meanies,
OTOH, want to understand *why* a patch helps in order to decide whether
it is really a good idea to merge it. So you've got a bunch of patches
(aka -ck) which visibly improve the overall responsiveness of a desktop
system, but apparently no one can conclusively explain why or how they
achieve that, and therefore they cannot be merged into mainline.

I don't have a solution to that dilemma either.


IMHO, what everybody agrees on, is that swap-prefetch has a positive effect 
in some cases, and nobody can prove an adverse effect (excluding power 
consumption).  The reason for this positive effect is also crystal clear:  
It prefetches from swap on idle into free memory, ie: it doesn't force 
anybody out, and they are the first to be dropped without further swap-out, 
which sounds really smart.


Conclusion:  Either prove swap-prefetch is broken, or get this merged quick.


If you can't prove why it helps and doesn't hurt, then it's a hack, by 
definition.  Behind any performance hack is some fundamental truth that can be 
exploited to greater effect if we reason about it.  So let's reason about it. 
I'll start.


Resource size has been outpacing processing latency since the dawn of time. 
Disks get bigger much faster than seek times shrink.  Main memory and cache keep 
growing, while single-threaded processing speed has nearly ground to a halt.


In the old days, it made lots of sense to manage resource allocation in pages 
and blocks.  In the past few years, we started reserving blocks in ext3 
automatically because it saves more in seek time than it costs in disk space. 
Now we're taking preallocation and antifragmentation to the next level with 
extent-based allocation in ext4.


Well, we're still using bitmap-style allocation for pages, and the prefetch-less 
swap mechanism adheres to this design as well.  Maybe it's time to start 
thinking about memory in a somewhat more extent-like fashion.


With swap prefetch, we're only optimizing the case when the box isn't loaded and 
there's RAM free, but we're not optimizing the case when the box is heavily 
loaded and we need for RAM to be free.  This is a complete reversal of sane 
development priorities.  If swap batching is an optimization at all (and we have 
empirical evidence that it is) then it should also be an optimization to swap 
out chunks of pages when we need to free memory.


So, how do we go about this grouping?  I suggest that if we keep per-VMA 
reference/fault/dirty statistics, we can tell which logically distinct chunks of 
memory are being regularly used.  This would also us to apply different page 
replacement policies to chunks of memory that are being used in different fashions.


With such statistics, we could then page out VMAs in 2MB chunks when we're under 
memory pressure, also giving us the option of transparently paging them back in 
to hugepages when we have the memory free, once anonymous hugepage support is in 
place.


I'm inclined to view swap prefetch as a successful scientific experiment, and 
use that data to inform a more reasoned engineering effort.  If we can design 
something intelligent which happens to behave more or less like swap prefetch 
does under the circumstances where swap prefetch helps, and does something else 
smart under the circumstances where swap prefetch makes no discernable 
difference, it'll be a much bigger improvement.


Because we cannot prove why the existing patch helps, we cannot say what impact 
it will have when things like virtualization and solid state drives radically 
change the coefficients of the equation we have not solved.  Providing a sysctl 
to turn off a misbehaving feature is a poor substitute for doing it right the 
first time, and leaving it off by default will ensure that it only gets used by 
the handful of people who know enough to rebuild with the patch anyway.


Let's talk about how we can make page replacement smarter, so it naturally 
accomplishes what swap prefetch accomplishes, as part of a design we can reason 
about.


CC-ing linux-mm, since that's where I think we should take this next.

-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-27 Thread Chris Snook

Bill Huey (hui) wrote:

On Fri, Jul 27, 2007 at 07:36:17PM -0400, Chris Snook wrote:
I don't think that achieving a constant error bound is always a good thing. 
 We all know that fairness has overhead.  If I have 3 threads and 2 
processors, and I have a choice between fairly giving each thread 1.0 
billion cycles during the next second, or unfairly giving two of them 1.1 
billion cycles and giving the other 0.9 billion cycles, then we can have a 
useful discussion about where we want to draw the line on the 
fairness/performance tradeoff.  On the other hand, if we can give two of 
them 1.1 billion cycles and still give the other one 1.0 billion cycles, 
it's madness to waste those 0.2 billion cycles just to avoid user jealousy. 
 The more complex the memory topology of a system, the more free cycles 
you'll get by tolerating short-term unfairness.  As a crude heuristic, 
scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but 
eventually we should take the boot-time computed migration costs into 
consideration.


You have to consider the target for this kind of code. There are applications
where you need something that falls within a constant error bound. According
to the numbers, the current CFS rebalancing logic doesn't achieve that to
any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything more
strict than that.


I've said from the beginning that I think that anyone who desperately needs 
perfect fairness should be explicitly enforcing it with the aid of realtime 
priorities.  The problem is that configuring and tuning a realtime application 
is a pain, and people want to be able to approximate this behavior without doing 
a whole lot of dirty work themselves.  I believe that CFS can and should be 
enhanced to ensure SMP-fairness over potentially short, user-configurable 
intervals, even for SCHED_OTHER.  I do not, however, believe that we should take 
it to the extreme of wasting CPU cycles on migrations that will not improve 
performance for *any* task, just to avoid letting some tasks get ahead of 
others.  We should be as fair as possible but no fairer.  If we've already made 
it as fair as possible, we should account for the margin of error and correct 
for it the next time we rebalance.  We should not burn the surplus just to get 
rid of it.


On a non-NUMA box with single-socket, non-SMT processors, a constant error bound 
is fine.  Once we add SMT, go multi-core, go NUMA, and add inter-chassis 
interconnects on top of that, we need to multiply this error bound at each stage 
in the hierarchy, or else we'll end up wasting CPU cycles on migrations that 
actually hurt the processes they're supposed to be helping, and hurt everyone 
else even more.  I believe we should enforce an error bound that is proportional 
to migration cost.



Even the rt overload code (from my memory) is subject to these limitations
as well until it's moved to use a single global queue while using CPU
binding to turn off that logic. It's the price you pay for accuracy.

If we allow a little short-term fairness (and I think we should) we can 
still account for this unfairness and compensate for it (again, with the 
same tolerance) at the next rebalancing.


Again, it's a function of *when* and depends on that application.

Adding system calls, while great for research, is not something which is 
done lightly in the published kernel.  If we're going to implement a user 
interface beyond simply interpreting existing priorities more precisely, it 
would be nice if this was part of a framework with a broader vision, such 
as a scheduler economy.


I'm not sure what you mean by scheduler economy, but CFS can and should
be extended to handle proportional scheduling which is outside of the
traditional Unix priority semantics. Having a new API to get at this is
unavoidable if you want it to eventually support -rt oriented appications
that have bandwidth semantics.


A scheduler economy is basically a credit scheduler, augmented to allow 
processes to exchange credits with each other.  If you want to get more 
sophisticated with fairness, you could price CPU time proportional to load on 
that CPU.


I've been house-hunting lately, so I like to think of it in real estate terms. 
If you're comfortable with your standard of living and you have enough money, 
you can rent the apartment in the chic part of town, right next to the subway 
station.  If you want to be more frugal because you're saving for retirement, 
you can get a place out in the suburbs, but the commute will be more of a pain. 
 If you can't make up your mind and keep moving back and forth, you spend a lot 
on moving and all your stuff gets dented and scratched.



All deadline based schedulers have API mechanisms like this to support
extended semantics. This is no different.

I had a feeling this patch was originally designed for the O(1) scheduler, 
and this is why.  The old scheduler had expired arrays, so adding a 
round-expired

pluggable scheduler flamewar thread (was Re: Volanomark slows by 80% under CFS)

2007-07-27 Thread Chris Snook

Andrea Arcangeli wrote:

On Fri, Jul 27, 2007 at 08:31:19PM -0400, Chris Snook wrote:
I think Volanomark is being pretty stupid, and deserves to run slowly, but 


Indeed, any app doing what volanomark does is pretty inefficient.

But this is not the point. I/O schedulers are pluggable to help for
inefficient apps too. If apps would be extremely smart they would all
use async-io for their reads, and there wouldn't be the need of
anticipatory scheduler just for an example.


I'm pretty sure the point of posting a patch that triples CFS performance on a 
certain benchmark and arguably improves the semantics of sched_yield was to 
improve CFS.  You have a point, but it is a point for a different thread.  I 
have taken the liberty of starting this thread for you.



The fact is there's no technical explanation for which we're forbidden
to be able to choose between CFS and O(1) at least at boot time.


Sure there is.  We can run a fully-functional POSIX OS without using any block 
devices at all.  We cannot run a fully-functional POSIX OS without a scheduler. 
 Any feature without which the OS cannot execute userspace code is sufficiently 
primitive that somewhere there is a device on which it will be impossible to 
debug if that feature fails to initialize.  It is quite reasonable to insist on 
only having one implementation of such features in any given kernel build.


Whether or not these alternatives belong in the source tree as config-time 
options is a political question, but preserving boot-time debugging capability 
is a perfectly reasonable technical motivation.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/23] make atomic_read() and atomic_set() behavior consistent on m32r

2007-08-22 Thread Chris Snook

Hirokazu Takata wrote:
I think the parameter of atomic_read() should have const 
qualifier to avoid these warnings, and IMHO this modification might be

worth applying on other archs.


I agree.


Here is an additional patch to revise the previous one for m32r.


I'll incorporate this change if we get enough consensus to justify a 
re-re-re-submit.  Since the patch is intended to be a functional no-op on m32r, 
I'm inclined to leave it alone at the moment.



I also tried to rewrite it with inline asm code, but the kernel text size
bacame roughly 2kB larger. So, I prefer C version.


You're not the only arch maintainer who prefers doing it in C.  If you trust 
your compiler (a big if, apparently), inline asm only improves code generation 
if you have a whole bunch of general purpose registers for the optimizer to play 
with.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fork Bombing Patch

2007-08-23 Thread Chris Snook

Krzysztof Halasa wrote:

Hi,

Anand Jahagirdar [EMAIL PROTECTED] writes:


   I am forwarding one more improved patch which i have modified as
per your suggestions. Insted of KERN_INFO i have used KERN_NOTICE and
i have added one more if block to check hard limit. how good it is?


Not very, still lacks #ifdef CONFIG_something and the required
Kconfig change (or other runtime thing defaulting to no printk).


Wrapping a single printk that's unrelated to debugging in an #ifdef 
CONFIG_* or a sysctl strikes me as abuse of those configuration 
facilities.  Where would we draw the line for other patches wanting to 
do similar things?


I realized that even checking the hard limit it insufficient, because 
that can be lowered (but not raised) by unprivileged processes.  If we 
can't do this unconditionally (and we can't, because the log pollution 
would be intolerable for many people) then we shouldn't do it at all.


Anand -- I appreciate the effort, but I think you should reconsider 
precisely what problem you're trying to solve here.  This approach can't 
tell the difference between legitimate self-regulation of resource 
utilization and a real attack.  Worse, in the event of a real attack, it 
could be used to make it more difficult for the administrator to notice 
something much more serious than a forkbomb.


I suspect that userspace might be a better place to solve this problem. 
 You could run your monitoring app with elevated or even realtime 
priority to ensure it will still function, and you have much more 
freedom in making the reporting configurable.  You can also look at much 
more data than we could ever allow in fork.c, and possibly detect 
attacks that this patch would miss if a clever attacker stayed just 
below the limit.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()

2007-08-24 Thread Chris Snook

Denys Vlasenko wrote:

On Friday 24 August 2007 18:06, Christoph Lameter wrote:

On Fri, 24 Aug 2007, Satyam Sharma wrote:

But if people do seem to have a mixed / confused notion of atomicity
and barriers, and if there's consensus, then as I'd said earlier, I
have no issues in going with the consensus (eg. having API variants).
Linus would be more difficult to convince, however, I suspect :-)

The confusion may be the result of us having barrier semantics in
atomic_read. If we take that out then we may avoid future confusions.


I think better name may help. Nuke atomic_read() altogether.

n = atomic_value(x);// doesnt hint as strongly at reading as atomic_read
n = atomic_fetch(x);// yes, we _do_ touch RAM
n = atomic_read_uncached(x); // or this

How does that sound?


atomic_value() vs. atomic_fetch() should be rather unambiguous. 
atomic_read_uncached() begs the question of precisely which cache we are 
avoiding, and could itself cause confusion.


So, if I were writing atomic.h from scratch, knowing what I know now, I think 
I'd use atomic_value() and atomic_fetch().  The problem is that there are a lot 
of existing users of atomic_read(), and we can't write a script to correctly 
guess their intent.  I'm not sure auditing all uses of atomic_read() is really 
worth the comparatively miniscule benefits.


We could play it safe and convert them all to atomic_fetch(), or we could 
acknowledge that changing the semantics 8 months ago was not at all disastrous, 
and make them all atomic_value(), allowing people to use atomic_fetch() where 
they really care.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fork Bombing Patch

2007-08-29 Thread Chris Snook

Anand Jahagirdar wrote:

Hi
   consider a case:
if non root user request admin for more number of processes than root
user,admin needs to modify settings in /etc/security/limits.conf file
and if that user is not trustworthy and if does fork bombing attack it
will kill the box.


If root is dumb enough to give the user whatever privileges they ask for, 
fork-bombing is the least of your problems.



(I have already tried this attack). in that case this loop will work,
but by the time attack might have killed the box (Bcoz so many
processes has already been created at that time) . so in that case
admin wont come to know that what has happened.


On large multi-user SMP systems, the default ulimits will keep the box 
responsive, if sluggish.  Perhaps you should file a bug with your distribution 
if you believe the default settings in limits.conf are too high.  There's no way 
to algorithmically distinguish a forkbomb from a legitimate highly-threaded 
workload.



Like this there are many cases..(actually these cases has already been
discussed On LKML 2 months before in my thread named fork bombing
attack).
in all these cases this printk helps adminstrator a lot.


What exactly does this patch help the administrator do?  If a box is thrashing, 
you still have sysrq.  You can also use cpusets and taskset to put your root 
login session on a dedicated processor, which is getting to be pretty cheap on 
modern many-core, many-thread systems.  Group scheduling is in the oven, which 
will allow you to prioritize classes of users in a more general manner, even on 
UP systems.



On 8/29/07, Simon Arlott [EMAIL PROTECTED] wrote:

On Wed, August 29, 2007 10:48, Anand Jahagirdar wrote:

Hi
printk_ratelimit function takes care of flooding the
syslog. due to printk_ratelimit function syslog will not be flooded


Um, no.  printk_ratelimit is on the order of *seconds*.  This prevents error 
conditions from causing the system to spend all of its CPU and I/O time logging. 
 It does very little to prevent log spamming.  If I sent you an email every 
second, it would make it much more difficult for you to find other messages in 
your inbox.  It's possible (easy, even) to write a forkbomber that doesn't 
actually harm system responsiveness, but will still trigger this printk as fast 
as possible.  If we merge this patch, every cracking toolkit in existence will 
add such a feature, because log spamming makes it harder for the administrator 
to find more important messages, and even if the administrator uses grep 
judiciously to filter them out, that doesn't help if logrotate has already 
deleted the log containing the information they need to keep /var/log from 
filling up.



anymore. as soon as administrator gets this message, he can take
action against that user (may be block user's access on server). i
think the my fork patch is very useful and helps administrator lot.


You still haven't explained why this can't be done in userspace.  If forkbombing 
is a serious threat (and it's not) you can run a forkbomb monitor with realtime 
priority that won't be severely impacted by thrashing among normal priority 
processes.  Userspace has room for much more sophisticated processing anyway, so 
doing this in the kernel doesn't make much sense.



i would also like to mention that in some of the cases
ulimit solution wont work. in that case fork bombing takes the machine
and server needs a reboot. i am sure in that situation this printk
statement helps administrator to know what has happened.


SysRq-t makes it quite obvious that the system has been forkbombed, allowing the 
administrator to lower ulimits if the box can't handle the load permitted by the 
default settings.  Sometimes SysRq is inconvenient due to lack of physical 
access, which is why I wrote hangwatch[1].


Hangwatch monitors /proc/loadavg and writes the specified set of SysRq triggers 
into /proc/sysrq-trigger when the specified load average is exceeded, with the 
specified frequency.  It doesn't require forks or dynamic memory allocation, so 
it works basically any time the box isn't locked up enough to trigger NMI 
watchdog, though realtime users may want to run it with chrt priority.  It's 
very simple, but it's proven so effective that there really hasn't been much 
need to develop it further since I initially wrote it a year ago.


Given how much we can already do in userspace, I don't really see a need to 
implement this in the kernel.  If you'd like me to add features to hangwatch, 
let's talk about that.  You can even fork it yourself, since it's GPL.


-- Chris

[1] http://people.redhat.com/csnook/hangwatch/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/24] make atomic_read() behave consistently on ia64

2007-08-13 Thread Chris Snook

Paul Mackerras wrote:

Chris Snook writes:


I'll do this for the whole patchset.  Stay tuned for the resubmit.


Could you incorporate Segher's patch to turn atomic_{read,set} into
asm on powerpc?  Segher claims that using asm is really the only
reliable way to ensure that gcc does what we want, and he seems to
have a point.

Paul.


I haven't seen a patch yet.  I'm going to resubmit with inline volatile-cast 
atomic[64]_[read|set] on all architectures as a reference point, and if anyone 
wants to go and implement some of them in assembly, that's between them and the 
relevant arch maintainers.  I have no problem with (someone else) doing it in 
assembly.  I just don't think it's necessary and won't let it hold up the effort 
to get consistent behavior on all architectures.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [74/2many] MAINTAINERS - ATL1 ETHERNET DRIVER

2007-08-13 Thread Chris Snook

[EMAIL PROTECTED] wrote:

Add file pattern to MAINTAINER entry

Signed-off-by: Joe Perches [EMAIL PROTECTED]

diff --git a/MAINTAINERS b/MAINTAINERS
index b8bb108..d9d1bcc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -752,6 +752,7 @@ L:  [EMAIL PROTECTED]
 W: http://sourceforge.net/projects/atl1
 W: http://atl1.sourceforge.net
 S: Maintained
+F: drivers/net/atl1*
 
 ATM

 P: Chas Williams


atl1 has its own directory, so this one should read

F:  drivers/net/atl1/*

-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/24] make atomic_read() behave consistently on frv

2007-08-13 Thread Chris Snook

David Howells wrote:

Chris Snook [EMAIL PROTECTED] wrote:


cpu_relax() contains a barrier, so it should do the right thing.  For non-smp
architectures, I'm concerned about interacting with interrupt handlers.  Some
drivers do use atomic_* operations.


I'm not sure that actually answers my question.  Why not smp_rmb()?

David


I would assume because we want to waste time efficiently even on non-smp 
architectures, rather than frying the CPU or draining the battery.  Certain 
looping execution patterns can cause the CPU to operate above thermal design 
power.  I have fans on my workstation that only ever come on when running 
LINPACK, and that's generally memory bandwidth-bound.  Just imagine what happens 
when you're executing the same few non-serializing instructions in a tight loop 
without ever stalling on memory fetches, or being scheduled out.


If there's another reason, I'd like to hear it too, because I'm just guessing 
here.

-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [74/2many] MAINTAINERS - ATL1 ETHERNET DRIVER

2007-08-13 Thread Chris Snook

Chris Snook wrote:

[EMAIL PROTECTED] wrote:

Add file pattern to MAINTAINER entry

Signed-off-by: Joe Perches [EMAIL PROTECTED]

diff --git a/MAINTAINERS b/MAINTAINERS
index b8bb108..d9d1bcc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -752,6 +752,7 @@ L:[EMAIL PROTECTED]
 W:http://sourceforge.net/projects/atl1
 W:http://atl1.sourceforge.net
 S:Maintained
+F:drivers/net/atl1*
 
 ATM

 P:Chas Williams


atl1 has its own directory, so this one should read

F:drivers/net/atl1/*

-- Chris



Actually, now that I've seen the format in the intro patch, it would be simpler 
just to use this:


F:  drivers/net/atl1/

-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/23] make atomic_read() and atomic_set() behavior consistent across all architectures

2007-08-13 Thread Chris Snook
By popular demand, I've redone the patchset to include volatile casts in 
atomic_set as well.  I've also converted the macros to inline functions, to help 
catch type mismatches at compile time.


This will do weird things on ia64 without Andreas Schwab's fix:

http://lkml.org/lkml/2007/8/10/410

Notably absent is a patch for powerpc.  I expect Segher Boessenkool's assembly 
implementation should suffice there:


http://lkml.org/lkml/2007/8/10/470

Thanks to all who commented on previous incarnations.

-- Chris Snook
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/23] document preferred use of volatile with atomic_t

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Document proper use of volatile for atomic_t operations.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/Documentation/atomic_ops.txt  2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/Documentation/atomic_ops.txt   2007-08-13 
03:36:43.0 -0400
@@ -12,13 +12,20 @@
 C integer type will fail.  Something like the following should
 suffice:
 
-   typedef struct { volatile int counter; } atomic_t;
+   typedef struct { int counter; } atomic_t;
+
+   Historically, counter has been declared as a volatile int.  This
+is now discouraged in favor of explicitly casting it as volatile where
+volatile behavior is required.  Most architectures will only require such
+a cast in atomic_read() and atomic_set(), as well as their 64-bit versions
+if applicable, since the more complex atomic operations directly or
+indirectly use assembly that results in volatile behavior.
 
The first operations to implement for atomic_t's are the
 initializers and plain reads.
 
#define ATOMIC_INIT(i)  { (i) }
-   #define atomic_set(v, i)((v)-counter = (i))
+   #define atomic_set(v, i)(*(volatile int *)(v)-counter = (i))
 
 The first macro is used in definitions, such as:
 
@@ -38,7 +45,7 @@
 
 Next, we have:
 
-   #define atomic_read(v)  ((v)-counter)
+   #define atomic_read(v)  (*(volatile int *)(v)-counter)
 
 which simply reads the current value of the counter.
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/23] make atomic_read() and atomic_set() behavior consistent on alpha

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on alpha.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-alpha/atomic.h2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-alpha/atomic.h 2007-08-13 05:00:36.0 
-0400
@@ -14,21 +14,35 @@
 
 
 /*
- * Counter is volatile to make sure gcc doesn't try to be clever
- * and move things around on us. We need to use _exactly_ the address
- * the user gave us, not some alias that contains the same information.
+ * Make sure gcc doesn't try to be clever and move things around
+ * on us. We need to use _exactly_ the address the user gave us,
+ * not some alias that contains the same information.
  */
-typedef struct { volatile int counter; } atomic_t;
-typedef struct { volatile long counter; } atomic64_t;
+typedef struct { int counter; } atomic_t;
+typedef struct { long counter; } atomic64_t;
 
 #define ATOMIC_INIT(i) ( (atomic_t) { (i) } )
 #define ATOMIC64_INIT(i)   ( (atomic64_t) { (i) } )
 
-#define atomic_read(v) ((v)-counter + 0)
-#define atomic64_read(v)   ((v)-counter + 0)
+static __inline__ int atomic_read(atomic_t *v)
+{
+   return *(volatile int *)v-counter + 0;
+}
+
+static __inline__ long atomic64_read(atomic64_t *v)
+{
+   return *(volatile long *)v-counter + 0;
+}
 
-#define atomic_set(v,i)((v)-counter = (i))
-#define atomic64_set(v,i)  ((v)-counter = (i))
+static __inline__ void atomic_set(atomic_t *v, int i)
+{
+   *(volatile int *)v-counter = i;
+}
+
+static __inline__ void atomic64_set(atomic64_t *v, long i)
+{
+   *(volatile long *)v-counter = i;
+}
 
 /*
  * To get proper branch prediction for the main line, we must branch
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/23] make atomic_read() and atomic_set() behavior consistent on arm

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on arm.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-arm/atomic.h  2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-arm/atomic.h   2007-08-13 04:44:50.0 
-0400
@@ -14,13 +14,16 @@
 #include linux/compiler.h
 #include asm/system.h
 
-typedef struct { volatile int counter; } atomic_t;
+typedef struct { int counter; } atomic_t;
 
 #define ATOMIC_INIT(i) { (i) }
 
 #ifdef __KERNEL__
 
-#define atomic_read(v) ((v)-counter)
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
 
 #if __LINUX_ARM_ARCH__ = 6
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/23] make atomic_read() and atomic_set() behavior consistent on avr32

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on avr32.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-avr32/atomic.h2007-08-13 
03:14:13.0 -0400
+++ linux-2.6.23-rc3/include/asm-avr32/atomic.h 2007-08-13 04:48:25.0 
-0400
@@ -16,11 +16,18 @@
 
 #include asm/system.h
 
-typedef struct { volatile int counter; } atomic_t;
+typedef struct { int counter; } atomic_t;
 #define ATOMIC_INIT(i)  { (i) }
 
-#define atomic_read(v) ((v)-counter)
-#define atomic_set(v, i)   (((v)-counter) = i)
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
+
+static inline void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 /*
  * atomic_sub_return - subtract the atomic variable
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/23] make atomic_read() and atomic_set() behavior consistent on blackfin

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on blackfin.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-blackfin/atomic.h 2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-blackfin/atomic.h  2007-08-13 
05:21:07.0 -0400
@@ -18,8 +18,15 @@ typedef struct {
 } atomic_t;
 #define ATOMIC_INIT(i) { (i) }
 
-#define atomic_read(v) ((v)-counter)
-#define atomic_set(v, i)   (((v)-counter) = i)
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
+
+static inline void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 static __inline__ void atomic_add(int i, atomic_t * v)
 {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/23] make atomic_read() and atomic_set() behavior consistent on cris

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on cris.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-cris/atomic.h 2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-cris/atomic.h  2007-08-13 05:23:37.0 
-0400
@@ -11,12 +11,19 @@
  * resource counting etc..
  */
 
-typedef struct { volatile int counter; } atomic_t;
+typedef struct { int counter; } atomic_t;
 
 #define ATOMIC_INIT(i)  { (i) }
 
-#define atomic_read(v) ((v)-counter)
-#define atomic_set(v,i) (((v)-counter) = (i))
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
+
+static inline void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 /* These should be written in asm but we do it in C for now. */
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/23] make atomic_read() and atomic_set() behavior consistent on frv

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on frv.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-frv/atomic.h  2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-frv/atomic.h   2007-08-13 05:27:08.0 
-0400
@@ -40,8 +40,16 @@ typedef struct {
 } atomic_t;
 
 #define ATOMIC_INIT(i) { (i) }
-#define atomic_read(v) ((v)-counter)
-#define atomic_set(v, i)   (((v)-counter) = (i))
+
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
+
+static inline void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 #ifndef CONFIG_FRV_OUTOFLINE_ATOMIC_OPS
 static inline int atomic_add_return(int i, atomic_t *v)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 8/23] make atomic_read() and atomic_set() behavior consistent on h8300

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on h8300.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-h8300/atomic.h2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-h8300/atomic.h 2007-08-13 05:29:05.0 
-0400
@@ -9,8 +9,15 @@
 typedef struct { int counter; } atomic_t;
 #define ATOMIC_INIT(i) { (i) }
 
-#define atomic_read(v) ((v)-counter)
-#define atomic_set(v, i)   (((v)-counter) = i)
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
+
+static inline void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 #include asm/system.h
 #include linux/kernel.h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 9/23] make atomic_read() and atomic_set() behavior consistent on i386

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on i386.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-i386/atomic.h 2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-i386/atomic.h  2007-08-13 05:31:45.0 
-0400
@@ -25,7 +25,10 @@ typedef struct { int counter; } atomic_t
  * 
  * Atomically reads the value of @v.
  */ 
-#define atomic_read(v) ((v)-counter)
+static __inline__ int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
 
 /**
  * atomic_set - set atomic variable
@@ -34,7 +37,10 @@ typedef struct { int counter; } atomic_t
  * 
  * Atomically sets the value of @v to @i.
  */ 
-#define atomic_set(v,i)(((v)-counter) = (i))
+static __inline__ void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 /**
  * atomic_add - add integer to atomic variable
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/23] make atomic_read() and atomic_set() behavior consistent on ia64

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on ia64.
This will do weird things without Andreas Schwab's fix:
http://lkml.org/lkml/2007/8/10/410

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-ia64/atomic.h 2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-ia64/atomic.h  2007-08-13 05:38:27.0 
-0400
@@ -19,19 +19,34 @@
 
 /*
  * On IA-64, counter must always be volatile to ensure that that the
- * memory accesses are ordered.
+ * memory accesses are ordered.  This must be enforced each time that
+ * counter is read or written.
  */
-typedef struct { volatile __s32 counter; } atomic_t;
-typedef struct { volatile __s64 counter; } atomic64_t;
+typedef struct { __s32 counter; } atomic_t;
+typedef struct { __s64 counter; } atomic64_t;
 
 #define ATOMIC_INIT(i) ((atomic_t) { (i) })
 #define ATOMIC64_INIT(i)   ((atomic64_t) { (i) })
 
-#define atomic_read(v) ((v)-counter)
-#define atomic64_read(v)   ((v)-counter)
+static inline __s32 atomic_read(atomic_t *v)
+{
+return *(volatile __s32 *)v-counter;
+}
+
+static inline void atomic_set(atomic_t *v, __s32 i)
+{
+*(volatile __s32 *)v-counter = i;
+}
 
-#define atomic_set(v,i)(((v)-counter) = (i))
-#define atomic64_set(v,i)  (((v)-counter) = (i))
+static inline __s64 atomic64_read(atomic64_t *v)
+{
+return *(volatile __s64 *)v-counter;
+}
+
+static inline void atomic64_set(atomic64_t *v, __s64 i)
+{
+*(volatile __s64 *)v-counter = i;
+}
 
 static __inline__ int
 ia64_atomic_add (int i, atomic_t *v)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/23] make atomic_read() and atomic_set() behavior consistent on m32r

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on m32r.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-m32r/atomic.h 2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-m32r/atomic.h  2007-08-13 05:42:09.0 
-0400
@@ -22,7 +22,7 @@
  * on us. We need to use _exactly_ the address the user gave us,
  * not some alias that contains the same information.
  */
-typedef struct { volatile int counter; } atomic_t;
+typedef struct { int counter; } atomic_t;
 
 #define ATOMIC_INIT(i) { (i) }
 
@@ -32,7 +32,10 @@ typedef struct { volatile int counter; }
  *
  * Atomically reads the value of @v.
  */
-#define atomic_read(v) ((v)-counter)
+static __inline__ int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
 
 /**
  * atomic_set - set atomic variable
@@ -41,7 +44,10 @@ typedef struct { volatile int counter; }
  *
  * Atomically sets the value of @v to @i.
  */
-#define atomic_set(v,i)(((v)-counter) = (i))
+static __inline__ void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 /**
  * atomic_add_return - add integer to atomic variable and return it
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/23] make atomic_read() and atomic_set() behavior consistent on m68knommu

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on m68knommu.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-m68knommu/atomic.h2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-m68knommu/atomic.h 2007-08-13 
05:47:46.0 -0400
@@ -15,8 +15,15 @@
 typedef struct { int counter; } atomic_t;
 #define ATOMIC_INIT(i) { (i) }
 
-#define atomic_read(v) ((v)-counter)
-#define atomic_set(v, i)   (((v)-counter) = i)
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
+
+static inline void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 static __inline__ void atomic_add(int i, atomic_t *v)
 {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/23] make atomic_read() and atomic_set() behavior consistent on m68k

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on m68k.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-m68k/atomic.h 2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-m68k/atomic.h  2007-08-13 05:45:43.0 
-0400
@@ -16,8 +16,15 @@
 typedef struct { int counter; } atomic_t;
 #define ATOMIC_INIT(i) { (i) }
 
-#define atomic_read(v) ((v)-counter)
-#define atomic_set(v, i)   (((v)-counter) = i)
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
+
+static inline void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 static inline void atomic_add(int i, atomic_t *v)
 {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 15/23] make atomic_read() and atomic_set() behavior consistent on parisc

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on parisc.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-parisc/atomic.h   2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-parisc/atomic.h2007-08-13 
05:59:35.0 -0400
@@ -128,7 +128,7 @@ __cmpxchg(volatile void *ptr, unsigned l
  * Cache-line alignment would conflict with, for example, linux/module.h
  */
 
-typedef struct { volatile int counter; } atomic_t;
+typedef struct { int counter; } atomic_t;
 
 /* It's possible to reduce all atomic operations to either
  * __atomic_add_return, atomic_set and atomic_read (the latter
@@ -159,7 +159,7 @@ static __inline__ void atomic_set(atomic
 
 static __inline__ int atomic_read(const atomic_t *v)
 {
-   return v-counter;
+   return *(volatile int *)v-counter;
 }
 
 /* exported interface */
@@ -227,7 +227,7 @@ static __inline__ int atomic_add_unless(
 
 #ifdef CONFIG_64BIT
 
-typedef struct { volatile s64 counter; } atomic64_t;
+typedef struct { s64 counter; } atomic64_t;
 
 #define ATOMIC64_INIT(i) ((atomic64_t) { (i) })
 
@@ -258,7 +258,7 @@ atomic64_set(atomic64_t *v, s64 i)
 static __inline__ s64
 atomic64_read(const atomic64_t *v)
 {
-   return v-counter;
+   return *(volatile s64 *)v-counter;
 }
 
 #define atomic64_add(i,v)  ((void)(__atomic64_add_return( ((s64)i),(v
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 16/23] make atomic_read() and atomic_set() behavior consistent on s390

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on s390.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-s390/atomic.h 2007-08-13 
03:14:13.0 -0400
+++ linux-2.6.23-rc3/include/asm-s390/atomic.h  2007-08-13 06:04:58.0 
-0400
@@ -67,8 +67,15 @@ typedef struct {
 
 #endif /* __GNUC__ */
 
-#define atomic_read(v)  ((v)-counter)
-#define atomic_set(v,i) (((v)-counter) = (i))
+static __inline__ int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
+
+static __inline__ void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 static __inline__ int atomic_add_return(int i, atomic_t * v)
 {
@@ -182,8 +189,15 @@ typedef struct {
 
 #endif /* __GNUC__ */
 
-#define atomic64_read(v)  ((v)-counter)
-#define atomic64_set(v,i) (((v)-counter) = (i))
+static __inline__ long long atomic64_read(atomic64_t *v)
+{
+return *(volatile long long *)v-counter;
+}
+
+static __inline__ void atomic64_set(atomic64_t *v, long long i)
+{
+*(volatile long long *)v-counter = i;
+}
 
 static __inline__ long long atomic64_add_return(long long i, atomic64_t * v)
 {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 14/23] make atomic_read() and atomic_set() behavior consistent on mips

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on mips.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-mips/atomic.h 2007-08-13 
03:14:13.0 -0400
+++ linux-2.6.23-rc3/include/asm-mips/atomic.h  2007-08-13 05:52:14.0 
-0400
@@ -20,7 +20,7 @@
 #include asm/war.h
 #include asm/system.h
 
-typedef struct { volatile int counter; } atomic_t;
+typedef struct { int counter; } atomic_t;
 
 #define ATOMIC_INIT(i){ (i) }
 
@@ -30,7 +30,10 @@ typedef struct { volatile int counter; }
  *
  * Atomically reads the value of @v.
  */
-#define atomic_read(v) ((v)-counter)
+static __inline__ int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
 
 /*
  * atomic_set - set atomic variable
@@ -39,7 +42,10 @@ typedef struct { volatile int counter; }
  *
  * Atomically sets the value of @v to @i.
  */
-#define atomic_set(v,i)((v)-counter = (i))
+static __inline__ void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 /*
  * atomic_add - add integer to atomic variable
@@ -404,7 +410,7 @@ static __inline__ int atomic_add_unless(
 
 #ifdef CONFIG_64BIT
 
-typedef struct { volatile long counter; } atomic64_t;
+typedef struct { long counter; } atomic64_t;
 
 #define ATOMIC64_INIT(i){ (i) }
 
@@ -413,14 +419,20 @@ typedef struct { volatile long counter; 
  * @v: pointer of type atomic64_t
  *
  */
-#define atomic64_read(v)   ((v)-counter)
+static __inline__ long atomic64_read(atomic64_t *v)
+{
+return *(volatile long *)v-counter;
+}
 
 /*
  * atomic64_set - set atomic variable
  * @v: pointer of type atomic64_t
  * @i: required value
  */
-#define atomic64_set(v,i)  ((v)-counter = (i))
+static __inline__ void atomic64_set(atomic64_t *v, long i)
+{
+*(volatile long *)v-counter = i;
+}
 
 /*
  * atomic64_add - add integer to atomic variable
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 21/23] make atomic_read() and atomic_set() behavior consistent on v850

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on v850.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-v850/atomic.h 2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-v850/atomic.h  2007-08-13 06:19:32.0 
-0400
@@ -27,8 +27,15 @@ typedef struct { int counter; } atomic_t
 
 #ifdef __KERNEL__
 
-#define atomic_read(v) ((v)-counter)
-#define atomic_set(v,i)(((v)-counter) = (i))
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
+
+static inline void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 static inline int atomic_add_return (int i, volatile atomic_t *v)
 {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 20/23] make atomic_read() and atomic_set() behavior consistent on sparc

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on alpha.
Leave sparc-internal atomic24_t type alone.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-sparc/atomic.h2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-sparc/atomic.h 2007-08-13 06:12:49.0 
-0400
@@ -13,7 +13,7 @@
 
 #include linux/types.h
 
-typedef struct { volatile int counter; } atomic_t;
+typedef struct { int counter; } atomic_t;
 
 #ifdef __KERNEL__
 
@@ -61,7 +61,10 @@ extern int atomic_cmpxchg(atomic_t *, in
 extern int atomic_add_unless(atomic_t *, int, int);
 extern void atomic_set(atomic_t *, int);
 
-#define atomic_read(v)  ((v)-counter)
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
 
 #define atomic_add(i, v)   ((void)__atomic_add_return( (int)(i), (v)))
 #define atomic_sub(i, v)   ((void)__atomic_add_return(-(int)(i), (v)))
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 19/23] make atomic_read() and atomic_set() behavior consistent on sparc64

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on sparc64.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-sparc64/atomic.h  2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-sparc64/atomic.h   2007-08-13 
06:17:01.0 -0400
@@ -11,17 +11,31 @@
 #include linux/types.h
 #include asm/system.h
 
-typedef struct { volatile int counter; } atomic_t;
-typedef struct { volatile __s64 counter; } atomic64_t;
+typedef struct { int counter; } atomic_t;
+typedef struct { __s64 counter; } atomic64_t;
 
 #define ATOMIC_INIT(i) { (i) }
 #define ATOMIC64_INIT(i)   { (i) }
 
-#define atomic_read(v) ((v)-counter)
-#define atomic64_read(v)   ((v)-counter)
+static __inline__ int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
+
+static __inline__ __s64 atomic64_read(atomic64_t *v)
+{
+return *(volatile __s64 *)v-counter;
+}
 
-#define atomic_set(v, i)   (((v)-counter) = i)
-#define atomic64_set(v, i) (((v)-counter) = i)
+static __inline__ void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
+
+static __inline__ void atomic64_set(atomic64_t *v, __s64 i)
+{
+*(volatile __s64 *)v-counter = i;
+}
 
 extern void atomic_add(int, atomic_t *);
 extern void atomic64_add(int, atomic64_t *);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 17/23] make atomic_read() and atomic_set() behavior consistent on sh64

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on sh64.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-sh64/atomic.h 2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-sh64/atomic.h  2007-08-13 06:08:37.0 
-0400
@@ -19,12 +19,19 @@
  *
  */
 
-typedef struct { volatile int counter; } atomic_t;
+typedef struct { int counter; } atomic_t;
 
 #define ATOMIC_INIT(i) ( (atomic_t) { (i) } )
 
-#define atomic_read(v) ((v)-counter)
-#define atomic_set(v,i)((v)-counter = (i))
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
+
+static inline void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 #include asm/system.h
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 18/23] make atomic_read() and atomic_set() behavior consistent on sh

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on sh.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-sh/atomic.h   2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-sh/atomic.h2007-08-13 06:07:16.0 
-0400
@@ -7,12 +7,19 @@
  *
  */
 
-typedef struct { volatile int counter; } atomic_t;
+typedef struct { int counter; } atomic_t;
 
 #define ATOMIC_INIT(i) ( (atomic_t) { (i) } )
 
-#define atomic_read(v) ((v)-counter)
-#define atomic_set(v,i)((v)-counter = (i))
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
+
+static inline void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 #include linux/compiler.h
 #include asm/system.h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 22/23] make atomic_read() and atomic_set() behavior consistent on x86_64

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on x86_64.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-x86_64/atomic.h   2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-x86_64/atomic.h2007-08-13 
06:22:43.0 -0400
@@ -32,7 +32,10 @@ typedef struct { int counter; } atomic_t
  * 
  * Atomically reads the value of @v.
  */ 
-#define atomic_read(v) ((v)-counter)
+static __inline__ int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
 
 /**
  * atomic_set - set atomic variable
@@ -41,7 +44,10 @@ typedef struct { int counter; } atomic_t
  * 
  * Atomically sets the value of @v to @i.
  */ 
-#define atomic_set(v,i)(((v)-counter) = (i))
+static __inline__ void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 /**
  * atomic_add - add integer to atomic variable
@@ -206,7 +212,7 @@ static __inline__ int atomic_sub_return(
 
 /* An 64bit atomic type */
 
-typedef struct { volatile long counter; } atomic64_t;
+typedef struct { long counter; } atomic64_t;
 
 #define ATOMIC64_INIT(i)   { (i) }
 
@@ -217,7 +223,10 @@ typedef struct { volatile long counter; 
  * Atomically reads the value of @v.
  * Doesn't imply a read memory barrier.
  */
-#define atomic64_read(v)   ((v)-counter)
+static __inline__ long atomic64_read(atomic64_t *v)
+{
+return *(volatile long *)v-counter;
+}
 
 /**
  * atomic64_set - set atomic64 variable
@@ -226,7 +235,10 @@ typedef struct { volatile long counter; 
  *
  * Atomically sets the value of @v to @i.
  */
-#define atomic64_set(v,i)  (((v)-counter) = (i))
+static __inline__ void atomic64_set(atomic64_t *v, long i)
+{
+*(volatile long *)v-counter = i;
+}
 
 /**
  * atomic64_add - add integer to atomic64 variable
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 23/23] make atomic_read() and atomic_set() behavior consistent on xtensa

2007-08-13 Thread Chris Snook
From: Chris Snook [EMAIL PROTECTED]

Use volatile consistently in atomic.h on xtensa.

Signed-off-by: Chris Snook [EMAIL PROTECTED]

--- linux-2.6.23-rc3-orig/include/asm-xtensa/atomic.h   2007-07-08 
19:32:17.0 -0400
+++ linux-2.6.23-rc3/include/asm-xtensa/atomic.h2007-08-13 
06:31:58.0 -0400
@@ -15,7 +15,7 @@
 
 #include linux/stringify.h
 
-typedef struct { volatile int counter; } atomic_t;
+typedef struct { int counter; } atomic_t;
 
 #ifdef __KERNEL__
 #include asm/processor.h
@@ -47,7 +47,10 @@ typedef struct { volatile int counter; }
  *
  * Atomically reads the value of @v.
  */
-#define atomic_read(v) ((v)-counter)
+static inline int atomic_read(atomic_t *v)
+{
+return *(volatile int *)v-counter;
+}
 
 /**
  * atomic_set - set atomic variable
@@ -56,7 +59,10 @@ typedef struct { volatile int counter; }
  *
  * Atomically sets the value of @v to @i.
  */
-#define atomic_set(v,i)((v)-counter = (i))
+static inline void atomic_set(atomic_t *v, int i)
+{
+*(volatile int *)v-counter = i;
+}
 
 /**
  * atomic_add - add integer to atomic variable
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   >