Re: [PATCH] [RFC] Throttle swappiness for interactive tasks
अभिजित भोपटकर (Abhijit Bhopatkar) wrote: The mm structures of interactive tasks are marked and the pages belonging to them are never shifted to inactive list in lru algorithm. Thus keeping interactive tasks in memory as long as possible. The interactivity is already determined by schedular so we reuse that knowledge to mark the mm structures. Signed-off-by: Abhijit Bhopatkar [EMAIL PROTECTED] --- Lying to the VM doesn't seem like the best way to handle this. A lot of tasks, including interactive ones have some/many pages that they touch once during startup, and don't touch again for a very long time, if ever. We want these pages swapped out long before the box swaps out the working set of our non-interactive processes. I like the general idea of swap priority influenced by scheduler priority, but if we're going to do that, we should do it in a general way that's independent of scheduler implementation, so it'll be useful to soft real-time users and still relevant if (when?) we replace the current scheduler with something else lacking a special interactive flag. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP lockup in virtualized environment
LAPLACE Cyprien wrote: An example: in kernel/pid.c:alloc_pid(), if one of the guest CPUs is descheduled when holding the pidmap_lock, what happens to the other guest CPUs who want to alloc/free pids ? Are they blocked too ? Yup. This is where it's really nice to have directed yields, where you tell the hypervisor to give your physical CPU time to the vcpu that's holding the lock you're blocking on. I know s390 can do this. Perhaps it's something worth generalizing in paravirt_ops? -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 14/17] atl1 trivial endianness misannotations
Al Viro wrote: NB: driver is choke-full of code that will break on big-endian; as long as the hardware is onboard-only we can live with that, but sooner or later that'll need fixing. Signed-off-by: Al Viro [EMAIL PROTECTED] --- drivers/net/atl1/atl1_main.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c index 88d4f70..dee3638 100644 --- a/drivers/net/atl1/atl1_main.c +++ b/drivers/net/atl1/atl1_main.c @@ -1328,7 +1328,7 @@ static int atl1_tx_csum(struct atl1_adapter *adapter, struct sk_buff *skb, if (likely(skb-ip_summed == CHECKSUM_PARTIAL)) { cso = skb-h.raw - skb-data; - css = (skb-h.raw + skb-csum) - skb-data; + css = (skb-h.raw + skb-csum_offset) - skb-data; if (unlikely(cso 0x1)) { printk(KERN_DEBUG %s: payload offset != even number\n, atl1_driver_name); This could certainly explain some checksumming problems we've seen. @@ -1562,7 +1562,7 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct net_device *netdev) /* mss will be nonzero if we're doing segment offload (TSO/GSO) */ mss = skb_shinfo(skb)-gso_size; if (mss) { - if (skb-protocol == ntohs(ETH_P_IP)) { + if (skb-protocol == htons(ETH_P_IP)) { proto_hdr_len = ((skb-h.raw - skb-data) + (skb-h.th-doff 2)); if (unlikely(proto_hdr_len len)) { ACK. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix atl1 braino
Al Viro wrote: Spot the bug... Signed-off-by: Al Viro [EMAIL PROTECTED] --- diff --git a/drivers/net/atl1/atl1_hw.c b/drivers/net/atl1/atl1_hw.c index 08b2d78..e28707a 100644 --- a/drivers/net/atl1/atl1_hw.c +++ b/drivers/net/atl1/atl1_hw.c @@ -357,7 +357,7 @@ void atl1_hash_set(struct atl1_hw *hw, u32 hash_value) */ hash_reg = (hash_value 31) 0x1; hash_bit = (hash_value 26) 0x1F; - mta = ioread32((hw + REG_RX_HASH_TABLE) + (hash_reg 2)); + mta = ioread32((hw-hw_addr + REG_RX_HASH_TABLE) + (hash_reg 2)); mta |= (1 hash_bit); iowrite32(mta, (hw-hw_addr + REG_RX_HASH_TABLE) + (hash_reg 2)); } ACK. Thanks for catching this. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GPL vs non-GPL device drivers
v j wrote: You don't get it do you. Our source code is meaningless to the Open Source community at large. It is only useful to our tiny set of competitors that have nothing to do with Linux. The Embedded space is very specific. We are only _using_ Linux. Just as we could have used VxWorks or OSE. Using our source code would not benefit anybody but our competitors. Sure we could make our drivers open-source. This is a decision that is made FIRST when evaluating an OS. If we we were required to make our drivers/HW open, we would just not have chosen Linux. It is as simple as that. Collaborating with the competition (coopetition) on a common technology platform reduces costs for anyone who chooses to get involved, giving them a collective competitive edge against anyone who doesn't. This is why there is so much industry interest in F/OSS, and mortal enemies in the business world happily work together on technical issues in Linux. If you choose to actively participate in the community, you will benefit from this phenomenon, as well as the patches you will receive from very smart kernel hackers who don't even own your hardware, and the pool of mature GPL code you can use to improve your drivers. If you do not choose to actively participate in the community, you can still keep using existing versions of the kernel that work fine for you, even if future versions do not. There are plenty of embedded devices out there using 2.4 or even 2.2 kernels that do what they need. Your competitors who do participate in the community (and there are a lot in the embedded space) enjoy reduced development costs, more stable and better-reviewed code, continuous compatibility with the latest versions, and influence in the community over the direction of future development. If you want to cede this advantage to your competitors, that's between you and your investors. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: init's children list is long and slows reaping children.
Linus Torvalds wrote: On Thu, 5 Apr 2007, Robin Holt wrote: For testing, Jack Steiner create the following patch. All it does is moves tasks which are transitioning to the zombie state from where they are in the children list to the head of the list. In this way, they will be the first found and reaping does speed up. We will still do a full scan of the list once the rearranged tasks are all removed. This does not seem to be a significant problem. I'd almost prefer to just put the zombie children on a separate list. I wonder how painful that would be.. That would still make it expensive for people who use WUNTRACED to get stopped children (since they'd have to look at all lists), but maybe that's not a big deal. Shouldn't be any worse than it already is. Another thing we could do is to just make sure that kernel threads simply don't end up as children of init. That whole thing is silly, they're really not children of the user-space init anyway. Comments? Linus Does anyone remember why we started doing this in the first place? I'm sure there are some tools that expect a process tree, rather than a forest, and making it a forest could make them unhappy. The support angel on my shoulder says we should just put all the kernel threads under a kthread subtree to shorten init's child list and minimize impact. The hacker devil on my other shoulder says that with usermode helpers, containers, etc. it's about time we treat it as a tree, and any tools that have a problem with that need to be fixed. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: init's children list is long and slows reaping children.
Chris Snook wrote: Linus Torvalds wrote: On Thu, 5 Apr 2007, Robin Holt wrote: For testing, Jack Steiner create the following patch. All it does is moves tasks which are transitioning to the zombie state from where they are in the children list to the head of the list. In this way, they will be the first found and reaping does speed up. We will still do a full scan of the list once the rearranged tasks are all removed. This does not seem to be a significant problem. I'd almost prefer to just put the zombie children on a separate list. I wonder how painful that would be.. That would still make it expensive for people who use WUNTRACED to get stopped children (since they'd have to look at all lists), but maybe that's not a big deal. Shouldn't be any worse than it already is. Another thing we could do is to just make sure that kernel threads simply don't end up as children of init. That whole thing is silly, they're really not children of the user-space init anyway. Comments? Linus Does anyone remember why we started doing this in the first place? I'm sure there are some tools that expect a process tree, rather than a forest, and making it a forest could make them unhappy. The support angel on my shoulder says we should just put all the kernel threads under a kthread subtree to shorten init's child list and minimize impact. The hacker devil on my other shoulder says that with usermode helpers, containers, etc. it's about time we treat it as a tree, and any tools that have a problem with that need to be fixed. -- Chris Err, that should have been about time we treat it as a forest. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: init's children list is long and slows reaping children.
Eric W. Biederman wrote: Linus Torvalds [EMAIL PROTECTED] writes: I'm not sure anybody would really be unhappy with pptr pointing to some magic and special task that has pid 0 (which makes it clear to everybody that the parent is something special), and that has SIGCHLD set to SIG_IGN (which should make the exit case not even go through the zombie phase). I can't even imagine *how* you'd make a tool unhappy with that, since even tools like ps (and even more pstree won't read all the process states atomically, so they invariably will see parent pointers that don't even exist any more, because by the time they get to the parent, it has exited already. Right. pid == 1 being missing might cause some confusing having but having ppid == 0 should be fine. Heck pid == 1 already has ppid == 0, so it is a value user space has had to deal with for a while. In addition there was a period in 2.6 where most kernel threads and init had a pgid == 0 and a session == 0, and nothing seemed to complain. We should probably make all of the kernel threads children of init_task. The initial idle thread on the first cpu that is the parent of pid == 1. That will give the ppid == 0 naturally because the idle thread has pid == 0. Linus, Eric, thanks for the history lesson. I think it's safe to say that anything that breaks because of this sort of change was already broken anyway. If we're going to scale to an obscene number of CPUs (which I believe was the original motivation on this thread) then putting the dead children on their own list will probably scale better. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] use symbolic constants in generic lseek code
The generic lseek code in fs/read_write.c uses hardcoded values for SEEK_{SET,CUR,END}. Patch 1 fixes the case statements to use the symbolic constants in include/linux/fs.h, and should not be at all controversial. Patch 2 adds a SEEK_MAX and uses it to validate user arguments. This makes the code a little cleaner and also enables future extensions (such as SEEK_DATA and SEEK_HOLE). If anyone has a problem with this, please speak up. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] use symbolic constants in generic lseek code
From: Chris Snook [EMAIL PROTECTED] Convert magic numbers to SEEK_* values from fs.h Signed-off-by: Chris Snook [EMAIL PROTECTED] -- --- a/fs/read_write.c 2007-02-20 14:49:45.0 -0500 +++ b/fs/read_write.c 2007-02-20 16:48:39.0 -0500 @@ -37,10 +37,10 @@ loff_t generic_file_llseek(struct file * mutex_lock(inode-i_mutex); switch (origin) { - case 2: + case SEEK_END: offset += inode-i_size; break; - case 1: + case SEEK_CUR: offset += file-f_pos; } retval = -EINVAL; @@ -63,10 +63,10 @@ loff_t remote_llseek(struct file *file, lock_kernel(); switch (origin) { - case 2: + case SEEK_END: offset += i_size_read(file-f_path.dentry-d_inode); break; - case 1: + case SEEK_CUR: offset += file-f_pos; } retval = -EINVAL; @@ -94,10 +94,10 @@ loff_t default_llseek(struct file *file, lock_kernel(); switch (origin) { - case 2: + case SEEK_END: offset += i_size_read(file-f_path.dentry-d_inode); break; - case 1: + case SEEK_CUR: offset += file-f_pos; } retval = -EINVAL; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] use use SEEK_MAX to validate user lseek arguments
From: Chris Snook [EMAIL PROTECTED] Add SEEK_MAX and use it to validate lseek arguments from userspace. Signed-off-by: Chris Snook [EMAIL PROTECTED] -- diff -urp b/fs/read_write.c c/fs/read_write.c --- b/fs/read_write.c 2007-02-20 16:48:39.0 -0500 +++ c/fs/read_write.c 2007-02-20 16:55:46.0 -0500 @@ -139,7 +139,7 @@ asmlinkage off_t sys_lseek(unsigned int goto bad; retval = -EINVAL; - if (origin = 2) { + if (origin = SEEK_MAX) { loff_t res = vfs_llseek(file, offset, origin); retval = res; if (res != (loff_t)retval) @@ -166,7 +166,7 @@ asmlinkage long sys_llseek(unsigned int goto bad; retval = -EINVAL; - if (origin 2) + if (origin SEEK_MAX) goto out_putf; offset = vfs_llseek(file, ((loff_t) offset_high 32) | offset_low, diff -urp b/include/linux/fs.h c/include/linux/fs.h --- b/include/linux/fs.h2007-02-20 14:49:46.0 -0500 +++ c/include/linux/fs.h2007-02-20 16:54:30.0 -0500 @@ -30,6 +30,7 @@ #define SEEK_SET 0 /* seek relative to beginning of file */ #define SEEK_CUR 1 /* seek relative to current file position */ #define SEEK_END 2 /* seek relative to end of file */ +#define SEEK_MAX SEEK_END /* And dynamically-tunable limits and defaults: */ struct files_stat_struct { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Lower HD transfer rate with NCQ enabled?
Paa Paa wrote: I'm using Linux 2.6.20.4. I noticed that I get lower SATA hard drive throughput with 2.6.20.4 than with 2.6.19. The reason was that 2.6.20 enables NCQ by defauly (queue_depth = 31/32 instead of 0/32). Transfer rate was measured using hdparm -t: With NCQ (queue_depth == 31): 50MB/s. Without NCQ (queue_depth == 0): 60MB/s. 20% difference is quite a lot. This is with Intel ICH8R controller and Western Digital WD1600YS hard disk in AHCI mode. I also used the next command to cat-copy a biggish (540MB) file and time it: rm temp sync time sh -c 'cat quite_big_file temp sync' Here I noticed no differences at all with and without NCQ. The times (real time) were basically the same in many successive runs. Around 19s. Q: What conclusion can I make on hdparm -t results or can I make any conclusions? Do I really have lower performance with NCQ or not? If I do, is this because of my HD or because of kernel? hdparm -t is a perfect example of a synthetic benchmark. NCQ was designed to optimize real-world workloads. The overhead gets hidden pretty well when there are multiple requests in flight simultaneously, as tends to be the case when you have a user thread reading data while a kernel thread is asynchronously flushing the user thread's buffered writes. Given that you're breaking even with one user thread and one kernel thread doing I/O, you'll probably get performance improvements with higher thread counts. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Usage semantics of atomic_set ( )
Vineet Gupta wrote: I'm trying to implement atomic ops for a CPU which has no inherent support for Read-Modify-Write Ops. Instead of using a global spin lock which protects all the atomic APIs, I want to use a spin lock per instance of atomic_t. What operations are you using to implement spinlocks? A few architectures use arrays of spinlocks to implement atomic_t. I believe sparc and parisc are among them. Assuming your spinlock implementation is sound and efficient, the same technique should work for you. -- Chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: irq load balancing
Venkat Subbiah wrote: Most of the load in my system is triggered by a single ethernet IRQ. Essentially the IRQ schedules a tasklet and most of the work is done in the taskelet which is scheduled in the IRQ. From what I read looks like the tasklet would be executed on the same CPU on which it was scheduled. So this means even in an SMP system it will be one processor which is overloaded. So will using the user space IRQ loadbalancer really help? A little bit. It'll keep other IRQs on different CPUs, which will prevent other interrupts from causing cache and TLB evictions that could slow down the interrupt handler for the NIC. What I am doubtful about is that the user space load balance comes along and changes the affinity once in a while. But really what I need is every interrupt to go to a different CPU in a round robin fashion. Doing it in a round-robin fashion will be disastrous for performance. Your cache miss rate will go through the roof and you'll hit the slow paths in the network stack most of the time. Looks like the APIC can distribute IRQ's dynamically? Is this supported in the kernel and any config or proc interface to turn this on/off. /proc/irq/$FOO/smp_affinity is a bitmask. You can mask an irq to multiple processors. Of course, this will absolutely kill your performance. That's why irqbalance never does this. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Lossy interrupts on x86_64
Jesse Barnes wrote: I just narrowed down a weird problem where I was losing more than 50% of my vblank interrupts to what seems to be the hires timers patch. Stock 2.6.23-rc5 works fine, but the latest (171) kernel from rawhide drops most of my interrupts unless I also have another interrupt source running (e.g. if I hold down a key or move the mouse I get the expected number of vblank interrupts, otherwise I get between 3 and 30 instead of the expected 60 per second). Any ideas? It seems like it might be bad APIC programming, but I haven't gone through those mods to look for suspects... What happens if you boot with 'noapic' or 'pci=nomsi'? Please post dmesg as well so we can see how the kernel is initializing the relevant hardware. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86_64: make atomic64_t semantics consistent with atomic_t
From: Chris Snook [EMAIL PROTECTED] The volatile keyword has already been removed from the declaration of atomic_t on x86_64. For consistency, remove it from atomic64_t as well. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- a/include/asm-x86_64/atomic.h 2007-07-08 19:32:17.0 -0400 +++ b/include/asm-x86_64/atomic.h 2007-09-13 11:30:51.0 -0400 @@ -206,7 +206,7 @@ static __inline__ int atomic_sub_return( /* An 64bit atomic type */ -typedef struct { volatile long counter; } atomic64_t; +typedef struct { long counter; } atomic64_t; #define ATOMIC64_INIT(i) { (i) } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: irq load balancing
Venkat Subbiah wrote: Since most network devices have a single status register for both receiver and transmit (and errors and the like), which needs a lock to protect access, you will likely end up with serious thrashing of moving the lock between cpus. Any ways to measure the trashing of locks? Since most network devices have a single status register for both receiver and transmit (and errors and the like) These register accesses will be mostly within the irq handler which I plan on keeping on the same processor. The network driver is actually tg3. Will looks closely into the driver. Why are you trying to do this, anyway? This is a classic example of fairness hurting both performance and efficiency. Unbalanced distribution of a single IRQ gives superior performance. There are cases when this is a worthwhile tradeoff, but the network stack is not one of them. In the HPC world, people generally want to squeeze maximum performance out of CPU/cache/RAM so they just accept the imbalance because it performs better than balancing it, and irqbalance can keep things fair over longer intervals if that's important. In the realtime world, people generally bind everything they can to one or two CPUs, and bind their realtime applications to the remaining ones to minimize contention. Distributing your network interrupts in a round-robin fashion will make your computer do exactly one thing faster: heat up the room. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CPU usage for 10Gbps UDP transfers
Lukas Hejtmanek wrote: Hello, is it expected that application sending 8900bytes datagram through 10Gbps NIC utilizes CPU to 100% and similarly the receiver also utilizes CPU to 100%. Is it something wrong or this is quite OK? (The box is dual single core Opteron 2.4GHz with Myricom 10GE NIC.) Every time a new generation of ethernet comes out, its peak throughput exceeds the memory/CPU/IO capacity of commodity hardware available at the time. This is normal. Of course, you may not be saturating the link, and it may be possible to tune the driver to improve your throughput, but you'll still be saturating a CPU on that hardware. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: patch/option to wipe memory at boot?
David Madore wrote: On Mon, Sep 17, 2007 at 11:11:52AM -0700, Jeremy Fitzhardinge wrote: Boot memtest86 for a little while before booting the kernel? And if you haven't already run it for a while, then that would be your first step anyway. Indeed, that does the trick, thanks for the suggestion. So I can be quite confident, now, that my RAM is sane and it's just that the BIOS doesn't initialize it properly. But I'd still like some way of filling the RAM when Linux starts (or perhaps in the bootloader), because letting memtest86 run after every cold reboot isn't a very satisfactory solution. Bootloaders like to do things like run in 16-bit or 32-bit mode on boxes where higher bitness is necessary to access all the memory. It may be possible to do this in the bootloader, but the BIOS is clearly the correct place to fix this problem. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PAGE_SIZE on 64bit and 32bit machines
Yoav Artzi wrote: According to my knowledge the PAGE_SIZE on 32bit architectures in 4KB. Logically, the PAGE_SIZE on 64bit architectures should be 8KB. That's at least the way I understand it. However, looking at the kernel code of x86_64, I see the PAGE_SIZE is 4KB. Can anyone explain to me what am I missing here? PAGE_SIZE is highly architecture-dependent. While it is true that 4K pages are typical on 32-bit architectures, and 64-bit architectures have historically introduced 8K pages, this is by no means a requirement. x86_64 uses the same page sizes that are available on i686+PAE, so you get 4K base pages. alpha and sparc64 typically use 8K base pages, though they have other options as well. ia64 defaults to 16K, though it can do 4K, 8K, and a bunch of larger base sizes. ppc64 does 4K and 64K. s390 uses 4K base pages in both 31-bit and 64-bit kernels. If x86_64 processors are released with TLBs that can handle 8K pages, it'll be straightforward to add that feature, but otherwise it would require faking it in software, which has lots of pitfalls and does nothing to improve TLB efficiency. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange delays / what usually happens every 10 min?
Florian Boelstler wrote: While running that test driver a delay of about 10ms _exactly_ occurs every 10 minutes. This is precisely the sort of thing that BIOS/firmware-level SMI handlers do, particularly those that have monitoring or management features. Try to determine if the kernel is doing anything during this time. If the entire kernel seems to be frozen, talk to the people who wrote the firmware. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: IM Kernel Failure 12/11/07
[EMAIL PROTECTED] wrote: Linux version 2.4.9-e.38smp ([EMAIL PROTECTED]) (gcc version 2.96 2731 (Red Hat Linux 7.2 2.96-124.7.2)) #1 SMP Wed Feb 11 00:09:01 EST 2004 Ancient vendor kernels are very out of scope for this mailing list. The following links may be useful: https://bugzilla.redhat.com/ https://www.redhat.com/apps/support/ http://www.redhat.com/mailman/listinfo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] drivers/net/: Spelling fixes
Joe Perches wrote: drivers/net/atl1/atl1_hw.c |2 +- drivers/net/atl1/atl1_main.c |2 +- The atl1 code will be heavily reworked in the 2.6.25 merge window, so this may cause headaches. Please remove these chunks before merging. The spelling corrections themselves are fine, and I will ensure that the revised driver includes them, if the comments in question are still present at all once we're done with all the changes and cleanups. -- Chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoid overflows in kernel/time.c
H. Peter Anvin wrote: NOTE: This patch uses a bc(1) script to compute the appropriate constants. Perhaps dc would be more appropriate? That's included in busybox. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel Development Objective-C
Ben Crowhurst wrote: Has Objective-C ever been considered for kernel development? No. Kernel programming requires what is essentially assembly language with a lot of syntactic sugar, which C provides. Higher-level languages abstract away too much detail to be suitable for the sort of bit-perfect control you need when you're directly controlling bare metal. You can still use object-oriented programming techniques in C, and we do this all the time in the kernel, but we do so with more fine-grained explicit control than a language like Objective-C would give us. More to the point, if we tried to use Objective-C, we'd find ourselves needing to fall back to C-style explicitness so often that it wouldn't be worth the trouble. In other news, I hear Hurd boots again! -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Kernel - Future works
Muhammad Nowbuth wrote: Hi all, Could anyone give some ideas of future pending works which are needed on the linux kernel? http://kernelnewbies.org/KernelHacking -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6.22.y][PATCH] atl1: disable broken 64-bit DMA
Jay Cliburn wrote: atl1: disable broken 64-bit DMA [ Upstream commit: 5f08e46b621a769e52a9545a23ab1d5fb2aec1d4 ] The L1 network chip can DMA to 64-bit addresses, but multiple descriptor rings share a single register for the high 32 bits of their address, so only a single, aligned, 4 GB physical address range can be used at a time. As a result, we need to confine the driver to a 32-bit DMA mask, otherwise we see occasional data corruption errors in systems containing 4 or more gigabytes of RAM. Signed-off-by: Jay Cliburn [EMAIL PROTECTED] Cc: Luca Tettamanti [EMAIL PROTECTED] Cc: Chris Snook [EMAIL PROTECTED] Acked-By: Chris Snook [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related
Martin Knoblauch wrote: Hi, currently I am tracking down an interesting effect when writing to a Solars-10/Sparc based server. The server exports two filesystems. One UFS, one VXFS. The filesystems are mounted NFS3/TCP, no special options. Linux kernel in question is 2.6.24-rc6, but it happens with earlier kernels (2.6.19.2, 2.6.22.6) as well. The client is x86_64 with 8 GB of ram. The problem: when writing to the VXFS based filesystem, performance drops dramatically when the the filesize reaches or exceeds dirty_ratio. For a dirty_ratio of 10% (about 800MB) files below 750 MB are transfered with about 30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If I perform the same tests on the UFS based FS, performance stays at about 30 MB/sec until 3GB and likely larger (I just stopped at 3 GB). Any ideas what could cause this difference? Any suggestions on debugging it? 1) Try normal NFS tuning, such as rsize/wsize tuning. 2) You're entering synchronous writeback mode, so you can delay the problem by raising dirty_ratio to 100, or reduce the size of the problem by lowering dirty_ratio to 1. Either one could help. 3) It sounds like the bottleneck is the vxfs filesystem. It only *appears* on the client side because writes up until dirty_ratio get buffered on the client. If you can confirm that the server is actually writing stuff to disk slower when the client is in writeback mode, then it's possible the Linux NFS client is doing something inefficient in writeback mode. -- Chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Quad core CPU detected but shows as single core in 2.6.23.1
Zurk Tech wrote: Hi guys, I have a tyan s3992 h2000 with single barcelona amd quad core cpu (the other cpu socket is empty). cat /proc/cpuinfo shows amd quad core processor but core : 1ive compiled the kernel from scratch with smp and amd64 + the numa stuff. i also tried debian etchs amd64 smp kernel and same result. is amd barcelona quad core cpu not yet supported or is it something else ? Thanks for any insight. im completely stumped. ive dealt with mutliprocessing machines before and have a couple of dual cores which are fine with the exact same kernel configs. my amd tk-53 x2 turions show 2 cores in cpuinfo The bootstrap protocol for Barcelona is a little different from older Opterons, so an older BIOS that doesn't know the new protocol won't be able to bring up any CPU other than the bootstrap processor. My wild guess is that this is what's happening and a BIOS update will fix it, but as Arjan said, please post dmesg when reporting bugs like this. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86: mostly merge types.h
From: Chris Snook [EMAIL PROTECTED] Most of types_32.h and types_64.h are the same. Merge the common definitions into types.h, keeping the differences in their own files. Also #error if types_{32,64}.h is included directly. Tested with allmodconfig on x86_64. Signed-off-by: Chris Snook [EMAIL PROTECTED] types.h| 45 + types_32.h | 48 ++-- types_64.h | 47 +++ 3 files changed, 58 insertions(+), 82 deletions(-) diff -urp a/include/asm-x86/types_32.h b/include/asm-x86/types_32.h --- a/include/asm-x86/types_32.h2007-10-18 04:23:36.0 -0400 +++ b/include/asm-x86/types_32.h2007-10-18 07:03:05.0 -0400 @@ -1,64 +1,28 @@ #ifndef _I386_TYPES_H #define _I386_TYPES_H -#ifndef __ASSEMBLY__ - -typedef unsigned short umode_t; - -/* - * __xx is ok: it doesn't pollute the POSIX namespace. Use these in the - * header files exported to user space - */ - -typedef __signed__ char __s8; -typedef unsigned char __u8; - -typedef __signed__ short __s16; -typedef unsigned short __u16; - -typedef __signed__ int __s32; -typedef unsigned int __u32; +#ifndef _X86_TYPES_H +#error Do not include this file directly. Use asm/types.h instead. +#endif -#if defined(__GNUC__) +#if !defined(__ASSEMBLY__) defined(__GNUC__) __extension__ typedef __signed__ long long __s64; __extension__ typedef unsigned long long __u64; #endif -#endif /* __ASSEMBLY__ */ - -/* - * These aren't exported outside the kernel to avoid name space clashes - */ #ifdef __KERNEL__ #define BITS_PER_LONG 32 #ifndef __ASSEMBLY__ - -typedef signed char s8; -typedef unsigned char u8; - -typedef signed short s16; -typedef unsigned short u16; - -typedef signed int s32; -typedef unsigned int u32; - -typedef signed long long s64; -typedef unsigned long long u64; - -/* DMA addresses come in generic and 64-bit flavours. */ - +/* DMA addresses come in generic and 64-bit flavours. */ #ifdef CONFIG_HIGHMEM64G typedef u64 dma_addr_t; #else typedef u32 dma_addr_t; #endif -typedef u64 dma64_addr_t; #endif /* __ASSEMBLY__ */ - #endif /* __KERNEL__ */ - -#endif +#endif /* _I386_TYPES_H */ diff -urp a/include/asm-x86/types_64.h b/include/asm-x86/types_64.h --- a/include/asm-x86/types_64.h2007-10-18 04:23:36.0 -0400 +++ b/include/asm-x86/types_64.h2007-10-18 07:03:11.0 -0400 @@ -1,55 +1,22 @@ #ifndef _X86_64_TYPES_H #define _X86_64_TYPES_H -#ifndef __ASSEMBLY__ - -typedef unsigned short umode_t; - -/* - * __xx is ok: it doesn't pollute the POSIX namespace. Use these in the - * header files exported to user space - */ - -typedef __signed__ char __s8; -typedef unsigned char __u8; - -typedef __signed__ short __s16; -typedef unsigned short __u16; - -typedef __signed__ int __s32; -typedef unsigned int __u32; +#ifndef _X86_TYPES_H +#error Do not include this file directly. Use asm/types.h instead. +#endif +#ifndef __ASSEMBLY__ typedef __signed__ long long __s64; typedef unsigned long long __u64; +#endif -#endif /* __ASSEMBLY__ */ - -/* - * These aren't exported outside the kernel to avoid name space clashes - */ #ifdef __KERNEL__ #define BITS_PER_LONG 64 #ifndef __ASSEMBLY__ - -typedef signed char s8; -typedef unsigned char u8; - -typedef signed short s16; -typedef unsigned short u16; - -typedef signed int s32; -typedef unsigned int u32; - -typedef signed long long s64; -typedef unsigned long long u64; - -typedef u64 dma64_addr_t; typedef u64 dma_addr_t; - -#endif /* __ASSEMBLY__ */ +#endif #endif /* __KERNEL__ */ - -#endif +#endif /* _X86_64_TYPES_H */ diff -urp a/include/asm-x86/types.h b/include/asm-x86/types.h --- a/include/asm-x86/types.h 2007-10-18 04:23:36.0 -0400 +++ b/include/asm-x86/types.h 2007-10-18 06:59:37.0 -0400 @@ -1,3 +1,46 @@ +#ifndef _X86_TYPES_H +#define _X86_TYPES_H + +#ifndef __ASSEMBLY__ + +typedef unsigned short umode_t; + +/* + * __xx is ok: it doesn't pollute the POSIX namespace. Use these in the + * header files exported to user space + */ + +typedef __signed__ char __s8; +typedef unsigned char __u8; + +typedef __signed__ short __s16; +typedef unsigned short __u16; + +typedef __signed__ int __s32; +typedef unsigned int __u32; + +/* + * These aren't exported outside the kernel to avoid name space clashes + */ +#ifdef __KERNEL__ + +typedef signed char s8; +typedef unsigned char u8; + +typedef signed short s16; +typedef unsigned short u16; + +typedef signed int s32; +typedef unsigned int u32; + +typedef signed long long s64; +typedef unsigned long long u64; + +typedef u64 dma64_addr_t; + +#endif /* __KERNEL__ */ +#endif /* __ASSEMBLY__ */ + #ifdef __KERNEL__ # ifdef CONFIG_X86_32 # include types_32.h @@ -11,3 +54,5 @@ # include types_64.h # endif #endif + +#endif /* _X86_TYPES_H */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message
[PATCH] x86: merge mmu{,_32,_64}.h
From: Chris Snook [EMAIL PROTECTED] Merge mmu_32.h and mmu_64.h into mmu.h. Signed-off-by: Chris Snook [EMAIL PROTECTED] diff -Nurp a/include/asm-x86/mmu_32.h b/include/asm-x86/mmu_32.h --- a/include/asm-x86/mmu_32.h 2007-10-20 02:42:24.0 -0400 +++ b/include/asm-x86/mmu_32.h 1969-12-31 19:00:00.0 -0500 @@ -1,18 +0,0 @@ -#ifndef __i386_MMU_H -#define __i386_MMU_H - -#include linux/mutex.h -/* - * The i386 doesn't have a mmu context, but - * we put the segment information here. - * - * cpu_vm_mask is used to optimize ldt flushing. - */ -typedef struct { - int size; - struct mutex lock; - void *ldt; - void *vdso; -} mm_context_t; - -#endif diff -Nurp a/include/asm-x86/mmu_64.h b/include/asm-x86/mmu_64.h --- a/include/asm-x86/mmu_64.h 2007-10-20 02:42:24.0 -0400 +++ b/include/asm-x86/mmu_64.h 1969-12-31 19:00:00.0 -0500 @@ -1,21 +0,0 @@ -#ifndef __x86_64_MMU_H -#define __x86_64_MMU_H - -#include linux/spinlock.h -#include linux/mutex.h - -/* - * The x86_64 doesn't have a mmu context, but - * we put the segment information here. - * - * cpu_vm_mask is used to optimize ldt flushing. - */ -typedef struct { - void *ldt; - rwlock_t ldtlock; - int size; - struct mutex lock; - void *vdso; -} mm_context_t; - -#endif diff -Nurp a/include/asm-x86/mmu.h b/include/asm-x86/mmu.h --- a/include/asm-x86/mmu.h 2007-10-20 02:42:24.0 -0400 +++ b/include/asm-x86/mmu.h 2007-10-20 02:38:36.0 -0400 @@ -1,5 +1,23 @@ -#ifdef CONFIG_X86_32 -# include mmu_32.h -#else -# include mmu_64.h +#ifndef _ASM_X86_MMU_H +#define _ASM_X86_MMU_H + +#include linux/spinlock.h +#include linux/mutex.h + +/* + * The x86 doesn't have a mmu context, but + * we put the segment information here. + * + * cpu_vm_mask is used to optimize ldt flushing. + */ +typedef struct { + void *ldt; +#ifdef CONFIG_X86_64 + rwlock_t ldtlock; #endif + int size; + struct mutex lock; + void *vdso; +} mm_context_t; + +#endif /* _ASM_X86_MMU_H */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86: unify a.out{,_32,_64}.h
From: Chris Snook [EMAIL PROTECTED] Unify x86 a.out_32.h and a.out_64.h Signed-off-by: Chris Snook [EMAIL PROTECTED] diff -Nurp a/include/asm-x86/a.out_32.h b/include/asm-x86/a.out_32.h --- a/include/asm-x86/a.out_32.h2007-10-20 06:20:01.0 -0400 +++ b/include/asm-x86/a.out_32.h1969-12-31 19:00:00.0 -0500 @@ -1,27 +0,0 @@ -#ifndef __I386_A_OUT_H__ -#define __I386_A_OUT_H__ - -struct exec -{ - unsigned long a_info;/* Use macros N_MAGIC, etc for access */ - unsigned a_text; /* length of text, in bytes */ - unsigned a_data; /* length of data, in bytes */ - unsigned a_bss; /* length of uninitialized data area for file, in bytes */ - unsigned a_syms; /* length of symbol table data in file, in bytes */ - unsigned a_entry;/* start address */ - unsigned a_trsize; /* length of relocation info for text, in bytes */ - unsigned a_drsize; /* length of relocation info for data, in bytes */ -}; - -#define N_TRSIZE(a)((a).a_trsize) -#define N_DRSIZE(a)((a).a_drsize) -#define N_SYMSIZE(a) ((a).a_syms) - -#ifdef __KERNEL__ - -#define STACK_TOP TASK_SIZE -#define STACK_TOP_MAX STACK_TOP - -#endif - -#endif /* __A_OUT_GNU_H__ */ diff -Nurp a/include/asm-x86/a.out_64.h b/include/asm-x86/a.out_64.h --- a/include/asm-x86/a.out_64.h2007-10-20 06:20:01.0 -0400 +++ b/include/asm-x86/a.out_64.h1969-12-31 19:00:00.0 -0500 @@ -1,28 +0,0 @@ -#ifndef __X8664_A_OUT_H__ -#define __X8664_A_OUT_H__ - -/* 32bit a.out */ - -struct exec -{ - unsigned int a_info; /* Use macros N_MAGIC, etc for access */ - unsigned a_text; /* length of text, in bytes */ - unsigned a_data; /* length of data, in bytes */ - unsigned a_bss; /* length of uninitialized data area for file, in bytes */ - unsigned a_syms; /* length of symbol table data in file, in bytes */ - unsigned a_entry;/* start address */ - unsigned a_trsize; /* length of relocation info for text, in bytes */ - unsigned a_drsize; /* length of relocation info for data, in bytes */ -}; - -#define N_TRSIZE(a)((a).a_trsize) -#define N_DRSIZE(a)((a).a_drsize) -#define N_SYMSIZE(a) ((a).a_syms) - -#ifdef __KERNEL__ -#include linux/thread_info.h -#define STACK_TOP TASK_SIZE -#define STACK_TOP_MAX TASK_SIZE64 -#endif - -#endif /* __A_OUT_GNU_H__ */ diff -Nurp a/include/asm-x86/a.out.h b/include/asm-x86/a.out.h --- a/include/asm-x86/a.out.h 2007-10-20 06:20:01.0 -0400 +++ b/include/asm-x86/a.out.h 2007-10-20 06:14:26.0 -0400 @@ -1,13 +1,32 @@ +#ifndef _ASM_X86_A_OUT_H +#define _ASM_X86_A_OUT_H + +/* 32bit a.out */ + +struct exec +{ + unsigned int a_info; /* Use macros N_MAGIC, etc for access */ + unsigned a_text; /* length of text, in bytes */ + unsigned a_data; /* length of data, in bytes */ + unsigned a_bss; /* length of uninitialized data area for file, in bytes */ + unsigned a_syms; /* length of symbol table data in file, in bytes */ + unsigned a_entry;/* start address */ + unsigned a_trsize; /* length of relocation info for text, in bytes */ + unsigned a_drsize; /* length of relocation info for data, in bytes */ +}; + +#define N_TRSIZE(a)((a).a_trsize) +#define N_DRSIZE(a)((a).a_drsize) +#define N_SYMSIZE(a) ((a).a_syms) + #ifdef __KERNEL__ +# include linux/thread_info.h +# define STACK_TOP TASK_SIZE # ifdef CONFIG_X86_32 -# include a.out_32.h +# define STACK_TOP_MAXSTACK_TOP # else -# include a.out_64.h -# endif -#else -# ifdef __i386__ -# include a.out_32.h -# else -# include a.out_64.h +# define STACK_TOP_MAXTASK_SIZE64 # endif #endif + +#endif /* _ASM_X86_A_OUT_H */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86: unify div64{,_32,_64}.h
From: Chris Snook [EMAIL PROTECTED] Unify x86 div64.h headers. Signed-off-by: Chris Snook [EMAIL PROTECTED] diff -Nurp a/include/asm-x86/div64_32.h b/include/asm-x86/div64_32.h --- a/include/asm-x86/div64_32.h2007-10-20 07:33:53.0 -0400 +++ b/include/asm-x86/div64_32.h1969-12-31 19:00:00.0 -0500 @@ -1,52 +0,0 @@ -#ifndef __I386_DIV64 -#define __I386_DIV64 - -#include linux/types.h - -/* - * do_div() is NOT a C function. It wants to return - * two values (the quotient and the remainder), but - * since that doesn't work very well in C, what it - * does is: - * - * - modifies the 64-bit dividend _in_place_ - * - returns the 32-bit remainder - * - * This ends up being the most efficient calling - * convention on x86. - */ -#define do_div(n,base) ({ \ - unsigned long __upper, __low, __high, __mod, __base; \ - __base = (base); \ - asm(:=a (__low), =d (__high):A (n)); \ - __upper = __high; \ - if (__high) { \ - __upper = __high % (__base); \ - __high = __high / (__base); \ - } \ - asm(divl %2:=a (__low), =d (__mod):rm (__base), 0 (__low), 1 (__upper)); \ - asm(:=A (n):a (__low),d (__high)); \ - __mod; \ -}) - -/* - * (long)X = ((long long)divs) / (long)div - * (long)rem = ((long long)divs) % (long)div - * - * Warning, this will do an exception if X overflows. - */ -#define div_long_long_rem(a,b,c) div_ll_X_l_rem(a,b,c) - -static inline long -div_ll_X_l_rem(long long divs, long div, long *rem) -{ - long dum2; - __asm__(divl %2:=a(dum2), =d(*rem) - :rm(div), A(divs)); - - return dum2; - -} - -extern uint64_t div64_64(uint64_t dividend, uint64_t divisor); -#endif diff -Nurp a/include/asm-x86/div64_64.h b/include/asm-x86/div64_64.h --- a/include/asm-x86/div64_64.h2007-10-20 07:33:53.0 -0400 +++ b/include/asm-x86/div64_64.h1969-12-31 19:00:00.0 -0500 @@ -1 +0,0 @@ -#include asm-generic/div64.h diff -Nurp a/include/asm-x86/div64.h b/include/asm-x86/div64.h --- a/include/asm-x86/div64.h 2007-10-20 07:33:53.0 -0400 +++ b/include/asm-x86/div64.h 2007-10-20 07:32:34.0 -0400 @@ -1,5 +1,58 @@ +#ifndef _ASM_X86_DIV64_H +#define _ASM_X86_DIV64_H + #ifdef CONFIG_X86_32 -# include div64_32.h -#else -# include div64_64.h -#endif + +#include linux/types.h + +/* + * do_div() is NOT a C function. It wants to return + * two values (the quotient and the remainder), but + * since that doesn't work very well in C, what it + * does is: + * + * - modifies the 64-bit dividend _in_place_ + * - returns the 32-bit remainder + * + * This ends up being the most efficient calling + * convention on x86. + */ +#define do_div(n,base) ({ \ + unsigned long __upper, __low, __high, __mod, __base; \ + __base = (base); \ + asm(:=a (__low), =d (__high):A (n)); \ + __upper = __high; \ + if (__high) { \ + __upper = __high % (__base); \ + __high = __high / (__base); \ + } \ + asm(divl %2:=a (__low), =d (__mod):rm (__base), 0 (__low), 1 (__upper)); \ + asm(:=A (n):a (__low),d (__high)); \ + __mod; \ +}) + +/* + * (long)X = ((long long)divs) / (long)div + * (long)rem = ((long long)divs) % (long)div + * + * Warning, this will do an exception if X overflows. + */ +#define div_long_long_rem(a,b,c) div_ll_X_l_rem(a,b,c) + +static inline long +div_ll_X_l_rem(long long divs, long div, long *rem) +{ + long dum2; + __asm__(divl %2:=a(dum2), =d(*rem) + :rm(div), A(divs)); + + return dum2; + +} + +extern uint64_t div64_64(uint64_t dividend, uint64_t divisor); + +# else +# include asm-generic/div64.h +# endif /* CONFIG_X86_32 */ +#endif /* _ASM_X86_DIV64_H */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.25-rc1 panics on boot
Dhaval Giani wrote: I am getting the following oops on bootup on 2.6.25-rc1 ... I am booting using kexec with maxcpus=1. It does not have any problems with maxcpus=2 or higher. Sounds like another (the same?) kexec cpu numbering bug. Can you post/link the entire dmesg from both a cold boot and a kexec boot so we can compare? -- Chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next build status
Stephen Rothwell wrote: Hi all, Initial status can be seen here http://kisskb.ellerman.id.au/kisskb/branch/9/ (I hope to make a better URL soon). Suggestions for more compiler/config combinations are welcome, but we can't necessarily commit to fulfilling all you wishes. :-) i386 allmodconfig please. Also, I highly recommend adding some randconfig builds, at least one 32-bit arch and one 64-bit arch. Any given randconfig build is not particularly likely to catch bugs that would be missed elsewhere, but doing them daily for two months will catch a lot of things before they get released. The catch, of course, is that you have to actually save the .config for this to be useful, which might require a slight modification to your scripts. -- Chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next build status
Tony Breeds wrote: On Thu, Feb 14, 2008 at 08:24:27PM -0500, Chris Snook wrote: Stephen Rothwell wrote: Hi all, Initial status can be seen here http://kisskb.ellerman.id.au/kisskb/branch/9/ (I hope to make a better URL soon). Suggestions for more compiler/config combinations are welcome, but we can't necessarily commit to fulfilling all you wishes. :-) i386 allmodconfig please. Wont i386 allmodconfig be equivalent to x86_64 allmodconfig? Only if there are no bugs. Driver code is most likely to trip over bitness/endianness bugs, and you've already got allmodconfig builds for be32, be64, and le64 architectures. Adding an le32 architecture (i386) completes the coverage of these basic categories. -- Chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] make LKDTM depend on BLOCK
From: Chris Snook [EMAIL PROTECTED] Make LKDTM depend on BLOCK to prevent build failures with certain configs. Signed-off-by: Chris Snook [EMAIL PROTECTED] diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index a370fe8..24b327c 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -524,6 +524,7 @@ config LKDTM tristate Linux Kernel Dump Test Tool Module depends on DEBUG_KERNEL depends on KPROBES + depends on BLOCK default n help This module enables testing of the different dumping mechanisms by -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND] x86_64: make atomic64_t work like atomic_t
Regardless of the greater controversy about the semantics of atomic_t, I think we can all agree that atomic_t and atomic64_t should have the same semantics. This is presently not the case on x86_64, where the volatile keyword was removed from the declaration of atomic_t, but it was not removed from the declaration of atomic64_t. The following patch fixes that inconsistency, without delving into anything more controversial. From: Chris Snook [EMAIL PROTECTED] The volatile keyword has already been removed from the declaration of atomic_t on x86_64. For consistency, remove it from atomic64_t as well. Signed-off-by: Chris Snook [EMAIL PROTECTED] CC: Andi Kleen [EMAIL PROTECTED] --- a/include/asm-x86_64/atomic.h 2007-07-08 19:32:17.0 -0400 +++ b/include/asm-x86_64/atomic.h 2007-09-13 11:30:51.0 -0400 @@ -206,7 +206,7 @@ static __inline__ int atomic_sub_return( /* An 64bit atomic type */ -typedef struct { volatile long counter; } atomic64_t; +typedef struct { long counter; } atomic64_t; #define ATOMIC64_INIT(i) { (i) } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bonnie++ with 1024k stripe SW/RAID5 causes kernel to goto D-state
Justin Piszcz wrote: Kernel: 2.6.23-rc8 (older kernels do this as well) When running the following command: /usr/bin/time /usr/sbin/bonnie++ -d /x/test -s 16384 -m p34 -n 16:10:16:64 It hangs unless I increase various parameters md/raid such as the stripe_cache_size etc.. # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 276 0.0 0.0 0 0 ?D12:14 0:00 [pdflush] root 277 0.0 0.0 0 0 ?D12:14 0:00 [pdflush] root 1639 0.0 0.0 0 0 ?D 12:14 0:00 [xfsbufd] root 1767 0.0 0.0 8100 420 ?Ds 12:14 0:00 root 2895 0.0 0.0 5916 632 ?Ds 12:15 0:00 /sbin/syslogd -r See the bottom for more details. Is this normal? Does md only work without tuning up to a certain stripe size? I use a RAID 5 with 1024k stripe which works fine with many optimizations, but if I just boot the system and run bonnie++ on it without applying the optimizations, it will hang in d-state. When I run the optimizations, then it exits out of D-state, pretty weird? Not at all. 1024k stripes are way outside the norm. If you do something way outside the norm, and don't tune for it in advance, don't be terribly surprised when something like bonnie++ brings your box to its knees. That's not to say we couldn't make md auto-tune itself more intelligently, but this isn't really a bug. With a sufficiently huge amount of RAM, you'd be able to dynamically allocate the buffers that you're not pre-allocating with stripe_cache_size, but bonnie++ is eating that up in this case. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: One process with multiple user ids.
Giuliano Gagliardi wrote: Hello, I have a server that has to switch to different user ids, but because it does other complex things, I would rather not have it run as root. Well, it's probably going to have to *start* as root, or use something like sudo. It's probably easiest to have it start as root and drop privileges as soon as possible, certainly before handling any untrusted data. I only need the server to be able to switch to certain pre-defined user ids. This is a very easy special case. Just start a process for each user ID and drop root privileges. They can communicate via sockets or even shared memory. If you wanted to switch between arbitrary UIDs at runtime, it might be worth doing something exotic, but it's really not in this case. Also, if you do it this way, it's rather easy to verify the correctness of your design, and you never have to touch kernel code. I have seen that two possible solutions have already been suggested here on the LKML, but it was some years ago, and nothing like it has been implemented. (1) Having supplementary user ids like there are supplementary group ids and system calls getuids() and setuids() that work like getgroups() and setgroups() But you can already accomplish this with ACLs and SELinux. You're trying to make this problem harder than it really is. (2) Allowing processes to pass user and group ids via sockets. And do what with them? You can already pass arbitrary data via sockets. It sounds like you need (1) to use (2). Both (1) and (2) would solve my problem. Now my question is whether there are any fundamental flaws with (1) or (2), or whether the right way to solve my problem is another one. (1) doesn't accomplish anything you can't already do, but it would make a huge mess of a lot of code. (2) is silly. Sockets are for communicating between userspace processes. If you want to be granting/revoking credentials, you should be using system calls, and even then only if you absolutely must. Having the kernel snoop traffic on sockets between processes would be disastrous for performance, and without that, any process could claim that it had been granted privileges over a socket and the kernel would just have to trust it. Don't overthink this. You don't need to touch the kernel at all to do this. Just use a multi-process model, like qmail does, for example. You can start with root privileges and drop them, or use sudo to help you out. It's fast, secure, takes advantage of modern multi-core CPUs, and is much simpler. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: gigabit ethernet power consumption
Pavel Machek wrote: Hi! I've found that gbit vs. 100mbit power consumption difference is about 1W -- pretty significant. (Maybe powertop should include it in the tips section? :). Energy Star people insist that machines should switch down to 100mbit when network is idle, and I guess that makes a lot of sense -- you save 1W locally and 1W on the router. Question is, how to implement it correctly? Daemon that would watch data rates and switch speeds using mii-tool would be simple, but is that enough? I believe you misspelled ethtool. While you're at it, why stop at 100Mb? I believe you save even more power at 10Mb, which is why WOL puts the card in 10Mb mode. In my experience, you generally want either the maximum setting or the minimum setting when going for power savings, because of the race-to-idle effect. Workloads that have a sustained fractional utilization are rare. Right now I'm at home, hooked up to a cable modem, so anything over 4Mb is wasted, unless I'm talking to the box across the room, which is rare. Talk to the NetworkManager folks. This is right up their alley. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DeskOpt - on fly task, i/o scheduler optimization
Michal Piotrowski wrote: Hi, Here is something that might be useful for gamers and audio/video editors http://www.stardust.webpages.pl/files/tools/deskopt/ You can easily tune CFS/CFQ scheduler params I would think that gamers and AV editors would want to be using deadline (or maybe even as), not cfq. How well does it work with other I/O schedulers? -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: HIMEM calculation
James C. Georgas wrote: I'm not sure I understand how the kernel calculates the amount of physical RAM it can map during the boot process. I've quoted two blocks of kernel messages below, one for a kernel with NOHIGHMEM and another for a kernel with HIGHMEM4G. If I do the math on the BIOS provided physical RAM map, there is less than 5MiB of the address space reserved. Since I only have 1GiB of physical RAM in the board, I figured that it would still be possible to physically map 1019MiB, even with the 3GiB/1GiB split between user space and kernel space that occurs with NOHIGHMEM. However, What actually happens is that I'm 127MiB short of a full GiB. What am I missing here? Why does that last 127MiB have to go in HIGHMEM? That's the vmalloc address space. You only get 896 MB in the NORMAL zone on i386, to leave room for vmalloc. If you don't like it, go 64-bit. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: HIMEM calculation
James Georgas wrote: That's the vmalloc address space. You only get 896 MB in the NORMAL zone on i386, to leave room for vmalloc. If you don't like it, go 64-bit. -- Chris I like it fine. I just didn't understand it. Thanks for answering. So, basically, the vmalloc address space is not backed by physical RAM, right? Rather, the virtual address space associated with vmalloc is mapped to physical pages by page tables? Basically, yes, but that's an oversimplification. We actually use page tables everywhere, but the conversion is simply +/- 0xC000 for the NORMAL zone, so we can skip most of the fancy VM work and just use a trivial macro. vmalloc can allocate large chunks of virtually contiguous memory even when the physical memory is heavily fragmented, and since we've set aside address space for it, it's visible in all process contexts. vmalloc is handy sometimes because it can complete even if there's no memory free when it's called, since the VM will swap out user pages and then return those remapped into the vmalloc address space. Unfortunately, we can't use vmalloc anywhere we want to use DMA because it will be accessed without the MMU. Worse, we also can't use it in any path that could be called while trying to free memory, due to recursion issues, which substantially limits its utility in the kernel. Some people *cough*OpenAFS*cough* use it carelessly and get all kinds of exciting panics under rare and difficult-to-reproduce load conditions. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mutex vs cache coherency protocol(for multiprocessor )
Xu Yang wrote: Hello everyone, Just got a rough question in my head. don't know whether anyone interested . mutex vs cache coherency protocol(for multiprocessor) both of these two can be used to protect shared resource in the memory. are both of them necessary? for example: in a multiprocessor system, if there is only mutex no cache coherency. obviously this would cause problem. what about there is no mutex mechanism, only cache coherency protocol in multiprocessor system? after consideration, I found this also could casue problem, when the processors are multithreading processors, which means more than one threads can be running on one processor. in this case if we only have cache coherency and no mutex, this would cause problem. because all the threads running on one processor share one cache, the cache coherency protocol can not be functioning anymore. the shrared resource could be crashed by different threads. then if all the processors in the multiprocessor system are sigle thread processor, only one thread can be running one one processor. is it ok, if we only have cache coherency protocol ,no mutex mechanism? anyone has any idea? all the comments are welcome and appreciated, including criticism. Cache coherency is necessary for SMP locking primitives (and thus Linux SMP support), but it is hardly sufficient. Take a look at all the exciting inline assembly in include/asm/ for spinlocks, atomic operations, etc. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: modinfo modulename question
Justin Piszcz wrote: Is there anyway to get/see what parameters were passed to a kernel module? Running modinfo -p module will show the defaults, but for example, st, the scsi tape driver, is there a way to see what it is currently using? I know in dmesg it shows this when you load it initially (but if say dmesg has been cleared or the buffer was filled up)? /sys/module/$MODULENAME/parameters/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Health monitor of a multi-threaded process
Yishai Hadas wrote: Hi List, I'm looking for any mechanism in a multi-threaded process to monitor the health of its running threads - or by a specific monitor thread or by any other mechanism. It includes the following aspects: 1) Threads are running and not stuck on any lock. If you're using posix locking, you'll never find yourself busy-waiting for very long. Use ps or top. 2) Threads are running and have not died accidentally. Use ps or top. 3) Threads are not consuming too much CPU/Memory. Use ps or top. You'll have to decide how much is too much. 4) Threads are not in any infinite loop. This requires solving the Halting Problem. If your management is demanding this feature, I suggest informing them that it is mathematically impossible. Just use top or ps. Don't reinvent the wheel. We've got a really good wheel. If you don't like top or ps as is, read the ps man page to see all the fancy formatting it can do, and parse it with a simple script in your favorite scripting language. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Document non-semantics of atomic_read() and atomic_set()
From: Chris Snook [EMAIL PROTECTED] Unambiguously document the fact that atomic_read() and atomic_set() do not imply any ordering or memory access, and that callers are obligated to explicitly invoke barriers as needed to ensure that changes to atomic variables are visible in all contexts that need to see them. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- a/Documentation/atomic_ops.txt 2007-07-08 19:32:17.0 -0400 +++ b/Documentation/atomic_ops.txt 2007-09-10 19:02:50.0 -0400 @@ -12,7 +12,11 @@ C integer type will fail. Something like the following should suffice: - typedef struct { volatile int counter; } atomic_t; + typedef struct { int counter; } atomic_t; + + Historically, counter has been declared volatile. This is now +discouraged. See Documentation/volatile-considered-harmful.txt for the +complete rationale. The first operations to implement for atomic_t's are the initializers and plain reads. @@ -42,6 +46,22 @@ which simply reads the current value of the counter. +*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! *** + +Some architectures may choose to use the volatile keyword, barriers, or +inline assembly to guarantee some degree of immediacy for atomic_read() +and atomic_set(). This is not uniformly guaranteed, and may change in +the future, so all users of atomic_t should treat atomic_read() and +atomic_set() as simple C assignment statements that may be reordered or +optimized away entirely by the compiler or processor, and explicitly +invoke the appropriate compiler and/or memory barrier for each use case. +Failure to do so will result in code that may suddenly break when used with +different architectures or compiler optimizations, or even changes in +unrelated code which changes how the compiler optimizes the section +accessing atomic_t variables. + +*** YOU HAVE BEEN WARNED! *** + Now, we move onto the actual atomic operation interfaces. void atomic_add(int i, atomic_t *v); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/4] CONFIG_STABLE: Define it
Satyam Sharma wrote: [ Just cleaning up my inbox, and stumbled across this thread ... ] On 5/31/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Introduce CONFIG_STABLE to control checks only useful for development. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] [...] menu General setup +config STABLE + bool Stable kernel + help + If the kernel is configured to be a stable kernel then various + checks that are only of interest to kernel development will be + omitted. + A programmer who uses assertions during testing and turns them off during production is like a sailor who wears a life vest while drilling on shore and takes it off at sea. - Tony Hoare Probably you meant to turn off debug _output_ (and not _checks_) with this config option? But we already have CONFIG_FOO_DEBUG_BAR for those situations ... There are plenty of validation and debugging features in the kernel that go WAY beyond mere assertions, often imposing significant overhead (particularly when you scale up) or creating interfaces you'd never use unless you were doing kernel development work. You really do want these features completely removed from production kernels. The point of this is not to remove one-line WARN_ON and BUG_ON checks (though we might remove a few from fast paths), but rather to disable big chunks of debugging code that don't implement anything visible to a production workload. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/4] CONFIG_STABLE: Define it
Satyam Sharma wrote: On 7/20/07, Chris Snook [EMAIL PROTECTED] wrote: Satyam Sharma wrote: [ Just cleaning up my inbox, and stumbled across this thread ... ] On 5/31/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Introduce CONFIG_STABLE to control checks only useful for development. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] [...] menu General setup +config STABLE + bool Stable kernel + help + If the kernel is configured to be a stable kernel then various + checks that are only of interest to kernel development will be + omitted. + A programmer who uses assertions during testing and turns them off during production is like a sailor who wears a life vest while drilling on shore and takes it off at sea. - Tony Hoare Probably you meant to turn off debug _output_ (and not _checks_) with this config option? But we already have CONFIG_FOO_DEBUG_BAR for those situations ... There are plenty of validation and debugging features in the kernel that go WAY beyond mere assertions, often imposing significant overhead (particularly when you scale up) or creating interfaces you'd never use unless you were doing kernel development work. You really do want these features completely removed from production kernels. As for entire such development/debugging-related features, most (all, really) should anyway have their own config options. They do. With kconfig dependencies, we can ensure that those config options are off when CONFIG_STABLE is set. That way you only have to set one option to ensure that all these expensive checks are disabled. The point of this is not to remove one-line WARN_ON and BUG_ON checks (though we might remove a few from fast paths), but rather to disable big chunks of debugging code that don't implement anything visible to a production workload. Oh yes, but it's still not clear to me why or how a kernel-wide CONFIG_STABLE or CONFIG_RELEASE would help ... what's wrong with finer granularity CONFIG_xxx_DEBUG_xxx kind of knobs? With kconfig dependencies, we can keep the fine granularity, but not have to spend a half hour digging through the configuration to make sure we have a production-suitable kernel. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/4] CONFIG_STABLE: Define it
Satyam Sharma wrote: On 7/20/07, Chris Snook [EMAIL PROTECTED] wrote: Satyam Sharma wrote: On 7/20/07, Chris Snook [EMAIL PROTECTED] wrote: Satyam Sharma wrote: [ Just cleaning up my inbox, and stumbled across this thread ... ] On 5/31/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Introduce CONFIG_STABLE to control checks only useful for development. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] [...] menu General setup +config STABLE + bool Stable kernel + help + If the kernel is configured to be a stable kernel then various + checks that are only of interest to kernel development will be + omitted. + A programmer who uses assertions during testing and turns them off during production is like a sailor who wears a life vest while drilling on shore and takes it off at sea. - Tony Hoare Probably you meant to turn off debug _output_ (and not _checks_) with this config option? But we already have CONFIG_FOO_DEBUG_BAR for those situations ... There are plenty of validation and debugging features in the kernel that go WAY beyond mere assertions, often imposing significant overhead (particularly when you scale up) or creating interfaces you'd never use unless you were doing kernel development work. You really do want these features completely removed from production kernels. As for entire such development/debugging-related features, most (all, really) should anyway have their own config options. They do. With kconfig dependencies, we can ensure that those config options are off when CONFIG_STABLE is set. That way you only have to set one option to ensure that all these expensive checks are disabled. Oh, so you mean use this (the negation of this, actually) as a universal kconfig dependency of all other such development/debugging related stuff? Hmm, the name is quite misleading in that case. There are many different ways you can use it. If I'm writing a configurable feature, I could make it depend on !CONFIG_STABLE, or I could ifdef my code out if CONFIG_STABLE is set, unless a more granular option is also set. The maintainer of the code that uses the config option has a lot of flexibility, at least until we start enforcing standards. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Tong Li wrote: This patch extends CFS to achieve better fairness for SMPs. For example, with 10 tasks (same priority) on 8 CPUs, it enables each task to receive equal CPU time (80%). The code works on top of CFS and provides SMP fairness at a coarser time grainularity; local on each CPU, it relies on CFS to provide fine-grained fairness and good interactivity. The code is based on the distributed weighted round-robin (DWRR) algorithm. It keeps two RB trees on each CPU: one is the original cfs_rq, referred to as active, and one is a new cfs_rq, called round-expired. Each CPU keeps a round number, initially zero. The scheduler works exactly the same way as in CFS, but only runs tasks from the active tree. Each task is assigned a round slice, equal to its weight times a system constant (e.g., 100ms), controlled by sysctl_base_round_slice. When a task uses up its round slice, it moves to the round-expired tree on the same CPU and stops running. Thus, at any time on each CPU, the active tree contains all tasks that are running in the current round, while tasks in round-expired have all finished the current round and await to start the next round. When an active tree becomes empty, it calls idle_balance() to grab tasks of the same round from other CPUs. If none can be moved over, it switches its active and round-expired trees, thus unleashing round-expired tasks and advancing the local round number by one. An invariant it maintains is that the round numbers of any two CPUs in the system differ by at most one. This property ensures fairness across CPUs. The variable sysctl_base_round_slice controls fairness-performance tradeoffs: a smaller value leads to better cross-CPU fairness at the potential cost of performance; on the other hand, the larger the value is, the closer the system behavior is to the default CFS without the patch. Any comments and suggestions would be highly appreciated. This patch is massive overkill. Maybe you're not seeing the overhead on your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RT workload. Do you have any data justifying the need for this patch? Doing anything globally is expensive, and should be avoided at all costs. The scheduler already rebalances when a CPU is idle, so you're really just rebalancing the overload here. On a server workload, we don't necessarily want to do that, since the overload may be multiple threads spawned to service a single request, and could be sharing a lot of data. Instead of an explicit system-wide fairness invariant (which will get very hard to enforce when you throw SCHED_FIFO processes into the mix and the scheduler isn't running on some CPUs), try a simpler invariant. If we guarantee that the load on CPU X does not differ from the load on CPU (X+1)%N by more than some small constant, then we know that the system is fairly balanced. We can achieve global fairness with local balancing, and avoid all this overhead. This has the added advantage of keeping most of the migrations core/socket/node-local on SMT/multicore/NUMA systems. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Chris Snook wrote: Tong Li wrote: This patch extends CFS to achieve better fairness for SMPs. For example, with 10 tasks (same priority) on 8 CPUs, it enables each task to receive equal CPU time (80%). The code works on top of CFS and provides SMP fairness at a coarser time grainularity; local on each CPU, it relies on CFS to provide fine-grained fairness and good interactivity. The code is based on the distributed weighted round-robin (DWRR) algorithm. It keeps two RB trees on each CPU: one is the original cfs_rq, referred to as active, and one is a new cfs_rq, called round-expired. Each CPU keeps a round number, initially zero. The scheduler works exactly the same way as in CFS, but only runs tasks from the active tree. Each task is assigned a round slice, equal to its weight times a system constant (e.g., 100ms), controlled by sysctl_base_round_slice. When a task uses up its round slice, it moves to the round-expired tree on the same CPU and stops running. Thus, at any time on each CPU, the active tree contains all tasks that are running in the current round, while tasks in round-expired have all finished the current round and await to start the next round. When an active tree becomes empty, it calls idle_balance() to grab tasks of the same round from other CPUs. If none can be moved over, it switches its active and round-expired trees, thus unleashing round-expired tasks and advancing the local round number by one. An invariant it maintains is that the round numbers of any two CPUs in the system differ by at most one. This property ensures fairness across CPUs. The variable sysctl_base_round_slice controls fairness-performance tradeoffs: a smaller value leads to better cross-CPU fairness at the potential cost of performance; on the other hand, the larger the value is, the closer the system behavior is to the default CFS without the patch. Any comments and suggestions would be highly appreciated. This patch is massive overkill. Maybe you're not seeing the overhead on your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RT workload. Do you have any data justifying the need for this patch? Doing anything globally is expensive, and should be avoided at all costs. The scheduler already rebalances when a CPU is idle, so you're really just rebalancing the overload here. On a server workload, we don't necessarily want to do that, since the overload may be multiple threads spawned to service a single request, and could be sharing a lot of data. Instead of an explicit system-wide fairness invariant (which will get very hard to enforce when you throw SCHED_FIFO processes into the mix and the scheduler isn't running on some CPUs), try a simpler invariant. If we guarantee that the load on CPU X does not differ from the load on CPU (X+1)%N by more than some small constant, then we know that the system is fairly balanced. We can achieve global fairness with local balancing, and avoid all this overhead. This has the added advantage of keeping most of the migrations core/socket/node-local on SMT/multicore/NUMA systems. -- Chris To clarify, I'm not suggesting that the balance with cpu (x+1)%n only algorithm is the only way to do this. Rather, I'm pointing out that even an extremely simple algorithm can give you fair loading when you already have CFS managing the runqueues. There are countless more sophisticated ways we could do this without using global locking, or possibly without any locking at all, other than the locking we already use during migration. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Tong Li wrote: On Mon, 23 Jul 2007, Chris Snook wrote: This patch is massive overkill. Maybe you're not seeing the overhead on your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RT workload. Do you have any data justifying the need for this patch? Doing anything globally is expensive, and should be avoided at all costs. The scheduler already rebalances when a CPU is idle, so you're really just rebalancing the overload here. On a server workload, we don't necessarily want to do that, since the overload may be multiple threads spawned to service a single request, and could be sharing a lot of data. Instead of an explicit system-wide fairness invariant (which will get very hard to enforce when you throw SCHED_FIFO processes into the mix and the scheduler isn't running on some CPUs), try a simpler invariant. If we guarantee that the load on CPU X does not differ from the load on CPU (X+1)%N by more than some small constant, then we know that the system is fairly balanced. We can achieve global fairness with local balancing, and avoid all this overhead. This has the added advantage of keeping most of the migrations core/socket/node-local on SMT/multicore/NUMA systems. Chris, These are all good comments. Thanks. I see three concerns and I'll try to address each. 1. Unjustified effort/cost My view is that fairness (or proportional fairness) is a first-order metric and necessary in many cases even at the cost of performance. In the cases where it's critical, we have realtime. In the cases where it's important, this implementation won't keep latency low enough to make people happier. If you've got a test case to prove me wrong, I'd like to see it. A server running multiple client apps certainly doesn't want the clients to see that they are getting different amounts of service, assuming the clients are of equal importance (priority). A conventional server receives client requests, does a brief amount of work, and then gives a response. This patch doesn't help that workload. This patch helps the case where you've got batch jobs running on a slightly overloaded compute server, and unfairness means you end up waiting for a couple threads to finish at the end while CPUs sit idle. I don't think it's that big of a problem, and if it is, I think we can solve it in a more elegant way than reintroducing expired queues. When the clients have different priorities, the server also wants to give them service time proportional to their priority/weight. The same is true for desktops, where users want to nice tasks and see an effect that's consistent with what they expect, i.e., task CPU time should be proportional to their nice values. The point is that it's important to enforce fairness because it enables users to control the system in a deterministic way and it helps each task get good response time. CFS achieves this on local CPUs and this patch makes the support stronger for SMPs. It's overkill to enforce unnecessary degree of fairness, but it is necessary to enforce an error bound, even if large, such that the user can reliably know what kind of CPU time (even performance) he'd get after making a nice value change. Doesn't CFS already do this? This patch ensures an error bound of (max task weight currently in system) * sysctl_base_round_slice compared to an idealized fair system. The thing that bugs me about this is the diminishing returns. It looks like it will only give a substantial benefit when system load is somewhere between 1.0 and 2.0. On a heavily-loaded system, CFS will do the right thing within a good margin of error, and on an underloaded system, even a naive scheduler will do the right thing. If you want to optimize smp fairness in this range, that's great, but there's probably a lighter-weight way to do it. 2. High performance overhead Two sources of overhead: (1) the global rw_lock, and (2) task migrations. I agree they can be problems on NUMA, but I'd argue they are not on SMPs. Any global lock can cause two performance problems: (1) serialization, and (2) excessive remote cache accesses and traffic. IMO (1) is not a problem since this is a rw_lock and a write_lock occurs infrequently only when all tasks in the system finish the current round. (2) could be a problem as every read/write lock causes an invalidation. It could be improved by using Nick's ticket lock. On the other hand, this is a single cache line and it's invalidated only when a CPU finishes all tasks in its local active RB tree, where each nice 0 task takes sysctl_base_round_slice (e.g., 30ms) to finish, so it looks to me the invalidations would be infrequent enough and could be noise in the whole system. Task migrations don't bother me all that much. Since we're migrating the *overload*, I expect those processes to be fairly cache-cold whenever we get around to them anyway. It'd be nice to be SMT/multicore
Re: miserable performance of 2.6.21 under network load
Aaron Porter wrote: I'm in the process up upgrading a pool of apache servers from 2.6.17.8 to 2.6.21.5, and we're seeing a pretty major change in behavior. Under identical network load, 2.6.21 has a load average more than 3 times higher, cpu 0 spends well over 90% of its time in interrupts (vs ~30% under 2.6.17). When we hit 3k apache sessions, ksoftirqd eats 100% of cpu0 and our network traffic drops off rapidly. The end result is that 2.6.17 performs twice as well under this load. Is it always CPU 0, or does it move? Are you running irqbalance? If you're running irqbalance, you can run a script that alternates between 'cat /proc/interrupts' and 'mpstat -P ALL 5 10' and watch the offending interrupt jump around between processors. It's not as informative as oprofile, as Andi suggested, but it's really easy to set up. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: miserable performance of 2.6.21 under network load
Aaron Porter wrote: On Tue, Jul 24, 2007 at 08:48:00PM +0200, Andi Kleen wrote: Aaron Porter [EMAIL PROTECTED] writes: I'm in the process up upgrading a pool of apache servers from 2.6.17.8 to 2.6.21.5, and we're seeing a pretty major change in behavior. Under identical network load, 2.6.21 has a load average more than 3 times higher, cpu 0 spends well over 90% of its time in interrupts (vs ~30% under 2.6.17). When we hit 3k apache sessions, ksoftirqd eats 100% of cpu0 and our network traffic drops off rapidly. The end result is that 2.6.17 performs twice as well under this load. Can you oprofile it? # opreport -l CPU: AMD64 processors, speed 1994.52 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 10 samples %app name symbol name 914379 48.8404 vmlinux-2.6.21.5 check_poison_obj 341920 18.2632 vmlinux-2.6.21.5 poison_obj I bet you have CONFIG_DEBUG_SLAB turned off in your 2.6.17 kernel, and turned on in your 2.6.21 kernel. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Chris Friesen wrote: Chris Snook wrote: Concerns aside, I agree that fairness is important, and I'd really like to see a test case that demonstrates the problem. One place that might be useful is the case of fairness between resource groups, where the load balancer needs to consider each group separately. You mean like the CFS group scheduler patches? I don't see how this patch is related to that, besides working on top of it. Now it may be the case that trying to keep the load of each class within X% of the other cpus is sufficient, but it's not trivial. I agree. My suggestion is that we try being fair from the bottom-up, rather than top-down. If most of the rebalancing is local, we can minimize expensive locking and cross-node migrations, and scale very nicely on large NUMA boxes. Consider the case where you have a resource group that is allocated 50% of each cpu in a dual cpu system, and only have a single task in that group. This means that in order to make use of the full group allocation, that task needs to be load-balanced to the other cpu as soon as it gets scheduled out. Most load-balancers can't handle that kind of granularity, but I have guys in our engineering team that would really like this level of performance. Divining the intentions of the administrator is an AI-complete problem and we're not going to try to solve that in the kernel. An intelligent administrator could also allocate 50% of each CPU to a resource group containing all the *other* processes. Then, when the other processes are scheduled out, your single task will run on whichever CPU is idle. This will very quickly equilibrate to the scheduling ping-pong you seem to want. The scheduler deliberately avoids this kind of migration by default because it hurts cache and TLB performance, so if you want to override this very sane default behavior, you're going to have to explicitly configure it yourself. We currently use CKRM on an SMP machine, but the only way we can get away with it is because our main app is affined to one cpu and just about everything else is affined to the other. If you're not explicitly allocating resources, you're just low-latency, not truly realtime. Realtime requires guaranteed resources, so messing with affinities is a necessary evil. We have another SMP box that would benefit from group scheduling, but we can't use it because the load balancer is not nearly good enough. Which scheduler? Have you tried the CFS group scheduler patches? -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Li, Tong N wrote: On Tue, 2007-07-24 at 16:39 -0400, Chris Snook wrote: Divining the intentions of the administrator is an AI-complete problem and we're not going to try to solve that in the kernel. An intelligent administrator could also allocate 50% of each CPU to a resource group containing all the *other* processes. Then, when the other processes are scheduled out, your single task will run on whichever CPU is idle. This will very quickly equilibrate to the scheduling ping-pong you seem to want. The scheduler deliberately avoids this kind of migration by default because it hurts cache and TLB performance, so if you want to override this very sane default behavior, you're going to have to explicitly configure it yourself. Well, the admin wouldn't specifically ask for 50% of each CPU. He would just allocate 50% of total CPU time---it's up to the scheduler to fulfill that. If a task is entitled to one CPU, then it'll stay there and have no migration. Migration occurs only if there's overload, in which case I think you agree in your last email that the cache and TLB impact is not an issue (at least in SMP). I don't think Chris's scenario has much bearing on your patch. What he wants is to have a task that will always be running, but can't monopolize either CPU. This is useful for certain realtime workloads, but as I've said before, realtime requires explicit resource allocation. I don't think this is very relevant to SCHED_FAIR balancing. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Bill Huey (hui) wrote: On Tue, Jul 24, 2007 at 04:39:47PM -0400, Chris Snook wrote: Chris Friesen wrote: We currently use CKRM on an SMP machine, but the only way we can get away with it is because our main app is affined to one cpu and just about everything else is affined to the other. If you're not explicitly allocating resources, you're just low-latency, not truly realtime. Realtime requires guaranteed resources, so messing with affinities is a necessary evil. You've mentioned this twice in this thread. If you're going to talk about this you should characterize this more specifically because resource allocation is a rather incomplete area in the Linux. Well, you need enough CPU time to meet your deadlines. You need pre-allocated memory, or to be able to guarantee that you can allocate memory fast enough to meet your deadlines. This principle extends to any other shared resource, such as disk or network. I'm being vague because it's open-ended. If a medical device fails to meet realtime guarantees because the battery fails, the patient's family isn't going to care how correct the software is. Realtime engineering is hard. Rebalancing is still an open research problem the last time I looked. Actually, it's worse than merely an open problem. A clairvoyant fair scheduler with perfect future knowledge can underperform a heuristic fair scheduler, because the heuristic scheduler can guess the future incorrectly resulting in unfair but higher-throughput behavior. This is a perfect example of why we only try to be as fair as is beneficial. Tong's previous trio patch is an attempt at resolving this using a generic grouping mechanism and some constructive discussion should come of it. Sure, but it seems to me to be largely orthogonal to this patch. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Chris Friesen wrote: Chris Snook wrote: I don't think Chris's scenario has much bearing on your patch. What he wants is to have a task that will always be running, but can't monopolize either CPU. This is useful for certain realtime workloads, but as I've said before, realtime requires explicit resource allocation. I don't think this is very relevant to SCHED_FAIR balancing. I'm not actually using the scenario I described, its just sort of a worst-case load-balancing thought experiment. What we want to be able to do is to specify a fraction of each cpu for each task group. We don't want to have to affine tasks to particular cpus. A fraction of *each* CPU, or a fraction of *total* CPU? Per-cpu granularity doesn't make anything more fair. You've got a big bucket of MIPS you want to divide between certain groups, but it shouldn't make a difference which CPUs those MIPS come from, other than the fact that we try to minimize overhead induced by migration. This means that the load balancer must be group-aware, and must trigger a re-balance (possibly just for a particular group) as soon as the cpu allocation for that group is used up on a particular cpu. If I have two threads with the same priority, and two CPUs, the scheduler will put one on each CPU, and they'll run happily without any migration or balancing. It sounds like you're saying that every X milliseconds, you want both to expire, be forbidden from running on the current CPU for the next X milliseconds, and then migrated to the other CPU. There's no gain in fairness here, and there's a big drop in performance. I suggested local fairness as a means to achieve global fairness because it could reduce overhead, and by adding the margin of error at each level in the locality hierarchy, you can get an algorithm which naturally tolerates the level of unfairness beyond which it is impossible to optimize. Strict local fairness for its own sake doesn't accomplish anything that's better than global fairness. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Chris Friesen wrote: Ingo Molnar wrote: the 3s is the problem: change that to 60s! We no way want to over-migrate for SMP fairness, the change i did gives us reasonable long-term SMP fairness without the need for high-rate rebalancing. Actually, I do have requirements from our engineering guys for short-term fairness. They'd actually like decent fairness over even shorter intervals...1 second would be nice, 2 is acceptable. They are willing to trade off random peak performance for predictability. Chris The sysctls for CFS have nanosecond resolution. They default to millisecond-order values, but you can set them much lower. See sched_fair.c for the knobs and their explanations. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Li, Tong N wrote: On Wed, 2007-07-25 at 16:55 -0400, Chris Snook wrote: Chris Friesen wrote: Ingo Molnar wrote: the 3s is the problem: change that to 60s! We no way want to over-migrate for SMP fairness, the change i did gives us reasonable long-term SMP fairness without the need for high-rate rebalancing. Actually, I do have requirements from our engineering guys for short-term fairness. They'd actually like decent fairness over even shorter intervals...1 second would be nice, 2 is acceptable. They are willing to trade off random peak performance for predictability. Chris The sysctls for CFS have nanosecond resolution. They default to millisecond-order values, but you can set them much lower. See sched_fair.c for the knobs and their explanations. -- Chris This is incorrect. Those knobs control local-CPU fairness granularity but have no control over fairness across CPUs. I'll do some benchmarking as Ingo suggested. tong CFS naturally enforces cross-CPU fairness anyway, as Ingo demonstrated. Lowering the local CPU parameters should cause system-wide fairness to converge faster. It might be worthwhile to create a more explicit knob for this, but I'm inclined to believe we could do it in much less than 700 lines. Ingo's one-liner to improve the 10/8 balancing case, and the resulting improvement, were exactly what I was saying should be possible and desirable. TCP Nagle aside, it generally shouldn't take 700 lines of code to speed up the rate of convergence of something that already converges. Until now I've been watching the scheduler rewrite from the sidelines, but I'm digging into it now. I'll try to give some more constructive criticism soon. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Tong Li wrote: I'd like to clarify that I'm not trying to push this particular code to the kernel. I'm a researcher. My intent was to point out that we have a problem in the scheduler and my dwrr algorithm can potentially help fix it. The patch itself was merely a proof-of-concept. I'd be thrilled if the algorithm can be proven useful in the real world. I appreciate the people who have given me comments. Since then, I've revised my algorithm/code. Now it doesn't require global locking but retains strong fairness properties (which I was able to prove mathematically). Thanks for doing this work. Please don't take the implementation criticism as a lack of appreciation for the work. I'd like to see dwrr in the scheduler, but I'm skeptical that re-introducing expired runqueues is the most efficient way to do it. Given the inherently controversial nature of scheduler code, particularly that which attempts to enforce fairness, perhaps a concise design document would help us come to an agreement about what we think the scheduler should do and what tradeoffs we're willing to make to do those things. Do you have a design document we could discuss? -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Tong Li wrote: On Fri, 27 Jul 2007, Chris Snook wrote: Tong Li wrote: I'd like to clarify that I'm not trying to push this particular code to the kernel. I'm a researcher. My intent was to point out that we have a problem in the scheduler and my dwrr algorithm can potentially help fix it. The patch itself was merely a proof-of-concept. I'd be thrilled if the algorithm can be proven useful in the real world. I appreciate the people who have given me comments. Since then, I've revised my algorithm/code. Now it doesn't require global locking but retains strong fairness properties (which I was able to prove mathematically). Thanks for doing this work. Please don't take the implementation criticism as a lack of appreciation for the work. I'd like to see dwrr in the scheduler, but I'm skeptical that re-introducing expired runqueues is the most efficient way to do it. Given the inherently controversial nature of scheduler code, particularly that which attempts to enforce fairness, perhaps a concise design document would help us come to an agreement about what we think the scheduler should do and what tradeoffs we're willing to make to do those things. Do you have a design document we could discuss? -- Chris Thanks for the interest. Attached is a design doc I wrote several months ago (with small modifications). It talks about the two pieces of my design: group scheduling and dwrr. The description was based on the original O(1) scheduler, but as my CFS patch showed, the algorithm is applicable to other underlying schedulers as well. It's interesting that I started working on this in January for the purpose of eventually writing a paper about it. So I knew reasonably well the related research work but was totally unaware that people in the Linux community were also working on similar things. This is good. If you are interested, I'd like to help with the algorithms and theory side of the things. tong --- Overview: Trio extends the existing Linux scheduler with support for proportional-share scheduling. It uses a scheduling algorithm, called Distributed Weighted Round-Robin (DWRR), which retains the existing scheduler design as much as possible, and extends it to achieve proportional fairness with O(1) time complexity and a constant error bound, compared to the ideal fair scheduling algorithm. The goal of Trio is not to improve interactive performance; rather, it relies on the existing scheduler for interactivity and extends it to support MP proportional fairness. Trio has two unique features: (1) it enables users to control shares of CPU time for any thread or group of threads (e.g., a process, an application, etc.), and (2) it enables fair sharing of CPU time across multiple CPUs. For example, with ten tasks running on eight CPUs, Trio allows each task to take an equal fraction of the total CPU time. These features enable Trio to complement the existing Linux scheduler to enable greater user flexibility and stronger fairness. Background: Over the years, there has been a lot of criticism that conventional Unix priorities and the nice interface provide insufficient support for users to accurately control CPU shares of different threads or applications. Many have studied scheduling algorithms that achieve proportional fairness. Assuming that each thread has a weight that expresses its desired CPU share, informally, a scheduler is proportionally fair if (1) it is work-conserving, and (2) it allocates CPU time to threads in exact proportion to their weights in any time interval. Ideal proportional fairness is impractical since it requires that all runnable threads be running simultaneously and scheduled with infinitesimally small quanta. In practice, every proportional-share scheduling algorithm approximates the ideal algorithm with the goal of achieving a constant error bound. For more theoretical background, please refer to the following papers: I don't think that achieving a constant error bound is always a good thing. We all know that fairness has overhead. If I have 3 threads and 2 processors, and I have a choice between fairly giving each thread 1.0 billion cycles during the next second, or unfairly giving two of them 1.1 billion cycles and giving the other 0.9 billion cycles, then we can have a useful discussion about where we want to draw the line on the fairness/performance tradeoff. On the other hand, if we can give two of them 1.1 billion cycles and still give the other one 1.0 billion cycles, it's madness to waste those 0.2 billion cycles just to avoid user jealousy. The more complex the memory topology of a system, the more free cycles you'll get by tolerating short-term unfairness. As a crude heuristic, scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but eventually we should take the boot-time computed migration costs into consideration. [1] A. K
Re: Volanomark slows by 80% under CFS
Tim Chen wrote: Ingo, Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1. Benchmark was run on a 2 socket Core2 machine. The change in scheduler treatment of sched_yield could play a part in changing Volanomark behavior. In CFS, sched_yield is implemented by dequeueing and requeueing a process . The time a process has spent running probably reduced the the cpu time due it by only a bit. The process could get re-queued pretty close to head of the queue, and may get scheduled again pretty quickly if there is still a lot of cpu time due. It may make sense to queue the yielding process a bit further behind in the queue. I made a slight change by zeroing out wait_runtime (i.e. have the process gives up cpu time due for it to run) for experimentation. Let's put aside gripes that Volanomark should have used a better mechanism to coordinate threads instead sched_yield for a second. Volanomark runs better and is only 40% (instead of 80%) down from old scheduler without CFS. Of course we should not tune for Volanomark and this is reference data. What are your view on how CFS's sched_yield should behave? Regards, Tim The primary purpose of sched_yield is for SCHED_FIFO realtime processes. Where nothing else will run, ever, unless the running thread blocks or yields the CPU. Under CFS, the yielding process will still be leftmost in the rbtree, otherwise it would have already been scheduled out. Zeroing out wait_runtime on sched_yield strikes me as completely appropriate. If the process wanted to sleep a finite duration, it should actually call a sleep function, but sched_yield is essentially saying I don't have anything else to do right now, so it's hardly fair to claim you've been waiting for your chance when you just gave it up. As for the remaining 40% degradation, if Volanomark is using it for synchronization, the scheduler is probably cycling through threads until it gets to the one that actually wants to do work. The O(1) scheduler will do this very quickly, whereas CFS has a bit more overhead. Interactivity boosting may have also helped the old scheduler find the right thread faster. I think Volanomark is being pretty stupid, and deserves to run slowly, but there are legitimate reasons to want to call sched_yield in a non-SCHED_FIFO process. If I'm performing multiple different calculations on the same set of data in multiple threads, and accessing the shared data in a linear fashion, I'd like to be able to have one thread give the other some CPU time so they can stay at the same point in the stream and improve cache hit rates, but this is only an optimization if I can do it without wasting CPU or gradually nicing myself into oblivion. Having sched_yield zero out wait_runtime seems like an appropriate way to make this use case work to the extent possible. Any user attempting such an optimization should have the good sense to do real work between sched_yield calls, to avoid calling the scheduler in a tight loop. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: swap-prefetch: A smart way to make good use of idle resources (was: updatedb)
Al Boldi wrote: People wrote: I believe the users who say their apps really do get paged back in though, so suspect that's not the case. Stopping the bush-circumference beating, I do not. -ck (and gentoo) have this massive Calimero thing going among their users where people are much less interested in technology than in how the nasty big kernel meanies are keeping them down (*). I think the problem is elsewhere. Users don't say: My apps get paged back in. They say: My system is more responsive. They really don't care *why* the reaction to a mouse click that takes three seconds with a mainline kernel is instantaneous with -ck. Nasty big kernel meanies, OTOH, want to understand *why* a patch helps in order to decide whether it is really a good idea to merge it. So you've got a bunch of patches (aka -ck) which visibly improve the overall responsiveness of a desktop system, but apparently no one can conclusively explain why or how they achieve that, and therefore they cannot be merged into mainline. I don't have a solution to that dilemma either. IMHO, what everybody agrees on, is that swap-prefetch has a positive effect in some cases, and nobody can prove an adverse effect (excluding power consumption). The reason for this positive effect is also crystal clear: It prefetches from swap on idle into free memory, ie: it doesn't force anybody out, and they are the first to be dropped without further swap-out, which sounds really smart. Conclusion: Either prove swap-prefetch is broken, or get this merged quick. If you can't prove why it helps and doesn't hurt, then it's a hack, by definition. Behind any performance hack is some fundamental truth that can be exploited to greater effect if we reason about it. So let's reason about it. I'll start. Resource size has been outpacing processing latency since the dawn of time. Disks get bigger much faster than seek times shrink. Main memory and cache keep growing, while single-threaded processing speed has nearly ground to a halt. In the old days, it made lots of sense to manage resource allocation in pages and blocks. In the past few years, we started reserving blocks in ext3 automatically because it saves more in seek time than it costs in disk space. Now we're taking preallocation and antifragmentation to the next level with extent-based allocation in ext4. Well, we're still using bitmap-style allocation for pages, and the prefetch-less swap mechanism adheres to this design as well. Maybe it's time to start thinking about memory in a somewhat more extent-like fashion. With swap prefetch, we're only optimizing the case when the box isn't loaded and there's RAM free, but we're not optimizing the case when the box is heavily loaded and we need for RAM to be free. This is a complete reversal of sane development priorities. If swap batching is an optimization at all (and we have empirical evidence that it is) then it should also be an optimization to swap out chunks of pages when we need to free memory. So, how do we go about this grouping? I suggest that if we keep per-VMA reference/fault/dirty statistics, we can tell which logically distinct chunks of memory are being regularly used. This would also us to apply different page replacement policies to chunks of memory that are being used in different fashions. With such statistics, we could then page out VMAs in 2MB chunks when we're under memory pressure, also giving us the option of transparently paging them back in to hugepages when we have the memory free, once anonymous hugepage support is in place. I'm inclined to view swap prefetch as a successful scientific experiment, and use that data to inform a more reasoned engineering effort. If we can design something intelligent which happens to behave more or less like swap prefetch does under the circumstances where swap prefetch helps, and does something else smart under the circumstances where swap prefetch makes no discernable difference, it'll be a much bigger improvement. Because we cannot prove why the existing patch helps, we cannot say what impact it will have when things like virtualization and solid state drives radically change the coefficients of the equation we have not solved. Providing a sysctl to turn off a misbehaving feature is a poor substitute for doing it right the first time, and leaving it off by default will ensure that it only gets used by the handful of people who know enough to rebuild with the patch anyway. Let's talk about how we can make page replacement smarter, so it naturally accomplishes what swap prefetch accomplishes, as part of a design we can reason about. CC-ing linux-mm, since that's where I think we should take this next. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: [RFC] scheduler: improve SMP fairness in CFS
Bill Huey (hui) wrote: On Fri, Jul 27, 2007 at 07:36:17PM -0400, Chris Snook wrote: I don't think that achieving a constant error bound is always a good thing. We all know that fairness has overhead. If I have 3 threads and 2 processors, and I have a choice between fairly giving each thread 1.0 billion cycles during the next second, or unfairly giving two of them 1.1 billion cycles and giving the other 0.9 billion cycles, then we can have a useful discussion about where we want to draw the line on the fairness/performance tradeoff. On the other hand, if we can give two of them 1.1 billion cycles and still give the other one 1.0 billion cycles, it's madness to waste those 0.2 billion cycles just to avoid user jealousy. The more complex the memory topology of a system, the more free cycles you'll get by tolerating short-term unfairness. As a crude heuristic, scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but eventually we should take the boot-time computed migration costs into consideration. You have to consider the target for this kind of code. There are applications where you need something that falls within a constant error bound. According to the numbers, the current CFS rebalancing logic doesn't achieve that to any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything more strict than that. I've said from the beginning that I think that anyone who desperately needs perfect fairness should be explicitly enforcing it with the aid of realtime priorities. The problem is that configuring and tuning a realtime application is a pain, and people want to be able to approximate this behavior without doing a whole lot of dirty work themselves. I believe that CFS can and should be enhanced to ensure SMP-fairness over potentially short, user-configurable intervals, even for SCHED_OTHER. I do not, however, believe that we should take it to the extreme of wasting CPU cycles on migrations that will not improve performance for *any* task, just to avoid letting some tasks get ahead of others. We should be as fair as possible but no fairer. If we've already made it as fair as possible, we should account for the margin of error and correct for it the next time we rebalance. We should not burn the surplus just to get rid of it. On a non-NUMA box with single-socket, non-SMT processors, a constant error bound is fine. Once we add SMT, go multi-core, go NUMA, and add inter-chassis interconnects on top of that, we need to multiply this error bound at each stage in the hierarchy, or else we'll end up wasting CPU cycles on migrations that actually hurt the processes they're supposed to be helping, and hurt everyone else even more. I believe we should enforce an error bound that is proportional to migration cost. Even the rt overload code (from my memory) is subject to these limitations as well until it's moved to use a single global queue while using CPU binding to turn off that logic. It's the price you pay for accuracy. If we allow a little short-term fairness (and I think we should) we can still account for this unfairness and compensate for it (again, with the same tolerance) at the next rebalancing. Again, it's a function of *when* and depends on that application. Adding system calls, while great for research, is not something which is done lightly in the published kernel. If we're going to implement a user interface beyond simply interpreting existing priorities more precisely, it would be nice if this was part of a framework with a broader vision, such as a scheduler economy. I'm not sure what you mean by scheduler economy, but CFS can and should be extended to handle proportional scheduling which is outside of the traditional Unix priority semantics. Having a new API to get at this is unavoidable if you want it to eventually support -rt oriented appications that have bandwidth semantics. A scheduler economy is basically a credit scheduler, augmented to allow processes to exchange credits with each other. If you want to get more sophisticated with fairness, you could price CPU time proportional to load on that CPU. I've been house-hunting lately, so I like to think of it in real estate terms. If you're comfortable with your standard of living and you have enough money, you can rent the apartment in the chic part of town, right next to the subway station. If you want to be more frugal because you're saving for retirement, you can get a place out in the suburbs, but the commute will be more of a pain. If you can't make up your mind and keep moving back and forth, you spend a lot on moving and all your stuff gets dented and scratched. All deadline based schedulers have API mechanisms like this to support extended semantics. This is no different. I had a feeling this patch was originally designed for the O(1) scheduler, and this is why. The old scheduler had expired arrays, so adding a round-expired
pluggable scheduler flamewar thread (was Re: Volanomark slows by 80% under CFS)
Andrea Arcangeli wrote: On Fri, Jul 27, 2007 at 08:31:19PM -0400, Chris Snook wrote: I think Volanomark is being pretty stupid, and deserves to run slowly, but Indeed, any app doing what volanomark does is pretty inefficient. But this is not the point. I/O schedulers are pluggable to help for inefficient apps too. If apps would be extremely smart they would all use async-io for their reads, and there wouldn't be the need of anticipatory scheduler just for an example. I'm pretty sure the point of posting a patch that triples CFS performance on a certain benchmark and arguably improves the semantics of sched_yield was to improve CFS. You have a point, but it is a point for a different thread. I have taken the liberty of starting this thread for you. The fact is there's no technical explanation for which we're forbidden to be able to choose between CFS and O(1) at least at boot time. Sure there is. We can run a fully-functional POSIX OS without using any block devices at all. We cannot run a fully-functional POSIX OS without a scheduler. Any feature without which the OS cannot execute userspace code is sufficiently primitive that somewhere there is a device on which it will be impossible to debug if that feature fails to initialize. It is quite reasonable to insist on only having one implementation of such features in any given kernel build. Whether or not these alternatives belong in the source tree as config-time options is a political question, but preserving boot-time debugging capability is a perfectly reasonable technical motivation. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/23] make atomic_read() and atomic_set() behavior consistent on m32r
Hirokazu Takata wrote: I think the parameter of atomic_read() should have const qualifier to avoid these warnings, and IMHO this modification might be worth applying on other archs. I agree. Here is an additional patch to revise the previous one for m32r. I'll incorporate this change if we get enough consensus to justify a re-re-re-submit. Since the patch is intended to be a functional no-op on m32r, I'm inclined to leave it alone at the moment. I also tried to rewrite it with inline asm code, but the kernel text size bacame roughly 2kB larger. So, I prefer C version. You're not the only arch maintainer who prefers doing it in C. If you trust your compiler (a big if, apparently), inline asm only improves code generation if you have a whole bunch of general purpose registers for the optimizer to play with. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fork Bombing Patch
Krzysztof Halasa wrote: Hi, Anand Jahagirdar [EMAIL PROTECTED] writes: I am forwarding one more improved patch which i have modified as per your suggestions. Insted of KERN_INFO i have used KERN_NOTICE and i have added one more if block to check hard limit. how good it is? Not very, still lacks #ifdef CONFIG_something and the required Kconfig change (or other runtime thing defaulting to no printk). Wrapping a single printk that's unrelated to debugging in an #ifdef CONFIG_* or a sysctl strikes me as abuse of those configuration facilities. Where would we draw the line for other patches wanting to do similar things? I realized that even checking the hard limit it insufficient, because that can be lowered (but not raised) by unprivileged processes. If we can't do this unconditionally (and we can't, because the log pollution would be intolerable for many people) then we shouldn't do it at all. Anand -- I appreciate the effort, but I think you should reconsider precisely what problem you're trying to solve here. This approach can't tell the difference between legitimate self-regulation of resource utilization and a real attack. Worse, in the event of a real attack, it could be used to make it more difficult for the administrator to notice something much more serious than a forkbomb. I suspect that userspace might be a better place to solve this problem. You could run your monitoring app with elevated or even realtime priority to ensure it will still function, and you have much more freedom in making the reporting configurable. You can also look at much more data than we could ever allow in fork.c, and possibly detect attacks that this patch would miss if a clever attacker stayed just below the limit. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()
Denys Vlasenko wrote: On Friday 24 August 2007 18:06, Christoph Lameter wrote: On Fri, 24 Aug 2007, Satyam Sharma wrote: But if people do seem to have a mixed / confused notion of atomicity and barriers, and if there's consensus, then as I'd said earlier, I have no issues in going with the consensus (eg. having API variants). Linus would be more difficult to convince, however, I suspect :-) The confusion may be the result of us having barrier semantics in atomic_read. If we take that out then we may avoid future confusions. I think better name may help. Nuke atomic_read() altogether. n = atomic_value(x);// doesnt hint as strongly at reading as atomic_read n = atomic_fetch(x);// yes, we _do_ touch RAM n = atomic_read_uncached(x); // or this How does that sound? atomic_value() vs. atomic_fetch() should be rather unambiguous. atomic_read_uncached() begs the question of precisely which cache we are avoiding, and could itself cause confusion. So, if I were writing atomic.h from scratch, knowing what I know now, I think I'd use atomic_value() and atomic_fetch(). The problem is that there are a lot of existing users of atomic_read(), and we can't write a script to correctly guess their intent. I'm not sure auditing all uses of atomic_read() is really worth the comparatively miniscule benefits. We could play it safe and convert them all to atomic_fetch(), or we could acknowledge that changing the semantics 8 months ago was not at all disastrous, and make them all atomic_value(), allowing people to use atomic_fetch() where they really care. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fork Bombing Patch
Anand Jahagirdar wrote: Hi consider a case: if non root user request admin for more number of processes than root user,admin needs to modify settings in /etc/security/limits.conf file and if that user is not trustworthy and if does fork bombing attack it will kill the box. If root is dumb enough to give the user whatever privileges they ask for, fork-bombing is the least of your problems. (I have already tried this attack). in that case this loop will work, but by the time attack might have killed the box (Bcoz so many processes has already been created at that time) . so in that case admin wont come to know that what has happened. On large multi-user SMP systems, the default ulimits will keep the box responsive, if sluggish. Perhaps you should file a bug with your distribution if you believe the default settings in limits.conf are too high. There's no way to algorithmically distinguish a forkbomb from a legitimate highly-threaded workload. Like this there are many cases..(actually these cases has already been discussed On LKML 2 months before in my thread named fork bombing attack). in all these cases this printk helps adminstrator a lot. What exactly does this patch help the administrator do? If a box is thrashing, you still have sysrq. You can also use cpusets and taskset to put your root login session on a dedicated processor, which is getting to be pretty cheap on modern many-core, many-thread systems. Group scheduling is in the oven, which will allow you to prioritize classes of users in a more general manner, even on UP systems. On 8/29/07, Simon Arlott [EMAIL PROTECTED] wrote: On Wed, August 29, 2007 10:48, Anand Jahagirdar wrote: Hi printk_ratelimit function takes care of flooding the syslog. due to printk_ratelimit function syslog will not be flooded Um, no. printk_ratelimit is on the order of *seconds*. This prevents error conditions from causing the system to spend all of its CPU and I/O time logging. It does very little to prevent log spamming. If I sent you an email every second, it would make it much more difficult for you to find other messages in your inbox. It's possible (easy, even) to write a forkbomber that doesn't actually harm system responsiveness, but will still trigger this printk as fast as possible. If we merge this patch, every cracking toolkit in existence will add such a feature, because log spamming makes it harder for the administrator to find more important messages, and even if the administrator uses grep judiciously to filter them out, that doesn't help if logrotate has already deleted the log containing the information they need to keep /var/log from filling up. anymore. as soon as administrator gets this message, he can take action against that user (may be block user's access on server). i think the my fork patch is very useful and helps administrator lot. You still haven't explained why this can't be done in userspace. If forkbombing is a serious threat (and it's not) you can run a forkbomb monitor with realtime priority that won't be severely impacted by thrashing among normal priority processes. Userspace has room for much more sophisticated processing anyway, so doing this in the kernel doesn't make much sense. i would also like to mention that in some of the cases ulimit solution wont work. in that case fork bombing takes the machine and server needs a reboot. i am sure in that situation this printk statement helps administrator to know what has happened. SysRq-t makes it quite obvious that the system has been forkbombed, allowing the administrator to lower ulimits if the box can't handle the load permitted by the default settings. Sometimes SysRq is inconvenient due to lack of physical access, which is why I wrote hangwatch[1]. Hangwatch monitors /proc/loadavg and writes the specified set of SysRq triggers into /proc/sysrq-trigger when the specified load average is exceeded, with the specified frequency. It doesn't require forks or dynamic memory allocation, so it works basically any time the box isn't locked up enough to trigger NMI watchdog, though realtime users may want to run it with chrt priority. It's very simple, but it's proven so effective that there really hasn't been much need to develop it further since I initially wrote it a year ago. Given how much we can already do in userspace, I don't really see a need to implement this in the kernel. If you'd like me to add features to hangwatch, let's talk about that. You can even fork it yourself, since it's GPL. -- Chris [1] http://people.redhat.com/csnook/hangwatch/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 9/24] make atomic_read() behave consistently on ia64
Paul Mackerras wrote: Chris Snook writes: I'll do this for the whole patchset. Stay tuned for the resubmit. Could you incorporate Segher's patch to turn atomic_{read,set} into asm on powerpc? Segher claims that using asm is really the only reliable way to ensure that gcc does what we want, and he seems to have a point. Paul. I haven't seen a patch yet. I'm going to resubmit with inline volatile-cast atomic[64]_[read|set] on all architectures as a reference point, and if anyone wants to go and implement some of them in assembly, that's between them and the relevant arch maintainers. I have no problem with (someone else) doing it in assembly. I just don't think it's necessary and won't let it hold up the effort to get consistent behavior on all architectures. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [74/2many] MAINTAINERS - ATL1 ETHERNET DRIVER
[EMAIL PROTECTED] wrote: Add file pattern to MAINTAINER entry Signed-off-by: Joe Perches [EMAIL PROTECTED] diff --git a/MAINTAINERS b/MAINTAINERS index b8bb108..d9d1bcc 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -752,6 +752,7 @@ L: [EMAIL PROTECTED] W: http://sourceforge.net/projects/atl1 W: http://atl1.sourceforge.net S: Maintained +F: drivers/net/atl1* ATM P: Chas Williams atl1 has its own directory, so this one should read F: drivers/net/atl1/* -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/24] make atomic_read() behave consistently on frv
David Howells wrote: Chris Snook [EMAIL PROTECTED] wrote: cpu_relax() contains a barrier, so it should do the right thing. For non-smp architectures, I'm concerned about interacting with interrupt handlers. Some drivers do use atomic_* operations. I'm not sure that actually answers my question. Why not smp_rmb()? David I would assume because we want to waste time efficiently even on non-smp architectures, rather than frying the CPU or draining the battery. Certain looping execution patterns can cause the CPU to operate above thermal design power. I have fans on my workstation that only ever come on when running LINPACK, and that's generally memory bandwidth-bound. Just imagine what happens when you're executing the same few non-serializing instructions in a tight loop without ever stalling on memory fetches, or being scheduled out. If there's another reason, I'd like to hear it too, because I'm just guessing here. -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [74/2many] MAINTAINERS - ATL1 ETHERNET DRIVER
Chris Snook wrote: [EMAIL PROTECTED] wrote: Add file pattern to MAINTAINER entry Signed-off-by: Joe Perches [EMAIL PROTECTED] diff --git a/MAINTAINERS b/MAINTAINERS index b8bb108..d9d1bcc 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -752,6 +752,7 @@ L:[EMAIL PROTECTED] W:http://sourceforge.net/projects/atl1 W:http://atl1.sourceforge.net S:Maintained +F:drivers/net/atl1* ATM P:Chas Williams atl1 has its own directory, so this one should read F:drivers/net/atl1/* -- Chris Actually, now that I've seen the format in the intro patch, it would be simpler just to use this: F: drivers/net/atl1/ -- Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/23] make atomic_read() and atomic_set() behavior consistent across all architectures
By popular demand, I've redone the patchset to include volatile casts in atomic_set as well. I've also converted the macros to inline functions, to help catch type mismatches at compile time. This will do weird things on ia64 without Andreas Schwab's fix: http://lkml.org/lkml/2007/8/10/410 Notably absent is a patch for powerpc. I expect Segher Boessenkool's assembly implementation should suffice there: http://lkml.org/lkml/2007/8/10/470 Thanks to all who commented on previous incarnations. -- Chris Snook - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/23] document preferred use of volatile with atomic_t
From: Chris Snook [EMAIL PROTECTED] Document proper use of volatile for atomic_t operations. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/Documentation/atomic_ops.txt 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/Documentation/atomic_ops.txt 2007-08-13 03:36:43.0 -0400 @@ -12,13 +12,20 @@ C integer type will fail. Something like the following should suffice: - typedef struct { volatile int counter; } atomic_t; + typedef struct { int counter; } atomic_t; + + Historically, counter has been declared as a volatile int. This +is now discouraged in favor of explicitly casting it as volatile where +volatile behavior is required. Most architectures will only require such +a cast in atomic_read() and atomic_set(), as well as their 64-bit versions +if applicable, since the more complex atomic operations directly or +indirectly use assembly that results in volatile behavior. The first operations to implement for atomic_t's are the initializers and plain reads. #define ATOMIC_INIT(i) { (i) } - #define atomic_set(v, i)((v)-counter = (i)) + #define atomic_set(v, i)(*(volatile int *)(v)-counter = (i)) The first macro is used in definitions, such as: @@ -38,7 +45,7 @@ Next, we have: - #define atomic_read(v) ((v)-counter) + #define atomic_read(v) (*(volatile int *)(v)-counter) which simply reads the current value of the counter. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/23] make atomic_read() and atomic_set() behavior consistent on alpha
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on alpha. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-alpha/atomic.h2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-alpha/atomic.h 2007-08-13 05:00:36.0 -0400 @@ -14,21 +14,35 @@ /* - * Counter is volatile to make sure gcc doesn't try to be clever - * and move things around on us. We need to use _exactly_ the address - * the user gave us, not some alias that contains the same information. + * Make sure gcc doesn't try to be clever and move things around + * on us. We need to use _exactly_ the address the user gave us, + * not some alias that contains the same information. */ -typedef struct { volatile int counter; } atomic_t; -typedef struct { volatile long counter; } atomic64_t; +typedef struct { int counter; } atomic_t; +typedef struct { long counter; } atomic64_t; #define ATOMIC_INIT(i) ( (atomic_t) { (i) } ) #define ATOMIC64_INIT(i) ( (atomic64_t) { (i) } ) -#define atomic_read(v) ((v)-counter + 0) -#define atomic64_read(v) ((v)-counter + 0) +static __inline__ int atomic_read(atomic_t *v) +{ + return *(volatile int *)v-counter + 0; +} + +static __inline__ long atomic64_read(atomic64_t *v) +{ + return *(volatile long *)v-counter + 0; +} -#define atomic_set(v,i)((v)-counter = (i)) -#define atomic64_set(v,i) ((v)-counter = (i)) +static __inline__ void atomic_set(atomic_t *v, int i) +{ + *(volatile int *)v-counter = i; +} + +static __inline__ void atomic64_set(atomic64_t *v, long i) +{ + *(volatile long *)v-counter = i; +} /* * To get proper branch prediction for the main line, we must branch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/23] make atomic_read() and atomic_set() behavior consistent on arm
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on arm. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-arm/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-arm/atomic.h 2007-08-13 04:44:50.0 -0400 @@ -14,13 +14,16 @@ #include linux/compiler.h #include asm/system.h -typedef struct { volatile int counter; } atomic_t; +typedef struct { int counter; } atomic_t; #define ATOMIC_INIT(i) { (i) } #ifdef __KERNEL__ -#define atomic_read(v) ((v)-counter) +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} #if __LINUX_ARM_ARCH__ = 6 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/23] make atomic_read() and atomic_set() behavior consistent on avr32
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on avr32. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-avr32/atomic.h2007-08-13 03:14:13.0 -0400 +++ linux-2.6.23-rc3/include/asm-avr32/atomic.h 2007-08-13 04:48:25.0 -0400 @@ -16,11 +16,18 @@ #include asm/system.h -typedef struct { volatile int counter; } atomic_t; +typedef struct { int counter; } atomic_t; #define ATOMIC_INIT(i) { (i) } -#define atomic_read(v) ((v)-counter) -#define atomic_set(v, i) (((v)-counter) = i) +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} + +static inline void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} /* * atomic_sub_return - subtract the atomic variable - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/23] make atomic_read() and atomic_set() behavior consistent on blackfin
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on blackfin. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-blackfin/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-blackfin/atomic.h 2007-08-13 05:21:07.0 -0400 @@ -18,8 +18,15 @@ typedef struct { } atomic_t; #define ATOMIC_INIT(i) { (i) } -#define atomic_read(v) ((v)-counter) -#define atomic_set(v, i) (((v)-counter) = i) +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} + +static inline void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} static __inline__ void atomic_add(int i, atomic_t * v) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/23] make atomic_read() and atomic_set() behavior consistent on cris
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on cris. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-cris/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-cris/atomic.h 2007-08-13 05:23:37.0 -0400 @@ -11,12 +11,19 @@ * resource counting etc.. */ -typedef struct { volatile int counter; } atomic_t; +typedef struct { int counter; } atomic_t; #define ATOMIC_INIT(i) { (i) } -#define atomic_read(v) ((v)-counter) -#define atomic_set(v,i) (((v)-counter) = (i)) +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} + +static inline void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} /* These should be written in asm but we do it in C for now. */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/23] make atomic_read() and atomic_set() behavior consistent on frv
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on frv. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-frv/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-frv/atomic.h 2007-08-13 05:27:08.0 -0400 @@ -40,8 +40,16 @@ typedef struct { } atomic_t; #define ATOMIC_INIT(i) { (i) } -#define atomic_read(v) ((v)-counter) -#define atomic_set(v, i) (((v)-counter) = (i)) + +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} + +static inline void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} #ifndef CONFIG_FRV_OUTOFLINE_ATOMIC_OPS static inline int atomic_add_return(int i, atomic_t *v) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 8/23] make atomic_read() and atomic_set() behavior consistent on h8300
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on h8300. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-h8300/atomic.h2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-h8300/atomic.h 2007-08-13 05:29:05.0 -0400 @@ -9,8 +9,15 @@ typedef struct { int counter; } atomic_t; #define ATOMIC_INIT(i) { (i) } -#define atomic_read(v) ((v)-counter) -#define atomic_set(v, i) (((v)-counter) = i) +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} + +static inline void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} #include asm/system.h #include linux/kernel.h - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 9/23] make atomic_read() and atomic_set() behavior consistent on i386
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on i386. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-i386/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-i386/atomic.h 2007-08-13 05:31:45.0 -0400 @@ -25,7 +25,10 @@ typedef struct { int counter; } atomic_t * * Atomically reads the value of @v. */ -#define atomic_read(v) ((v)-counter) +static __inline__ int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} /** * atomic_set - set atomic variable @@ -34,7 +37,10 @@ typedef struct { int counter; } atomic_t * * Atomically sets the value of @v to @i. */ -#define atomic_set(v,i)(((v)-counter) = (i)) +static __inline__ void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} /** * atomic_add - add integer to atomic variable - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/23] make atomic_read() and atomic_set() behavior consistent on ia64
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on ia64. This will do weird things without Andreas Schwab's fix: http://lkml.org/lkml/2007/8/10/410 Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-ia64/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-ia64/atomic.h 2007-08-13 05:38:27.0 -0400 @@ -19,19 +19,34 @@ /* * On IA-64, counter must always be volatile to ensure that that the - * memory accesses are ordered. + * memory accesses are ordered. This must be enforced each time that + * counter is read or written. */ -typedef struct { volatile __s32 counter; } atomic_t; -typedef struct { volatile __s64 counter; } atomic64_t; +typedef struct { __s32 counter; } atomic_t; +typedef struct { __s64 counter; } atomic64_t; #define ATOMIC_INIT(i) ((atomic_t) { (i) }) #define ATOMIC64_INIT(i) ((atomic64_t) { (i) }) -#define atomic_read(v) ((v)-counter) -#define atomic64_read(v) ((v)-counter) +static inline __s32 atomic_read(atomic_t *v) +{ +return *(volatile __s32 *)v-counter; +} + +static inline void atomic_set(atomic_t *v, __s32 i) +{ +*(volatile __s32 *)v-counter = i; +} -#define atomic_set(v,i)(((v)-counter) = (i)) -#define atomic64_set(v,i) (((v)-counter) = (i)) +static inline __s64 atomic64_read(atomic64_t *v) +{ +return *(volatile __s64 *)v-counter; +} + +static inline void atomic64_set(atomic64_t *v, __s64 i) +{ +*(volatile __s64 *)v-counter = i; +} static __inline__ int ia64_atomic_add (int i, atomic_t *v) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/23] make atomic_read() and atomic_set() behavior consistent on m32r
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on m32r. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-m32r/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-m32r/atomic.h 2007-08-13 05:42:09.0 -0400 @@ -22,7 +22,7 @@ * on us. We need to use _exactly_ the address the user gave us, * not some alias that contains the same information. */ -typedef struct { volatile int counter; } atomic_t; +typedef struct { int counter; } atomic_t; #define ATOMIC_INIT(i) { (i) } @@ -32,7 +32,10 @@ typedef struct { volatile int counter; } * * Atomically reads the value of @v. */ -#define atomic_read(v) ((v)-counter) +static __inline__ int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} /** * atomic_set - set atomic variable @@ -41,7 +44,10 @@ typedef struct { volatile int counter; } * * Atomically sets the value of @v to @i. */ -#define atomic_set(v,i)(((v)-counter) = (i)) +static __inline__ void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} /** * atomic_add_return - add integer to atomic variable and return it - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 12/23] make atomic_read() and atomic_set() behavior consistent on m68knommu
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on m68knommu. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-m68knommu/atomic.h2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-m68knommu/atomic.h 2007-08-13 05:47:46.0 -0400 @@ -15,8 +15,15 @@ typedef struct { int counter; } atomic_t; #define ATOMIC_INIT(i) { (i) } -#define atomic_read(v) ((v)-counter) -#define atomic_set(v, i) (((v)-counter) = i) +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} + +static inline void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} static __inline__ void atomic_add(int i, atomic_t *v) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 13/23] make atomic_read() and atomic_set() behavior consistent on m68k
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on m68k. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-m68k/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-m68k/atomic.h 2007-08-13 05:45:43.0 -0400 @@ -16,8 +16,15 @@ typedef struct { int counter; } atomic_t; #define ATOMIC_INIT(i) { (i) } -#define atomic_read(v) ((v)-counter) -#define atomic_set(v, i) (((v)-counter) = i) +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} + +static inline void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} static inline void atomic_add(int i, atomic_t *v) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 15/23] make atomic_read() and atomic_set() behavior consistent on parisc
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on parisc. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-parisc/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-parisc/atomic.h2007-08-13 05:59:35.0 -0400 @@ -128,7 +128,7 @@ __cmpxchg(volatile void *ptr, unsigned l * Cache-line alignment would conflict with, for example, linux/module.h */ -typedef struct { volatile int counter; } atomic_t; +typedef struct { int counter; } atomic_t; /* It's possible to reduce all atomic operations to either * __atomic_add_return, atomic_set and atomic_read (the latter @@ -159,7 +159,7 @@ static __inline__ void atomic_set(atomic static __inline__ int atomic_read(const atomic_t *v) { - return v-counter; + return *(volatile int *)v-counter; } /* exported interface */ @@ -227,7 +227,7 @@ static __inline__ int atomic_add_unless( #ifdef CONFIG_64BIT -typedef struct { volatile s64 counter; } atomic64_t; +typedef struct { s64 counter; } atomic64_t; #define ATOMIC64_INIT(i) ((atomic64_t) { (i) }) @@ -258,7 +258,7 @@ atomic64_set(atomic64_t *v, s64 i) static __inline__ s64 atomic64_read(const atomic64_t *v) { - return v-counter; + return *(volatile s64 *)v-counter; } #define atomic64_add(i,v) ((void)(__atomic64_add_return( ((s64)i),(v - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 16/23] make atomic_read() and atomic_set() behavior consistent on s390
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on s390. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-s390/atomic.h 2007-08-13 03:14:13.0 -0400 +++ linux-2.6.23-rc3/include/asm-s390/atomic.h 2007-08-13 06:04:58.0 -0400 @@ -67,8 +67,15 @@ typedef struct { #endif /* __GNUC__ */ -#define atomic_read(v) ((v)-counter) -#define atomic_set(v,i) (((v)-counter) = (i)) +static __inline__ int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} + +static __inline__ void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} static __inline__ int atomic_add_return(int i, atomic_t * v) { @@ -182,8 +189,15 @@ typedef struct { #endif /* __GNUC__ */ -#define atomic64_read(v) ((v)-counter) -#define atomic64_set(v,i) (((v)-counter) = (i)) +static __inline__ long long atomic64_read(atomic64_t *v) +{ +return *(volatile long long *)v-counter; +} + +static __inline__ void atomic64_set(atomic64_t *v, long long i) +{ +*(volatile long long *)v-counter = i; +} static __inline__ long long atomic64_add_return(long long i, atomic64_t * v) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 14/23] make atomic_read() and atomic_set() behavior consistent on mips
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on mips. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-mips/atomic.h 2007-08-13 03:14:13.0 -0400 +++ linux-2.6.23-rc3/include/asm-mips/atomic.h 2007-08-13 05:52:14.0 -0400 @@ -20,7 +20,7 @@ #include asm/war.h #include asm/system.h -typedef struct { volatile int counter; } atomic_t; +typedef struct { int counter; } atomic_t; #define ATOMIC_INIT(i){ (i) } @@ -30,7 +30,10 @@ typedef struct { volatile int counter; } * * Atomically reads the value of @v. */ -#define atomic_read(v) ((v)-counter) +static __inline__ int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} /* * atomic_set - set atomic variable @@ -39,7 +42,10 @@ typedef struct { volatile int counter; } * * Atomically sets the value of @v to @i. */ -#define atomic_set(v,i)((v)-counter = (i)) +static __inline__ void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} /* * atomic_add - add integer to atomic variable @@ -404,7 +410,7 @@ static __inline__ int atomic_add_unless( #ifdef CONFIG_64BIT -typedef struct { volatile long counter; } atomic64_t; +typedef struct { long counter; } atomic64_t; #define ATOMIC64_INIT(i){ (i) } @@ -413,14 +419,20 @@ typedef struct { volatile long counter; * @v: pointer of type atomic64_t * */ -#define atomic64_read(v) ((v)-counter) +static __inline__ long atomic64_read(atomic64_t *v) +{ +return *(volatile long *)v-counter; +} /* * atomic64_set - set atomic variable * @v: pointer of type atomic64_t * @i: required value */ -#define atomic64_set(v,i) ((v)-counter = (i)) +static __inline__ void atomic64_set(atomic64_t *v, long i) +{ +*(volatile long *)v-counter = i; +} /* * atomic64_add - add integer to atomic variable - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 21/23] make atomic_read() and atomic_set() behavior consistent on v850
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on v850. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-v850/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-v850/atomic.h 2007-08-13 06:19:32.0 -0400 @@ -27,8 +27,15 @@ typedef struct { int counter; } atomic_t #ifdef __KERNEL__ -#define atomic_read(v) ((v)-counter) -#define atomic_set(v,i)(((v)-counter) = (i)) +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} + +static inline void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} static inline int atomic_add_return (int i, volatile atomic_t *v) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 20/23] make atomic_read() and atomic_set() behavior consistent on sparc
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on alpha. Leave sparc-internal atomic24_t type alone. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-sparc/atomic.h2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-sparc/atomic.h 2007-08-13 06:12:49.0 -0400 @@ -13,7 +13,7 @@ #include linux/types.h -typedef struct { volatile int counter; } atomic_t; +typedef struct { int counter; } atomic_t; #ifdef __KERNEL__ @@ -61,7 +61,10 @@ extern int atomic_cmpxchg(atomic_t *, in extern int atomic_add_unless(atomic_t *, int, int); extern void atomic_set(atomic_t *, int); -#define atomic_read(v) ((v)-counter) +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} #define atomic_add(i, v) ((void)__atomic_add_return( (int)(i), (v))) #define atomic_sub(i, v) ((void)__atomic_add_return(-(int)(i), (v))) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 19/23] make atomic_read() and atomic_set() behavior consistent on sparc64
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on sparc64. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-sparc64/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-sparc64/atomic.h 2007-08-13 06:17:01.0 -0400 @@ -11,17 +11,31 @@ #include linux/types.h #include asm/system.h -typedef struct { volatile int counter; } atomic_t; -typedef struct { volatile __s64 counter; } atomic64_t; +typedef struct { int counter; } atomic_t; +typedef struct { __s64 counter; } atomic64_t; #define ATOMIC_INIT(i) { (i) } #define ATOMIC64_INIT(i) { (i) } -#define atomic_read(v) ((v)-counter) -#define atomic64_read(v) ((v)-counter) +static __inline__ int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} + +static __inline__ __s64 atomic64_read(atomic64_t *v) +{ +return *(volatile __s64 *)v-counter; +} -#define atomic_set(v, i) (((v)-counter) = i) -#define atomic64_set(v, i) (((v)-counter) = i) +static __inline__ void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} + +static __inline__ void atomic64_set(atomic64_t *v, __s64 i) +{ +*(volatile __s64 *)v-counter = i; +} extern void atomic_add(int, atomic_t *); extern void atomic64_add(int, atomic64_t *); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 17/23] make atomic_read() and atomic_set() behavior consistent on sh64
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on sh64. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-sh64/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-sh64/atomic.h 2007-08-13 06:08:37.0 -0400 @@ -19,12 +19,19 @@ * */ -typedef struct { volatile int counter; } atomic_t; +typedef struct { int counter; } atomic_t; #define ATOMIC_INIT(i) ( (atomic_t) { (i) } ) -#define atomic_read(v) ((v)-counter) -#define atomic_set(v,i)((v)-counter = (i)) +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} + +static inline void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} #include asm/system.h - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 18/23] make atomic_read() and atomic_set() behavior consistent on sh
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on sh. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-sh/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-sh/atomic.h2007-08-13 06:07:16.0 -0400 @@ -7,12 +7,19 @@ * */ -typedef struct { volatile int counter; } atomic_t; +typedef struct { int counter; } atomic_t; #define ATOMIC_INIT(i) ( (atomic_t) { (i) } ) -#define atomic_read(v) ((v)-counter) -#define atomic_set(v,i)((v)-counter = (i)) +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} + +static inline void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} #include linux/compiler.h #include asm/system.h - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 22/23] make atomic_read() and atomic_set() behavior consistent on x86_64
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on x86_64. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-x86_64/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-x86_64/atomic.h2007-08-13 06:22:43.0 -0400 @@ -32,7 +32,10 @@ typedef struct { int counter; } atomic_t * * Atomically reads the value of @v. */ -#define atomic_read(v) ((v)-counter) +static __inline__ int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} /** * atomic_set - set atomic variable @@ -41,7 +44,10 @@ typedef struct { int counter; } atomic_t * * Atomically sets the value of @v to @i. */ -#define atomic_set(v,i)(((v)-counter) = (i)) +static __inline__ void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} /** * atomic_add - add integer to atomic variable @@ -206,7 +212,7 @@ static __inline__ int atomic_sub_return( /* An 64bit atomic type */ -typedef struct { volatile long counter; } atomic64_t; +typedef struct { long counter; } atomic64_t; #define ATOMIC64_INIT(i) { (i) } @@ -217,7 +223,10 @@ typedef struct { volatile long counter; * Atomically reads the value of @v. * Doesn't imply a read memory barrier. */ -#define atomic64_read(v) ((v)-counter) +static __inline__ long atomic64_read(atomic64_t *v) +{ +return *(volatile long *)v-counter; +} /** * atomic64_set - set atomic64 variable @@ -226,7 +235,10 @@ typedef struct { volatile long counter; * * Atomically sets the value of @v to @i. */ -#define atomic64_set(v,i) (((v)-counter) = (i)) +static __inline__ void atomic64_set(atomic64_t *v, long i) +{ +*(volatile long *)v-counter = i; +} /** * atomic64_add - add integer to atomic64 variable - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 23/23] make atomic_read() and atomic_set() behavior consistent on xtensa
From: Chris Snook [EMAIL PROTECTED] Use volatile consistently in atomic.h on xtensa. Signed-off-by: Chris Snook [EMAIL PROTECTED] --- linux-2.6.23-rc3-orig/include/asm-xtensa/atomic.h 2007-07-08 19:32:17.0 -0400 +++ linux-2.6.23-rc3/include/asm-xtensa/atomic.h2007-08-13 06:31:58.0 -0400 @@ -15,7 +15,7 @@ #include linux/stringify.h -typedef struct { volatile int counter; } atomic_t; +typedef struct { int counter; } atomic_t; #ifdef __KERNEL__ #include asm/processor.h @@ -47,7 +47,10 @@ typedef struct { volatile int counter; } * * Atomically reads the value of @v. */ -#define atomic_read(v) ((v)-counter) +static inline int atomic_read(atomic_t *v) +{ +return *(volatile int *)v-counter; +} /** * atomic_set - set atomic variable @@ -56,7 +59,10 @@ typedef struct { volatile int counter; } * * Atomically sets the value of @v to @i. */ -#define atomic_set(v,i)((v)-counter = (i)) +static inline void atomic_set(atomic_t *v, int i) +{ +*(volatile int *)v-counter = i; +} /** * atomic_add - add integer to atomic variable - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/