Re: [patch] oom: kill all threads that share mm with killed task
On Mon, 23 Apr 2007, Christoph Lameter wrote: Obvious fix. It was broken by http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584 Dec 7. So its in 2.6.20 and later. Candiate for stable? I agree it's obvious enough that it should be included in stable. Otherwise the entire iteration becomes a big no-op and it won't alleviate the OOM condition in one call to out_of_memory() because there may be outstanding tasks with the shared -mm. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Transparently handle .symbol lookup for kprobes
Srinivasa Ds writes: + } else {\ + char dot_name[KSYM_NAME_LEN+1]; \ + dot_name[0] = '.'; \ + dot_name[1] = '\0'; \ + strncat(dot_name, name, KSYM_NAME_LEN); \ Assuming the kernel strncat works like the userspace one does, there is a possibility that dot_name[] won't be properly null-terminated here. If strlen(name) = KSYM_NAME_LEN-1, then strncat will set dot_name[KSYM_NAME_LEN-1] to something non-null and won't touch dot_name[KSYM_NAME_LEN]. Paul. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
Christoph Hellwig writes: The first question is obviously, is this really something we want? spawning kernel thread on demand without reaping them properly seems quite dangerous. What specifically has to be done to reap a kernel thread? Are you concerned about the number of threads, or about having zombies hanging around? Paul. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SOME STUFF ABOUT REISER4
On Sun, 22 Apr 2007 19:00:46 -0700, Eric Hopper [EMAIL PROTECTED] said: I know that this whole effort has been put in disarray by the prosecution of Hans Reiser, but I'm curious as to its status. Is Reiser4 going to be going into the Linus kernel anytime soon? Is there somewhere I should be looking to find this out without wasting bandwidth here? There was a thread the other day, that talked about Reiser4. It took a while but I have found it (actually two) http://lkml.org/lkml/2007/4/5/360 http://lkml.org/lkml/2007/4/9/4 You may want to check them out. -- [EMAIL PROTECTED] -- http://www.fastmail.fm - Access your email from home and the web - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/25] xen: Add nosegneg capability to the vsyscall page notes
Roland McGrath wrote: I have to admit I still don't really understand all this. Is it documented somewhere? I have explained it in public more than once, but I don't know off hand anywhere that was helpfully recorded. Thanks very much. I'd been poking about, but the closest I came to an actual description was various patches fixing bugs, so it was a little incomplete. For example, a Xen-enabled kernel can use a single vDSO image (or a single pair of int80/sysenter images), containing the nosegneg hwcap note. When there is no need for it (native or hvm or 64-bit hv or whatever), it just clears the mask word. If you actually do this, you'll want to modify the NOTE_KERNELCAP_BEGIN macro to define a global label you can use with VDSO_SYM. Thanks for the pointer. I'd been getting a bit of heat for enabling the nonegseg flag unconditionally. If I can make Xen-specific then that will be one less source of complaints. J - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Arjan van de Ven wrote: Within reason, it's not the number of clients that X has that causes its CPU bandwidth use to sky rocket and cause problems. It's more to to with what type of clients they are. Most GUIs (even ones that are constantly updating visual data (e.g. gkrellm -- I can open quite a large number of these without increasing X's CPU usage very much)) cause very little load on the X server. The exceptions to this are the there is actually 2 and not just 1 X server, and they are VERY VERY different in behavior. Case 1: Accelerated driver If X talks to a decent enough card it supports will with acceleration, it will be very rare for X itself to spend any kind of significant amount of CPU time, all the really heavy stuff is done in hardware, and asynchronously at that. A bit of batching will greatly improve system performance in this case. Case 2: Unaccelerated VESA Some drivers in X, especially the VESA and NV drivers (which are quite common, vesa is used on all hardware without a special driver nowadays), have no or not enough acceleration to matter for modern desktops. This means the CPU is doing all the heavy lifting, in the X program. In this case even a simple move the window a bit becomes quite a bit of a CPU hog already. Mine's a: SiS 661/741/760 PCI/AGP or 662/761Gx PCIE VGA Display adapter according to X's display settings tool. Which category does that fall into? It's not a special adapter and is just the one that came with the motherboard. It doesn't use much CPU unless I grab a window and wiggle it all over the screen or do something like ls -lR / in an xterm. The cases are fundamentally different in behavior, because in the first case, X hardly consumes the time it would get in any scheme, while in the second case X really is CPU bound and will happily consume any CPU time it can get. Which still doesn't justify an elaborate points sharing scheme. Whichever way you look at that that's just another way of giving X more CPU bandwidth and there are simpler ways to give X more CPU if it needs it. However, I think there's something seriously wrong if it needs the -19 nice that I've heard mentioned. You might as well just run it as a real time process. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
NonExecutable Bit in 32Bit
Hey, is it right, that the NX Bit is not used under i386-Arch but under x86_64-Arch? When yes, is there a special argument for it not to be used? Ciao Thilo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] x86_64: Reflect the relocatability of the kernel in the ELF header.
On Sun, Apr 22, 2007 at 11:12:13PM -0600, Eric W. Biederman wrote: Currently because vmlinux does not reflect that the kernel is relocatable we still have to support CONFIG_PHYSICAL_START. So this patch adds a small c program to do what we cannot do with a linker script, set the elf header type to ET_DYN. This should remove the last obstacle to removing CONFIG_PHYSICAL_START on x86_64. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] [Dropping fastboot mailing list from CC as kexec mailing list is new list for this discussion] [..] +void file_open(const char *name) +{ + if ((fd = open(name, O_RDWR, 0)) 0) + die(Unable to open `%s': %m, name); +} + +static void mketrel(void) +{ + unsigned char e_type[2]; + if (read(fd, e_ident, sizeof(e_ident)) != sizeof(e_ident)) + die(Cannot read ELF header: %s\n, strerror(errno)); + + if (memcmp(e_ident, ELFMAG, 4) != 0) + die(No ELF magic\n); + + if ((e_ident[EI_CLASS] != ELFCLASS64) + (e_ident[EI_CLASS] != ELFCLASS32)) + die(Unrecognized ELF class: %x\n, e_ident[EI_CLASS]); + + if ((e_ident[EI_DATA] != ELFDATA2LSB) + (e_ident[EI_DATA] != ELFDATA2MSB)) + die(Unrecognized ELF data encoding: %x\n, e_ident[EI_DATA]); + + if (e_ident[EI_VERSION] != EV_CURRENT) + die(Unknown ELF version: %d\n, e_ident[EI_VERSION]); + + if (e_ident[EI_DATA] == ELFDATA2LSB) { + e_type[0] = ET_REL 0xff; + e_type[1] = ET_REL 8; + } else { + e_type[1] = ET_REL 0xff; + e_type[0] = ET_REL 8; + } Hi Eric, Should this be ET_REL or ET_DYN? kexec refuses to load this vmlinux as it does not find it to be executable type. I am not well versed with various conventions but if I go through Executable and Linking Format document, this is what it says about various file types. • A relocatable file holds code and data suitable for linking with other object files to create an executable or a shared object file. • An executable file holds a program suitable for execution. • A shared object file holds code and data suitable for linking in two contexts. First, the link editor may process it with other relocatable and shared object files to create another object file. Second, the dynamic linker combines it with an executable file and other shared objects to create a process image. So above does not seem to fit in the ET_REL type. We can't relink this vmlinux? And it does not seem to fit in ET_DYN definition too. We are not relinking this vmlinux with another executable or other relocatable files. I remember once you mentioned the term dynamic executable which can be loaded at a non-compiled address and let run without requiring any relocation processing. This vmlinux will fall in that category but can't relate it to standard elf file definitions. Thanks Vivek - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Peter Williams [EMAIL PROTECTED] wrote: The cases are fundamentally different in behavior, because in the first case, X hardly consumes the time it would get in any scheme, while in the second case X really is CPU bound and will happily consume any CPU time it can get. Which still doesn't justify an elaborate points sharing scheme. Whichever way you look at that that's just another way of giving X more CPU bandwidth and there are simpler ways to give X more CPU if it needs it. However, I think there's something seriously wrong if it needs the -19 nice that I've heard mentioned. Gene has done some testing under CFS with X reniced to +10 and the desktop still worked smoothly for him. So CFS does not 'need' a reniced X. There are simply advantages to negative nice levels: for example screen refreshes are smoother on any scheduler i tried. BUT, there is a caveat: on non-CFS schedulers i tried X is much more prone to get into 'overscheduling' scenarios that visibly hurt X's performance, while on CFS there's a max of 1000-1500 context switches a second at nice -10. (which, considering the cost of a context switch is well under 1% overhead.) So, my point is, the nice level of X for desktop users should not be set lower than a low limit suggested by that particular scheduler's author. That limit is scheduler-specific. Con i think recommends a nice level of -1 for X when using SD [Con, can you confirm?], while my tests show that if you want you can go as low as -10 under CFS, without any bad side-effects. (-19 was a bit too much) [...] You might as well just run it as a real time process. hm, that would be a bad idea under any scheduler (including CFS), because real time processes can starve other processes indefinitely. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NonExecutable Bit in 32Bit
On Tue, 24 Apr 2007, Cestonaro, Thilo (external) wrote: Hey, is it right, that the NX Bit is not used under i386-Arch but under x86_64-Arch? When yes, is there a special argument for it not to be used? Ciao Thilo I don't think so - some i386 cpus definitely have support for the NX bit. Would having this be supported in i386 help debugging (and security) significantly? William Heimbigner [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] Fixes and cleanups for earlyprintk aka boot console.
On Thu, 15 Mar 2007 16:46:39 +0100 Gerd Hoffmann [EMAIL PROTECTED] wrote: The console subsystem already has an idea of a boot console, using the CON_BOOT flag. The implementation has some flaws though. The major problem is that presence of a boot console makes register_console() ignore any other console devices (unless explicitly specified on the kernel command line). This patch fixes the console selection code to *not* consider a boot console a full-featured one, so the first non-boot console registering will become the default console instead. This way the unregister call for the boot console in the register_console() function actually triggers and the handover from the boot console to the real console device works smoothly. Added a printk for the handover, so you know which console device the output goes to when the boot console stops printing messages. The disable_early_printk() call is obsolete with that patch, explicitly disabling the early console isn't needed any more as it works automagically with that patch. I've walked through the tree, dropped all disable_early_printk() instances found below arch/ and tagged the consoles with CON_BOOT if needed. The code is tested on x86, sh (thanks to Paul) and mips (thanks to Ralf). Changes to last version: Rediffed against -rc3, adapted to mips cleanups by Ralf, fixed udbg-immortal cmd line arg on powerpc. I get this, across netconsole: [17179569.184000] console handover: boot [earlyvga_f_0] - real [tty0] wanna take a look at why there's cruft in bootconsole-name please? in grub.conf I have kernel /boot/bzImage-2.6.21-rc7-mm1 ro root=LABEL=/ rhgb vga=0x263 [EMAIL PROTECTED]/eth0,[EMAIL PROTECTED]/00:0D:56:C6:C6:CC profile=1 earlyprintk=vga resume=8:5 time and I'm using http://userweb.kernel.org/~akpm/config-sony.txt Thanks. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Transparently handle .symbol lookup for kprobes
Paul Mackerras wrote: Srinivasa Ds writes: +} else {\ +char dot_name[KSYM_NAME_LEN+1]; \ +dot_name[0] = '.'; \ +dot_name[1] = '\0'; \ +strncat(dot_name, name, KSYM_NAME_LEN); \ Assuming the kernel strncat works like the userspace one does, there is a possibility that dot_name[] won't be properly null-terminated here. If strlen(name) = KSYM_NAME_LEN-1, then strncat will set dot_name[KSYM_NAME_LEN-1] to something non-null and won't touch dot_name[KSYM_NAME_LEN]. Irrespective of length of the string, kernel implementation of strncat(lib/string.c) ensures that last character of string is set to null. So dot_name[] is always null terminated. char *strncat(char *dest, const char *src, size_t count) { char *tmp = dest; if (count) { while (*dest) dest++; while ((*dest++ = *src++) != 0) { if (--count == 0) { *dest = '\0'; break; } } } return tmp; } EXPORT_SYMBOL(strncat); === Is this OK then ?? Thanks Srinivasa DS Paul. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] Ignore stolen time in the softlockup watchdog
On Tue, 27 Mar 2007 14:49:20 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] wrote: The softlockup watchdog is currently a nuisance in a virtual machine, since the whole system could have the CPU stolen from it for a long period of time. While it would be unlikely for a guest domain to be denied timer interrupts for over 10s, it could happen and any softlockup message would be completely spurious. Earlier I proposed that sched_clock() return time in unstolen nanoseconds, which is how Xen and VMI currently implement it. If the softlockup watchdog uses sched_clock() to measure time, it would automatically ignore stolen time, and therefore only report when the guest itself locked up. When running native, sched_clock() returns real-time nanoseconds, so the behaviour would be unchanged. Note that sched_clock() used this way is inherently per-cpu, so this patch makes sure that the per-processor watchdog thread initialized its own timestamp. This patch (ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/broken-out/ignore-stolen-time-in-the-softlockup-watchdog.patch) causes six failures in the locking self-tests, which I must say is rather clever of it. Here's the first one: [17179569.184000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar [17179569.184000] ... MAX_LOCKDEP_SUBCLASSES:8 [17179569.184000] ... MAX_LOCK_DEPTH: 30 [17179569.184000] ... MAX_LOCKDEP_KEYS:2048 [17179569.184000] ... CLASSHASH_SIZE: 1024 [17179569.184000] ... MAX_LOCKDEP_ENTRIES: 8192 [17179569.184000] ... MAX_LOCKDEP_CHAINS: 16384 [17179569.184000] ... CHAINHASH_SIZE: 8192 [17179569.184000] memory used by lock dependency info: 992 kB [17179569.184000] per task-struct memory footprint: 1200 bytes [17179569.184000] [17179569.184000] | Locking API testsuite: [17179569.184000] [17179569.184000] | spin |wlock |rlock |mutex | wsem | rsem | [17179569.184000] -- [17179569.184000] A-A deadlock: ok | ok | ok | ok | ok | ok | [17179569.184000] A-B-B-A deadlock: ok | ok | ok | ok | ok | ok | [17179569.184000] A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok | [17179569.184001] A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok | [17179569.184002] A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok | [17179569.184003] A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok | [17179569.184004] A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok | ok | [17179569.184005] double unlock: ok | ok | ok | ok | ok | ok | [17179569.184006] initialize held: ok | ok | ok | ok | ok | ok | [17179569.184006] bad unlock order: ok | ok | ok | ok | ok | ok | [17179569.184006] -- [17179569.184006] recursive read-lock: | ok | | ok | [17179569.184006]recursive read-lock #2: | ok | | ok | [17179569.184007] mixed read-write-lock: | ok | | ok | [17179569.184007] mixed write-read-lock: | ok | | ok | [17179569.184007] -- [17179569.184007] hard-irqs-on + irq-safe-A/12: ok | ok | ok | [17179569.184007] soft-irqs-on + irq-safe-A/12: ok | ok | ok | [17179569.184007] hard-irqs-on + irq-safe-A/21: ok | ok | ok | [17179569.184007] soft-irqs-on + irq-safe-A/21: ok | ok | ok | [17179569.184007]sirq-safe-A = hirqs-on/12: ok | ok |irq event stamp: 458 [17179569.184007] hardirqs last enabled at (458): [c01e4116] irqsafe2A_rlock_12+0x96/0xa3 [17179569.184007] hardirqs last disabled at (457): [c01095b9] sched_clock+0x5e/0xe9 [17179569.184007] softirqs last enabled at (454): [c01e4101] irqsafe2A_rlock_12+0x81/0xa3 [17179569.184007] softirqs last disabled at (450): [c01e408b] irqsafe2A_rlock_12+0xb/0xa3 [17179569.184007] FAILED| [c0104cf0] dump_trace+0x63/0x1ec [17179569.184007] [c0104e93] show_trace_log_lvl+0x1a/0x30 [17179569.184007] [c01059ec] show_trace+0x12/0x14 [17179569.184007] [c0105a45] dump_stack+0x16/0x18 [17179569.184007] [c01e1eb5] dotest+0x6b/0x3d0 [17179569.184007] [c01eb249] locking_selftest+0x915/0x1a58 [17179569.184007] [c048c979] start_kernel+0x1d0/0x2a2 [17179569.184007] === [17179569.184007] [17179569.184007]sirq-safe-A = hirqs-on/21:irq event stamp: 462
Re: [REPORT] First glitch1 results, 2.6.21-rc7-git6-CFSv5 + SD 0.46
* Ed Tomlinson [EMAIL PROTECTED] wrote: SD 0.46 1-2 FPS cfs v5 nice -19 219-233 FPS cfs v5 nice 0 1000-1996 cfs v5 nice -10 60-65 FPS the problem is, the glxgears portion of this test is an _inverse_ testcase. The reason? glxgears on true 3D hardware will _not_ use X, it will directly use the 3D driver of the kernel. So by renicing X to -19 you give the xterms more chance to show stuff - the performance of the glxgears will 'degrade' - but that is what you asked for: glxgears is 'just another CPU hog' that competes with X, it's not a true X client. if you are after glxgears performance in this test then you'll get the best performance out of this by renicing X to +19 or even SCHED_BATCH. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] Ignore stolen time in the softlockup watchdog
Andrew Morton wrote: On Tue, 27 Mar 2007 14:49:20 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] wrote: The softlockup watchdog is currently a nuisance in a virtual machine, since the whole system could have the CPU stolen from it for a long period of time. While it would be unlikely for a guest domain to be denied timer interrupts for over 10s, it could happen and any softlockup message would be completely spurious. Earlier I proposed that sched_clock() return time in unstolen nanoseconds, which is how Xen and VMI currently implement it. If the softlockup watchdog uses sched_clock() to measure time, it would automatically ignore stolen time, and therefore only report when the guest itself locked up. When running native, sched_clock() returns real-time nanoseconds, so the behaviour would be unchanged. Note that sched_clock() used this way is inherently per-cpu, so this patch makes sure that the per-processor watchdog thread initialized its own timestamp. This patch (ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/broken-out/ignore-stolen-time-in-the-softlockup-watchdog.patch) causes six failures in the locking self-tests, which I must say is rather clever of it. Interesting. Which variation of sched_clock do you have in your tree at the moment? J - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Ingo Molnar wrote: * Peter Williams [EMAIL PROTECTED] wrote: The cases are fundamentally different in behavior, because in the first case, X hardly consumes the time it would get in any scheme, while in the second case X really is CPU bound and will happily consume any CPU time it can get. Which still doesn't justify an elaborate points sharing scheme. Whichever way you look at that that's just another way of giving X more CPU bandwidth and there are simpler ways to give X more CPU if it needs it. However, I think there's something seriously wrong if it needs the -19 nice that I've heard mentioned. Gene has done some testing under CFS with X reniced to +10 and the desktop still worked smoothly for him. As a data point here, and probably nothing to do with X, but I did manage to lock it up, solid, reset button time tonight, by wanting 'smart' to get done with an update session after amanda had started. I took both smart processes I could see in htop all the way to -19, but when it was about done about 3 minutes later, everything came to an instant, frozen, reset button required lockup. I should have stopped at -17 I guess. :( So CFS does not 'need' a reniced X. There are simply advantages to negative nice levels: for example screen refreshes are smoother on any scheduler i tried. BUT, there is a caveat: on non-CFS schedulers i tried X is much more prone to get into 'overscheduling' scenarios that visibly hurt X's performance, while on CFS there's a max of 1000-1500 context switches a second at nice -10. (which, considering the cost of a context switch is well under 1% overhead.) So, my point is, the nice level of X for desktop users should not be set lower than a low limit suggested by that particular scheduler's author. That limit is scheduler-specific. Con i think recommends a nice level of -1 for X when using SD [Con, can you confirm?], while my tests show that if you want you can go as low as -10 under CFS, without any bad side-effects. (-19 was a bit too much) [...] You might as well just run it as a real time process. hm, that would be a bad idea under any scheduler (including CFS), because real time processes can starve other processes indefinitely. Ingo -- Cheers, Gene There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. -Ed Howdershelt (Author) I have discovered that all human evil comes from this, man's being unable to sit still in a room. -- Blaise Pascal - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Ingo Molnar wrote: static void yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to) { struct rb_node *curr, *next, *first; struct task_struct *p_next; /* * yield-to support: if we are on the same runqueue then * give half of our wait_runtime (if it's positive) to the other task: */ if (p_to p-wait_runtime 0) { p-wait_runtime = 1; p_to-wait_runtime += p-wait_runtime; } the above is the basic expression of: charge a positive bank balance. [..] [note, due to the nanoseconds unit there's no rounding loss to worry about.] Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss? Ingo Rogan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Gene Heskett [EMAIL PROTECTED] wrote: Gene has done some testing under CFS with X reniced to +10 and the desktop still worked smoothly for him. As a data point here, and probably nothing to do with X, but I did manage to lock it up, solid, reset button time tonight, by wanting 'smart' to get done with an update session after amanda had started. I took both smart processes I could see in htop all the way to -19, but when it was about done about 3 minutes later, everything came to an instant, frozen, reset button required lockup. I should have stopped at -17 I guess. :( yeah, i guess this has little to do with X. I think in your scenario it might have been smarter to either stop, or to renice the workloads that took away CPU power from others to _positive_ nice levels. Negative nice levels can indeed be dangerous. (Btw., to protect against such mishaps in the future i have changed the SysRq-N [SysRq-Nice] implementation in my tree to not only change real-time tasks to SCHED_OTHER, but to also renice negative nice levels back to 0 - this will show up in -v6. That way you'd only have had to hit SysRq-N to get the system out of the wedge.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 2007-04-24 at 12:58 +1000, Neil Brown wrote: On Friday April 20, [EMAIL PROTECTED] wrote: Scale writeback cache per backing device, proportional to its writeout speed. So it works like this: We account for writeout in full pages. When a page has the Writeback flag cleared, we account that as a successfully retired write for the relevant bdi. By using floating averages we keep track of how many writes each bdi has retired 'recently' where the unit of time in which we understand 'recently' is a single page written. That is actually that period I keep referring to. So recently is the last 'period' number of writeout completions. We keep a floating average for each bdi, and a floating average for the total writeouts (that 'average' is, of course, 1.) 1 in the sense of unity, yes :-) Using these numbers we can calculate what faction of 'recently' retired writes were retired by each bdi (get_writeout_scale). Multiplying this fraction by the system-wide number of pages that are allowed to be dirty before write-throttling, we get the number of pages that the bdi can have dirty before write-throttling the bdi. I note that the same fraction is *not* applied to background_thresh. Should it be? I guess not - there would be interesting starting transients, as a bdi which had done no writeout would not be allowed any dirty pages, so background writeout would start immediately, which isn't what you want... or is it? This is something I have not been able to come to a conclusive answer yet,... For each bdi we also track the number of (dirty, writeback, unstable) pages and do not allow this to exceed the limit set for this bdi. The calculations involving 'reserve' in get_dirty_limits are a little confusing. It looks like you calculating how much total head-room there is for the bdi (pages that the system can still dirty - pages this bdi has dirty) and making sure the number returned in pbdi_dirty doesn't allow more than that to be used. Yes, it limits the earned share of the total dirty limit to the possible share, ensuring that the total dirty limit is never exceeded. This is especially relevant when the proportions change faster than the pages get written out, ie. when the period total dirty limit. This is probably a reasonable thing to do but it doesn't feel like the right place. I think get_dirty_limits should return the raw threshold, and balance_dirty_pages should do both tests - the bdi-local test and the system-wide test. Ok, that makes sense I guess. Currently you have a rather odd situation where + if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh) + break; might included numbers obtained with bdi_stat_sum being compared with numbers obtained with bdi_stat. Yes, I was aware of that. The bdi_thresh is based on bdi_stat() numbers, whereas the others could be bdi_stat_sum(). I think this is ok, since the threshold is a 'guess' anyway, we just _need_ to ensure we do not get trapped by writeouts not arriving (due to getting stuck in the per cpu deltas). -- I have all this commented in the new version. With these patches, the VM still (I think) assumes that each BDI has a reasonable queue limit, so that writeback_inodes will block on a full queue. If a BDI has a very large queue, balance_dirty_pages will simply turn lots of DIRTY pages into WRITEBACK pages and then think We've done our duty without actually blocking at all. It will block once we exceed the total number of dirty pages allowed for that BDI. But yes, this does not take away the need for queue limits. This work was primarily aimed at allowing multiple queues to not interfere as much, so they all can make progress and not get starved. With the extra accounting that we now have, I would like to see balance_dirty_pages dirty pages wait until RECLAIMABLE+WRITEBACK is actually less than 'threshold'. This would probably mean that we would need to support per-bdi background_writeout to smooth things out. Maybe that it fodder for another patch-set. Indeed, I still have to wrap my mind around the background thing. Your input is appreciated. You set: + vm_cycle_shift = 1 + ilog2(vm_total_pages); Can you explain that? You found the one random knob I hid :-) My experience is that scaling dirty limits with main memory isn't what we really want. When you get machines with very large memory, the amount that you want to be dirty is more a function of the speed of your IO devices, rather than the amount of memory, otherwise you can sometimes see large filesystem lags ('sync' taking minutes?) I wonder if it makes sense to try to limit the dirty data for a bdi to the amount that it can write out in some period of time - maybe 3 seconds. Probably configurable. You seem to have almost all the infrastructure in place to do that, and I think it
Re: [patch 1/4] Ignore stolen time in the softlockup watchdog
On Mon, 23 Apr 2007 23:58:20 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Tue, 27 Mar 2007 14:49:20 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] wrote: The softlockup watchdog is currently a nuisance in a virtual machine, since the whole system could have the CPU stolen from it for a long period of time. While it would be unlikely for a guest domain to be denied timer interrupts for over 10s, it could happen and any softlockup message would be completely spurious. Earlier I proposed that sched_clock() return time in unstolen nanoseconds, which is how Xen and VMI currently implement it. If the softlockup watchdog uses sched_clock() to measure time, it would automatically ignore stolen time, and therefore only report when the guest itself locked up. When running native, sched_clock() returns real-time nanoseconds, so the behaviour would be unchanged. Note that sched_clock() used this way is inherently per-cpu, so this patch makes sure that the per-processor watchdog thread initialized its own timestamp. This patch (ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/broken-out/ignore-stolen-time-in-the-softlockup-watchdog.patch) causes six failures in the locking self-tests, which I must say is rather clever of it. Interesting. I'll say. Which variation of sched_clock do you have in your tree at the moment? Andi's, plus the below fix. Sigh. I thought I was only two more bugs away from a release, then... [18014389.347124] BUG: unable to handle kernel paging request at virtual address 6b6b7193 [18014389.347142] printing eip: [18014389.347149] c029a80c [18014389.347156] *pde = [18014389.347166] Oops: [#1] [18014389.347174] Modules linked in: i915 drm ipw2200 sonypi ipv6 autofs4 hidp l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables x_tables cpufreq_ondemand video sbs button battery asus_acpi ac nvram ohci1394 ieee1394 ehci_hcd uhci_hcd sg joydev snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm sr_mod cdrom snd_timer ieee80211 i2c_i801 piix ieee80211_crypt i2c_core generic snd soundcore snd_page_alloc ext3 jbd ide_disk ide_core [18014389.347520] CPU:0 [18014389.347521] EIP:0060:[c029a80c]Tainted: G D VLI [18014389.347522] EFLAGS: 00010296 (2.6.21-rc7-mm1 #35) [18014389.347547] EIP is at input_release_device+0x8/0x4e [18014389.347555] eax: c99709a8 ebx: 6b6b6b6b ecx: 0286 edx: [18014389.347563] esi: 6b6b6b6b edi: c99709cc ebp: c21e3d40 esp: c21e3d38 [18014389.347571] ds: 007b es: 007b fs: 00d8 gs: ss: 0068 [18014389.347580] Process khubd (pid: 159, ti=c21e2000 task=c20a62f0 task.ti=c21e2000) [18014389.347588] Stack: 6b6b6b6b c99709a8 c21e3d60 c029b489 c2014ec8 c9182000 c96b167c c9970954 [18014389.347655]c9970954 c99709cc c21e3d80 c029d401 c9977a6c c96b1000 c21e3d90 c9970954 [18014389.347708]c99709a8 c9164000 c21e3d90 c029d4b5 c96b1000 c9970564 c21e3db0 c029c50b [18014389.347771] Call Trace: [18014389.347792] [c029b489] input_close_device+0x13/0x51 [18014389.347810] [c029d401] mousedev_destroy+0x29/0x7e [18014389.347827] [c029d4b5] mousedev_disconnect+0x5f/0x63 [18014389.347842] [c029c50b] input_unregister_device+0x6a/0x100 [18014389.347858] [c02abf9c] hidinput_disconnect+0x24/0x41 [18014389.347874] [c02aef29] hid_disconnect+0x79/0xc9 [18014389.347889] [c028e1db] usb_unbind_interface+0x47/0x8f [18014389.347916] [c0256852] __device_release_driver+0x74/0x90 [18014389.347933] [c0256c5f] device_release_driver+0x37/0x4e [18014389.347957] [c02561c6] bus_remove_device+0x73/0x82 [18014389.347977] [c02547c1] device_del+0x214/0x28c [18014389.348132] [c028bb72] usb_disable_device+0x62/0xc2 [18014389.348148] [c0288893] usb_disconnect+0x99/0x126 [18014389.348163] [c0288d2c] hub_thread+0x3a5/0xb07 [18014389.348178] [c012cbe5] kthread+0x6e/0x79 [18014389.348194] [c0104917] kernel_thread_helper+0x7/0x10 [18014389.348210] === [18014389.348218] INFO: lockdep is turned off. [18014389.348224] Code: 5b 5d c3 55 b9 f0 ff ff ff 8b 50 0c 89 e5 83 ba 28 06 00 00 00 75 08 89 82 28 06 00 00 31 c9 5d 89 c8 c3 55 89 e5 56 53 8b 70 0c 39 86 28 06 00 00 75 3a 8b 9e e4 08 00 00 c7 86 28 06 00 00 00 I dunno. I'll keep plugging for another couple hours then I'll shove out what I have as a -mm snapshot whatsit. Things are just ridiculous. I'm thinking of having a hard-disk crash and accidentally losing everything. From: Andrew Morton [EMAIL PROTECTED] WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to .init.text:sc_cpu_event from .data between 'sc_cpu_notifier' (at offset 0x2110) and 'mcelog' Use hotcpu_notifier(). This takes care of making sure that the unused code
How do you send a reply to an email you have deleted.
How do you send a reply to an email you have deleted? -- [EMAIL PROTECTED] -- http://www.fastmail.fm - I mean, what is it about a decent email service? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Ingo Molnar wrote: * Gene Heskett [EMAIL PROTECTED] wrote: Gene has done some testing under CFS with X reniced to +10 and the desktop still worked smoothly for him. As a data point here, and probably nothing to do with X, but I did manage to lock it up, solid, reset button time tonight, by wanting 'smart' to get done with an update session after amanda had started. I took both smart processes I could see in htop all the way to -19, but when it was about done about 3 minutes later, everything came to an instant, frozen, reset button required lockup. I should have stopped at -17 I guess. :( yeah, i guess this has little to do with X. I think in your scenario it might have been smarter to either stop, or to renice the workloads that took away CPU power from others to _positive_ nice levels. Negative nice levels can indeed be dangerous. (Btw., to protect against such mishaps in the future i have changed the SysRq-N [SysRq-Nice] implementation in my tree to not only change real-time tasks to SCHED_OTHER, but to also renice negative nice levels back to 0 - this will show up in -v6. That way you'd only have had to hit SysRq-N to get the system out of the wedge.) Ingo That sounds handy, particularly with idiots like me at the wheel... -- Cheers, Gene There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. -Ed Howdershelt (Author) When a Banker jumps out of a window, jump after him--that's where the money is. -- Robespierre - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Gene Heskett [EMAIL PROTECTED] wrote: (Btw., to protect against such mishaps in the future i have changed the SysRq-N [SysRq-Nice] implementation in my tree to not only change real-time tasks to SCHED_OTHER, but to also renice negative nice levels back to 0 - this will show up in -v6. That way you'd only have had to hit SysRq-N to get the system out of the wedge.) That sounds handy, particularly with idiots like me at the wheel... by that standard i guess we tinkerers are all idiots ;) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tue, 24 Apr 2007, Ingo Molnar wrote: * Gene Heskett [EMAIL PROTECTED] wrote: Gene has done some testing under CFS with X reniced to +10 and the desktop still worked smoothly for him. As a data point here, and probably nothing to do with X, but I did manage to lock it up, solid, reset button time tonight, by wanting 'smart' to get done with an update session after amanda had started. I took both smart processes I could see in htop all the way to -19, but when it was about done about 3 minutes later, everything came to an instant, frozen, reset button required lockup. I should have stopped at -17 I guess. :( yeah, i guess this has little to do with X. I think in your scenario it might have been smarter to either stop, or to renice the workloads that took away CPU power from others to _positive_ nice levels. Negative nice levels can indeed be dangerous. (Btw., to protect against such mishaps in the future i have changed the SysRq-N [SysRq-Nice] implementation in my tree to not only change real-time tasks to SCHED_OTHER, but to also renice negative nice levels back to 0 - this will show up in -v6. That way you'd only have had to hit SysRq-N to get the system out of the wedge.) if you are trying to unwedge a system it may be a good idea to renice all tasks to 0, it could be that a task at +19 is holding a lock that something else is waiting for. David Lang - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] x86_64: Reflect the relocatability of the kernel in the ELF header.
Vivek Goyal [EMAIL PROTECTED] writes: On Sun, Apr 22, 2007 at 11:12:13PM -0600, Eric W. Biederman wrote: Currently because vmlinux does not reflect that the kernel is relocatable we still have to support CONFIG_PHYSICAL_START. So this patch adds a small c program to do what we cannot do with a linker script, set the elf header type to ET_DYN. This should remove the last obstacle to removing CONFIG_PHYSICAL_START on x86_64. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] [Dropping fastboot mailing list from CC as kexec mailing list is new list for this discussion] [..] +void file_open(const char *name) +{ +if ((fd = open(name, O_RDWR, 0)) 0) +die(Unable to open `%s': %m, name); +} + +static void mketrel(void) +{ +unsigned char e_type[2]; +if (read(fd, e_ident, sizeof(e_ident)) != sizeof(e_ident)) +die(Cannot read ELF header: %s\n, strerror(errno)); + +if (memcmp(e_ident, ELFMAG, 4) != 0) +die(No ELF magic\n); + +if ((e_ident[EI_CLASS] != ELFCLASS64) +(e_ident[EI_CLASS] != ELFCLASS32)) +die(Unrecognized ELF class: %x\n, e_ident[EI_CLASS]); + +if ((e_ident[EI_DATA] != ELFDATA2LSB) +(e_ident[EI_DATA] != ELFDATA2MSB)) +die(Unrecognized ELF data encoding: %x\n, e_ident[EI_DATA]); + +if (e_ident[EI_VERSION] != EV_CURRENT) +die(Unknown ELF version: %d\n, e_ident[EI_VERSION]); + +if (e_ident[EI_DATA] == ELFDATA2LSB) { +e_type[0] = ET_REL 0xff; +e_type[1] = ET_REL 8; +} else { +e_type[1] = ET_REL 0xff; +e_type[0] = ET_REL 8; +} Hi Eric, Should this be ET_REL or ET_DYN? kexec refuses to load this vmlinux as it does not find it to be executable type. Doh. It should be ET_DYN. I had relocatable much to much on the brain, and so I stuffed in the wrong type. I am not well versed with various conventions but if I go through Executable and Linking Format document, this is what it says about various file types. • A relocatable file holds code and data suitable for linking with other object files to create an executable or a shared object file. • An executable file holds a program suitable for execution. • A shared object file holds code and data suitable for linking in two contexts. First, the link editor may process it with other relocatable and shared object files to create another object file. Second, the dynamic linker combines it with an executable file and other shared objects to create a process image. So above does not seem to fit in the ET_REL type. We can't relink this vmlinux? And it does not seem to fit in ET_DYN definition too. We are not relinking this vmlinux with another executable or other relocatable files. I remember once you mentioned the term dynamic executable which can be loaded at a non-compiled address and let run without requiring any relocation processing. This vmlinux will fall in that category but can't relate it to standard elf file definitions. Sorry about that. ET_DYN without a PT_DYNAMIC segment, without a PT_INTERP segment, and with a valid entry point is exactly that. Loaders never perform relocation processing on a ET_DYN executable but they are allowed to shift all of the addresses by a single delta so long as all of the alignment restrictions are honored. Relocation processing when it happens comes from the dynamic linker, which is set in PT_INTERP and the dynamic linker looks a PT_DYNAMIC to figure out what relocations are available for processing. The basic issue is that ld don't really comprehend what we are doing since we are building a position independent executable in a way that the normal tools don't allow, so we have to poke the header. If we had compiled with -fPIC we could have specified -pie or --pic-executable to ld and it would have done the right thing. But as it is our executable only changes physical addresses and not virtual addresses something completely foreign to ld. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* David Lang [EMAIL PROTECTED] wrote: (Btw., to protect against such mishaps in the future i have changed the SysRq-N [SysRq-Nice] implementation in my tree to not only change real-time tasks to SCHED_OTHER, but to also renice negative nice levels back to 0 - this will show up in -v6. That way you'd only have had to hit SysRq-N to get the system out of the wedge.) if you are trying to unwedge a system it may be a good idea to renice all tasks to 0, it could be that a task at +19 is holding a lock that something else is waiting for. Yeah, that's possible too, but +19 tasks are getting a small but guaranteed share of the CPU so eventually it ought to release it. It's still a possibility, but i think i'll wait for a specific incident to happen first, and then react to that incident :-) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Ingo Molnar [EMAIL PROTECTED] wrote: yeah, i guess this has little to do with X. I think in your scenario it might have been smarter to either stop, or to renice the workloads that took away CPU power from others to _positive_ nice levels. Negative nice levels can indeed be dangerous. btw., was X itself at nice 0 or nice -10 when the lockup happened? Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Rogan Dawes [EMAIL PROTECTED] wrote: if (p_to p-wait_runtime 0) { p-wait_runtime = 1; p_to-wait_runtime += p-wait_runtime; } the above is the basic expression of: charge a positive bank balance. [..] [note, due to the nanoseconds unit there's no rounding loss to worry about.] Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss? yes. But not that we'll only truly have to worry about that when we'll have context-switching performance in that range - currently it's at least 2-3 orders of magnitude above that. Microseconds seemed to me to be too coarse already, that's why i picked nanoseconds and 64-bit arithmetics for CFS. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]Fix parsing kernelcore boot option for ia64
Mel-san. I tested your patch (Thanks!). It worked. But.. In my understanding, why ia64 doesn't use early_param() macro for mem= at el. is that it has to use mem= option at efi handling which is called before parse_early_param(). Current ia64's boot path is setup_arch() - efi handling - parse_early_param() - numa handling - pgdat/zone init kernelcore= option is just used at pgdat/zone initialization. (no arch dependent part...) So I think just adding == early_param(kernelcore,cmpdline_parse_kernelcore) == to ia64 is ok. Then, it can be common code. How is this patch? I confirmed this can work well too. When kernelcore boot option is specified, kernel can't boot up on ia64. It is cause of eternal loop. In addition, its code can be common code. This is fix for it. I tested this patch on my ia64 box. Signed-off-by: Yasunori Goto [EMAIL PROTECTED] - arch/i386/kernel/setup.c |1 - arch/ia64/kernel/efi.c |2 -- arch/powerpc/kernel/prom.c |1 - arch/ppc/mm/init.c |2 -- arch/x86_64/kernel/e820.c |1 - include/linux/mm.h |1 - mm/page_alloc.c|3 +++ 7 files changed, 3 insertions(+), 8 deletions(-) Index: kernelcore/arch/ia64/kernel/efi.c === --- kernelcore.orig/arch/ia64/kernel/efi.c 2007-04-24 15:09:37.0 +0900 +++ kernelcore/arch/ia64/kernel/efi.c 2007-04-24 15:25:22.0 +0900 @@ -423,8 +423,6 @@ efi_init (void) mem_limit = memparse(cp + 4, cp); } else if (memcmp(cp, max_addr=, 9) == 0) { max_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp)); - } else if (memcmp(cp, kernelcore=,11) == 0) { - cmdline_parse_kernelcore(cp+11); } else if (memcmp(cp, min_addr=, 9) == 0) { min_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp)); } else { Index: kernelcore/arch/i386/kernel/setup.c === --- kernelcore.orig/arch/i386/kernel/setup.c2007-04-24 15:29:20.0 +0900 +++ kernelcore/arch/i386/kernel/setup.c 2007-04-24 15:29:39.0 +0900 @@ -195,7 +195,6 @@ static int __init parse_mem(char *arg) return 0; } early_param(mem, parse_mem); -early_param(kernelcore, cmdline_parse_kernelcore); #ifdef CONFIG_PROC_VMCORE /* elfcorehdr= specifies the location of elf core header Index: kernelcore/arch/powerpc/kernel/prom.c === --- kernelcore.orig/arch/powerpc/kernel/prom.c 2007-04-24 15:04:47.0 +0900 +++ kernelcore/arch/powerpc/kernel/prom.c 2007-04-24 15:30:25.0 +0900 @@ -431,7 +431,6 @@ static int __init early_parse_mem(char * return 0; } early_param(mem, early_parse_mem); -early_param(kernelcore, cmdline_parse_kernelcore); /* * The device tree may be allocated below our memory limit, or inside the Index: kernelcore/arch/ppc/mm/init.c === --- kernelcore.orig/arch/ppc/mm/init.c 2007-04-24 15:04:47.0 +0900 +++ kernelcore/arch/ppc/mm/init.c 2007-04-24 15:30:56.0 +0900 @@ -214,8 +214,6 @@ void MMU_setup(void) } } -early_param(kernelcore, cmdline_parse_kernelcore); - /* * MMU_init sets up the basic memory mappings for the kernel, * including both RAM and possibly some I/O regions, Index: kernelcore/arch/x86_64/kernel/e820.c === --- kernelcore.orig/arch/x86_64/kernel/e820.c 2007-04-24 15:04:47.0 +0900 +++ kernelcore/arch/x86_64/kernel/e820.c2007-04-24 15:34:02.0 +0900 @@ -604,7 +604,6 @@ static int __init parse_memopt(char *p) return 0; } early_param(mem, parse_memopt); -early_param(kernelcore, cmdline_parse_kernelcore); static int userdef __initdata; Index: kernelcore/include/linux/mm.h === --- kernelcore.orig/include/linux/mm.h 2007-04-24 15:09:37.0 +0900 +++ kernelcore/include/linux/mm.h 2007-04-24 15:35:52.0 +0900 @@ -1051,7 +1051,6 @@ extern unsigned long find_max_pfn_with_a extern void free_bootmem_with_active_regions(int nid, unsigned long max_low_pfn); extern void sparse_memory_present_with_active_regions(int nid); -extern int cmdline_parse_kernelcore(char *p); #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID extern int early_pfn_to_nid(unsigned long pfn); #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ Index: kernelcore/mm/page_alloc.c === --- kernelcore.orig/mm/page_alloc.c 2007-04-24 15:09:37.0 +0900 +++ kernelcore/mm/page_alloc.c 2007-04-24 16:00:21.0 +0900 @@
Re: [REPORT] cfs-v4 vs sd-0.44
* Ingo Molnar [EMAIL PROTECTED] wrote: [...] That way you'd only have had to hit SysRq-N to get the system out of the wedge.) small correction: Alt-SysRq-N. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] i802.11: fixed memory leak on multicasts
Hi, socket buffers were not always freed when receiving multicasts Bye, -- Markus Pietrek Lead Software Engineer Phone: +49-7667-908-501, Fax: +49-7667-908-200 mailto:[EMAIL PROTECTED] FS Forth-Systeme GmbH A Digi International Company Kueferstr. 8, 79206 Breisach, Germany Tax: 07008/12000 / VAT: DE142208834 / Reg. Amtsgericht Freiburg HRB 290212 Directors: Klaus Flesch, Subramanian Krishnan, Dieter Vesper http://www.digi.com Index: net/ieee80211/ieee80211_rx.c === RCS file: /data/vcs/cvs/fsforth_products/LxNETES/linux/net/ieee80211/ieee80211_rx.c,v retrieving revision 1.5 retrieving revision 1.6 diff -c -r1.5 -r1.6 *** net/ieee80211/ieee80211_rx.c13 Apr 2007 12:39:38 - 1.5 --- net/ieee80211/ieee80211_rx.c23 Apr 2007 15:51:28 - 1.6 *** *** 860,868 break; } ! if (is_packet_for_us) if (!ieee80211_rx(ieee, skb, stats)) dev_kfree_skb_irq(skb); return; drop_free: --- 860,871 break; } ! if (is_packet_for_us) { if (!ieee80211_rx(ieee, skb, stats)) dev_kfree_skb_irq(skb); + } else + dev_kfree_skb_irq(skb); + return; drop_free:
cfs works fine for me
Hello, I have tried the cfs patches with 2.6.20.7 in the last days. I am using KDE 3.5.6, gentoo unstable and have a dual core AMD64 system with 1GB ram and a nvidia card (using the closed source drivers, yes I suck, but I love playing 3d games once in a while). I don't have interactivity problems with plain kernel.org kernels (except when swapping a lot, swapping really sucks) My system works well and is stable. With the cfs patches, my system continues to work well. I have not seen any regressions, desktop is snappy, emerge'ing stuff (niced to +19), does not hurt and unreal tournament 2004 is as fast (or slow, depends on the situation) as always. It even looks like FPS under heavy stress (like onslaught torlan when lots of bots and me are fighting at a powernode), don't go down as low as with the mainline scheduler. Not a big difference, but it is there (20-25 with plain kernel.org kernel in extrem situations compared to 30 with the cfs patches). Maybe I did not hit the worst case, playing is a little bit restricted at the moment - my wrist and ellbow hate me, but it looks promising. Apart from the worst case scenrios, FPS are more or less the same. My usage consisted of surfing the web with konqueror, watching videos with xine and mplayer, using kmail (with tens of thousands of mails in different folders), looking at pictures with kuickshow, installing XFCE, asorted updates, typing lots and lots of stuff in kate and web forums, listening to mp3/ogg with amarok, playing pysol/kpat/lgeneral/wesnoth/ut2004/freecol, a lot of that parallel (not ut2004... I don't want to hurt my precious fps...). Again, my system worked fine with the 'normal' scheduler, from the stuff I read in the lkml archives I must be some special kind of guy, so there was no improvement on the 'feels snappy or not' front, but there are also no regressions. So from my point of view, everything is fine with cfs and I would not mind having it as default scheduler. If you want specs of my hardware, my kernel config or any other information, just send me an email. I am not subscribed to lkml, nor can I read any of its archives in the next couple of days, which is one reason why I don't answer to one at the existing threads (I don't even know if there are some at the moment), so in case of an answer cc'ing me would be nice. Glück Auf Volker - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[REPORT] cfs-v5 vs sd-0.46
Hi list, with cfs-v5 finally booting on my machine I have run my daily numbercrunching jobs on both cfs-v5 and sd-0.46, 2.6.21-v7 on top of a stock openSUSE 10.2 (X86_64). Config for both kernel is the same except for the X boost option in cfs-v5 which on my system didn't work (X still was @ -19; I understand this will be fixed in -v6). HZ is 250 in both. System is a Dell XPS M1710, Intel Core2 2.33GHz, 4GB, NVIDIA GeForce Go 7950 GTX with proprietary driver 1.0-9755 I'm running three single threaded perl scripts that do double precision floating point math with little i/o after initially loading the data. Both cfs and sd showed very similar behavior when monitored in top. I'll show more or less representative excerpt from a 10 minutes log, delay 3sec. sd-0.46 top - 00:14:24 up 1:17, 9 users, load average: 4.79, 4.95, 4.80 Tasks: 3 total, 3 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 99.8%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.2%hi, 0.0%si, 0.0%st Mem: 3348628k total, 1648560k used, 1700068k free,64392k buffers Swap: 2097144k total,0k used, 2097144k free, 828204k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 6671 mgd 33 0 95508 22m 3652 R 100 0.7 44:28.11 perl 6669 mgd 31 0 95176 22m 3652 R 50 0.7 43:50.02 perl 6674 mgd 31 0 95368 22m 3652 R 50 0.7 47:55.29 perl cfs-v5 top - 08:07:50 up 21 min, 9 users, load average: 4.13, 4.16, 3.23 Tasks: 3 total, 3 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 99.5%us, 0.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Mem: 3348624k total, 1193500k used, 2155124k free,32516k buffers Swap: 2097144k total,0k used, 2097144k free, 545568k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 6357 mgd 20 0 92024 19m 3652 R 100 0.6 8:54.21 perl 6356 mgd 20 0 91652 18m 3652 R 50 0.6 10:35.52 perl 6359 mgd 20 0 91700 18m 3652 R 50 0.6 8:47.32 perl What did surprise me is that cpu utilization had been spread 100/50/50 (round robin) most of the time. I did expect 66/66/66 or so. What I also don't understand is the difference in load average, sd constantly had higher values, the above figures are representative for the whole log. I don't know which is better though. Here are excerpts from a concurrently run vmstat 3 200: sd-0.46 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 5 0 0 1702928 63664 82787600 067 458 1350 100 0 0 0 3 0 0 1702928 63684 82787600 089 468 1362 100 0 0 0 5 0 0 1702680 63696 82787600 0 132 461 1598 99 1 0 0 8 0 0 1702680 63712 82789200 080 465 1180 99 1 0 0 3 0 0 1702712 63732 82788400 067 453 1005 100 0 0 0 4 0 0 1702792 63744 82792000 041 461 1138 100 0 0 0 3 0 0 1702792 63760 82791600 057 456 1073 100 0 0 0 3 0 0 1702808 63776 82792800 0 111 473 1095 100 0 0 0 3 0 0 1702808 63788 82792800 081 461 1092 99 1 0 0 3 0 0 1702188 63808 82792800 0 160 463 1437 99 1 0 0 3 0 0 1702064 63884 82790000 0 229 479 1125 99 0 0 0 4 0 0 1702064 63912 82797200 177 460 1108 100 0 0 0 7 0 0 1702032 63920 82800000 040 463 1068 100 0 0 0 4 0 0 1702048 63928 82800800 068 454 1114 100 0 0 0 11 0 0 1702048 63928 82800800 0 0 458 1001 100 0 0 0 3 0 0 1701500 63960 82802000 0
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
On Tue, 24 Apr 2007 15:00:42 +1000, Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: Like anything else, modules should have separated the entrypoints for - Initiating a removal request - Releasing the module The former is use did rmmod, can unregister things from subsystems, etc... (and can file if the driver decides to refuse removal requests when it's busy doing things or whatever policy that module wants to implement). The later is called when all references to the modules have been dropped, it's a bit like the kref release (and could be implemented as one). That sounds quite similar to the problems we have with kobject refcounting vs. module unloading. The patchset I posted at http://marc.info/?l=linux-kernelm=117679014404994w=2 exposes the refcount of the kobject embedded in the module. Maybe the kthread code could use that reference as well? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NonExecutable Bit in 32Bit
On 4/24/07, William Heimbigner [EMAIL PROTECTED] wrote: On Tue, 24 Apr 2007, Cestonaro, Thilo (external) wrote: Hey, is it right, that the NX Bit is not used under i386-Arch but under x86_64-Arch? When yes, is there a special argument for it not to be used? Ciao Thilo I don't think so - some i386 cpus definitely have support for the NX bit. In detail: 1) if your CPU has NX support (some 32bit Xeons do) 2) it is not disabled in the BIOS 3) you see 'nx' in the 'flags' line in /proc/cpuinfo 4) and you have a kernel with the following config options CONFIG_HIGHMEM64G=y CONFIG_HIGHMEM=y CONFIG_X86_PAE=y NX should just work. [snip] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] [PATCH] eHCA: Add Modify Port verb
Hi Hal, you are correct, with the current firmware version it will fail later. Christoph R. [EMAIL PROTECTED] wrote on 23.04.2007 18:55:59: Hi Joachim, On Mon, 2007-04-23 at 12:23, Joachim Fenkes wrote: Add Modify Port verb support to eHCA driver. ib_cm needs this to initialize properly. I didn't think IB_PORT_SM was allowed (as QP0 is not exposed) or does this just fail later when it is attempted to be actually set ? -- Hal - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/9] Kconfig: cleanup s390 v2.
On Mon, 2007-04-23 at 10:45 -0700, Andrew Morton wrote: Andrew: I plan to add patches 1-5 to the for-andrew branch of the git390 repository if that is fine with you. The only thing that will be missing in the tree is the patch that disables wireless for s390. The code does compile but without hardware it is mute to have the config options. I'll wait until the git-wireless.patch is upstream. Patches 7-9 depend on patches found in -mm. umm, OK. If it's Ok I think I'll duck it for now: -mm is full. Over-full, really: I've been working basically continuously since Friday getting the current dungpile to compile and boot, and it's still miles away from that. I understand. I'll wait until -mm is a little bit smaller again. It is just that someday I want to finish with the Kconfig cleanup, it has been sitting on my harddriver for ages now. -- blue skies, IBM Deutschland Entwicklung GmbH MartinVorsitzender des Aufsichtsrats: Johann Weihen Geschäftsführung: Herbert Kircher Martin Schwidefsky Sitz der Gesellschaft: Böblingen Linux on zSeries Registergericht: Amtsgericht Stuttgart, Development HRB 243294 Reality continues to ruin my life. - Calvin. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v5 vs sd-0.46
* Michael Gerdau [EMAIL PROTECTED] wrote: I'm running three single threaded perl scripts that do double precision floating point math with little i/o after initially loading the data. thanks for the testing! What I also don't understand is the difference in load average, sd constantly had higher values, the above figures are representative for the whole log. I don't know which is better though. hm, it's hard from here to tell that. What load average does the vanilla kernel report? I'd take that as a reference. Here are excerpts from a concurrently run vmstat 3 200: sd-0.46 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 5 0 0 1702928 63664 82787600 067 458 1350 100 0 0 0 3 0 0 1702928 63684 82787600 089 468 1362 100 0 0 0 5 0 0 1702680 63696 82787600 0 132 461 1598 99 1 0 0 8 0 0 1702680 63712 82789200 080 465 1180 99 1 0 0 cfs-v5 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 6 0 0 2157728 31816 54523600 0 103 543 748 100 0 0 0 4 0 0 2157780 31828 54525600 063 435 752 100 0 0 0 4 0 0 2157928 31852 54525600 0 105 424 770 100 0 0 0 4 0 0 2157928 31868 54526800 0 261 457 763 100 0 0 0 interesting - CFS has half the context-switch rate of SD. That is probably because on your workload CFS defaults to longer 'timeslices' than SD. You can influence the 'timeslice length' under SD via /proc/sys/kernel/rr_interval (milliseconds units) and under CFS via /proc/sys/kernel/sched_granularity_ns. On CFS the value is not necessarily the timeslice length you will observe - for example in your workload above the granularity is set to 5 msec, but your rescheduling rate is 13 msecs. SD default to a rr_interval value of 8 msecs, which in your workload produces a timeslice length of 6-7 msecs. so to be totally 'fair' and get the same rescheduling 'granularity' you should probably lower CFS's sched_granularity_ns to 2 msecs. Last not least I'd like to add that at least on my system having X niced to -19 does result in kind of erratic (for lack of a better word) desktop behavior. I'll will reevaluate this with -v6 but for now IMO nicing X to -19 is a regression at least on my machine despite the claim that cfs doesn't suffer from it. indeed with -19 the rescheduling limit is so high under CFS that it does not throttle X's scheduling rate enough and so it will make CFS behave as badly as other schedulers. I retested this with -10 and it should work better with that. In -v6 i changed the default to -10 too. PS: Only learning how to test these things I'm happy to get pointed out the shortcomings of what I tested above. Of course suggestions for improvements are welcome. your report was perfectly fine and useful. no visible regressions is valuable feedback too. [ In fact, such type of feedback is the one i find the easiest to resolve ;-) ] Since you are running number-crunchers you might be able to give performacne feedback too: do you have any reliable 'performance metric' available for your number cruncher jobs (ops per minute, runtime, etc.) so that it would be possible to compare number-crunching performance of mainline to SD and to CFS as well? If that value is easy to get and reliable/stable enough to be meaningful. (And it would be nice to also establish some ballpark figure about how much noise there is in any performance metric, so that we can see whether any differences between schedulers are systematic or not.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
cpufreq default governor
Question: is there some reason that kconfig does not allow for default governors of conservative/ondemand/powersave? I'm not aware of any reason why one of those governors could not be used as default. William Heimbigner [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523
On Tue, 24 Apr 2007, Herbert Xu wrote: Hmm, *sigh*. I guess the patch below fixes the problem, but it is a masterpiece in the field of ugliness. And I am not sure whether it is completely correct either. Are there any immediate ideas for better solution with respect to how struct sock locking works? Please cc such patches to netdev. Thanks. Hi Herbert, well it's pretty much bluetooth-specific, and bluez-devel was CCed, but OK. diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c index 71f5cfb..c5c93cd 100644 --- a/net/bluetooth/hci_sock.c +++ b/net/bluetooth/hci_sock.c @@ -656,7 +656,10 @@ static int hci_sock_dev_event(struct notifier_block *this, unsigned long event, /* Detach sockets from device */ read_lock(hci_sk_list.lock); sk_for_each(sk, node, hci_sk_list.head) { - lock_sock(sk); + if (in_atomic()) + bh_lock_sock(sk); + else + lock_sock(sk); This doesn't do what you think it does. bh_lock_sock can still succeed even with lock_sock held by someone else. I know, this was precisely the reason why I converted the bh_lock_sock() to lock_sock() here some time ago (as it was racy with l2cap_connect_cfm()). Does this need to occur immediately when an event occurs? If not I'd suggest moving this into a workqueue. Will have to check whether this will be processed properly in time when going to suspend. Thanks, -- Jiri Kosina - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/7] libata: check for AN support
Hello, Kristen Carlson Accardi wrote: static unsigned int ata_print_id = 1; @@ -1744,6 +1745,23 @@ int ata_dev_configure(struct ata_device } dev-cdb_len = (unsigned int) rc; + /* + * check to see if this ATAPI device supports + * Asynchronous Notification + */ + if ((ap-flags ATA_FLAG_AN) ata_id_has_AN(id)) + { + /* issue SET feature command to turn this on */ + rc = ata_dev_set_AN(dev); Please don't store err_mask into int rc. Please store it to a separate err_mask variable and report it when printing error message. + if (rc) { + ata_dev_printk(dev, KERN_ERR, + unable to set AN\n); + rc = -EINVAL; Wouldn't -EIO be more appropriate? + goto err_out_nosup; + } + dev-flags |= ATA_DFLAG_AN; + } + Not NACKing. Just notes for future improvements. We need to be more careful here. ATA/ATAPI world is filled with braindamaged devices and I bet there are devices which advertises it can do AN but chokes when AN is enabled. This should be handled similarly to ACPI failure. Currently ACPI does the following. 1. try once, if fail, record that ACPI failed. return error to trigger retry. 2. try again, if fail again, ignore error if possible (!FROZEN) and turn off ACPI. This fallback mechanism for optional features can probably be generalized and used for both ACPI and AN. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/7] libata: check for AN support
+ /* + * check to see if this ATAPI device supports + * Asynchronous Notification + */ + if ((ap-flags ATA_FLAG_AN) ata_id_has_AN(id)) + { Bracketing police ^^^ + /* issue SET feature command to turn this on */ + rc = ata_dev_set_AN(dev); + if (rc) { + ata_dev_printk(dev, KERN_ERR, + unable to set AN\n); + rc = -EINVAL; + goto err_out_nosup; How fatal is this - do we need to ignore the device at this point or should we just pretend (possibly correctly) that the device itself does not support notification. @@ -299,6 +305,8 @@ struct ata_taskfile { #define ata_id_queue_depth(id) (((id)[75] 0x1f) + 1) #define ata_id_removeable(id)((id)[0] (1 7)) #define ata_id_has_dword_io(id) ((id)[50] (1 0)) +#define ata_id_has_AN(id)\ + ((id[76] (~id[76])) ((id)[78] (1 5))) Might be nice to check ATA version as well to be paranoid but this all looks ok as its a reserved field since way back when. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/7] genhd: expose AN to user space
Kristen Carlson Accardi wrote: +static struct disk_attribute disk_attr_capability = { + .attr = {.name = capability_flags, .mode = S_IRUGO }, + .show = disk_capability_read +}; How about just capability? I think that would be more consistent with other attributes. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/7] libata: send event when AN received
+ /* check the 'N' bit in word 0 of the FIS */ + if (f[0] (1 15)) { + int port_addr = ((f[0] 0x0f00) 8); + struct ata_device *adev = ap-device[port_addr]; You can't be sure that the port_addr returned will be in range if a device is malfunctioning... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [mmc] alternative TI FM MMC/SD driver for 2.6.21-rc7
Hi, If you add support for let's say [tifm_8xx2] in the future, which would have port offsets different that [tifm_7xx1], you would also need a completely new modules for slots (sd, ms, etc). Does not this constitutes an unbounded speculation? Only time will tell :) And then, what would you propose to do with adapters that have SD support disabled? There are quite a few of those in the wild, as of right now (SD support is provided by bundled SDHCI on such systems, if at all). Similar argument goes for other media types as well - many controllers have xD support disabled too (I think you have one of those - Sony really values its customers). After all, it is not healthy to have dead code in the kernel. A typical kernel config is an allmconfig, which has tones of dead code: just see a 'General setup' part of your distro '.config'. There are item like 'SMP' selected by default for 686+ CPUs. And this is far more overhead that a single check of card type on insert. To allow customization, boolean module options that disable certain card type may suffice. And again, you are doing a great work with the driver. -- Sergey Yanovich - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm take2] 64bit-futex - provide new commands instead of new syscall
Ulrich Drepper a écrit : It looks mostly good. I wouldn't use the high bit to differentiate the 64-bit operations, though. Since we do not allow to apply it to all operations the only effect will be that the compiler has a harder time generating the code for the switch statement. If you use continuous values a simple jump table can be used and no conditionals. Smaller and faster. Something like that may be... Signed-off-by: Pierre Peiffer [EMAIL PROTECTED] -- Pierre --- include/asm-ia64/futex.h|8 - include/asm-powerpc/futex.h |6 - include/asm-s390/futex.h|8 - include/asm-sparc64/futex.h |8 - include/asm-um/futex.h |9 - include/asm-x86_64/futex.h | 86 -- include/asm-x86_64/unistd.h |2 include/linux/futex.h |6 + include/linux/syscalls.h|3 kernel/futex.c | 203 ++-- kernel/futex_compat.c |2 kernel/sys_ni.c |1 12 files changed, 95 insertions(+), 247 deletions(-) Index: b/include/asm-ia64/futex.h === --- a/include/asm-ia64/futex.h +++ b/include/asm-ia64/futex.h @@ -124,13 +124,7 @@ futex_atomic_cmpxchg_inatomic(int __user static inline u64 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval) { - return 0; -} - -static inline int -futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr) -{ - return 0; + return -ENOSYS; } #endif /* _ASM_FUTEX_H */ Index: b/include/asm-powerpc/futex.h === --- a/include/asm-powerpc/futex.h +++ b/include/asm-powerpc/futex.h @@ -119,11 +119,5 @@ futex_atomic_cmpxchg_inatomic64(u64 __us return 0; } -static inline int -futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr) -{ - return 0; -} - #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_FUTEX_H */ Index: b/include/asm-s390/futex.h === --- a/include/asm-s390/futex.h +++ b/include/asm-s390/futex.h @@ -51,13 +51,7 @@ static inline int futex_atomic_cmpxchg_i static inline u64 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval) { - return 0; -} - -static inline int -futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr) -{ - return 0; + return -ENOSYS; } #endif /* __KERNEL__ */ Index: b/include/asm-sparc64/futex.h === --- a/include/asm-sparc64/futex.h +++ b/include/asm-sparc64/futex.h @@ -108,13 +108,7 @@ futex_atomic_cmpxchg_inatomic(int __user static inline u64 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval) { - return 0; -} - -static inline int -futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr) -{ - return 0; + return -ENOSYS; } #endif /* !(_SPARC64_FUTEX_H) */ Index: b/include/asm-um/futex.h === --- a/include/asm-um/futex.h +++ b/include/asm-um/futex.h @@ -6,14 +6,7 @@ static inline u64 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval) { - return 0; + return -ENOSYS; } -static inline int -futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr) -{ - return 0; -} - - #endif Index: b/include/asm-x86_64/futex.h === --- a/include/asm-x86_64/futex.h +++ b/include/asm-x86_64/futex.h @@ -41,38 +41,6 @@ =r (tem) \ : r (oparg), i (-EFAULT), m (*uaddr), 1 (0)) -#define __futex_atomic_op1_64(insn, ret, oldval, uaddr, oparg) \ - __asm__ __volatile ( \ -1: insn \n \ -2: .section .fixup,\ax\\n\ -3: movq %3, %1\n\ - jmp 2b\n\ - .previous\n\ - .section __ex_table,\a\\n\ - .align 8\n\ - .quad 1b,3b\n\ - .previous \ - : =r (oldval), =r (ret), =m (*uaddr) \ - : i (-EFAULT), m (*uaddr), 0 (oparg), 1 (0)) - -#define __futex_atomic_op2_64(insn, ret, oldval, uaddr, oparg) \ - __asm__ __volatile ( \ -1: movq %2, %0\n\ - movq %0, %3\n \ - insn \n \ -2: LOCK_PREFIX cmpxchgq %3, %2\n\ - jnz 1b\n\ -3: .section .fixup,\ax\\n\ -4: movq %5, %1\n\ - jmp 3b\n\ - .previous\n\ - .section __ex_table,\a\\n\ - .align 8\n\ - .quad 1b,4b,2b,4b\n\ - .previous \ - : =a (oldval), =r (ret), =m (*uaddr), \ - =r (tem) \ - : r (oparg), i (-EFAULT), m (*uaddr), 1 (0)) static inline int futex_atomic_op_inuser (int encoded_op, int __user *uaddr) @@ -128,60 +96,6 @@ futex_atomic_op_inuser (int encoded_op, } static inline int -futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr) -{ - int op = (encoded_op 28) 7; - int cmp = (encoded_op 24) 15; - u64 oparg = (encoded_op 8) 20; - u64 cmparg = (encoded_op 20) 20; - u64 oldval = 0, ret, tem; - - if (encoded_op (FUTEX_OP_OPARG_SHIFT 28)) - oparg = 1 oparg; - - if (! access_ok (VERIFY_WRITE, uaddr, sizeof(u64))) - return -EFAULT; - -
Re: 2.6.21-rc6-mm1
On Sun, 8 Apr 2007 14:35:59 -0700, Andrew Morton [EMAIL PROTECTED] wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/ - Lots of x86 updates Has somthing related with PTY's changed in this kernel ? I have to enable legacy PTY handling in a couple boxes to get ssh working. If not, I had openpty() errors and nor sshd nor virtual terminals (aterm) were able to get a terminal. User space (udev) is the same in three boxes and one works and two fail. I had /dev/ptmx everywhere and /dev/pts mounted Any idea ? TIA -- J.A. Magallon jamagallon()ono!com \ Software is like sex: \ It's better when it's free Mandriva Linux release 2008.0 (Cooker) for i586 Linux 2.6.20-jam10 (gcc 4.1.2 20070302 (prerelease) (4.1.2-1mdv2007.1)) #1 SMP PREEMPT - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: PageLRU can be non-atomic bit operation
At 11:47 07/04/24, Nick Piggin wrote: As Hugh points out, we must have atomic ops here, so changing the generic code to use the __ version is wrong. However if there is a faster way that i386 can perform the atomic variant, then doing so will speed up the generic code without breaking other architectures. Do you mean writing page-flags.h specific for i386 so improving generic code and without breaking other architectures ? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH -mm take4 2/6] support multiple logging
On Fri, 20 Apr 2007 18:51:13 +0900 Keiichi KII [EMAIL PROTECTED] wrote: I started to do some cleanups and fixups here, but abandoned it when it was all getting a bit large. Here are some fixes against this patch: I'm going to fix my patches by following your reviews and send new patches on the LKML and the netdev ML in a few days. Well.. before you can finish this work we need to decide upon what the interface to userspace will be. - The miscdev isn't appropriate Why isn't miscdev appropriate? We just shouldn't use miscdev for networking conventionally? -- Keiichi KII NEC Corporation OSS Promotion Center E-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH -mm take4 2/6] support multiple logging
We don't really have anything that corresponds to netpoll's connections at higher levels. I'm tempted to say we should make this work more like the dummy network device. ie: modprobe netconsole -o netcon1 [params] modprobe netconsole -o netcon2 [params] The configuration of netconsole's looks like the configuration of routes. Granted you probably have more routes than netconsoles, but the interface issues are similar. Netlink with a small application wouldn't be nice. And having /proc/net/netconsole (read-only) would be good for the netlink impaired. Do you say that we had better use procfs instead of sysfs to show the configurations of netconsole? If so, I have a question. I thought that procfs use things related to process as far as possible. Is it no problem to use procfs here? -- Keiichi KII NEC Corporation OSS Promotion Center E-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/15] CFQ IO scheduler patch series
Hi, I have a series of patches for the CFQ IO scheduler that I'd like to get some more testing on. The patch series is also scheduled to enter the next -mm, but if I'd like people to consciously give it a spin on its own as well. The patches are also available from the 'cfq' branch of the block layer tree: git://git.kernel.dk/data/git/linux-2.6-block.git and I've uploaded a rolled up version here as well: http://brick.kernel.dk/snaps/cfq-update-20070424 The patch series is essentially a series of cleanups and smaller optimizations, but there's also a larger change in there (patches 4 to 7) that completely rework how CFQ selects which queue to process. It's an experimental approach similar to the CFS CPU scheduler, in which management lists are converted to a single rbtree instead. So give it a spin if you have the time, and let me know how it performs and/or feels for you workload and hardware. cfq-iosched.c | 676 ++ 1 file changed, 357 insertions(+), 319 deletions(-) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/15] cfq-iosched: improve preemption for cooperating tasks
When testing the syslet async io approach, I discovered that CFQ sometimes didn't perform as well as expected. cfq_should_preempt() needs to better check for cooperating tasks, so fix that by allowing preemption of an equal priority queue if the recently queued request is as good a candidate for IO as the one we are currently waiting for. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c | 26 -- 1 files changed, 20 insertions(+), 6 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 9e37971..a683d00 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -861,15 +861,11 @@ static int cfq_arm_slice_timer(struct cfq_data *cfqd) static void cfq_dispatch_insert(request_queue_t *q, struct request *rq) { - struct cfq_data *cfqd = q-elevator-elevator_data; struct cfq_queue *cfqq = RQ_CFQQ(rq); cfq_remove_request(rq); cfqq-on_dispatch[rq_is_sync(rq)]++; elv_dispatch_sort(q, rq); - - rq = list_entry(q-queue_head.prev, struct request, queuelist); - cfqd-last_sector = rq-sector + rq-nr_sectors; } /* @@ -1579,6 +1575,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq, struct request *rq) { struct cfq_queue *cfqq = cfqd-active_queue; + sector_t dist; if (cfq_class_idle(new_cfqq)) return 0; @@ -1588,14 +1585,14 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq, if (cfq_class_idle(cfqq)) return 1; - if (!cfq_cfqq_wait_request(new_cfqq)) - return 0; + /* * if the new request is sync, but the currently running queue is * not, let the sync request have priority. */ if (rq_is_sync(rq) !cfq_cfqq_sync(cfqq)) return 1; + /* * So both queues are sync. Let the new request get disk time if * it's a metadata request and the current queue is doing regular IO. @@ -1603,6 +1600,21 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq, if (rq_is_meta(rq) !cfqq-meta_pending) return 1; + if (!cfqd-active_cic || !cfq_cfqq_wait_request(cfqq)) + return 0; + + /* +* if this request is as-good as one we would expect from the +* current cfqq, let it preempt +*/ + if (rq-sector cfqd-last_sector) + dist = rq-sector - cfqd-last_sector; + else + dist = cfqd-last_sector - rq-sector; + + if (dist = cfqd-active_cic-seek_mean) + return 1; + return 0; } @@ -1719,6 +1731,8 @@ static void cfq_completed_request(request_queue_t *q, struct request *rq) cfqq-on_dispatch[sync]--; cfqq-service_last = now; + cfqd-last_sector = rq-hard_sector + rq-hard_nr_sectors; + if (!cfq_class_idle(cfqq)) cfqd-last_end_request = now; -- 1.5.1.1.190.g74474 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/15] cfq-iosched: development update
- Implement logic for detecting cooperating processes, so we choose the best available queue whenever possible. - Improve residual slice time accounting. - Remove dead code: we no longer see async requests coming in on sync queues. That part was removed a long time ago. That means that we can also remove the difference between cfq_cfqq_sync() and cfq_cfqq_class_sync(), they are now indentical. And we can kill the on_dispatch array, just make it a counter. - Allow a process to go into the current list, if it hasn't been serviced in this scheduler tick yet. Possible future improvements including caching the cfqq lookup in cfq_close_cooperator(), so we don't have to look it up twice. cfq_get_best_queue() should just use that last decision instead of doing it again. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c | 381 +++ 1 files changed, 261 insertions(+), 120 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index a683d00..3883ba8 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -56,13 +56,7 @@ static struct completion *ioc_gone; #define ASYNC (0) #define SYNC (1) -#define cfq_cfqq_dispatched(cfqq) \ - ((cfqq)-on_dispatch[ASYNC] + (cfqq)-on_dispatch[SYNC]) - -#define cfq_cfqq_class_sync(cfqq) ((cfqq)-key != CFQ_KEY_ASYNC) - -#define cfq_cfqq_sync(cfqq)\ - (cfq_cfqq_class_sync(cfqq) || (cfqq)-on_dispatch[SYNC]) +#define cfq_cfqq_sync(cfqq)((cfqq)-key != CFQ_KEY_ASYNC) #define sample_valid(samples) ((samples) 80) @@ -79,6 +73,7 @@ struct cfq_data { struct list_head busy_rr; struct list_head cur_rr; struct list_head idle_rr; + unsigned long cur_rr_tick; unsigned int busy_queues; /* @@ -98,11 +93,12 @@ struct cfq_data { struct cfq_queue *active_queue; struct cfq_io_context *active_cic; int cur_prio, cur_end_prio; + unsigned long prio_time; unsigned int dispatch_slice; struct timer_list idle_class_timer; - sector_t last_sector; + sector_t last_position; unsigned long last_end_request; /* @@ -117,6 +113,9 @@ struct cfq_data { unsigned int cfq_slice_idle; struct list_head cic_list; + + sector_t new_seek_mean; + u64 new_seek_total; }; /* @@ -133,6 +132,8 @@ struct cfq_queue { unsigned int key; /* member of the rr/busy/cur/idle cfqd list */ struct list_head cfq_list; + /* in what tick we were last serviced */ + unsigned long rr_tick; /* sorted list of pending requests */ struct rb_root sort_list; /* if fifo isn't expired, next request to serve */ @@ -148,10 +149,11 @@ struct cfq_queue { unsigned long slice_end; unsigned long service_last; + unsigned long slice_start; long slice_resid; - /* number of requests that are on the dispatch list */ - int on_dispatch[2]; + /* number of requests that are on the dispatch list or inside driver */ + int dispatched; /* io prio of this group */ unsigned short ioprio, org_ioprio; @@ -159,6 +161,8 @@ struct cfq_queue { /* various state flags, see below */ unsigned int flags; + + sector_t last_request_pos; }; enum cfqq_state_flags { @@ -259,6 +263,8 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq) * easily introduce oscillations. */ cfqq-slice_resid = 0; + + cfqq-slice_start = jiffies; } /* @@ -307,7 +313,7 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2) s1 = rq1-sector; s2 = rq2-sector; - last = cfqd-last_sector; + last = cfqd-last_position; /* * by definition, 1KiB is 2 sectors @@ -398,39 +404,42 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq, return cfq_choose_req(cfqd, next, prev); } -static void cfq_resort_rr_list(struct cfq_queue *cfqq, int preempted) +/* + * This function finds out where to insert a BE queue in the service hierarchy + */ +static void cfq_resort_be_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq, + int preempted) { - struct cfq_data *cfqd = cfqq-cfqd; struct list_head *list, *n; struct cfq_queue *__cfqq; + int add_tail = 0; /* -* Resorting requires the cfqq to be on the RR list already. +* if cfqq has requests in flight, don't allow it to be +* found in cfq_set_active_queue before it has finished them. +* this is done to increase fairness between a process that +* has lots of io pending vs one that only generates one +* sporadically or synchronously */ - if (!cfq_cfqq_on_rr(cfqq)) - return; - -
Re: [REPORT] cfs-v5 vs sd-0.46
What I also don't understand is the difference in load average, sd constantly had higher values, the above figures are representative for the whole log. I don't know which is better though. hm, it's hard from here to tell that. What load average does the vanilla kernel report? I'd take that as a reference. I will redo this test with sd-0.46, cfs-v5 and mainline later today. interesting - CFS has half the context-switch rate of SD. That is probably because on your workload CFS defaults to longer 'timeslices' than SD. You can influence the 'timeslice length' under SD via /proc/sys/kernel/rr_interval (milliseconds units) and under CFS via /proc/sys/kernel/sched_granularity_ns. On CFS the value is not necessarily the timeslice length you will observe - for example in your workload above the granularity is set to 5 msec, but your rescheduling rate is 13 msecs. SD default to a rr_interval value of 8 msecs, which in your workload produces a timeslice length of 6-7 msecs. so to be totally 'fair' and get the same rescheduling 'granularity' you should probably lower CFS's sched_granularity_ns to 2 msecs. I'll change default nice in cfs to -10. I'm also happy to adjust /proc/sys/kernel/sched_granularity_ns to 2msec. However checking /proc/sys/kernel/rr_interval reveals it is 16 (msec) on my system. Anyway, I'll have to do some urgent other work and won't be able to do lots of testing until tonight (but then I will). Best, Michael -- Technosis GmbH, Geschäftsführer: Michael Gerdau, Tobias Dittmar Sitz Hamburg; HRB 89145 Amtsgericht Hamburg Vote against SPAM - see http://www.politik-digital.de/spam/ Michael Gerdau email: [EMAIL PROTECTED] GPG-keys available on request or at public keyserver pgpfJX2s3TRBz.pgp Description: PGP signature
Re: [RFC] another scheduler beater
* Bill Davidsen [EMAIL PROTECTED] wrote: The small attached script does a nice job of showing animation glitches in the glxgears animation. I have run one set of tests, and will have several more tomorrow. I'm off to a poker game, and would like to let people draw their own conclusions. Based on just this script as load I would say renice on X isn't a good thing. Based on one small test, I would say that renice of X in conjunction with heavy disk i/o and a single fast scrolling xterm (think kernel compile) seems to slow the raid6 thread measurably. Results late tomorrow, it will be an early and long day :-( hm, i'm wondering what you would expect the scheduler to do here? for this particular test you'll get the best result by renicing X to +19! Why? Because, as far as i can see this is a partially 'inverted' test of X's scheduling. While the script is definitely useful (you taught me that nice xterm -geom trick to automate the placing of busy xterms :), some caveats do apply when interpreting the results: If you have a kernel 3D driver (which you seem to have, judging by the glxgears numbers you are getting) then running 'glxgears' wont involve X at all. glxgears just gets its own window and then the kernel driver draws straight into it, without any side-trips to X. You can see this for yourself by starting glitch1.sh from an ssh terminal, and then _totally stop_ the X server via kill -STOP 12345 - all the xterms will stop, the X desktop freezes, but the glxgears instance will still happily draw its stuff and wheels are happily turning on the screen. So in this sense glxgears is a 'CPU hog' workload, largely independent of X. now, by renicing X to -10 and running the xterms you'll definitely hurt CPU hogs - even if it happens to be a glxgears process that draws 3D graphics in a window provided by X. But this is precisely what is supposed to happen in this case. You should get the best glxgears performance by renicing X to _+19_, and that seems to be happening according to your numbers - and that's what happens in my own testing too. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/15] cfq-iosched: minor updates
- Move the queue_new flag clear to when the queue is selected - Only select the non-first queue in cfq_get_best_queue(), if there's a substantial difference between the best and first. - Get rid of -busy_rr - Only select a close cooperator, if the current queue is known to take a while to think. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c | 81 +++--- 1 files changed, 18 insertions(+), 63 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 3883ba8..04fea76 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -70,7 +70,6 @@ struct cfq_data { * rr list of queues with requests and the count of them */ struct list_head rr_list[CFQ_PRIO_LISTS]; - struct list_head busy_rr; struct list_head cur_rr; struct list_head idle_rr; unsigned long cur_rr_tick; @@ -410,59 +409,18 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq, static void cfq_resort_be_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq, int preempted) { - struct list_head *list, *n; - struct cfq_queue *__cfqq; - int add_tail = 0; - - /* -* if cfqq has requests in flight, don't allow it to be -* found in cfq_set_active_queue before it has finished them. -* this is done to increase fairness between a process that -* has lots of io pending vs one that only generates one -* sporadically or synchronously -*/ - if (cfqq-dispatched) - list = cfqd-busy_rr; - else if (cfqq-ioprio == (cfqd-cur_prio + 1) -cfq_cfqq_sync(cfqq) -(time_before(cfqd-prio_time, cfqq-service_last) || - cfq_cfqq_queue_new(cfqq) || preempted)) { - list = cfqd-cur_rr; - add_tail = 1; - } else - list = cfqd-rr_list[cfqq-ioprio]; - - if (!cfq_cfqq_sync(cfqq) || add_tail) { - /* -* async queue always goes to the end. this wont be overly -* unfair to writes, as the sort of the sync queue wont be -* allowed to pass the async queue again. -*/ - list_add_tail(cfqq-cfq_list, list); - } else if (preempted || cfq_cfqq_queue_new(cfqq)) { - /* -* If this queue was preempted or is new (never been serviced), -* let it be added first for fairness but beind other new -* queues. -*/ - n = list; - while (n-next != list) { - __cfqq = list_entry_cfqq(n-next); - if (!cfq_cfqq_queue_new(__cfqq)) - break; + if (!cfq_cfqq_sync(cfqq)) + list_add_tail(cfqq-cfq_list, cfqd-rr_list[cfqq-ioprio]); + else { + struct list_head *n = cfqd-rr_list[cfqq-ioprio]; - n = n-next; - } - list_add(cfqq-cfq_list, n); - } else { /* * sort by last service, but don't cross a new or async -* queue. we don't cross a new queue because it hasn't been -* service before, and we don't cross an async queue because -* it gets added to the end on expire. +* queue. we don't cross a new queue because it hasn't +* been service before, and we don't cross an async +* queue because it gets added to the end on expire. */ - n = list; - while ((n = n-prev) != list) { + while ((n = n-prev) != cfqd-rr_list[cfqq-ioprio]) { struct cfq_queue *__c = list_entry_cfqq(n); if (!cfq_cfqq_sync(__c) || !__c-service_last) @@ -719,6 +677,7 @@ __cfq_set_active_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq) cfq_clear_cfqq_must_alloc_slice(cfqq); cfq_clear_cfqq_fifo_expire(cfqq); cfq_mark_cfqq_slice_new(cfqq); + cfq_clear_cfqq_queue_new(cfqq); cfqq-rr_tick = cfqd-cur_rr_tick; } @@ -737,7 +696,6 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq, cfq_clear_cfqq_must_dispatch(cfqq); cfq_clear_cfqq_wait_request(cfqq); - cfq_clear_cfqq_queue_new(cfqq); /* * store what was left of this slice, if the queue idled out @@ -839,13 +797,15 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd, static struct cfq_queue *cfq_get_best_queue(struct cfq_data *cfqd) { struct cfq_queue *cfqq = NULL, *__cfqq; - sector_t best = -1, dist; + sector_t best = -1, first = -1, dist; list_for_each_entry(__cfqq, cfqd-cur_rr, cfq_list) { if (!__cfqq-next_rq ||
[PATCH 12/15] cfq-iosched: get rid of -dispatch_slice
We can track it fairly accurately locally, let the slice handling take care of the rest. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c |6 +- 1 files changed, 1 insertions(+), 5 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index b680002..8f76aed 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -106,7 +106,6 @@ struct cfq_data { struct cfq_queue *active_queue; struct cfq_io_context *active_cic; - unsigned int dispatch_slice; struct timer_list idle_class_timer; @@ -769,8 +768,6 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq, put_io_context(cfqd-active_cic-ioc); cfqd-active_cic = NULL; } - - cfqd-dispatch_slice = 0; } static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out) @@ -1020,7 +1017,6 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq, */ cfq_dispatch_insert(cfqd-queue, rq); - cfqd-dispatch_slice++; dispatched++; if (!cfqd-active_cic) { @@ -1038,7 +1034,7 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq, * queue always expire after 1 dispatch round. */ if (cfqd-busy_queues 1 ((!cfq_cfqq_sync(cfqq) - cfqd-dispatch_slice = cfq_prio_to_maxrq(cfqd, cfqq)) || + dispatched = cfq_prio_to_maxrq(cfqd, cfqq)) || cfq_class_idle(cfqq))) { cfqq-slice_end = jiffies + 1; cfq_slice_expired(cfqd, 0); -- 1.5.1.1.190.g74474 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/15] cfq-iosched: speed up rbtree handling
For cases where the rbtree is mainly used for sorting and min retrieval, a nice speedup of the rbtree code is to maintain a cache of the leftmost node in the tree. Also spotted in the CFS CPU scheduler code. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c | 62 +++--- 1 files changed, 48 insertions(+), 14 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index ad29a99..7f964ee 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -70,6 +70,18 @@ static struct completion *ioc_gone; #define sample_valid(samples) ((samples) 80) /* + * Most of our rbtree usage is for sorting with min extraction, so + * if we cache the leftmost node we don't have to walk down the tree + * to find it. Idea borrowed from Ingo Molnars CFS scheduler. We should + * move this into the elevator for the rq sorting as well. + */ +struct cfq_rb_root { + struct rb_root rb; + struct rb_node *left; +}; +#define CFQ_RB_ROOT(struct cfq_rb_root) { RB_ROOT, NULL, } + +/* * Per block device queue structure */ struct cfq_data { @@ -78,7 +90,7 @@ struct cfq_data { /* * rr list of queues with requests and the count of them */ - struct rb_root service_tree; + struct cfq_rb_root service_tree; struct list_head cur_rr; struct list_head idle_rr; unsigned int busy_queues; @@ -378,6 +390,23 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2) } } +static struct rb_node *cfq_rb_first(struct cfq_rb_root *root) +{ + if (root-left) + return root-left; + + return rb_first(root-rb); +} + +static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root) +{ + if (root-left == n) + root-left = NULL; + + rb_erase(n, root-rb); + RB_CLEAR_NODE(n); +} + /* * would be nice to take fifo expire time into account as well */ @@ -417,10 +446,10 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd, static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq) { - struct rb_node **p = cfqd-service_tree.rb_node; + struct rb_node **p = cfqd-service_tree.rb.rb_node; struct rb_node *parent = NULL; - struct cfq_queue *__cfqq; unsigned long rb_key; + int left = 1; rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; rb_key += cfqq-slice_resid; @@ -433,22 +462,29 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, if (rb_key == cfqq-rb_key) return; - rb_erase(cfqq-rb_node, cfqd-service_tree); + cfq_rb_erase(cfqq-rb_node, cfqd-service_tree); } while (*p) { + struct cfq_queue *__cfqq; + parent = *p; __cfqq = rb_entry(parent, struct cfq_queue, rb_node); if (rb_key __cfqq-rb_key) p = (*p)-rb_left; - else + else { p = (*p)-rb_right; + left = 0; + } } + if (left) + cfqd-service_tree.left = cfqq-rb_node; + cfqq-rb_key = rb_key; rb_link_node(cfqq-rb_node, parent, p); - rb_insert_color(cfqq-rb_node, cfqd-service_tree); + rb_insert_color(cfqq-rb_node, cfqd-service_tree.rb); } static void cfq_resort_rr_list(struct cfq_queue *cfqq, int preempted) @@ -509,10 +545,8 @@ cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq) cfq_clear_cfqq_on_rr(cfqq); list_del_init(cfqq-cfq_list); - if (!RB_EMPTY_NODE(cfqq-rb_node)) { - rb_erase(cfqq-rb_node, cfqd-service_tree); - RB_CLEAR_NODE(cfqq-rb_node); - } + if (!RB_EMPTY_NODE(cfqq-rb_node)) + cfq_rb_erase(cfqq-rb_node, cfqd-service_tree); BUG_ON(!cfqd-busy_queues); cfqd-busy_queues--; @@ -758,8 +792,8 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd) * if current list is non-empty, grab first entry. */ cfqq = list_entry_cfqq(cfqd-cur_rr.next); - } else if (!RB_EMPTY_ROOT(cfqd-service_tree)) { - struct rb_node *n = rb_first(cfqd-service_tree); + } else if (!RB_EMPTY_ROOT(cfqd-service_tree.rb)) { + struct rb_node *n = cfq_rb_first(cfqd-service_tree); cfqq = rb_entry(n, struct cfq_queue, rb_node); } else if (!list_empty(cfqd-idle_rr)) { @@ -1030,7 +1064,7 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd) int dispatched = 0; struct rb_node *n; - while ((n = rb_first(cfqd-service_tree)) != NULL) { + while ((n = cfq_rb_first(cfqd-service_tree)) != NULL) { struct cfq_queue *cfqq = rb_entry(n, struct cfq_queue, rb_node);
[PATCH 8/15] cfq-iosched: style cleanups and comments
Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c | 66 ++ 1 files changed, 50 insertions(+), 16 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index e6cc77f..f86ff4d 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -222,7 +222,7 @@ CFQ_CFQQ_FNS(slice_new); static struct cfq_queue *cfq_find_cfq_hash(struct cfq_data *, unsigned int, unsigned short); static void cfq_dispatch_insert(request_queue_t *, struct request *); -static struct cfq_queue *cfq_get_queue(struct cfq_data *cfqd, unsigned int key, struct task_struct *tsk, gfp_t gfp_mask); +static struct cfq_queue *cfq_get_queue(struct cfq_data *, unsigned int, struct task_struct *, gfp_t); /* * scheduler run of queue, if there are requests pending and no one in the @@ -389,6 +389,9 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2) } } +/* + * The below is leftmost cache rbtree addon + */ static struct rb_node *cfq_rb_first(struct cfq_rb_root *root) { if (root-left) @@ -442,13 +445,18 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd, return ((cfqd-busy_queues - 1) * cfq_prio_slice(cfqd, 1, 0)); } +/* + * The cfqd-service_tree holds all pending cfq_queue's that have + * requests waiting to be processed. It is sorted in the order that + * we will service the queues. + */ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq) { struct rb_node **p = cfqd-service_tree.rb.rb_node; struct rb_node *parent = NULL; unsigned long rb_key; - int left = 1; + int left; rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; rb_key += cfqq-slice_resid; @@ -464,6 +472,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, cfq_rb_erase(cfqq-rb_node, cfqd-service_tree); } + left = 1; while (*p) { struct cfq_queue *__cfqq; struct rb_node **n; @@ -503,17 +512,16 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, rb_insert_color(cfqq-rb_node, cfqd-service_tree.rb); } +/* + * Update cfqq's position in the service tree. + */ static void cfq_resort_rr_list(struct cfq_queue *cfqq, int preempted) { - struct cfq_data *cfqd = cfqq-cfqd; - /* * Resorting requires the cfqq to be on the RR list already. */ - if (!cfq_cfqq_on_rr(cfqq)) - return; - - cfq_service_tree_add(cfqd, cfqq); + if (cfq_cfqq_on_rr(cfqq)) + cfq_service_tree_add(cfqq-cfqd, cfqq); } /* @@ -530,6 +538,10 @@ cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq) cfq_resort_rr_list(cfqq, 0); } +/* + * Called when the cfqq no longer has requests pending, remove it from + * the service tree. + */ static inline void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq) { @@ -648,8 +660,7 @@ static void cfq_remove_request(struct request *rq) } } -static int -cfq_merge(request_queue_t *q, struct request **req, struct bio *bio) +static int cfq_merge(request_queue_t *q, struct request **req, struct bio *bio) { struct cfq_data *cfqd = q-elevator-elevator_data; struct request *__rq; @@ -775,6 +786,10 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, int preempted, __cfq_slice_expired(cfqd, cfqq, preempted, timed_out); } +/* + * Get next queue for service. Unless we have a queue preemption, + * we'll simply select the first cfqq in the service tree. + */ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd) { struct cfq_queue *cfqq = NULL; @@ -786,10 +801,11 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd) cfqq = list_entry_cfqq(cfqd-cur_rr.next); } else if (!RB_EMPTY_ROOT(cfqd-service_tree.rb)) { struct rb_node *n = cfq_rb_first(cfqd-service_tree); - unsigned long end; cfqq = rb_entry(n, struct cfq_queue, rb_node); if (cfq_class_idle(cfqq)) { + unsigned long end; + /* * if we have idle queues and no rt or be queues had * pending requests, either allow immediate service if @@ -807,6 +823,9 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd) return cfqq; } +/* + * Get and set a new active queue for service. + */ static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd) { struct cfq_queue *cfqq; @@ -892,6 +911,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) mod_timer(cfqd-idle_slice_timer, jiffies + sl); } +/* + * Move request from internal lists to the request queue dispatch list. + */ static void cfq_dispatch_insert(request_queue_t *q, struct request *rq) {
[PATCH 14/15] cfq-iosched: improve sync vs async workloads
Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c | 31 ++- 1 files changed, 18 insertions(+), 13 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index f920527..772df89 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -96,6 +96,7 @@ struct cfq_data { struct hlist_head *cfq_hash; int rq_in_driver; + int sync_flight; int hw_tag; /* @@ -905,11 +906,15 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) */ static void cfq_dispatch_insert(request_queue_t *q, struct request *rq) { + struct cfq_data *cfqd = q-elevator-elevator_data; struct cfq_queue *cfqq = RQ_CFQQ(rq); cfq_remove_request(rq); cfqq-dispatched++; elv_dispatch_sort(q, rq); + + if (cfq_cfqq_sync(cfqq)) + cfqd-sync_flight++; } /* @@ -1094,27 +1099,24 @@ static int cfq_dispatch_requests(request_queue_t *q, int force) while ((cfqq = cfq_select_queue(cfqd)) != NULL) { int max_dispatch; - if (cfqd-busy_queues 1) { - /* -* So we have dispatched before in this round, if the -* next queue has idling enabled (must be sync), don't -* allow it service until the previous have completed. -*/ - if (cfqd-rq_in_driver cfq_cfqq_idle_window(cfqq) - dispatched) + max_dispatch = cfqd-cfq_quantum; + if (cfq_class_idle(cfqq)) + max_dispatch = 1; + + if (cfqq-dispatched = max_dispatch) { + if (cfqd-busy_queues 1) break; - if (cfqq-dispatched = cfqd-cfq_quantum) + if (cfqq-dispatched = 4 * max_dispatch) break; } + if (cfqd-sync_flight !cfq_cfqq_sync(cfqq)) + break; + cfq_clear_cfqq_must_dispatch(cfqq); cfq_clear_cfqq_wait_request(cfqq); del_timer(cfqd-idle_slice_timer); - max_dispatch = cfqd-cfq_quantum; - if (cfq_class_idle(cfqq)) - max_dispatch = 1; - dispatched += __cfq_dispatch_requests(cfqd, cfqq, max_dispatch); } @@ -1767,6 +1769,9 @@ static void cfq_completed_request(request_queue_t *q, struct request *rq) cfqd-rq_in_driver--; cfqq-dispatched--; + if (cfq_cfqq_sync(cfqq)) + cfqd-sync_flight--; + if (!cfq_class_idle(cfqq)) cfqd-last_end_request = now; -- 1.5.1.1.190.g74474 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/15] cfq-iosched: rework the whole round-robin list concept
Drawing on some inspiration from the CFS CPU scheduler design, overhaul the pending cfq_queue concept list management. Currently CFQ uses a doubly linked list per priority level for sorting and service uses. Kill those lists and maintain an rbtree of cfq_queue's, sorted by when to service them. This unfortunately means that the ionice levels aren't as strong anymore, will work on improving those later. We only scale the slice time now, not the number of times we service. This means that latency is better (for all priority levels), but that the distinction between the highest and lower levels aren't as big. The diffstat speaks for itself. cfq-iosched.c | 363 +- 1 file changed, 125 insertions(+), 238 deletions(-) Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c | 361 +- 1 files changed, 123 insertions(+), 238 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 04fea76..ad29a99 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -26,7 +26,16 @@ static int cfq_slice_async = HZ / 25; static const int cfq_slice_async_rq = 2; static int cfq_slice_idle = HZ / 125; +/* + * grace period before allowing idle class to get disk access + */ #define CFQ_IDLE_GRACE (HZ / 10) + +/* + * below this threshold, we consider thinktime immediate + */ +#define CFQ_MIN_TT (2) + #define CFQ_SLICE_SCALE(5) #define CFQ_KEY_ASYNC (0) @@ -69,10 +78,9 @@ struct cfq_data { /* * rr list of queues with requests and the count of them */ - struct list_head rr_list[CFQ_PRIO_LISTS]; + struct rb_root service_tree; struct list_head cur_rr; struct list_head idle_rr; - unsigned long cur_rr_tick; unsigned int busy_queues; /* @@ -91,8 +99,6 @@ struct cfq_data { struct cfq_queue *active_queue; struct cfq_io_context *active_cic; - int cur_prio, cur_end_prio; - unsigned long prio_time; unsigned int dispatch_slice; struct timer_list idle_class_timer; @@ -131,8 +137,10 @@ struct cfq_queue { unsigned int key; /* member of the rr/busy/cur/idle cfqd list */ struct list_head cfq_list; - /* in what tick we were last serviced */ - unsigned long rr_tick; + /* service_tree member */ + struct rb_node rb_node; + /* service_tree key */ + unsigned long rb_key; /* sorted list of pending requests */ struct rb_root sort_list; /* if fifo isn't expired, next request to serve */ @@ -147,8 +155,6 @@ struct cfq_queue { struct list_head fifo; unsigned long slice_end; - unsigned long service_last; - unsigned long slice_start; long slice_resid; /* number of requests that are on the dispatch list or inside driver */ @@ -240,30 +246,26 @@ static inline pid_t cfq_queue_pid(struct task_struct *task, int rw, int is_sync) * if a queue is marked sync and has sync io queued. A sync queue with async * io only, should not get full sync slice length. */ -static inline int -cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq) +static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync, +unsigned short prio) { - const int base_slice = cfqd-cfq_slice[cfq_cfqq_sync(cfqq)]; + const int base_slice = cfqd-cfq_slice[sync]; - WARN_ON(cfqq-ioprio = IOPRIO_BE_NR); + WARN_ON(prio = IOPRIO_BE_NR); + + return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio)); +} - return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - cfqq-ioprio)); +static inline int +cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq) +{ + return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq-ioprio); } static inline void cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq) { cfqq-slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies; - cfqq-slice_end += cfqq-slice_resid; - - /* -* Don't carry over residual for more than one slice, we only want -* to slightly correct the fairness. Carrying over forever would -* easily introduce oscillations. -*/ - cfqq-slice_resid = 0; - - cfqq-slice_start = jiffies; } /* @@ -403,33 +405,50 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq, return cfq_choose_req(cfqd, next, prev); } -/* - * This function finds out where to insert a BE queue in the service hierarchy - */ -static void cfq_resort_be_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq, - int preempted) +static unsigned long cfq_slice_offset(struct cfq_data *cfqd, + struct cfq_queue *cfqq) { - if (!cfq_cfqq_sync(cfqq)) - list_add_tail(cfqq-cfq_list,
[PATCH 10/15] cfq-iosched: get rid of -cur_rr and -cfq_list
It's only used for preemption now that the IDLE and RT queues also use the rbtree. If we pass an 'add_front' variable to cfq_service_tree_add(), we can set -rb_key to 0 to force insertion at the front of the tree. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c | 87 +++ 1 files changed, 32 insertions(+), 55 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 251131a..2d0e9c5 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -45,9 +45,6 @@ static int cfq_slice_idle = HZ / 125; */ #define CFQ_QHASH_SHIFT6 #define CFQ_QHASH_ENTRIES (1 CFQ_QHASH_SHIFT) -#define list_entry_qhash(entry)hlist_entry((entry), struct cfq_queue, cfq_hash) - -#define list_entry_cfqq(ptr) list_entry((ptr), struct cfq_queue, cfq_list) #define RQ_CIC(rq) ((struct cfq_io_context*)(rq)-elevator_private) #define RQ_CFQQ(rq)((rq)-elevator_private2) @@ -91,7 +88,6 @@ struct cfq_data { * rr list of queues with requests and the count of them */ struct cfq_rb_root service_tree; - struct list_head cur_rr; unsigned int busy_queues; /* @@ -146,8 +142,6 @@ struct cfq_queue { struct hlist_node cfq_hash; /* hash key */ unsigned int key; - /* member of the rr/busy/cur/idle cfqd list */ - struct list_head cfq_list; /* service_tree member */ struct rb_node rb_node; /* service_tree key */ @@ -452,16 +446,19 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd, * we will service the queues. */ static void cfq_service_tree_add(struct cfq_data *cfqd, - struct cfq_queue *cfqq) + struct cfq_queue *cfqq, int add_front) { struct rb_node **p = cfqd-service_tree.rb.rb_node; struct rb_node *parent = NULL; unsigned long rb_key; int left; - rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; - rb_key += cfqq-slice_resid; - cfqq-slice_resid = 0; + if (!add_front) { + rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; + rb_key += cfqq-slice_resid; + cfqq-slice_resid = 0; + } else + rb_key = 0; if (!RB_EMPTY_NODE(cfqq-rb_node)) { /* @@ -516,13 +513,13 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, /* * Update cfqq's position in the service tree. */ -static void cfq_resort_rr_list(struct cfq_queue *cfqq, int preempted) +static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq) { /* * Resorting requires the cfqq to be on the RR list already. */ if (cfq_cfqq_on_rr(cfqq)) - cfq_service_tree_add(cfqq-cfqd, cfqq); + cfq_service_tree_add(cfqd, cfqq, 0); } /* @@ -536,7 +533,7 @@ cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq) cfq_mark_cfqq_on_rr(cfqq); cfqd-busy_queues++; - cfq_resort_rr_list(cfqq, 0); + cfq_resort_rr_list(cfqd, cfqq); } /* @@ -548,7 +545,6 @@ cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq) { BUG_ON(!cfq_cfqq_on_rr(cfqq)); cfq_clear_cfqq_on_rr(cfqq); - list_del_init(cfqq-cfq_list); if (!RB_EMPTY_NODE(cfqq-rb_node)) cfq_rb_erase(cfqq-rb_node, cfqd-service_tree); @@ -765,7 +761,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq, if (timed_out !cfq_cfqq_slice_new(cfqq)) cfqq-slice_resid = cfqq-slice_end - jiffies; - cfq_resort_rr_list(cfqq, preempted); + cfq_resort_rr_list(cfqd, cfqq); if (cfqq == cfqd-active_queue) cfqd-active_queue = NULL; @@ -793,31 +789,28 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, int preempted, */ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd) { - struct cfq_queue *cfqq = NULL; + struct cfq_queue *cfqq; + struct rb_node *n; - if (!list_empty(cfqd-cur_rr)) { - /* -* if current list is non-empty, grab first entry. -*/ - cfqq = list_entry_cfqq(cfqd-cur_rr.next); - } else if (!RB_EMPTY_ROOT(cfqd-service_tree.rb)) { - struct rb_node *n = cfq_rb_first(cfqd-service_tree); + if (RB_EMPTY_ROOT(cfqd-service_tree.rb)) + return NULL; - cfqq = rb_entry(n, struct cfq_queue, rb_node); - if (cfq_class_idle(cfqq)) { - unsigned long end; + n = cfq_rb_first(cfqd-service_tree); + cfqq = rb_entry(n, struct cfq_queue, rb_node); - /* -* if we have idle queues and no rt or be queues had -* pending requests, either allow immediate service if -
[PATCH 6/15] cfq-iosched: sort RT queues into the rbtree
Currently CFQ does a linked insert into the current list for RT queues. We can just factor the class into the rb insertion, and then we don't have to treat RT queues in a special way. It's faster, too. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c | 27 --- 1 files changed, 12 insertions(+), 15 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 7f964ee..38ac492 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -471,7 +471,16 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, parent = *p; __cfqq = rb_entry(parent, struct cfq_queue, rb_node); - if (rb_key __cfqq-rb_key) + /* +* sort RT queues first, we always want to give +* preference to them. after that, sort on the next +* service time. +*/ + if (cfq_class_rt(cfqq) cfq_class_rt(__cfqq)) + p = (*p)-rb_left; + else if (cfq_class_rt(cfqq) cfq_class_rt(__cfqq)) + p = (*p)-rb_right; + else if (rb_key __cfqq-rb_key) p = (*p)-rb_left; else { p = (*p)-rb_right; @@ -490,7 +499,6 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, static void cfq_resort_rr_list(struct cfq_queue *cfqq, int preempted) { struct cfq_data *cfqd = cfqq-cfqd; - struct list_head *n; /* * Resorting requires the cfqq to be on the RR list already. @@ -500,25 +508,14 @@ static void cfq_resort_rr_list(struct cfq_queue *cfqq, int preempted) list_del_init(cfqq-cfq_list); - if (cfq_class_rt(cfqq)) { - /* -* At to the front of the current list, but behind other -* RT queues. -*/ - n = cfqd-cur_rr; - while (n-next != cfqd-cur_rr) - if (!cfq_class_rt(cfqq)) - break; - - list_add(cfqq-cfq_list, n); - } else if (cfq_class_idle(cfqq)) { + if (cfq_class_idle(cfqq)) { /* * IDLE goes to the tail of the idle list */ list_add_tail(cfqq-cfq_list, cfqd-idle_rr); } else { /* -* So we get here, ergo the queue is a regular best-effort queue +* RT and BE queues, sort into the rbtree */ cfq_service_tree_add(cfqd, cfqq); } -- 1.5.1.1.190.g74474 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
This is probably a reasonable thing to do but it doesn't feel like the right place. I think get_dirty_limits should return the raw threshold, and balance_dirty_pages should do both tests - the bdi-local test and the system-wide test. Ok, that makes sense I guess. Well, my narrow minded world view says it's not such a good idea, because it would again introduce the deadlock scenario, we're trying to avoid. In a sense allowing a queue to go over the global limit just a little bit is a good thing. Actually the very original code does that: if writeback was started for write_chunk number of pages, then we allow ratelimit (8) _new_ pages to be dirtied, effectively ignoring the global limit. That's why I've been saying, that the current code is so unfair: if there are lots of dirty pages to be written back to a particular device, then balance_dirty_pages() allows the dirty producer to make even more pages dirty, but if there are _no_ dirty pages for a device, and we are over the limit, then that dirty producer is allowed absolutely no new dirty pages until the global counts subside. I'm still not quite sure what purpose the above soft limiting serves. It seems to just give advantage to writers, which managed to accumulate lots of dirty pages, and then can convert that into even more dirtyings. Would it make sense to remove this behavior, and ensure that balance_dirty_pages() doesn't return until the per-queue limits have been complied with? Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/15] cfq-iosched: sort IDLE queues into the rbtree
Same treatment as the RT conversion, just put the sorted idle branch at the end of the tree. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c | 67 +++--- 1 files changed, 31 insertions(+), 36 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 38ac492..e6cc77f 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -92,7 +92,6 @@ struct cfq_data { */ struct cfq_rb_root service_tree; struct list_head cur_rr; - struct list_head idle_rr; unsigned int busy_queues; /* @@ -467,25 +466,33 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, while (*p) { struct cfq_queue *__cfqq; + struct rb_node **n; parent = *p; __cfqq = rb_entry(parent, struct cfq_queue, rb_node); /* * sort RT queues first, we always want to give -* preference to them. after that, sort on the next -* service time. +* preference to them. IDLE queues goes to the back. +* after that, sort on the next service time. */ if (cfq_class_rt(cfqq) cfq_class_rt(__cfqq)) - p = (*p)-rb_left; + n = (*p)-rb_left; else if (cfq_class_rt(cfqq) cfq_class_rt(__cfqq)) - p = (*p)-rb_right; + n = (*p)-rb_right; + else if (cfq_class_idle(cfqq) cfq_class_idle(__cfqq)) + n = (*p)-rb_left; + else if (cfq_class_idle(cfqq) cfq_class_idle(__cfqq)) + n = (*p)-rb_right; else if (rb_key __cfqq-rb_key) - p = (*p)-rb_left; - else { - p = (*p)-rb_right; + n = (*p)-rb_left; + else + n = (*p)-rb_right; + + if (n == (*p)-rb_right) left = 0; - } + + p = n; } if (left) @@ -506,19 +513,7 @@ static void cfq_resort_rr_list(struct cfq_queue *cfqq, int preempted) if (!cfq_cfqq_on_rr(cfqq)) return; - list_del_init(cfqq-cfq_list); - - if (cfq_class_idle(cfqq)) { - /* -* IDLE goes to the tail of the idle list -*/ - list_add_tail(cfqq-cfq_list, cfqd-idle_rr); - } else { - /* -* RT and BE queues, sort into the rbtree -*/ - cfq_service_tree_add(cfqd, cfqq); - } + cfq_service_tree_add(cfqd, cfqq); } /* @@ -791,20 +786,22 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd) cfqq = list_entry_cfqq(cfqd-cur_rr.next); } else if (!RB_EMPTY_ROOT(cfqd-service_tree.rb)) { struct rb_node *n = cfq_rb_first(cfqd-service_tree); + unsigned long end; cfqq = rb_entry(n, struct cfq_queue, rb_node); - } else if (!list_empty(cfqd-idle_rr)) { - /* -* if we have idle queues and no rt or be queues had pending -* requests, either allow immediate service if the grace period -* has passed or arm the idle grace timer -*/ - unsigned long end = cfqd-last_end_request + CFQ_IDLE_GRACE; - - if (time_after_eq(jiffies, end)) - cfqq = list_entry_cfqq(cfqd-idle_rr.next); - else - mod_timer(cfqd-idle_class_timer, end); + if (cfq_class_idle(cfqq)) { + /* +* if we have idle queues and no rt or be queues had +* pending requests, either allow immediate service if +* the grace period has passed or arm the idle grace +* timer +*/ + end = cfqd-last_end_request + CFQ_IDLE_GRACE; + if (time_before(jiffies, end)) { + mod_timer(cfqd-idle_class_timer, end); + cfqq = NULL; + } + } } return cfqq; @@ -1068,7 +1065,6 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd) } dispatched += cfq_forced_dispatch_cfqqs(cfqd-cur_rr); - dispatched += cfq_forced_dispatch_cfqqs(cfqd-idle_rr); cfq_slice_expired(cfqd, 0, 0); @@ -2047,7 +2043,6 @@ static void *cfq_init_queue(request_queue_t *q) cfqd-service_tree = CFQ_RB_ROOT; INIT_LIST_HEAD(cfqd-cur_rr); - INIT_LIST_HEAD(cfqd-idle_rr); INIT_LIST_HEAD(cfqd-cic_list); cfqd-cfq_hash = kmalloc_node(sizeof(struct hlist_head) * CFQ_QHASH_ENTRIES,
[PATCH 9/15] cfq-iosched: slice offset should take ioprio into account
Use the max_slice-cur_slice as the multipler for the insertion offset. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index f86ff4d..251131a 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -442,7 +442,8 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd, /* * just an approximation, should be ok. */ - return ((cfqd-busy_queues - 1) * cfq_prio_slice(cfqd, 1, 0)); + return (cfqd-busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) - + cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq-ioprio)); } /* -- 1.5.1.1.190.g74474 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 13/15] cfq-iosched: never allow an async queue idling
We don't enable it by default, don't let it get enabled during runtime. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 8f76aed..f920527 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -1597,7 +1597,12 @@ static void cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct cfq_io_context *cic) { - int enable_idle = cfq_cfqq_idle_window(cfqq); + int enable_idle; + + if (!cfq_cfqq_sync(cfqq)) + return; + + enable_idle = cfq_cfqq_idle_window(cfqq); if (!cic-ioc-task || !cfqd-cfq_slice_idle || (cfqd-hw_tag CIC_SEEKY(cic))) -- 1.5.1.1.190.g74474 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v5 vs sd-0.46
* Michael Gerdau [EMAIL PROTECTED] wrote: so to be totally 'fair' and get the same rescheduling 'granularity' you should probably lower CFS's sched_granularity_ns to 2 msecs. I'll change default nice in cfs to -10. I'm also happy to adjust /proc/sys/kernel/sched_granularity_ns to 2msec. However checking /proc/sys/kernel/rr_interval reveals it is 16 (msec) on my system. ah, yeah - there due to the SMP rule in SD: rr_interval *= 1 + ilog2(num_online_cpus()); and you have a 2-CPU system, so you get 8msec*2 == 16 msecs default interval. I find this a neat solution and i have talked to Con about this already and i'll adopt Con's idea in CFS too. Nevertheless, despite the settings, SD seems to be rescheduling every 6-7 msecs, while CFS reschedules only every 13 msecs. Here i'm assuming that the vmstats are directly comparable: that your number-crunchers behave the same during the full runtime - is that correct? (If not then the vmstat result should be run at roughly the same type of stage of the workload, on all the schedulers.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/15] cfq-iosched: don't pass unused preemption variable around
We don't use it anymore in the slice expiry handling. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c | 28 +--- 1 files changed, 13 insertions(+), 15 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 2d0e9c5..b680002 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -746,7 +746,7 @@ __cfq_set_active_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq) */ static void __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq, - int preempted, int timed_out) + int timed_out) { if (cfq_cfqq_wait_request(cfqq)) del_timer(cfqd-idle_slice_timer); @@ -755,8 +755,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq, cfq_clear_cfqq_wait_request(cfqq); /* -* store what was left of this slice, if the queue idled out -* or was preempted +* store what was left of this slice, if the queue idled/timed out */ if (timed_out !cfq_cfqq_slice_new(cfqq)) cfqq-slice_resid = cfqq-slice_end - jiffies; @@ -774,13 +773,12 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq, cfqd-dispatch_slice = 0; } -static inline void cfq_slice_expired(struct cfq_data *cfqd, int preempted, -int timed_out) +static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out) { struct cfq_queue *cfqq = cfqd-active_queue; if (cfqq) - __cfq_slice_expired(cfqd, cfqq, preempted, timed_out); + __cfq_slice_expired(cfqd, cfqq, timed_out); } /* @@ -989,7 +987,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) } expire: - cfq_slice_expired(cfqd, 0, 0); + cfq_slice_expired(cfqd, 0); new_queue: cfqq = cfq_set_active_queue(cfqd); keep_queue: @@ -1043,7 +1041,7 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq, cfqd-dispatch_slice = cfq_prio_to_maxrq(cfqd, cfqq)) || cfq_class_idle(cfqq))) { cfqq-slice_end = jiffies + 1; - cfq_slice_expired(cfqd, 0, 0); + cfq_slice_expired(cfqd, 0); } return dispatched; @@ -1077,7 +1075,7 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd) dispatched += __cfq_forced_dispatch_cfqq(cfqq); } - cfq_slice_expired(cfqd, 0, 0); + cfq_slice_expired(cfqd, 0); BUG_ON(cfqd-busy_queues); @@ -1147,7 +1145,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq) BUG_ON(cfq_cfqq_on_rr(cfqq)); if (unlikely(cfqd-active_queue == cfqq)) { - __cfq_slice_expired(cfqd, cfqq, 0, 0); + __cfq_slice_expired(cfqd, cfqq, 0); cfq_schedule_dispatch(cfqd); } @@ -1204,7 +1202,7 @@ static void cfq_free_io_context(struct io_context *ioc) static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq) { if (unlikely(cfqq == cfqd-active_queue)) { - __cfq_slice_expired(cfqd, cfqq, 0, 0); + __cfq_slice_expired(cfqd, cfqq, 0); cfq_schedule_dispatch(cfqd); } @@ -1677,7 +1675,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq, */ static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq) { - cfq_slice_expired(cfqd, 1, 1); + cfq_slice_expired(cfqd, 1); /* * Put the new queue at the front of the of the current list, @@ -1784,7 +1782,7 @@ static void cfq_completed_request(request_queue_t *q, struct request *rq) cfq_clear_cfqq_slice_new(cfqq); } if (cfq_slice_used(cfqq)) - cfq_slice_expired(cfqd, 0, 1); + cfq_slice_expired(cfqd, 1); else if (sync RB_EMPTY_ROOT(cfqq-sort_list)) cfq_arm_slice_timer(cfqd); } @@ -1979,7 +1977,7 @@ static void cfq_idle_slice_timer(unsigned long data) } } expire: - cfq_slice_expired(cfqd, 0, timed_out); + cfq_slice_expired(cfqd, timed_out); out_kick: cfq_schedule_dispatch(cfqd); out_cont: @@ -2025,7 +2023,7 @@ static void cfq_exit_queue(elevator_t *e) spin_lock_irq(q-queue_lock); if (cfqd-active_queue) - __cfq_slice_expired(cfqd, cfqd-active_queue, 0, 0); + __cfq_slice_expired(cfqd, cfqd-active_queue, 0); while (!list_empty(cfqd-cic_list)) { struct cfq_io_context *cic = list_entry(cfqd-cic_list.next, -- 1.5.1.1.190.g74474 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 15/15] cfq-iosched: tighten queue request overlap condition
For tagged devices, allow overlap of requests if the idle window isn't enabled on the current active queue. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- block/cfq-iosched.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 772df89..8093733 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -983,7 +983,8 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) * flight or is idling for a new request, allow either of these * conditions to happen (or time out) before selecting a new queue. */ - if (cfqq-dispatched || timer_pending(cfqd-idle_slice_timer)) { + if (timer_pending(cfqd-idle_slice_timer) || + (cfqq-dispatched cfq_cfqq_idle_window(cfqq))) { cfqq = NULL; goto keep_queue; } -- 1.5.1.1.190.g74474 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/7] genhd: send async notification on media change
Kristen Carlson Accardi wrote: Send an uevent to user space to indicate that a media change event has occurred. Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED] Index: 2.6-git/block/genhd.c === --- 2.6-git.orig/block/genhd.c +++ 2.6-git/block/genhd.c @@ -643,6 +643,25 @@ struct seq_operations diskstats_op = { .show = diskstats_show }; +static void media_change_notify_thread(struct work_struct *work) +{ + struct gendisk *gd = container_of(work, struct gendisk, async_notify); + char event[] = MEDIA_CHANGE=1; + char *envp[] = { event, NULL }; + + /* + * set enviroment vars to indicate which event this is for + * so that user space will know to go check the media status. + */ + kobject_uevent_env(gd-kobj, KOBJ_CHANGE, envp); +} + +void genhd_media_change_notify(struct gendisk *disk) +{ + schedule_work(disk-async_notify); +} +EXPORT_SYMBOL_GPL(genhd_media_change_notify); genhd might go away while async_notify work is in-flight. You'll need to either grab a reference or wait for the work to finish in release routine. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Ingo Molnar wrote: * Rogan Dawes [EMAIL PROTECTED] wrote: if (p_to p-wait_runtime 0) { p-wait_runtime = 1; p_to-wait_runtime += p-wait_runtime; } the above is the basic expression of: charge a positive bank balance. [..] [note, due to the nanoseconds unit there's no rounding loss to worry about.] Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss? yes. But not that we'll only truly have to worry about that when we'll have context-switching performance in that range - currently it's at least 2-3 orders of magnitude above that. Microseconds seemed to me to be too coarse already, that's why i picked nanoseconds and 64-bit arithmetics for CFS. Ingo I guess my point was if we somehow get to an odd number of nanoseconds, we'd end up with rounding errors. I'm not sure if your algorithm will ever allow that. Rogan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH -mm take4 2/6] support multiple logging
On Tue, 24 Apr 2007 17:14:28 +0900 Keiichi KII [EMAIL PROTECTED] wrote: On Fri, 20 Apr 2007 18:51:13 +0900 Keiichi KII [EMAIL PROTECTED] wrote: I started to do some cleanups and fixups here, but abandoned it when it was all getting a bit large. Here are some fixes against this patch: I'm going to fix my patches by following your reviews and send new patches on the LKML and the netdev ML in a few days. Well.. before you can finish this work we need to decide upon what the interface to userspace will be. - The miscdev isn't appropriate Why isn't miscdev appropriate? We just shouldn't use miscdev for networking conventionally? Yes it's rather odd, especially for networking. What does the miscdev _do_ anyway? Is it purely a target for the ioctls? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 2007-04-24 at 10:19 +0200, Miklos Szeredi wrote: This is probably a reasonable thing to do but it doesn't feel like the right place. I think get_dirty_limits should return the raw threshold, and balance_dirty_pages should do both tests - the bdi-local test and the system-wide test. Ok, that makes sense I guess. Well, my narrow minded world view says it's not such a good idea, because it would again introduce the deadlock scenario, we're trying to avoid. I was only referring to the placement of the clipping; and exactly where that happens does not affect the deadlock. In a sense allowing a queue to go over the global limit just a little bit is a good thing. Actually the very original code does that: if writeback was started for write_chunk number of pages, then we allow ratelimit (8) _new_ pages to be dirtied, effectively ignoring the global limit. It might be time to get rid of that rate-limiting. balance_dirty_pages()'s fast path is not nearly as heavy as it used to be. All these fancy counter systems have removed quite a bit of iteration from there. That's why I've been saying, that the current code is so unfair: if there are lots of dirty pages to be written back to a particular device, then balance_dirty_pages() allows the dirty producer to make even more pages dirty, but if there are _no_ dirty pages for a device, and we are over the limit, then that dirty producer is allowed absolutely no new dirty pages until the global counts subside. Well, that got fixed on a per device basis with this patch, it is still true for multiple tasks writing to the same device. I'm still not quite sure what purpose the above soft limiting serves. It seems to just give advantage to writers, which managed to accumulate lots of dirty pages, and then can convert that into even more dirtyings. The queues only limit the actual in-flight writeback pages, balance_dirty_pages() considers all pages that might become writeback as well as those that are. Would it make sense to remove this behavior, and ensure that balance_dirty_pages() doesn't return until the per-queue limits have been complied with? I don't think that will help, balance_dirty_pages drives the queues. That is, it converts pages from mere dirty to writeback. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cpufreq default governor
Hi William, On 24/04/07, William Heimbigner [EMAIL PROTECTED] wrote: Question: is there some reason that kconfig does not allow for default governors of conservative/ondemand/powersave? Performance? I'm not aware of any reason why one of those governors could not be used as default. My hardware doesn't work properly with ondemand governor. I hear strange noises when frequency is changed. William Heimbigner [EMAIL PROTECTED] Regards, Michal -- Michal K. K. Piotrowski LTG - Linux Testers Group (PL) (http://www.stardust.webpages.pl/ltg/) LTG - Linux Testers Group (EN) (http://www.stardust.webpages.pl/linux_testers_group_en/) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
On Tue, Apr 24, 2007 at 03:55:06PM +1000, Paul Mackerras wrote: Christoph Hellwig writes: The first question is obviously, is this really something we want? spawning kernel thread on demand without reaping them properly seems quite dangerous. What specifically has to be done to reap a kernel thread? Are you concerned about the number of threads, or about having zombies hanging around? I'm mostly concerned about number of threads and possible leakage of threads. Linas already explained it's not a problem in this case, so it's covered. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v5 vs sd-0.46
Here i'm assuming that the vmstats are directly comparable: that your number-crunchers behave the same during the full runtime - is that correct? Yes, basically it does (disregarding small fluctuations) I'll see whether I can produce some type of absolute performance measure as well. Thinking about it I guess this should be fairly simple to implement. Best, Michael -- Technosis GmbH, Geschäftsführer: Michael Gerdau, Tobias Dittmar Sitz Hamburg; HRB 89145 Amtsgericht Hamburg Vote against SPAM - see http://www.politik-digital.de/spam/ Michael Gerdau email: [EMAIL PROTECTED] GPG-keys available on request or at public keyserver pgprODjr3hqXe.pgp Description: PGP signature
Re: [REPORT] cfs-v5 vs sd-0.46
* Michael Gerdau [EMAIL PROTECTED] wrote: Here i'm assuming that the vmstats are directly comparable: that your number-crunchers behave the same during the full runtime - is that correct? Yes, basically it does (disregarding small fluctuations) ok, good. I'll see whether I can produce some type of absolute performance measure as well. Thinking about it I guess this should be fairly simple to implement. oh, you are writing the number-cruncher? In general the 'best' performance metrics for scheduler validation are the ones where you have immediate feedback: i.e. some ops/sec (or ops per minute) value in some readily accessible place, or some milliseconds-per-100,000 ops type of metric - whichever lends itself better to the workload at hand. If you measure time then the best is to use long long and nanoseconds and the monotonic clocksource: unsigned long long rdclock(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, ts); return ts.tv_sec * 10ULL + ts.tv_nsec; } (link to librt via -lrt to pick up clock_gettime()) The cost of a clock_gettime() (or of a gettimeofday()) can be a couple of microseconds on some systems, so it shouldnt be done too frequently. Plus an absolute metric of the whole workload took X.Y seconds is useful too. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm] utrace: fix double free re __rcu_process_callbacks()
The following patch fixes double free manifesting itself as crash in __rcu_process_callbasks(): http://marc.info/?l=linux-kernelm=117518764517017w=2 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229112 The problem is with check_dead_utrace() conditionally scheduling struct utrace for freeing but not cleaning struct task_struct::utrace pointer leaving it reachable: tsk-utrace_flags = flags; if (flags) spin_unlock(utrace-lock); else rcu_utrace_free(utrace); OTOH, utrace_release_task() first clears -utrace pointer, then frees struct utrace itself: Roland inserted some debugging into 2.6.21-rc6-mm1 so that aforementined double free couldn't be reproduced without seeing BUG at kernel/utrace.c:176 first. It triggers if one struct utrace were passed to rcu_utrace_free() second time. With patch applied I no longer see¹ BUG message and double frees on 2-way P3, 8-way ia64, Core 2 Duo boxes. Testcase is at the first link. I _think_ it adds leak if utrace_reap() takes branch without freeing but, well, I hope Roland will give me some clue on how to fix it too. Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED] --- kernel/utrace.c |6 +- 1 file changed, 1 insertion(+), 5 deletions(-) ¹ But I see whole can of other bugs! I think they were already lurking but weren't easily reproducable without hitting double-free first. FWIW, it's BUG_ON(!list_empty(tsk-ptracees)); oops at the beginning of remove_engine() NULL -report_quiesce call which is absent in ptrace utrace ops BUG_ON(tracehook_check_released(p)); --- a/kernel/utrace.c +++ b/kernel/utrace.c @@ -205,7 +205,6 @@ utrace_clear_tsk(struct task_struct *tsk if (utrace-u.live.signal == NULL) { task_lock(tsk); if (likely(tsk-utrace != NULL)) { - rcu_assign_pointer(tsk-utrace, NULL); tsk-utrace_flags = UTRACE_ACTION_NOREAP; } task_unlock(tsk); @@ -305,10 +304,7 @@ check_dead_utrace(struct task_struct *ts } tsk-utrace_flags = flags; - if (flags) - spin_unlock(utrace-lock); - else - rcu_utrace_free(utrace); + spin_unlock(utrace-lock); /* * Now we're finished updating the utrace state. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v5 vs sd-0.46
oh, you are writing the number-cruncher? Yep. In general the 'best' performance metrics for scheduler validation are the ones where you have immediate feedback: i.e. some ops/sec (or ops per minute) value in some readily accessible place, or some milliseconds-per-100,000 ops type of metric - whichever lends itself better to the workload at hand. I'll have to see whether that works out. I don't have an easily available ops/sec but I guess I could create something similar. If you measure time then the best is to use long long and nanoseconds and the monotonic clocksource: [snip] Thanks, I will implement that, for Linux anyway. Plus an absolute metric of the whole workload took X.Y seconds is useful too. That's the easiest to come by and is already available. Best, Michael -- Technosis GmbH, Geschäftsführer: Michael Gerdau, Tobias Dittmar Sitz Hamburg; HRB 89145 Amtsgericht Hamburg Vote against SPAM - see http://www.politik-digital.de/spam/ Michael Gerdau email: [EMAIL PROTECTED] GPG-keys available on request or at public keyserver pgpwqhqmZDVz7.pgp Description: PGP signature
Re: [PATCH 10/10] mm: per device dirty threshold
This is probably a reasonable thing to do but it doesn't feel like the right place. I think get_dirty_limits should return the raw threshold, and balance_dirty_pages should do both tests - the bdi-local test and the system-wide test. Ok, that makes sense I guess. Well, my narrow minded world view says it's not such a good idea, because it would again introduce the deadlock scenario, we're trying to avoid. I was only referring to the placement of the clipping; and exactly where that happens does not affect the deadlock. OK. In a sense allowing a queue to go over the global limit just a little bit is a good thing. Actually the very original code does that: if writeback was started for write_chunk number of pages, then we allow ratelimit (8) _new_ pages to be dirtied, effectively ignoring the global limit. It might be time to get rid of that rate-limiting. balance_dirty_pages()'s fast path is not nearly as heavy as it used to be. All these fancy counter systems have removed quite a bit of iteration from there. Hmm. The rate limiting probably makes lots of sense for dirty_exceeded==0, when ratelimit can be a nice large value. For dirty_exceeded==1 it may make sense to disable ratelimiting, OTOH having a granularity of 8 pages probably doesn't matter, because of the granularity of the percpu counter is usually larger (except on UP). That's why I've been saying, that the current code is so unfair: if there are lots of dirty pages to be written back to a particular device, then balance_dirty_pages() allows the dirty producer to make even more pages dirty, but if there are _no_ dirty pages for a device, and we are over the limit, then that dirty producer is allowed absolutely no new dirty pages until the global counts subside. Well, that got fixed on a per device basis with this patch, it is still true for multiple tasks writing to the same device. Yes, this is the part of this patchset I'm personally interested in ;) I'm still not quite sure what purpose the above soft limiting serves. It seems to just give advantage to writers, which managed to accumulate lots of dirty pages, and then can convert that into even more dirtyings. The queues only limit the actual in-flight writeback pages, balance_dirty_pages() considers all pages that might become writeback as well as those that are. Would it make sense to remove this behavior, and ensure that balance_dirty_pages() doesn't return until the per-queue limits have been complied with? I don't think that will help, balance_dirty_pages drives the queues. That is, it converts pages from mere dirty to writeback. Yes. But current logic says, that if you convert write_chunk dirty to writeback, you are allowed to dirty ratelimit more. D: number of dirty pages W: number of writeback pages L: global limit C: write_chunk = ratelimit_pages * 1.5 R: ratelimit If D+W = L, then R = 8 Let's assume, that D == L and W == 0. And that all of the dirty pages belong to a single device. Also for simplicity, lets assume an infinite length queue, and a slow device. Then while converting the dirty pages to writeback, D / C * R new dirty pages can be created. So when all existing dirty have been converted: D = L / C * R W = L D + W = L * (1 + R / C) So we see, that we're now even more above the limit than before the conversion. This means, that we starve writers to other devices, which don't have as many dirty pages, because until the slow device doesn't finish these writes they will not get to do anything. Your patch helps this in that if the other writers have an empty queue and no dirty, they will be allowed to slowly start writing. But they will not gain their full share until the slow dirty-hog goes below the global limit, which may take some time. So I think the logical thing to do, is if the dirty-hog is over it's queue limit, don't let it dirty any more until it's dirty+writeback go below the limit. That allowes other devices to more quickly gain their share of dirty pages. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: sendfile to nonblocking socket
David Schwartz пишет: You have a misunderstanding about the semantics of 'sendfile'. The 'sendfile' function is just a more efficient version of a read followed by a write. If you did a read followed by a write, it would block as well (in the read). DS sendfile function is not just a more efficient version of a read followed by a write. It reads from one fd and write to another at tha same time. Please try to read 2G, and then write 2G - and how much memory you will be need and how much time you will loose while reading 2G from disk, but not writing them to socket. You are correct. What I meant to say was that it's just a more efficient version of 'mmap'ing a file and then 'write'ing from the 'mmap'. The 'write' to a non-blocking socket can still 'block' on disk I/O. If you know more efficient method to transfer file from disk to network - please advise. Now all I want is really non-blocking sendfile. Currently sendfile is non-blocking on network, but not on disk i/o. And when I have network faster than disk - I get block. There are many different techniques and which is correct depends on what direction you want to go. POSIX asynchronous I/O is one possibility. Threads plus epoll is another. It really depends upon how much performance you need, how much complexity you can tolerate, and how portable you need to be. DS - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 2007-04-24 at 11:14 +0200, Miklos Szeredi wrote: I'm still not quite sure what purpose the above soft limiting serves. It seems to just give advantage to writers, which managed to accumulate lots of dirty pages, and then can convert that into even more dirtyings. The queues only limit the actual in-flight writeback pages, balance_dirty_pages() considers all pages that might become writeback as well as those that are. Would it make sense to remove this behavior, and ensure that balance_dirty_pages() doesn't return until the per-queue limits have been complied with? I don't think that will help, balance_dirty_pages drives the queues. That is, it converts pages from mere dirty to writeback. Yes. But current logic says, that if you convert write_chunk dirty to writeback, you are allowed to dirty ratelimit more. D: number of dirty pages W: number of writeback pages L: global limit C: write_chunk = ratelimit_pages * 1.5 R: ratelimit If D+W = L, then R = 8 Let's assume, that D == L and W == 0. And that all of the dirty pages belong to a single device. Also for simplicity, lets assume an infinite length queue, and a slow device. Then while converting the dirty pages to writeback, D / C * R new dirty pages can be created. So when all existing dirty have been converted: D = L / C * R W = L D + W = L * (1 + R / C) So we see, that we're now even more above the limit than before the conversion. This means, that we starve writers to other devices, which don't have as many dirty pages, because until the slow device doesn't finish these writes they will not get to do anything. Your patch helps this in that if the other writers have an empty queue and no dirty, they will be allowed to slowly start writing. But they will not gain their full share until the slow dirty-hog goes below the global limit, which may take some time. So I think the logical thing to do, is if the dirty-hog is over it's queue limit, don't let it dirty any more until it's dirty+writeback go below the limit. That allowes other devices to more quickly gain their share of dirty pages. Ahh, now I see; I had totally blocked out these few lines: pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ yeah, those look dubious indeed... And reading back Neil's comments, I think he agrees. Shall we just kill those? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/3] 2.6.21-rc7: known regressions (v2)
On Mon, Apr 23, 2007 at 03:18:19PM -0700, Greg KH wrote: On Mon, Apr 23, 2007 at 11:48:47PM +0200, Adrian Bunk wrote: This email lists some known regressions in Linus' tree compared to 2.6.20. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject: gammu no longer works References : http://lkml.org/lkml/2007/4/20/84 Submitter : Wolfgang Erig [EMAIL PROTECTED] Status : unknown I've asked for more information about this, and so far am not sure it's a real problem. It is a real problem for me. I tried this on 2 different boxes with the same behaviour. No sync between my Nokia mobile and Linux with the latest kernel :( Which additional information is useful for this problem? Wolfgang $ gammu textall --backup backup Press Ctrl+C to break... [Gammu- 1.10.0 built 10:15:07 Mar 13 2007 in gcc 4.1] [Connection - fbuspl2303] [Model type - 3100] [Device - /dev/ttyUSB0] [Run on - Linux, kernel 2.6.21-rc7-g80d74d51 (#9 SMP Wed Apr 18 21:41:41 CEST 2007)] [Module - 1100|1100a|1100b|2650|3100|3100b|3105|3108|3200|3200a|3205|3220|3300|3510|3510i|3530|3589i|3590|3595|5100|5140|5140i|6020|6021|6030|6100|6101|6103|6111|6125|6131|6170|6200|6220|6230|6230i|6233|6234|6270|6280|6310|6310i|6385|6510|6610|6610i|6800|6810|6820|6822|7200|7210|7250|7250i|7260|7270|7360|7370|7600|8310|8390|8910|8910i] Setting speed to 19200 I/O possible - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] Per-container pages reclamation
Pavel Emelianov wrote: Implement try_to_free_pages_in_container() to free the pages in container that has run out of memory. The scan_control-isolate_pages() function isolates the container pages only. Pavel, I've just started playing around with these patches, I preferred the approach of v1. Please see below +static unsigned long isolate_container_pages(unsigned long nr_to_scan, + struct list_head *src, struct list_head *dst, + unsigned long *scanned, struct zone *zone) +{ + unsigned long nr_taken = 0; + struct page *page; + struct page_container *pc; + unsigned long scan; + LIST_HEAD(pc_list); + + for (scan = 0; scan nr_to_scan !list_empty(src); scan++) { + pc = list_entry(src-prev, struct page_container, list); + page = pc-page; + if (page_zone(page) != zone) + continue; shrink_zone() will walk all pages looking for pages belonging to this container and this slows down the reclaim quite a bit. Although we've reused code, we've ended up walking the entire list of the zone to find pages belonging to a particular container, this was the same problem I had with my RSS controller patches. + + list_move(pc-list, pc_list); + -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
Ahh, now I see; I had totally blocked out these few lines: pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ yeah, those look dubious indeed... And reading back Neil's comments, I think he agrees. Shall we just kill those? I think we should. Athough I'm a little afraid, that Akpm will tell me again, that I'm a stupid git, and that those lines are in fact vitally important ;) Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] [PATCH -mm] utrace: fix double free re __rcu_process_callbacks()
Roland, can you please help with it? current utrace state is far from being stable, RHEL5 and -mm kernels can be quite easily crashed with some of the exploits we collected so far. Alexey can help you with any information needed - call traces, test cases, but without your help we can't fix it all ourselfes :/ Thanks, Kirill Alexey Dobriyan wrote: The following patch fixes double free manifesting itself as crash in __rcu_process_callbasks(): http://marc.info/?l=linux-kernelm=117518764517017w=2 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229112 The problem is with check_dead_utrace() conditionally scheduling struct utrace for freeing but not cleaning struct task_struct::utrace pointer leaving it reachable: tsk-utrace_flags = flags; if (flags) spin_unlock(utrace-lock); else rcu_utrace_free(utrace); OTOH, utrace_release_task() first clears -utrace pointer, then frees struct utrace itself: Roland inserted some debugging into 2.6.21-rc6-mm1 so that aforementined double free couldn't be reproduced without seeing BUG at kernel/utrace.c:176 first. It triggers if one struct utrace were passed to rcu_utrace_free() second time. With patch applied I no longer see¹ BUG message and double frees on 2-way P3, 8-way ia64, Core 2 Duo boxes. Testcase is at the first link. I _think_ it adds leak if utrace_reap() takes branch without freeing but, well, I hope Roland will give me some clue on how to fix it too. Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED] --- kernel/utrace.c |6 +- 1 file changed, 1 insertion(+), 5 deletions(-) ¹ But I see whole can of other bugs! I think they were already lurking but weren't easily reproducable without hitting double-free first. FWIW, it's BUG_ON(!list_empty(tsk-ptracees)); oops at the beginning of remove_engine() NULL -report_quiesce call which is absent in ptrace utrace ops BUG_ON(tracehook_check_released(p)); --- a/kernel/utrace.c +++ b/kernel/utrace.c @@ -205,7 +205,6 @@ utrace_clear_tsk(struct task_struct *tsk if (utrace-u.live.signal == NULL) { task_lock(tsk); if (likely(tsk-utrace != NULL)) { - rcu_assign_pointer(tsk-utrace, NULL); tsk-utrace_flags = UTRACE_ACTION_NOREAP; } task_unlock(tsk); @@ -305,10 +304,7 @@ check_dead_utrace(struct task_struct *ts } tsk-utrace_flags = flags; - if (flags) - spin_unlock(utrace-lock); - else - rcu_utrace_free(utrace); + spin_unlock(utrace-lock); /* * Now we're finished updating the utrace state. ___ Devel mailing list [EMAIL PROTECTED] https://openvz.org/mailman/listinfo/devel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]Fix parsing kernelcore boot option for ia64
Subject: Check zone boundaries when freeing bootmem Zone boundaries do not have to be aligned to MAX_ORDER_NR_PAGES. Hmm. I don't understand here yet... Could you explain more? This issue occurs only when ZONE_MOVABLE is specified. If its boundary is aligned to MAX_ORDER automatically, I guess user will not mind it. From memory hotplug view, I prefer section size alignment to make simple code. :-P However, during boot, there is an implicit assumption that they are aligned to a BITS_PER_LONG boundary when freeing pages as quickly as possible. This patch checks the zone boundaries when freeing pages from the bootmem allocator. Anyway, the patch works well. Bye. -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[1/2] w1: allow bus master to have reset and byte ops.
Signed-off-by: Matt Reimer [EMAIL PROTECTED] Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] --- drivers/w1/w1_int.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/w1/w1_int.c b/drivers/w1/w1_int.c index 357a2e0..258defd 100644 --- a/drivers/w1/w1_int.c +++ b/drivers/w1/w1_int.c @@ -100,7 +100,8 @@ int w1_add_master_device(struct w1_bus_master *master) /* validate minimum functionality */ if (!(master-touch_bit master-reset_bus) -!(master-write_bit master-read_bit)) { +!(master-write_bit master-read_bit) + !(master-write_byte master-read_byte master-reset_bus)) { printk(KERN_ERR w1_add_master_device: invalid function set\n); return(-EINVAL); } -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 24 Apr 2007 11:47:20 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote: Ahh, now I see; I had totally blocked out these few lines: pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ yeah, those look dubious indeed... And reading back Neil's comments, I think he agrees. Shall we just kill those? I think we should. Athough I'm a little afraid, that Akpm will tell me again, that I'm a stupid git, and that those lines are in fact vitally important ;) It depends what they're replaced with. That code is there, iirc, to prevent a process from getting stuck in balance_dirty_pages() forever due to the dirtying activity of other processes. hm, we ask the process to write write_chunk pages each go around the loop. So if it wrote write-chunk/2 pages on the first pass it might end up writing write_chunk*1.5 pages total. I guess that's rare and doesn't matter much if it does happen - the upper bound is write_chunk*2-1, I think. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[2/2] Driver for the Maxim DS1WM, a 1-wire bus master ASIC core.
Signed-off-by: Matt Reimer [EMAIL PROTECTED] Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] --- drivers/w1/masters/Kconfig |8 + drivers/w1/masters/Makefile |2 +- drivers/w1/masters/ds1wm.c | 463 +++ include/linux/ds1wm.h | 13 ++ 4 files changed, 485 insertions(+), 1 deletions(-) create mode 100644 drivers/w1/masters/ds1wm.c create mode 100644 include/linux/ds1wm.h diff --git a/drivers/w1/masters/Kconfig b/drivers/w1/masters/Kconfig index 2fb4255..ca44f9e 100644 --- a/drivers/w1/masters/Kconfig +++ b/drivers/w1/masters/Kconfig @@ -35,5 +35,13 @@ config W1_MASTER_DS2482 This driver can also be built as a module. If so, the module will be called ds2482. +config W1_DS1WM + tristate Maxim DS1WM 1-wire busmaster + depends on W1 + help + Say Y here to enable the DS1WM 1-wire driver, such as that + in HP iPAQ devices like h5xxx, h2200, and ASIC3-based like + hx4700. + endmenu diff --git a/drivers/w1/masters/Makefile b/drivers/w1/masters/Makefile index 4cee256..a9e45fb 100644 --- a/drivers/w1/masters/Makefile +++ b/drivers/w1/masters/Makefile @@ -5,4 +5,4 @@ obj-$(CONFIG_W1_MASTER_MATROX) += matrox_w1.o obj-$(CONFIG_W1_MASTER_DS2490) += ds2490.o obj-$(CONFIG_W1_MASTER_DS2482) += ds2482.o - +obj-$(CONFIG_W1_DS1WM) += ds1wm.o diff --git a/drivers/w1/masters/ds1wm.c b/drivers/w1/masters/ds1wm.c new file mode 100644 index 000..cea74e1 --- /dev/null +++ b/drivers/w1/masters/ds1wm.c @@ -0,0 +1,463 @@ +/* + * 1-wire busmaster driver for DS1WM and ASICs with embedded DS1WMs + * such as HP iPAQs (including h5xxx, h2200, and devices with ASIC3 + * like hx4700). + * + * Copyright (c) 2004-2005, Szabolcs Gyurko [EMAIL PROTECTED] + * Copyright (c) 2004-2007, Matt Reimer [EMAIL PROTECTED] + * + * Use consistent with the GNU GPL is permitted, + * provided that this copyright notice is + * preserved in its entirety in all copies and derived works. + */ + +#include linux/module.h +#include linux/interrupt.h +#include linux/irq.h +#include linux/pm.h +#include linux/platform_device.h +#include linux/clk.h +#include linux/delay.h +#include linux/ds1wm.h + +#include asm/io.h + +#include ../w1.h +#include ../w1_int.h + + +#define DS1WM_CMD 0x00/* R/W 4 bits command */ +#define DS1WM_DATA 0x01/* R/W 8 bits, transmit/receive buffer */ +#define DS1WM_INT 0x02/* R/W interrupt status */ +#define DS1WM_INT_EN 0x03/* R/W interrupt enable */ +#define DS1WM_CLKDIV 0x04/* R/W 5 bits of divisor and pre-scale */ + +#define DS1WM_CMD_1W_RESET 1 0 /* force reset on 1-wire bus */ +#define DS1WM_CMD_SRA 1 1 /* enable Search ROM accelerator mode */ +#define DS1WM_CMD_DQ_OUTPUT 1 2 /* write only - forces bus low */ +#define DS1WM_CMD_DQ_INPUT 1 3 /* read only - reflects state of bus */ + +#define DS1WM_INT_PD 1 0 /* presence detect */ +#define DS1WM_INT_PDR 1 1 /* presence detect result */ +#define DS1WM_INT_TBE 1 2 /* tx buffer empty */ +#define DS1WM_INT_TSRE 1 3 /* tx shift register empty */ +#define DS1WM_INT_RBF 1 4 /* rx buffer full */ +#define DS1WM_INT_RSRF 1 5 /* rx shift register full */ + +#define DS1WM_INTEN_EPD1 0 /* enable presence detect int */ +#define DS1WM_INTEN_IAS1 1 /* INTR active state */ +#define DS1WM_INTEN_ETBE1 2 /* enable tx buffer empty int */ +#define DS1WM_INTEN_ETMT1 3 /* enable tx shift register empty int */ +#define DS1WM_INTEN_ERBF1 4 /* enable rx buffer full int */ +#define DS1WM_INTEN_ERSRF 1 5 /* enable rx shift register full int */ +#define DS1WM_INTEN_DQO1 6 /* enable direct bus driving ops + (undocumented), Szabolcs Gyurko */ + + +#define DS1WM_TIMEOUT (HZ * 5) + +static struct { + unsigned long freq; + unsigned long divisor; +} freq[] = { + { 400, 0x8 }, + { 500, 0x2 }, + { 600, 0x5 }, + { 700, 0x3 }, + { 800, 0xc }, + { 1000, 0x6 }, + { 1200, 0x9 }, + { 1400, 0x7 }, + { 1600, 0x10 }, + { 2000, 0xa }, + { 2400, 0xd }, + { 2800, 0xb }, + { 3200, 0x14 }, + { 4000, 0xe }, + { 4800, 0x11 }, + { 5600, 0xf }, + { 6400, 0x18 }, + { 8000, 0x12 }, + { 9600, 0x15 }, + { 11200, 0x13 }, + { 12800, 0x1c }, +}; + +struct ds1wm_data { + void*map; + int bus_shift; /* # of shifts to calc register offsets */ + struct platform_device *pdev; + struct ds1wm_platform_data *pdata; + int irq; + struct clk *clk; + int slave_present; + void*reset_complete; + void
Re: [PATCH]Fix parsing kernelcore boot option for ia64
On Tue, 24 Apr 2007, Yasunori Goto wrote: Subject: Check zone boundaries when freeing bootmem Zone boundaries do not have to be aligned to MAX_ORDER_NR_PAGES. Hmm. I don't understand here yet... Could you explain more? Nodes are required to be MAX_ORDER_NR_PAGES aligned for the buddy algorithm to work but zones can be at any alignment because the page_is_buddy() check checks the zone_id of two buddies when merging. As zones are generally aligned anyway, it was never noticed that the bootmem allocators assumes zones are at least order-5 aligned on 32 bit and order-6 aligned on 64 bit. This issue occurs only when ZONE_MOVABLE is specified. Yes, because it can be sized to any value. At the moment, zones are aligned to MAX_ORDER_NR_PAGES so it was not noticed that bootmem makes assumptions on zone alignment. If its boundary is aligned to MAX_ORDER automatically, I guess user will not mind it. Probably not. They will get a different amount of memory usable by the kernel than they asked for but it doesn't really matter. Huge pages generally need MAX_ORDER_NR_PAGES base pages as well so the alignment doesn't hurt there. From memory hotplug view, I prefer section size alignment to make simple code. :-P That's fair. I'll roll up a patch that aligns to MAX_ORDER_NR_PAGES to begin with and then decide if it should align to section size on SPARSEMEM or not. However, during boot, there is an implicit assumption that they are aligned to a BITS_PER_LONG boundary when freeing pages as quickly as possible. This patch checks the zone boundaries when freeing pages from the bootmem allocator. Anyway, the patch works well. Right, I'll resend it to linux-mm as a standalone patch later so because it fixes a correctness issue albeit one that is easily avoided. Bye. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kthread: Enhance kthread_stop to abort interruptible sleeps
On Fri, 13 Apr 2007 21:13:13 -0600 [EMAIL PROTECTED] (Eric W. Biederman) wrote: This patch reworks kthread_stop so it is more flexible and it causes the target kthread to abort interruptible sleeps. Allowing a larger class of kernel threads to use to the kthread API. The changes start by defining TIF_KTHREAD_STOP on all architectures. TIF_KTHREAD_STOP is a per process flag that I can set from another process to indicate that a kernel thread should stop. wake_up_process in kthread_stop has been replaced by signal_wake_up ensuring that the kernel thread if sleeping is woken up in a timely manner and with TIF_SIGNAL_PENDING set, which causes us to break out of interruptible sleeps. recalc_signal_pending was modified to keep TIF_SIGNAL_PENDING set for as long as TIF_KTHREAD_STOP is set. Arbitrary paths to do_exit are now allowed. I have placed a completion on the thread stack and pointed vfork_done at it, when the mm_release is called from do_exit the completion will be called. Since the completion is stored on the stack it is important that kthread() now calls do_exit ensuring the stack frame that holds the completion is never released, and so that our exit_code is certain to make it unchanged all the way to do_exit. To allow kthread_stop to read the process exit code when exit_mm wakes it up I have moved the setting of exit_code to the beginning of do_exit. This patch causes this oops: http://userweb.kernel.org/~akpm/s5000508.jpg with this config: http://userweb.kernel.org/~akpm/config-x.txt - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 2007-04-24 at 03:00 -0700, Andrew Morton wrote: On Tue, 24 Apr 2007 11:47:20 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote: Ahh, now I see; I had totally blocked out these few lines: pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ yeah, those look dubious indeed... And reading back Neil's comments, I think he agrees. Shall we just kill those? I think we should. Athough I'm a little afraid, that Akpm will tell me again, that I'm a stupid git, and that those lines are in fact vitally important ;) It depends what they're replaced with. That code is there, iirc, to prevent a process from getting stuck in balance_dirty_pages() forever due to the dirtying activity of other processes. hm, we ask the process to write write_chunk pages each go around the loop. So if it wrote write-chunk/2 pages on the first pass it might end up writing write_chunk*1.5 pages total. I guess that's rare and doesn't matter much if it does happen - the upper bound is write_chunk*2-1, I think. Right, but I think the problem is that its dirty - writeback, not dirty - writeback completed. Ie. they don't guarantee progress, it could be that the total nr_reclaimable + nr_writeback will steadily increase due to this break. How about ensuring that vm_writeout_total increases least 2*sync_writeback_pages() during our stay in balance_dirty_pages(). That way we have the guarantee that more pages get written out than can be dirtied. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: PageLRU can be non-atomic bit operation
Hisashi Hifumi wrote: At 11:47 07/04/24, Nick Piggin wrote: As Hugh points out, we must have atomic ops here, so changing the generic code to use the __ version is wrong. However if there is a faster way that i386 can perform the atomic variant, then doing so will speed up the generic code without breaking other architectures. Do you mean writing page-flags.h specific for i386 so improving generic code and without breaking other architectures ? I meant improving the i386 bitops specific code. However if there is some variant of operation that is not captured with the current bitop API, but could provide a useful speedup of common page flag manipulations, then you might consider extending the bitop API and making page-flags.h use that new operation. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
Ahh, now I see; I had totally blocked out these few lines: pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ yeah, those look dubious indeed... And reading back Neil's comments, I think he agrees. Shall we just kill those? I think we should. Athough I'm a little afraid, that Akpm will tell me again, that I'm a stupid git, and that those lines are in fact vitally important ;) It depends what they're replaced with. That code is there, iirc, to prevent a process from getting stuck in balance_dirty_pages() forever due to the dirtying activity of other processes. hm, we ask the process to write write_chunk pages each go around the loop. So if it wrote write-chunk/2 pages on the first pass it might end up writing write_chunk*1.5 pages total. I guess that's rare and doesn't matter much if it does happen - the upper bound is write_chunk*2-1, I think. Right, but I think the problem is that its dirty - writeback, not dirty - writeback completed. Ie. they don't guarantee progress, it could be that the total nr_reclaimable + nr_writeback will steadily increase due to this break. How about ensuring that vm_writeout_total increases least 2*sync_writeback_pages() during our stay in balance_dirty_pages(). That way we have the guarantee that more pages get written out than can be dirtied. No, because that's a global counter, which many writers could be looking at. We'd need a per-task writeout counter, but when finishing the write we don't know anymore which task it was performed for. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/7] libata: check for AN support
Sorry for replying to Alan's reply, I missed the original mail. +#define ata_id_has_AN(id) \ + ((id[76] (~id[76])) ((id)[78] (1 5))) (a ~a) (b 32) I don't think that does what you think it does, because at that point it's a funny way to write 0 ((0 or 1) binary-and (0 or 32)). I'm not even sure what it is you want. If for the first part you wanted (id[76] != 0x00 id[76] != 0xff), please write just that, thanks :-) OG. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 2007-04-24 at 12:19 +0200, Miklos Szeredi wrote: Ahh, now I see; I had totally blocked out these few lines: pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ yeah, those look dubious indeed... And reading back Neil's comments, I think he agrees. Shall we just kill those? I think we should. Athough I'm a little afraid, that Akpm will tell me again, that I'm a stupid git, and that those lines are in fact vitally important ;) It depends what they're replaced with. That code is there, iirc, to prevent a process from getting stuck in balance_dirty_pages() forever due to the dirtying activity of other processes. hm, we ask the process to write write_chunk pages each go around the loop. So if it wrote write-chunk/2 pages on the first pass it might end up writing write_chunk*1.5 pages total. I guess that's rare and doesn't matter much if it does happen - the upper bound is write_chunk*2-1, I think. Right, but I think the problem is that its dirty - writeback, not dirty - writeback completed. Ie. they don't guarantee progress, it could be that the total nr_reclaimable + nr_writeback will steadily increase due to this break. How about ensuring that vm_writeout_total increases least 2*sync_writeback_pages() during our stay in balance_dirty_pages(). That way we have the guarantee that more pages get written out than can be dirtied. No, because that's a global counter, which many writers could be looking at. We'd need a per-task writeout counter, but when finishing the write we don't know anymore which task it was performed for. Yeah, just reached that conclusion myself too - again, I ran into that when trying to figure out how to do the per task balancing right. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 3/3] PM: Introduce suspend notifiers (rev. 2)
On Sun, 22 Apr 2007 20:48:08 +0200 Rafael J. Wysocki [EMAIL PROTECTED] wrote: Make it possible to register suspend notifiers so that subsystems can perform suspend-related operations that should not be carried out by device drivers' .suspend() and .resume() routines. x86_64 allnoconfig: arch/x86_64/kernel/e820.c: In function 'e820_mark_nosave_regions': arch/x86_64/kernel/e820.c:279: warning: implicit declaration of function 'register_nosave_region' arch/x86_64/kernel/built-in.o: In function `e820_mark_nosave_regions': : undefined reference to `register_nosave_region' arch/x86_64/kernel/built-in.o: In function `e820_mark_nosave_regions': : undefined reference to `register_nosave_region' - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kthread: Enhance kthread_stop to abort interruptible sleeps
Andrew Morton [EMAIL PROTECTED] writes: On Fri, 13 Apr 2007 21:13:13 -0600 [EMAIL PROTECTED] (Eric W. Biederman) wrote: This patch reworks kthread_stop so it is more flexible and it causes the target kthread to abort interruptible sleeps. Allowing a larger class of kernel threads to use to the kthread API. The changes start by defining TIF_KTHREAD_STOP on all architectures. TIF_KTHREAD_STOP is a per process flag that I can set from another process to indicate that a kernel thread should stop. wake_up_process in kthread_stop has been replaced by signal_wake_up ensuring that the kernel thread if sleeping is woken up in a timely manner and with TIF_SIGNAL_PENDING set, which causes us to break out of interruptible sleeps. recalc_signal_pending was modified to keep TIF_SIGNAL_PENDING set for as long as TIF_KTHREAD_STOP is set. Arbitrary paths to do_exit are now allowed. I have placed a completion on the thread stack and pointed vfork_done at it, when the mm_release is called from do_exit the completion will be called. Since the completion is stored on the stack it is important that kthread() now calls do_exit ensuring the stack frame that holds the completion is never released, and so that our exit_code is certain to make it unchanged all the way to do_exit. To allow kthread_stop to read the process exit code when exit_mm wakes it up I have moved the setting of exit_code to the beginning of do_exit. This patch causes this oops: http://userweb.kernel.org/~akpm/s5000508.jpg with this config: http://userweb.kernel.org/~akpm/config-x.txt Thanks. If I am reading the oops properly this happened during bootup and vfork_done was set to NULL? The NULL vfork_done is really weird as exec is the only thing that sets vfork_done to NULL. Either I've got a stupid bug in there somewhere or we have just found the weirdest memory stomp. I will take a look and see if I can reproduce this shortly. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/