Re: [PATCH] get rid of NR_OPEN and introduce a sysctl_nr_open
[EMAIL PROTECTED] a écrit : On Tue, 27 Nov 2007 08:09:19 +0100, Eric Dumazet said: Changing NR_OPEN is not considered safe because of vmalloc space potential exhaust. Verbiage about this point... +nr_open +--- + +Denotes the maximum number of file-handles a process can +allocate. Default value is 1024*1024 (1048576) which should be +enough for most machines. Actual limit depends on RLIMIT_NOFILE +resource limit. + should probably be in here - can you add something of the form "Setting this too high can cause vmalloc failures, especially on smaller-RAM machines", and/or *say* how much RAM the default takes? Sure, it's 1M entries, but my tuning on a 2G-RAM machine will differ if these are byte-sized, or 128-byte sized - one is off in a corner, the other is 1/16th of my entire memory. vmalloc failures can already happen if you start 32 processes on i386 kernels, each of them wanting to open file handle number 600.000 (if their RLIMIT_NOFILE >= 60) fcntl(0, F_DUPFD, 60); We are not going to add warnings about vmalloc on every sysctl around there that could allow a root user to exhaust vmalloc space. This is a vmalloc issue on 32bit kernel, and quite frankly I never hit this limit. If you take a look at vmalloc() implementation, fact that it uses a 'struct vm_struct *vmlist;' to track all active zones show that vmalloc() is not used that much. Also, would it be useful to *lower* the value drastically, if you know a priori that no process should get up to 1K file handles, much less 1M? Does that buy me anything different than setting RLIMIT_NOFILE=1024? NR_OPEN is the max value that RLIMIT_NOFILE can reach, nothing more. You can set it to 256*1024*1024 or 4*1024 it wont change memory needs on your machine, unless you raise RLIMIT_NOFILE and one of your program leaks file handles, or really want to open simultaneously many of them. Most programs wont open more than 500 files, so their file table is allocated via kmalloc() - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820
On Mon, 26 Nov 2007 23:27:03 PST, Andrew Morton said: > > git-x86.patch > > git-x86-fixup.patch > > git-x86-thread_order-borkage.patch > > git-x86-thread_order-borkage-fix.patch > > git-x86-identify_cpu-fix.patch > > git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch > > git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch > > git-x86-inlining-borkage.patch > > x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch > > x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD > You could try http://userweb.kernel.org/~akpm/mmotm/ - we might have already > fixed it. I suspect that trying -rc3-mm1 but refreshing just the 10 patches above from -mmotm would be far less likely to pull in other heartburn? > Otherwise, please proceed to work out which diff I need to drop and hope like > hell that it isn't git-x86.. That's a 41,240 line diff, the rest *total* to about 400 lines. I don't have warm-n-fuzzies about my odds here. ;) I'm a git-idiot, but *do* know how to git-bisect through Linus tree - what would I need to do to git-bisect through git-x86.patch? (I do *not* know how to deal with more than 1 source git tree, so if the magic is just 'get a linus tree, merge git-x86, then bisect as usual", I'm stuck on "merge git-x86").. pgpxMGUuWzdJd.pgp Description: PGP signature
Re: [RFC] Documentation about unaligned memory access
On Nov 23, 2007, at 5:43 AM, Heikki Orsila wrote: On Fri, Nov 23, 2007 at 12:15:53AM +, Daniel Drake wrote: Why unaligned access is bad === Most architectures are unable to perform unaligned memory accesses. Any unaligned access causes a processor exception. "Some architectures are unable to perform unaligned memory accesses, either an exception is generated, or the data access is silently invalid. In architectures that allow unaligned access, natural aligned accesses are usually faster than non-aligned." In summary: if your code causes unaligned memory accesses to happen, your code will not work on some platforms, and will perform *very* badly on others. *very* -> *slower* Natural alignment = Please move this definition before "Why unaligned access is bad". Also, it would be nice to have a table of ISAs: ISA NeedNeed natural alignment alignment by x m68kNo 2 powerpc/ppc Yes Word size on ppc it varies from processor to processor if misaligned data is fixed up or causes an exception. However its highly recommend to be naturally aligned. I'm not sure I follow what is meant by the second column (need alignment by x). - k - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 11/14] Powerpc: Use generic per cpu
On Nov 26, 2007, at 6:14 PM, Christoph Lameter wrote: Powerpc has a way to determine the address of the per cpu area of the currently executing processor via the paca and the array of per cpu offsets is avoided by looking up the per cpu area from the remote paca's (copying x86_64). Cc: Paul Mackerras <[EMAIL PROTECTED]> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/asm-powerpc/percpu.h | 19 --- 1 file changed, 19 deletions(-) Index: linux-2.6/include/asm-powerpc/percpu.h === --- linux-2.6.orig/include/asm-powerpc/percpu.h 2007-11-24 10:27:31.088350556 -0800 +++ linux-2.6/include/asm-powerpc/percpu.h 2007-11-24 10:29:20.752350757 -0800 @@ -16,25 +16,6 @@ #define __my_cpu_offset() get_paca()->data_offset #define per_cpu_offset(x) (__per_cpu_offset(x)) This concerns me. paca doesn't exist on all PPC platforms. - k - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Booting latest linux kernel(2.6.20) on MPC8548ECDS
On Nov 26, 2007, at 11:54 PM, rajendra prasad wrote: Hi, I am using MPC8548ECDS board from CDS for my telecom application. I am able to build 2.6.10 linux kernel and boot 2.6.10 kernel on MPC8548ECDS board.When I take same configuration file and built successfully but not able to boot on MPC8548E CDS board.I am using u-boot-1.1.6 as boot loader.I came to know taht latest kernel is booted with new procedure.Pls tell me the procedure how to boot procedure. Ask this question on the linuxppc-dev list. You're more likely to get an answer. Its unclear, but you are trying to use the latest kernel on a MPC8548E CDS? - k - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820
On Tue, 27 Nov 2007 02:16:26 -0500 [EMAIL PROTECTED] wrote: > On Tue, 20 Nov 2007 20:45:25 PST, Andrew Morton said: > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ > > Finally got both time and motivation to at least start a bisect.. > > 2.6.23-mm1 works on my D820 (x86_64 kernel, Core2 Duo T7200) > > 24-rc3-mm1 (plus 3 patches from hotfixes/) bricks *instantly* at boot - grub > prints its 3 or 4 lines saying what it loaded, the screen clears, and *blam* > dead. No serial console output, no pair of penguins on the monitor, no > netconsole, no earlyprintk=vga output, no alt-sysrq, only thing that does > *anything* is "hold the power button for 5 seconds". Whatever it is, it > happens *very* early (before we get as far as the 'Linux version 2.6.mumble' > banner), and happens *hard*. > > I've bisected it down this far: > > git-ipwireless_cs.patch GOOD > git-x86.patch > git-x86-fixup.patch > git-x86-thread_order-borkage.patch > git-x86-thread_order-borkage-fix.patch > git-x86-identify_cpu-fix.patch > git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch > git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch > git-x86-inlining-borkage.patch > x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch > x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD > > Anybody got any good debugging ideas before I go through and do the final > 3 or 4 bisects? I suspect I'll need them once I find the offending patch > to tell *why* said patch dies on my box - I've seen enough traffic regarding > -rc3-mm1 dying *later* to know it's probably a subtle issue and not one > that will be obvious once I finger a specific patch. For example, it's > probably not the IO-APIC panic that people are seeing, because their kernels > live long enough to panic. ;) > You could try http://userweb.kernel.org/~akpm/mmotm/ - we might have already fixed it. Otherwise, please proceed to work out which diff I need to drop and hope like hell that it isn't git-x86.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] get rid of NR_OPEN and introduce a sysctl_nr_open
On Tue, 27 Nov 2007 08:09:19 +0100, Eric Dumazet said: > Changing NR_OPEN is not considered safe because of vmalloc space potential > exhaust. Verbiage about this point... > +nr_open > +--- > + > +Denotes the maximum number of file-handles a process can > +allocate. Default value is 1024*1024 (1048576) which should be > +enough for most machines. Actual limit depends on RLIMIT_NOFILE > +resource limit. > + should probably be in here - can you add something of the form "Setting this too high can cause vmalloc failures, especially on smaller-RAM machines", and/or *say* how much RAM the default takes? Sure, it's 1M entries, but my tuning on a 2G-RAM machine will differ if these are byte-sized, or 128-byte sized - one is off in a corner, the other is 1/16th of my entire memory. Also, would it be useful to *lower* the value drastically, if you know a priori that no process should get up to 1K file handles, much less 1M? Does that buy me anything different than setting RLIMIT_NOFILE=1024? pgp1uLtbj6Sc1.pgp Description: PGP signature
Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820
On Tue, 20 Nov 2007 20:45:25 PST, Andrew Morton said: > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Finally got both time and motivation to at least start a bisect.. 2.6.23-mm1 works on my D820 (x86_64 kernel, Core2 Duo T7200) 24-rc3-mm1 (plus 3 patches from hotfixes/) bricks *instantly* at boot - grub prints its 3 or 4 lines saying what it loaded, the screen clears, and *blam* dead. No serial console output, no pair of penguins on the monitor, no netconsole, no earlyprintk=vga output, no alt-sysrq, only thing that does *anything* is "hold the power button for 5 seconds". Whatever it is, it happens *very* early (before we get as far as the 'Linux version 2.6.mumble' banner), and happens *hard*. I've bisected it down this far: git-ipwireless_cs.patch GOOD git-x86.patch git-x86-fixup.patch git-x86-thread_order-borkage.patch git-x86-thread_order-borkage-fix.patch git-x86-identify_cpu-fix.patch git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch git-x86-inlining-borkage.patch x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD Anybody got any good debugging ideas before I go through and do the final 3 or 4 bisects? I suspect I'll need them once I find the offending patch to tell *why* said patch dies on my box - I've seen enough traffic regarding -rc3-mm1 dying *later* to know it's probably a subtle issue and not one that will be obvious once I finger a specific patch. For example, it's probably not the IO-APIC panic that people are seeing, because their kernels live long enough to panic. ;) pgpbW8UIlUa1z.pgp Description: PGP signature
patch driver-core-fix-race-in-__device_release_driver.patch added to gregkh-2.6 tree
This is a note to let you know that I've just added the patch titled Subject: Driver core: fix race in __device_release_driver to my gregkh-2.6 tree. Its filename is driver-core-fix-race-in-__device_release_driver.patch This tree can be found at http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/ >From [EMAIL PROTECTED] Mon Nov 26 22:49:20 2007 From: Alan Stern <[EMAIL PROTECTED]> Date: Fri, 16 Nov 2007 11:57:28 -0500 (EST) Subject: Driver core: fix race in __device_release_driver To: Greg KH <[EMAIL PROTECTED]>, David Woodhouse <[EMAIL PROTECTED]> Cc: USB development list <[EMAIL PROTECTED]>, Kernel development list Message-ID: <[EMAIL PROTECTED]> This patch (as1013) was suggested by David Woodhouse; it fixes a race in the driver core. If a device is unregistered at the same time as its driver is unloaded, the driver's code pages may be unmapped while the remove method is still running. The calls to get_driver() and put_driver() were intended to prevent this, but they don't work if the driver's module count has already dropped to 0. Instead, the patch keeps the device on the driver's list until after the remove method has returned. This forces the necessary synchronization to occur. Signed-off-by: Alan Stern <[EMAIL PROTECTED]> Signed-off-by: David Woodhouse <[EMAIL PROTECTED]> Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]> --- drivers/base/dd.c |5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -289,11 +289,10 @@ static void __device_release_driver(stru { struct device_driver * drv; - drv = get_driver(dev->driver); + drv = dev->driver; if (drv) { driver_sysfs_remove(dev); sysfs_remove_link(>kobj, "driver"); - klist_remove(>knode_driver); if (dev->bus) blocking_notifier_call_chain(>bus->p->bus_notifier, @@ -306,7 +305,7 @@ static void __device_release_driver(stru drv->remove(dev); devres_release_all(dev); dev->driver = NULL; - put_driver(drv); + klist_remove(>knode_driver); } } Patches currently in gregkh-2.6 which might be from [EMAIL PROTECTED] are driver/pm-acquire-device-locks-prior-to-suspending.patch driver/create-sys-...-power-when-config_pm-is-set.patch driver/driver-core-fix-race-in-__device_release_driver.patch usb/usb-add-support-for-an-older-firmware-revision-for-the-nikon-d200.patch usb/usb-fix-priority-mistakes-in-drivers-usb-core-hub.c.patch usb/usb-fix-signr-comment-in-usbdevice_fs.h.patch usb/usb-mailing-lists-have-changed.patch usb/usb-power-management-documenation-update.patch usb/usb-hcd-avoid-duplicate-local_irq_disable.patch usb/usb-usb-mon-mon_bin.c-cleanups.patch usb/usb-keep-track-of-whether-interface-sysfs-files-exist.patch usb/usb-uevent-environment-key-fix.patch usb/usb-autosuspend-for-cdc-acm.patch usb/usb-fix-up-ehci-startup-synchronization.patch usb/usb-usb-storage-new-lockable-subclass-0x07.patch usb/usb-don-t-change-hc-power-state-for-a-freeze.patch usb/usb-dummy_hcd-don-t-register-drivers-on-the-platform-bus.patch usb/usb-force-handover-port-to-companion-when-hub_port_connect_change-fails.patch usb/usb-make-ksuspend_usbd-thread-non-freezable.patch usb/usb-usb-storage-unusual_devs-entry-for-jetflash-ts1gjf2a.patch usb/usb-storage-always-set-the-allow_restart-flag.patch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] get rid of NR_OPEN and introduce a sysctl_nr_open
As changing NR_OPEN from 1024*1024 to 16*1024*1024 was considered a litle bit dangerous, just let it default to 1024*1024 but adds a new sysctl to let sysadmins change this value. Thank you [PATCH] get rid of NR_OPEN and introduce a sysctl_nr_open NR_OPEN (historically set to 1024*1024) actually forbids processes to open more than 1024*1024 handles. Unfortunatly some production servers hit the not so 'ridiculously high value' of 1024*1024 file descriptors per process. Changing NR_OPEN is not considered safe because of vmalloc space potential exhaust. This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to 1024*1024, so that admins can decide to change this limit if their workload needs it. Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> Cc: Alan Cox <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Documentation/filesystems/proc.txt |8 Documentation/sysctl/fs.txt| 10 ++ fs/file.c |8 +--- include/linux/fs.h |2 +- kernel/sys.c |2 +- kernel/sysctl.c|8 6 files changed, 33 insertions(+), 5 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index dec9945..9b390d7 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -989,6 +989,14 @@ nr_inodes Denotes the number of inodes the system has allocated. This number will grow and shrink dynamically. +nr_open +--- + +Denotes the maximum number of file-handles a process can +allocate. Default value is 1024*1024 (1048576) which should be +enough for most machines. Actual limit depends on RLIMIT_NOFILE +resource limit. + nr_free_inodes -- diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt index aa986a3..f992543 100644 --- a/Documentation/sysctl/fs.txt +++ b/Documentation/sysctl/fs.txt @@ -23,6 +23,7 @@ Currently, these files are in /proc/sys/fs: - inode-max - inode-nr - inode-state +- nr_open - overflowuid - overflowgid - suid_dumpable @@ -91,6 +92,15 @@ usage of file handles and you don't need to increase the maximum. == +nr_open: + +This denotes the maximum number of file-handles a process can +allocate. Default value is 1024*1024 (1048576) which should be +enough for most machines. Actual limit depends on RLIMIT_NOFILE +resource limit. + +== + inode-max, inode-nr & inode-state: As with file handles, the kernel allocates the inode structures diff --git a/fs/file.c b/fs/file.c index c5575de..5110acb 100644 --- a/fs/file.c +++ b/fs/file.c @@ -24,6 +24,8 @@ struct fdtable_defer { struct fdtable *next; }; +int sysctl_nr_open __read_mostly = 1024*1024; + /* * We use this list to defer free fdtables that have vmalloced * sets/arrays. By keeping a per-cpu list, we avoid having to embed @@ -147,8 +149,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr) nr /= (1024 / sizeof(struct file *)); nr = roundup_pow_of_two(nr + 1); nr *= (1024 / sizeof(struct file *)); - if (nr > NR_OPEN) - nr = NR_OPEN; + if (nr > sysctl_nr_open) + nr = sysctl_nr_open; fdt = kmalloc(sizeof(struct fdtable), GFP_KERNEL); if (!fdt) @@ -233,7 +235,7 @@ int expand_files(struct files_struct *files, int nr) if (nr < fdt->max_fds) return 0; /* Can we expand? */ - if (nr >= NR_OPEN) + if (nr >= sysctl_nr_open) return -EMFILE; /* All good, so we try */ diff --git a/include/linux/fs.h b/include/linux/fs.h index b3ec4a4..1cda287 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -21,7 +21,7 @@ /* Fixed constants first: */ #undef NR_OPEN -#define NR_OPEN (1024*1024)/* Absolute upper limit on fd num */ +extern int sysctl_nr_open; #define INR_OPEN 1024 /* Initial setting for nfile rlimits */ #define BLOCK_SIZE_BITS 10 diff --git a/kernel/sys.c b/kernel/sys.c index d1fe71e..99c6ce1 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1472,7 +1472,7 @@ asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim) if ((new_rlim.rlim_max > old_rlim->rlim_max) && !capable(CAP_SYS_RESOURCE)) return -EPERM; - if (resource == RLIMIT_NOFILE && new_rlim.rlim_max > NR_OPEN) + if (resource == RLIMIT_NOFILE && new_rlim.rlim_max > sysctl_nr_open) return -EPERM; retval = security_task_setrlimit(resource, _rlim); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 0deed82..de22f7b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1127,6 +1127,14 @@ static struct ctl_table fs_table[] = { .proc_handler = _dointvec, }, { +
Re: [PATCH] dmaengine: Driver for the AVR32 DMACA controller
This: > Subject: Re: [PATCH] dmaengine: Driver for the AVR32 DMACA controller in no way describes this: > This patch corrects recently changed (and now invalid) Kconfig > descriptions for the DMA engine framework: grr. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] [VIDEO]: Complement va_start() with va_end().
Complement va_start() with va_end(). Signed-off-by: Richard Knutsson <[EMAIL PROTECTED]> --- Compile-tested on i386 with allyesconfig and allmodconfig. diff --git a/drivers/media/video/saa5246a.c b/drivers/media/video/saa5246a.c index ad02329..996b494 100644 --- a/drivers/media/video/saa5246a.c +++ b/drivers/media/video/saa5246a.c @@ -187,12 +187,14 @@ static int i2c_senddata(struct saa5246a_device *t, ...) { unsigned char buf[64]; int v; - int ct=0; + int ct = 0; va_list argp; - va_start(argp,t); + va_start(argp, t); - while((v=va_arg(argp,int))!=-1) - buf[ct++]=v; + while ((v = va_arg(argp, int)) != -1) + buf[ct++] = v; + + va_end(argp); return i2c_sendbuf(t, buf[0], ct-1, buf+1); } diff --git a/drivers/media/video/saa5249.c b/drivers/media/video/saa5249.c index 94bb59a..da5ca30 100644 --- a/drivers/media/video/saa5249.c +++ b/drivers/media/video/saa5249.c @@ -282,12 +282,14 @@ static int i2c_senddata(struct saa5249_device *t, ...) { unsigned char buf[64]; int v; - int ct=0; + int ct = 0; va_list argp; va_start(argp,t); - while((v=va_arg(argp,int))!=-1) - buf[ct++]=v; + while ((v = va_arg(argp, int)) != -1) + buf[ct++] = v; + + va_end(argp); return i2c_sendbuf(t, buf[0], ct-1, buf+1); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] make I/O schedulers non-modular
On 25-11-2007 18:22, Jens Axboe wrote: > On Sun, Nov 25 2007, Adrian Bunk wrote: ... >> Is there any technical reason why we need 4 different schedulers at all? > > Until we have the perfect scheduler :-) IMHO this is not enough yet. There is something called "the right of choice", and, it seems, things are usually far from perfect where this right is not respected. Regards, Jarek P. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] [NET]: Fix TX bug VLAN in VLAN
2007/11/27, Herbert Xu <[EMAIL PROTECTED]>: > On Tue, Nov 27, 2007 at 02:32:49PM +0900, Joonwoo Park wrote: > > > > Thanks Herbert. > > Well.. I think patch would work propely for AF_PACKET also. > > (I did not insert BUG() macro in my patch) > > How do you think? > > Are you sure? I thought you need to check both in the xmit function. > That is, > >if (veth->h_vlan_proto != htons(ETH_P_8021Q) || >VLAN_DEV_INFO(dev)->flags & VLAN_FLAG_REORDER_HDR) { > > Otherwise you'll miss AF_PACKET packets when REORDER is off. Thanks Herbert! I agree with you. Thanks. Joonwoo [NET]: Fix TX bug VLAN in VLAN Fix misbehavior of vlan_dev_hard_start_xmit() for recursive encapsulations. Signed-off-by: Joonwoo Park <[EMAIL PROTECTED]> --- diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c index 7a36878..4f99bb8 100644 --- a/net/8021q/vlan_dev.c +++ b/net/8021q/vlan_dev.c @@ -462,7 +462,8 @@ int vlan_dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev) * OTHER THINGS LIKE FDDI/TokenRing/802.3 SNAPs... */ - if (veth->h_vlan_proto != htons(ETH_P_8021Q)) { + if (veth->h_vlan_proto != htons(ETH_P_8021Q) || + VLAN_DEV_INFO(dev)->flags & VLAN_FLAG_REORDER_HDR) { int orig_headroom = skb_headroom(skb); unsigned short veth_TCI; --- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] utsns: Restore proper namespace handling.
On Mon, 26 Nov 2007 09:19:17 -0600 "Serge E. Hallyn" <[EMAIL PROTECTED]> wrote: > Quoting Eric W. Biederman ([EMAIL PROTECTED]): > > > > When CONFIG_UTS_NS was removed it seems that we also deleted > > the code for handling sysctls in the other then the initial > > uts namespace. This patch restores that code. > > > > Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]> > > Thanks, Eric. > > Acked-by: Serge Hallyn <[EMAIL PROTECTED]> > > > --- > > kernel/utsname_sysctl.c |2 ++ > > 1 files changed, 2 insertions(+), 0 deletions(-) > > > > diff --git a/kernel/utsname_sysctl.c b/kernel/utsname_sysctl.c > > index c76c064..71f58c3 100644 > > --- a/kernel/utsname_sysctl.c > > +++ b/kernel/utsname_sysctl.c > > @@ -18,6 +18,8 @@ > > static void *get_uts(ctl_table *table, int write) > > { > > char *which = table->data; > > + struct uts_namespace *uts_ns = current->nsproxy->uts_ns; > > + which = (which - (char *)_uts_ns) + (char *)uts_ns; > > > > if (!write) > > down_read(_sem); I already have a (more codingstylely attractive) version of this from Pavel, for which I shall steal your ack. --- a/kernel/utsname_sysctl.c~isolate-the-uts-namespaces-domainname-and-hostname-back +++ a/kernel/utsname_sysctl.c @@ -18,6 +18,10 @@ static void *get_uts(ctl_table *table, int write) { char *which = table->data; + struct uts_namespace *uts_ns; + + uts_ns = current->nsproxy->uts_ns; + which = (which - (char *)_uts_ns) + (char *)uts_ns; if (!write) down_read(_sem); _ Those pointer tricksies are revolting. What's going on in there? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24-rc3-mm1] IPC: consolidate sem_exit_ns(), msg_exit_ns and shm_exit_ns()
On Mon, 26 Nov 2007 22:44:38 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > On Fri, 23 Nov 2007 17:52:50 +0100 Pierre Peiffer <[EMAIL PROTECTED]> wrote: > > > sem_exit_ns(), msg_exit_ns() and shm_exit_ns() are all called when an > > ipc_namespace is > > released to free all ipcs of each type. > > But in fact, they do the same thing: they loop around all ipcs to free them > > individually by calling a specific routine. > > > > This patch proposes to consolidate this by introducing a common function, > > free_ipcs(), > > that do the job. The specific routine to call on each individual ipcs is > > passed as > > parameter. For this, these ipc-specific 'free' routines are reworked to > > take a > > generic 'struct ipc_perm' as parameter. > > This conflicts in more-than-trivial ways with Pavel's > move-the-ipc-namespace-under-ipc_ns-option.patch, which was in > 2.6.24-rc3-mm1. > err, no, it wasn't that patch. For some reason your change assumes that msg_exit_ns() (for example) doesn't have these lines: kfree(ns->ids[IPC_MSG_IDS]); ns->ids[IPC_MSG_IDS] = NULL; in it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24-rc3-mm1] IPC: consolidate sem_exit_ns(), msg_exit_ns and shm_exit_ns()
On Fri, 23 Nov 2007 17:52:50 +0100 Pierre Peiffer <[EMAIL PROTECTED]> wrote: > sem_exit_ns(), msg_exit_ns() and shm_exit_ns() are all called when an > ipc_namespace is > released to free all ipcs of each type. > But in fact, they do the same thing: they loop around all ipcs to free them > individually by calling a specific routine. > > This patch proposes to consolidate this by introducing a common function, > free_ipcs(), > that do the job. The specific routine to call on each individual ipcs is > passed as > parameter. For this, these ipc-specific 'free' routines are reworked to take a > generic 'struct ipc_perm' as parameter. This conflicts in more-than-trivial ways with Pavel's move-the-ipc-namespace-under-ipc_ns-option.patch, which was in 2.6.24-rc3-mm1. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [NET]: Fix TX bug VLAN in VLAN
On Tue, Nov 27, 2007 at 02:32:49PM +0900, Joonwoo Park wrote: > > Thanks Herbert. > Well.. I think patch would work propely for AF_PACKET also. > (I did not insert BUG() macro in my patch) > How do you think? Are you sure? I thought you need to check both in the xmit function. That is, if (veth->h_vlan_proto != htons(ETH_P_8021Q) || VLAN_DEV_INFO(dev)->flags & VLAN_FLAG_REORDER_HDR) { Otherwise you'll miss AF_PACKET packets when REORDER is off. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
out of office
I am responding to the oil spill in San Francisco Bay and will have very limited email contact for the forseeable future. If this is urgent, please call my cell phone (415-717-6348), and I'll get back to you as soon as possible. Thanks for your patience, Christine Abraham Christine Abraham Marine Ecology Division PRBO Conservation Science 3820 Cypress Dr. #11 Petaluma, California 94954 707-781-2555 ext. 334 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] please pull infiniband.git
Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will pull some small fixes for 2.6.24: Erez Zilber (1): IB/iser: Add missing counter increment in iser_data_buf_aligned_len() Jack Morgenstein (1): mlx4_core: Fix state check in mlx4_qp_modify() Joachim Fenkes (1): IB/ehca: Fix static rate regression Ralph Campbell (4): IB/ipath: Fix offset returned to ibv_resize_cq() IB/ipath: Fix error path in QP creation IB/ipath: Fix offset returned to ibv_modify_srq() IB/ipath: Normalize error return codes for posting work requests drivers/infiniband/hw/ehca/ehca_qp.c |4 +- drivers/infiniband/hw/ipath/ipath_cq.c| 19 +--- drivers/infiniband/hw/ipath/ipath_qp.c| 15 ++ drivers/infiniband/hw/ipath/ipath_srq.c | 44 + drivers/infiniband/hw/ipath/ipath_verbs.c |8 +++-- drivers/infiniband/ulp/iser/iser_memory.c |6 ++- drivers/net/mlx4/qp.c |2 +- 7 files changed, 61 insertions(+), 37 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 2e3e654..dd12668 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -1203,7 +1203,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, mqpcb->service_level = attr->ah_attr.sl; update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SERVICE_LEVEL, 1); - if (ehca_calc_ipd(shca, my_qp->init_attr.port_num, + if (ehca_calc_ipd(shca, mqpcb->prim_phys_port, attr->ah_attr.static_rate, >max_static_rate)) { ret = -EINVAL; @@ -1302,7 +1302,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, mqpcb->source_path_bits_al = attr->alt_ah_attr.src_path_bits; mqpcb->service_level_al = attr->alt_ah_attr.sl; - if (ehca_calc_ipd(shca, my_qp->init_attr.port_num, + if (ehca_calc_ipd(shca, mqpcb->alt_phys_port, attr->alt_ah_attr.static_rate, >max_static_rate_al)) { ret = -EINVAL; diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index 08d8ae1..d1380c7 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -395,12 +395,9 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) goto bail; } - /* -* Return the address of the WC as the offset to mmap. -* See ipath_mmap() for details. -*/ + /* Check that we can write the offset to mmap. */ if (udata && udata->outlen >= sizeof(__u64)) { - __u64 offset = (__u64) wc; + __u64 offset = 0; ret = ib_copy_to_udata(udata, , sizeof(offset)); if (ret) @@ -450,6 +447,18 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) struct ipath_mmap_info *ip = cq->ip; ipath_update_mmap_info(dev, ip, sz, wc); + + /* +* Return the offset to mmap. +* See ipath_mmap() for details. +*/ + if (udata && udata->outlen >= sizeof(__u64)) { + ret = ib_copy_to_udata(udata, >offset, + sizeof(ip->offset)); + if (ret) + goto bail; + } + spin_lock_irq(>pending_lock); if (list_empty(>pending_mmaps)) list_add(>pending_mmaps, >pending_mmaps); diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c index 6a41fdb..b997ff8 100644 --- a/drivers/infiniband/hw/ipath/ipath_qp.c +++ b/drivers/infiniband/hw/ipath/ipath_qp.c @@ -835,7 +835,8 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, init_attr->qp_type); if (err) { ret = ERR_PTR(err); - goto bail_rwq; + vfree(qp->r_rq.wq); + goto bail_qp; } qp->ip = NULL; ipath_reset_qp(qp); @@ -863,7 +864,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, sizeof(offset)); if (err) { ret = ERR_PTR(err); - goto bail_rwq; + goto bail_ip; } } else {
Re: 2.6.24-rc3-mm1
On Fri, 23 Nov 2007 06:55:41 +0100 Gabriel C <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > On Fri, 23 Nov 2007 02:39:08 +0100 Gabriel C <[EMAIL PROTECTED]> wrote: > > > >> I have some warnings on each SCSI disc: > >> > >> > >> ... > >> > >> [ 30.724410] scsi 0:0:0:0: Direct-Access SEAGATE ST318406LW > >> 0109 PQ: 0 ANSI: 3 > >> [ 30.724419] scsi0:A:0:0: Tagged Queuing enabled. Depth 32 > >> [ 30.724435] target0:0:0: Beginning Domain Validation > >> [ 30.724446] target0:0:0: Domain Validation Initial Inquiry Failed <-- > >> [ 30.724572] target0:0:0: Ending Domain Validation > >> [ 30.729747] scsi 0:0:1:0: Direct-Access FUJITSU MAH3182MP > >> 0114 PQ: 0 ANSI: 4 > >> [ 30.729754] scsi0:A:1:0: Tagged Queuing enabled. Depth 32 > >> [ 30.729771] target0:0:1: Beginning Domain Validation > >> [ 30.729780] target0:0:1: Domain Validation Initial Inquiry Failed <-- > >> [ 30.729908] target0:0:1: Ending Domain Validation > >> > > > > Don't know what would have caused that. But yes, something is wrong in > > scsi land. > > Actually I'm lucky the author didn't fix that FIXME in scsi_transport_spi.c > and I still can boot ;) > > > > >> no idea whatever this is related but buffered disk reads are 2.XX MB/sec > >> and the box is somewhat laggy. > >> > >> hdparm -t on sda and sdb reports : > >> > >> /dev/sda: > >> Timing buffered disk reads:8 MB in 3.26 seconds = 2.46 MB/sec > >> > >> /dev/sdb: > >> Timing buffered disk reads:8 MB in 3.56 seconds = 2.25 MB/sec > >> > >> My IDE discs are fine. > >> > >> Please let me know if you need my config or any other informations. > >> > > > > And you're the second to report very slow scsi throughput in 2.6.24-rc3-mm1. > > > > I found the commit which cause these problems , it is in git-scsi-misc patch > and reverting it fixes both problems for me. > > http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff_plain;h=8655a546c83fc43f0a73416bbd126d02de7ad6c0;hp=5bc717b6bdaaf52edf365eb7d9d8c89fec79df5d > OK, thanks. I'll assume that James and Hannes have this in hand (or will have, by mid-week) and I won't do anything here. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] remove CONFIG_EXPERIMENTAL
On Mon, 26 Nov 2007 23:34:11 EST, Dave Jones said: > On Mon, Nov 26, 2007 at 10:44:44PM -0500, [EMAIL PROTECTED] wrote: > > > I suspect that given the "once it escapes, it's cast in stone" view we take > > towards user-visible API/etc, there isn't much *real* room for an > > 'EXPERIMENTAL' flag anymore. Most of the usage should probably be > confined to > > individual drivers, where all we should need is a 'default n' and suitable > > warning verbiage in the Kconfig file warning about the driver eating your > > filesystems and small animals for breakfast. > > Potential corruptors are usually flagged with (DANGEROUS) in the text, Right, and given that, an additional EXPERIMENTAL flag seems superfluous. > (One may argue that they shouldn't have escaped -mm) > > > We certainly shouldn't have > > one big flag for *all* in-progress drivers - I don't need to accidentally > > enable a busticated ethernet driver because I want a USB widget. > > So no ethernet driver at all is better than a broken but mostly working one? > Again if it isn't mostly working, it shouldn't have escaped -mm No. The point is that using the *same* flag to control whether I can select a mostly-working USB widget and a mostly-working Ethernet driver is Just Wrong. Those of us who live in the US may have seen the insurance commercial where Joe Sixpack is asking "Honey, what does this switch do?" "I don't know" flip, flip, flip with no obvious impact. Meanwhile, 3 houses down, somebody's car is being beat up by a garage door opener going open/close/open/close... I enable EXPERIMENTAL to enable my USB widget. When the next release comes out, I then go and do something like a 'make [foo]config'. What indication do I get that now-selectable device drivers are 'depends on EXPERIMENTAL' and *not* safe for selection? (Yes, in menuconfig, you can ask it to show the 'depends on' list, *if* you suspect that it might be an issue. But why would I suspect that?) In no case should we be creating a situation where users are thinking "Damn, every driver may or may not be bodgy, I have to *check* if it's experimental before I enable it, just because there was one that I *asked* for". Particularly fun if you're migrating to new hardware and you don't *know* yet which drivers you need, and you're getting prompted for possibly-dodgy ALSA modules because you asked for a USB module (And yes, trying to wade through all the ALSA/Intel HDA/AC97/Sigmatel *was* painful enough when I moved from a Dell Latitude C840 to a D820 - fortunately enough, I didn't have to deal with EXPERIMENTAL ALSA drivers adding to the mix.. ) EXPERIMENTAL in a mainline kernel as a *single* switch for a *lot* of totally unreleated code is even more broken than EMBEDDED (which at least had a common rationale). And over in the -mm kernel where it *should* be, it's superfluous at best, because a -mm kernel might as well just add -DCONFIG_EXPERIMENTAL=y to CFLAGS and save you the effort. ;) pgpgLWOI5NjRe.pgp Description: PGP signature
Re: [PATCH 38/54] efivars: remove new_var and del_var files from sysfs
On Fri, Nov 16, 2007 at 09:01:16AM -0600, Matt Domsch wrote: > On Fri, Nov 02, 2007 at 04:59:16PM -0700, Greg Kroah-Hartman wrote: > > WTF? Passing binary structures into a sysfs file, expecting it to be in > > the correct format/endianness? That's just wrong on so many levels. > > > > So, these files are deleted. If you want to add them back, please do so > > in configfs, or in debugfs. Or use text strings, which is what sysfs is > > only for. > > > I have tested gregkh's patches tree, which includes this patch, the > patch to put these back as binary blob interfaces, as well as other > cleanups, on an Itanium2 system. The efibootmgr userspace application > continues to work as it did before this patch series, which I claim is > success. For the patches that touch drivers/firmware/efivars.c I can > say: > > Tested-by: Matt Domsch <[EMAIL PROTECTED]> Great, thanks for doing this, I appreciate it. greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: WARNING: at kernel/resource.c:189 __release_resource
On Thu, 22 Nov 2007 22:41:16 +0100 Jiri Slaby <[EMAIL PROTECTED]> wrote: > Hi, > > Step aside. What's the purpose of having two similar patches for one issue, > it then warns about the same thing twice: > make-sure-nobodys-leaking-resources.patch > releasing-resources-with-children.patch Oh well. It's better than having none. Matthew, could you have think about something for mainline please? > Ok, I hit the bug, suspend of 00:06 device complains about it: > WARNING: at .../kernel/resource.c:185 __release_resource() > > Call Trace: > [] release_resource+0xb5/0xf0 > [] pnp_release_resources+0x70/0x130 > [] pnp_stop_dev+0x45/0x90 > [] pnp_bus_suspend+0x92/0xb0 > [] suspend_device+0x113/0x180 > [] device_suspend+0x200/0x320 > [] suspend_devices_and_enter+0xa5/0x170 > [] enter_state+0x209/0x270 > [] state_store+0xaf/0xf0 > [] kobj_attr_store+0x17/0x20 > [] sysfs_write_file+0xce/0x140 > [] vfs_write+0xc7/0x170 > [] sys_write+0x50/0x90 > [] system_call+0x7e/0x83 > > # LANG=en ll /sys/devices/pnp0/00:06/ > total 0 > lrwxrwxrwx 1 root root0 Nov 22 22:35 driver -> > ../../../bus/pnp/drivers/serial > -r--r--r-- 1 root root 4096 Nov 22 22:35 id > -r--r--r-- 1 root root 4096 Nov 22 22:35 options > drwxr-xr-x 2 root root0 Nov 22 22:35 power > -rw-r--r-- 1 root root 4096 Nov 22 22:35 resources > lrwxrwxrwx 1 root root0 Nov 22 22:35 subsystem -> ../../../bus/pnp > drwxr-xr-x 3 root root0 Nov 22 22:35 tty > -rw-r--r-- 1 root root 4096 Nov 22 22:35 uevent > I suppose that's a genuine leak, presumably in 8250_pnp. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
On Tue, 2007-11-27 at 15:49 +1100, Rusty Russell wrote: > On Monday 26 November 2007 17:15:44 Roland Dreier wrote: > > > Except C doesn't have namespaces and this mechanism doesn't create them. > > > So this is just complete and utter makework; as I said before, noone's > > > going to confuse all those udp_* functions if they're not in the udp > > > namespace. > > > > I don't understand why you're so opposed to organizing the kernel's > > exported symbols in a more self-documenting way. > > No, I was the one who moved exports near their declarations. That's > organised. I just don't see how this new "organization" will help: oh good, > I won't accidentally use the udp functions any more?!? > > > It seems pretty > > clear to me that having a mechanism that requires modules to make > > explicit which (semi-)internal APIs makes reviewing easier > > Perhaps you've got lots of patches were people are using internal APIs they > shouldn't? > Maybe the issue is "who can tell" since what is external and what is internal is not explicitly defined? > > , makes it > > easier to communicate "please don't use that API" to module authors, > > Well, introduce an EXPORT_SYMBOL_INTERNAL(). It's a lot less code. But > you'd > still need to show that people are having trouble knowing what APIs to use. > > and takes at least a small step towards bringing the kernel's exported > > API under control. > > There is no "exported API" to bring under control. Hmm...apparently, there are those that are struggling... > There are symbols we > expose for the kernel's own use which can be used by external modules at > their own risk. > > > What's the real downside? > > No. That's the wrong question. What's the real upside? Explicitly documenting what comprises the kernel API (external, supported) and what comprises the kernel implementation (internal, not supported). > > Let's not put code in the core because "it doesn't seem to hurt". > agreed. > I'm sure you think there's a real problem, but I'm still waiting for someone > to *show* it to me. Then we can look at solutions. I think the benefits should include: - forcing developers to identify their exports as part of the implementation or as part of the kernel API - making it easier for reviewers to identify when developers are adding to the kernel API and thereby focusing the appropriate level of review to the new function - making it obvious to developers when they are binding their implementation to a particular kernel release > Rusty. > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[git pull] Input updates for 2.6.24-rc3
Hi Linus, Please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git for-linus or master.kernel.org:/pub/scm/linux/kernel/git/dtor/input.git for-linus to receive updates for the input subsystem. Changelog: - Aristeu Rozanski (2): Input: add definitions for frame forward and frame back keys Input: adds the context menu key (HUT GenDesc 0x84) Dmitry Torokhov (3): sony-laptop: fit input devices into sysfs tree sonypi: fit input devices into sysfs tree Sonypi: use synchronize_irq instead of sycnronize_sched Herbert Valerio Riedel (1): Input: gpio-keys - request and configure GPIOs Jiri Kosina (1): Input: i8042 - add i8042.noloop quirk for MS Virtual Machine Mike Frysinger (1): Input: bf54x-keys - keypad does not exist on BF544 parts Diffstat: drivers/char/sonypi.c |8 -- drivers/input/keyboard/Kconfig|2 +- drivers/input/keyboard/gpio_keys.c| 38 drivers/input/serio/i8042-x86ia64io.h |8 +++ drivers/misc/sony-laptop.c| 10 +--- include/linux/input.h |5 6 files changed, 53 insertions(+), 18 deletions(-) -- Dmitry - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] [INPUT/KEYPAD] Blackfin BF54x keypad driver: keypad does not exist on BF544 parts
On Friday 23 November 2007, Bryan Wu wrote: > From: Mike Frysinger <[EMAIL PROTECTED]> > > Signed-off-by: Mike Frysinger <[EMAIL PROTECTED]> > Signed-off-by: Bryan Wu <[EMAIL PROTECTED]> Applied, thank you Mike & Bryan. -- Dmitry - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmc: Add missing sg_init_table() call
On Mon, 26 Nov 2007 21:29:55 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > > Pierre, I can queue this up but if you merge it into your tree I shall drop > it and shall lose track of it. So it's then all down to you to remember to > get the fix into 2.6.24. > > (Except this particular bug looks like a post-2.6.23 regression, so I can cc > the Rafael which never forgets, so it will then get tracked all the way into > Linus's tree) > Jens said he applied it, so I figured the issue was handled. Jens, what happened to it? Rgds -- -- Pierre Ossman Linux kernel, MMC maintainerhttp://www.kernel.org PulseAudio, core developer http://pulseaudio.org rdesktop, core developer http://www.rdesktop.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Booting latest linux kernel(2.6.20) on MPC8548ECDS
Hi, I am using MPC8548ECDS board from CDS for my telecom application. I am able to build 2.6.10 linux kernel and boot 2.6.10 kernel on MPC8548ECDS board.When I take same configuration file and built successfully but not able to boot on MPC8548E CDS board.I am using u-boot-1.1.6 as boot loader.I came to know taht latest kernel is booted with new procedure.Pls tell me the procedure how to boot procedure. Regards,. RAJ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mm: add dirty_highmem option
On Tue, 27 Nov 2007 16:24:24 +1100 "Bron Gondwana" <[EMAIL PROTECTED]> wrote: > On Mon, 26 Nov 2007 20:54:28 -0800, "Andrew Morton" <[EMAIL PROTECTED]> said: > > On Thu, 22 Nov 2007 14:42:04 +1100 Bron Gondwana <[EMAIL PROTECTED]> > > wrote: > > > > > /* > > > + * free highmem will not be subtracted from the total free memory > > > + * for calculating free ratios if vm_dirty_highmem is true > > > + */ > > > +int vm_dirty_highmem; > > > > One would expect that setting dirty_highmem to true would cause highmem > > to > > be accounted in dirty-memory calculations. However with this change > > reality is in fact the inverse of that. > > > > So how about this? > > Actually, I'm confused now. Maybe I chose a bad name to begin with. > Does it mean "I am allowed to dirty high memory" or "my high memory > will be dirty if this is on"? But we're always allowed to dirty highmem - there'd be no point in having it otherwise. Hence the term dirty_highmem is confusing. umm, really you want /proc/sys/vm/dont-account-highmem-in-dirty-memory-calculations, only shorter. Do you agree? If so, then it's still not a very pleasing interface - setting something to "true" to disable a particular piece of kernel behaviour implies a single negation which we don't really need. It would be simpler to have /proc/sys/vm/do-account-highmem-in-dirty-memory-calculations, defaulting to "true" - this has no negations. So... how about /proc/sys/vm/, umm. OK, I give up. Please see if you can think of something less confusing which involves no negations? Thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: freeze vs freezer
On Mon, Nov 26, 2007 at 10:53:34PM +0100, Rafael J. Wysocki wrote: > On Monday, 26 of November 2007, David Chinner wrote: > > So how do you handle threads that are blocked on I/O or a lock during > > the system freeze process, then? > > We wait until they can continue. So if I have a process blocked on an unavilable NFS mount, I can't suspend? -- Matthew Garrett | [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Dynticks Causing High Context Switch Rate in ksoftirqd
On Mon, 26 Nov 2007 22:36:17 -0600 Robert Hancock <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > Question: Why is ksoftirqd eating about 5 to 10 percent of my CPU > > on an idle system? The problem occurs if I config the kernel with > > tickless support (i.e. CONFIG_TICK_ONESHOT=y). (Thanks to > > "oprofile" for putting me onto this.) > > > > I have noted this same problem on kernel versions: 2.6.23.1, > > 2.6.23.8 and 2.6.23.9 > > > > ** > > *** Output from "vmstat -n 1 10" -- Note very high context switch > > rate *** *** This is on a idle > > machine! *** > > ** > > > > procs ---memory-- ---swap-- -io --system-- > > cpu > > r b swpd free buff cache si sobibo incs > > us sy id wa > > 0 0 0 1925556 4768 11610400 124 26 > > 7538 1 2 96 1 > > 0 0 0 1925556 4768 11610400 0 02 > > 147329 0 1 99 0 > > What did oprofile show? It should be able to narrow down what > function(s) are responsible for the CPU usage.. > or better, what does powertop version 1.9 show? that tends to show tickless wakeup artifacts quite nicely -- If you want to reach me at my work email, use [EMAIL PROTECTED] For development, discussion and tips for power savings, visit http://www.lesswatts.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Error returns not handled correctly by sysfs.c:subsys_attr_store()
Greg KH wrote: > On Mon, Nov 26, 2007 at 08:31:16PM -0800, Andrew Morton wrote: >> On Wed, 21 Nov 2007 15:16:59 -0700 Andrew Patterson <[EMAIL PROTECTED]> >> wrote: >> >>> The buf in fs/sysfs.c:subsys_attr_store() does not seem to be updated >>> correctly when returning a negative value (indicating that an error >>> condition has occurred) is returned. If a negative value is returned, >>> the next subsequent call to subsys_attr_store will have the contents of >>> buf appended to the previous call. >> subsys_attr_store() gets deleted by >> http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-01-driver/kset-kill-subsys-attr.patch >> >> So maybe we will soon accidentally fix whatever-this-is? Or maybe we will >> faithfully maintain it. > > Yes, subsys attributes go away, but this is showing a bug in the sysfs > core with attributes, not in the "middle" layers of attributes. > > I bounced the original bug report to Tejun, who has been changing the > logic around this area to see if he sees anything that might be > different now. > > Tejun? (groaning buried under ATA bugs) Will take a look soon. Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Error returns not handled correctly by sysfs.c:subsys_attr_store()
On Mon, Nov 26, 2007 at 08:31:16PM -0800, Andrew Morton wrote: > On Wed, 21 Nov 2007 15:16:59 -0700 Andrew Patterson <[EMAIL PROTECTED]> wrote: > > > The buf in fs/sysfs.c:subsys_attr_store() does not seem to be updated > > correctly when returning a negative value (indicating that an error > > condition has occurred) is returned. If a negative value is returned, > > the next subsequent call to subsys_attr_store will have the contents of > > buf appended to the previous call. > > subsys_attr_store() gets deleted by > http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-01-driver/kset-kill-subsys-attr.patch > > So maybe we will soon accidentally fix whatever-this-is? Or maybe we will > faithfully maintain it. Yes, subsys attributes go away, but this is showing a bug in the sysfs core with attributes, not in the "middle" layers of attributes. I bounced the original bug report to Tejun, who has been changing the logic around this area to see if he sees anything that might be different now. Tejun? thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add iSCSI IBFT Support (v0.3)
On Mon, Nov 26, 2007 at 11:50:10PM -0500, Konrad Rzeszutek wrote: > > > > > > sysfs files have ONE VALUE PER FILE, not a whole bunch of different > > > things in a single file. Please fix this. > > > > The subparameters _are_ actually part of a single value, that value being > > associated with the initiator instance. > > > > Konrad is trying to implement a "work-alike" for what open firmware does. > > open-iscsi already has the ability to extract the same format > > bits from real OFW. > > > > See open-iscsi.git/utils/fwparam_ppc. > > > Greg, > > In light of what Doug says (which is all true), should I go ahead with a new > version of this module which would export one value per file? The problem > that will be encountered is that a ethernetX sysfs directory would have (for > example): > > /sys/firmware/ibft/ethernet0/pci-bdf > 5:1:0 > /sys/firmware/ibft/ethernet0/mac > 00:11:25:9d:8b:00 > /sys/firmware/ibft/ethernet0/vlan > 0 > /sys/firmware/ibft/ethernet0/gateway > 192.168.79.254 > /sys/firmware/ibft/ethernet0/origin > 0 > /sys/firmware/ibft/ethernet0/subnet-mask > 22 > /sys/firmware/ibft/ethernet0/ip-addr > 192.168.77.41 > /sys/firmware/ibft/ethernet0/flags > 7 Yes, that is the proper way to do this kind of thing in sysfs. > And the flag would contain the value "7" which would mean the user would have > to parse what each bit means? (the v0.3 of the module does not export this > flag but uses it to figure out which is the boot iSCSI target). Sure, as long as it means something to userspace, and is a single value, and is documented, that's fine. thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add iSCSI IBFT Support (v0.3)
On Mon, Nov 26, 2007 at 11:23:31PM -0500, Konrad Rzeszutek wrote: > On Monday 26 November 2007 22:31:38 Greg KH wrote: > > > +#if defined(CONFIG_ISCSI_IBFT) || defined(CONFIG_ISCSI_IBFT_MODULE) > ..snip.. > > > +static ssize_t find_ibft(void) > > > +{ > ..snip.. > > > +} > > > > What is a function (not even an inline one) doing in a .h file? > > I was not sure where to put it. This function (find_ibft) is used by the > setup_[32|64].c and the iscsi_ibft.c code. Randy suggested I put in .c file, > but I am not sure exactly where? Should I make a new file in called > libs/iscsi_ibft_helper.c ? Put it in your .c file and make it a global function to be called by someone else if they need it. thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [NET]: Fix TX bug VLAN in VLAN
2007/11/26, Herbert Xu <[EMAIL PROTECTED]>: > On Fri, Nov 23, 2007 at 12:12:52PM +, Joonwoo Park wrote: > > This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=8766 > > > > Is it possible? > > BUG((veth->h_vlan_proto != htons(ETH_P_8021Q)) && > > !(VLAN_DEV_INFO(dev)->flags & VLAN_FLAG_REORDER_HDR)) > > I'm afraid, queued packet before vconfig set_flag would do that. > > Yes, AF_PACKET would do that. So you should check both. > Thanks Herbert. Well.. I think patch would work propely for AF_PACKET also. (I did not insert BUG() macro in my patch) How do you think? Thanks Joonwoo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmc: Add missing sg_init_table() call
On Thu, 22 Nov 2007 20:32:51 +0100 Haavard Skinnemoen <[EMAIL PROTECTED]> wrote: > mmc_init_queue only initializes the scatterlists with sg_init_table() > when using a bounce buffer. This leads to a BUG() when CONFIG_DEBUG_SG > is set. > I assume that 2.6.23 is not afflicted in this way? > --- > drivers/mmc/card/queue.c |3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c > index 1b9c9b6..30cd13b 100644 > --- a/drivers/mmc/card/queue.c > +++ b/drivers/mmc/card/queue.c > @@ -180,12 +180,13 @@ int mmc_init_queue(struct mmc_queue *mq, struct > mmc_card *card, spinlock_t *lock > blk_queue_max_hw_segments(mq->queue, host->max_hw_segs); > blk_queue_max_segment_size(mq->queue, host->max_seg_size); > > - mq->sg = kzalloc(sizeof(struct scatterlist) * > + mq->sg = kmalloc(sizeof(struct scatterlist) * > host->max_phys_segs, GFP_KERNEL); > if (!mq->sg) { > ret = -ENOMEM; > goto cleanup_queue; > } > + sg_init_table(mq->sg, host->max_phys_segs); > } > > init_MUTEX(>thread_sem); Pierre, I can queue this up but if you merge it into your tree I shall drop it and shall lose track of it. So it's then all down to you to remember to get the fix into 2.6.24. (Except this particular bug looks like a post-2.6.23 regression, so I can cc the Rafael which never forgets, so it will then get tracked all the way into Linus's tree) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mm: add dirty_highmem option
On Mon, 26 Nov 2007 20:54:28 -0800, "Andrew Morton" <[EMAIL PROTECTED]> said: > On Thu, 22 Nov 2007 14:42:04 +1100 Bron Gondwana <[EMAIL PROTECTED]> > wrote: > > > /* > > + * free highmem will not be subtracted from the total free memory > > + * for calculating free ratios if vm_dirty_highmem is true > > + */ > > +int vm_dirty_highmem; > > One would expect that setting dirty_highmem to true would cause highmem > to > be accounted in dirty-memory calculations. However with this change > reality is in fact the inverse of that. > > So how about this? Actually, I'm confused now. Maybe I chose a bad name to begin with. Does it mean "I am allowed to dirty high memory" or "my high memory will be dirty if this is on"? Hmm... I'm even having trouble articulating what's odd about it. I guess my internal model was: "if this flag is set then you are allowed to make high memory dirty without needing to flush it immediately", which is why I made it that way around. No - you're wrong. My patch _did_ include high memory in the dirty memory calculations when dirty_highmem was true. > x = global_page_state(NR_FREE_PAGES) > + global_page_state(NR_INACTIVE) > + global_page_state(NR_ACTIVE); This is the total memory, _including_ high memory. > x -= highmem_dirtyable_memory(x); This removes the high memory from the total count. I think I got it right. If dirty_highmem is set to true, then don't subtract highmem from the total memory count before calculating the percentages. That's what I meant, and that's what the toggle did. Removed the subtraction. Bron. > Documentation/filesystems/proc.txt |4 ++-- > mm/page-writeback.c|8 > 2 files changed, 6 insertions(+), 6 deletions(-) > > diff -puN > Documentation/filesystems/proc.txt~mm-add-dirty_highmem-option-fix > Documentation/filesystems/proc.txt > --- a/Documentation/filesystems/proc.txt~mm-add-dirty_highmem-option-fix > +++ a/Documentation/filesystems/proc.txt > @@ -1265,8 +1265,8 @@ Contains, as a boolean, a switch to allo > part of the "available" memory against which the dirty ratios will be > applied. > > -Setting this to 1 can be useful on 32 bit machines where you want to > make > -random changes within an MMAPed file that is larger than your available > +Setting this to 0 (false) can be useful on 32 bit machines where you > wish to > +make random changes within an MMAPed file that is larger than your > available > lowmem, however it is potentially dangerous and has serious > bounce-buffer > issues. > > diff -puN mm/page-writeback.c~mm-add-dirty_highmem-option-fix > mm/page-writeback.c > --- a/mm/page-writeback.c~mm-add-dirty_highmem-option-fix > +++ a/mm/page-writeback.c > @@ -69,10 +69,10 @@ static inline long sync_writeback_pages( > int dirty_background_ratio = 5; > > /* > - * free highmem will not be subtracted from the total free memory > - * for calculating free ratios if vm_dirty_highmem is true > + * free highmem will be subtracted from the total free memory for > calculating > + * free ratios if vm_dirty_highmem is true > */ > -int vm_dirty_highmem; > +int vm_dirty_highmem = 1; > > /* > * The generator of dirty data starts writeback at this percentage > @@ -293,7 +293,7 @@ static unsigned long determine_dirtyable > x = global_page_state(NR_FREE_PAGES) > + global_page_state(NR_INACTIVE) > + global_page_state(NR_ACTIVE); > - if (!vm_dirty_highmem) > + if (vm_dirty_highmem) > x -= highmem_dirtyable_memory(x); > return x + 1; /* Ensure that we never return 0 */ > } > _ > > > > > (I dropped the already-merged part of your patch) > > (I fixed a build error in kernel/sysctl.c: "one" was defined twice when > suitable config options were set). > > (It's an unpleasing patch, btw. But it's an unpleasant problem and at > least > this way people can tell us "hey, I did and it started to work") -- Bron Gondwana [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.23.9
On Tue, 27 Nov 2007 02:39:08 +0100, Patrick McHardy said: > Tomasz K wrote: > > On Mon, 26 Nov 2007, Greg Kroah-Hartman wrote: > > [..] > > > > Still there is no aroud officialy released iptables tarball with > > support for rules for new xt_{connlimit,time,u32} modules. > > Anyone know where are patches for manage connlimit, time, u32 rules > > which will be included in next release ? > > A well chosen thread to ask this question. xt_time is not even > included in 2.6.23. > > http://netfilter.org/news.html#2007-10-15 And I don't see any mention of connlimit, time, or u32 in the Changelog for that... I admit I haven't peeked inside the actual tarball to see if they added stuff and didn't Changelog it... pgpykpjdHQNhy.pgp Description: PGP signature
Re: [PATCH][SHMEM] Factor out sbi->free_inodes manipulations
On Fri, 23 Nov 2007 13:41:55 + (GMT) Hugh Dickins <[EMAIL PROTECTED]> wrote: > Looks good, but we can save slightly more there (depending on config), > and I found your inc/dec names a little confusing, since the count is > going the other way: how do you feel about this version? (I'd like it > better if those helpers could take a struct inode *, but they cannot.) > Hugh > > > From: Pavel Emelyanov <[EMAIL PROTECTED]> > > The shmem_sb_info structure has a number of free_inodes. This > value is altered in appropriate places under spinlock and with > the sbi->max_inodes != 0 check. > > Consolidate these manipulations into two helpers. > > This is minus 42 bytes of shmem.o and minus 4 :) lines of code. > > Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]> > Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> > --- > > mm/shmem.c | 72 --- > 1 file changed, 34 insertions(+), 38 deletions(-) > > --- 2.6.24-rc3/mm/shmem.c 2007-11-07 04:21:45.0 + > +++ linux/mm/shmem.c 2007-11-23 12:43:28.0 + > @@ -207,6 +207,31 @@ static void shmem_free_blocks(struct ino > } > } > > +static int shmem_reserve_inode(struct super_block *sb) > +{ > + struct shmem_sb_info *sbinfo = SHMEM_SB(sb); > + if (sbinfo->max_inodes) { > + spin_lock(>stat_lock); > + if (!sbinfo->free_inodes) { > + spin_unlock(>stat_lock); > + return -ENOMEM; > + } > + sbinfo->free_inodes--; > + spin_unlock(>stat_lock); > + } > + return 0; > +} It is peculair to (wrongly) return -ENOMEM > + if (shmem_reserve_inode(inode->i_sb)) > + return -ENOSPC; and to then correct it in the caller.. Something boringly conventional such as the below, perhaps? --- a/mm/shmem.c~shmem-factor-out-sbi-free_inodes-manipulations-fix +++ a/mm/shmem.c @@ -212,7 +212,7 @@ static int shmem_reserve_inode(struct su spin_lock(>stat_lock); if (!sbinfo->free_inodes) { spin_unlock(>stat_lock); - return -ENOMEM; + return -ENOSPC; } sbinfo->free_inodes--; spin_unlock(>stat_lock); @@ -1679,14 +1679,16 @@ static int shmem_create(struct inode *di static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) { struct inode *inode = old_dentry->d_inode; + int ret; /* * No ordinary (disk based) filesystem counts links as inodes; * but each new link needs a new dentry, pinning lowmem, and * tmpfs dentries cannot be pruned until they are unlinked. */ - if (shmem_reserve_inode(inode->i_sb)) - return -ENOSPC; + ret = shmem_reserve_inode(inode->i_sb); + if (ret) + goto out; dir->i_size += BOGO_DIRENT_SIZE; inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; @@ -1694,7 +1696,8 @@ static int shmem_link(struct dentry *old atomic_inc(>i_count);/* New dentry reference */ dget(dentry); /* Extra pinning count for the created dentry */ d_instantiate(dentry, inode); - return 0; +out: + return ret; } static int shmem_unlink(struct inode *dir, struct dentry *dentry) _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 04/14] ia64: Remove the __SMALL_ADDR_AREA attribute for per cpu access
On 11/26/07, Christoph Lameter <[EMAIL PROTECTED]> wrote: > The model(small) attribute is not supported by gcc 4.X. The tests > will always be negative today. What was the rationale for removing this attribute? --david - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch 5/5] sched: Improve fairness of cpu bandwidth allocation for task groups
The current load balancing scheme isn't good for group fairness. For ex: on a 8-cpu system, I created 3 groups as under: a = 8 tasks (cpu.shares = 1024) b = 4 tasks (cpu.shares = 1024) c = 3 tasks (cpu.shares = 1024) a, b and c are task groups that have equal weight. We would expect each of the groups to receive 33.33% of cpu bandwidth under a fair scheduler. This is what I get with the latest scheduler git tree: Col1 | Col2| Col3 | Col4 --|-|---|--- a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5% b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3% c | 86.326 | 18.0% | 47.5% 47.9% 48.5% Explanation of o/p: Col1 -> Group name Col2 -> Cumulative execution time (in seconds) received by all tasks of that group in a 60sec window across 8 cpus Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in percentage. Col3 data is derived as: Col3 = 100 * Col2 / (NR_CPUS * 60) Col4 -> CPU bandwidth received by each individual task of the group. Col4 = 100 * cpu_time_recd_by_task / 60 [I can share the test case that produces a similar o/p if reqd] The deviation from desired group fairness is as below: a = +24.47% b = -9.13% c = -15.33% which is quite high. After the patch below is applied, here are the results: Col1 | Col2| Col3 | Col4 --|-|---|--- a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3% b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5% c | 160.653 | 33.5% | 85.8% 90.6% 91.4% Deviation from desired group fairness is as below: a = +0.67% b = -0.83% c = +0.17% which is far better IMO. Most of other runs have yielded a deviation within +-2% at the most, which is good. Why do we see bad (group) fairness with current scheuler? = Currently cpu's weight is just the summation of individual task weights. This can yield incorrect results. For ex: consider three groups as below on a 2-cpu system: CPU0CPU1 --- A (10) B(5) C(5) --- Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all of which are on CPU1. Each task has the same weight (NICE_0_LOAD = 1024). The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and the load balancer will think both CPUs are perfectly balanced and won't move around any tasks. This, however, would yield this bandwidth: A = 50% B = 25% C = 25% which is not the desired result. What's changing in the patch? = - How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is defined (see below) - API Change - Two tunables introduced in sysfs (under SCHED_DEBUG) to control the frequency at which the load balance monitor thread runs. The basic change made in this patch is how cpu weight (rq->load.weight) is calculated. Its now calculated as the summation of group weights on a cpu, rather than summation of task weights. Weight exerted by a group on a cpu is dependent on the shares allocated to it and also the number of tasks the group has on that cpu compared to the total number of (runnable) tasks the group has in the system. Let, W(K,i) = Weight of group K on cpu i T(K,i) = Task load present in group K's cfs_rq on cpu i T(K)= Total task load of group K across various cpus S(K)= Shares allocated to group K NRCPUS = Number of online cpus in the scheduler domain to which group K is assigned. Then, W(K,i) = S(K) * NRCPUS * T(K,i) / T(K) A load balance monitor thread is created at bootup, which periodically runs and adjusts group's weight on each cpu. To avoid its overhead, two min/max tunables are introduced (under SCHED_DEBUG) to control the rate at which it runs. Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]> --- include/linux/sched.h |4 kernel/sched.c| 259 -- kernel/sched_fair.c | 88 ++-- kernel/sysctl.c | 18 +++ 4 files changed, 330 insertions(+), 39 deletions(-) Index: current/include/linux/sched.h === ---
Re: [PATCH] Add iSCSI IBFT Support (v0.3)
On Mon, 26 Nov 2007 19:31:38 PST, Greg KH wrote: > On Mon, Nov 26, 2007 at 06:56:42PM -0400, Konrad Rzeszutek wrote: > > +/* > > + * Routines for reading of the iBFT data in a human readable fashion. > > + */ > > +ssize_t ibft_attr_show_initiator(struct ibft_kobject *entry, > > +struct ibft_attribute *attr, > > +char *buf) > > +{ > > + struct ibft_initiator *initiator = attr->initiator; > > + void *ibft_loc = entry->data->hdr; > > + char *str = buf; > > + > > + if (!initiator) > > + return 0; > > + > > + str += sprintf_ipaddr(str, "isns", initiator->isns_server); > > + str += sprintf_ipaddr(str, "slp", initiator->slp_server); > > + str += sprintf_ipaddr(str, "primary_radius_server", > > + initiator->pri_radius_server); > > + str += sprintf_ipaddr(str, "secondary_radius_server", > > + initiator->sec_radius_server); > > + str += sprintf_string(str, "itname", initiator->initiator_name_len, > > + (char *)ibft_loc + initiator->initiator_name_off); > > + str--; > > + > > + return str-buf; > > +} > > sysfs files have ONE VALUE PER FILE, not a whole bunch of different > things in a single file. Please fix this. The subparameters _are_ actually part of a single value, that value being associated with the initiator instance. Konrad is trying to implement a "work-alike" for what open firmware does. open-iscsi already has the ability to extract the same format bits from real OFW. See open-iscsi.git/utils/fwparam_ppc. > > > > + > > +ssize_t ibft_attr_show_nic(struct ibft_kobject *entry, > > + struct ibft_attribute *attr, > > + char *buf) > > +{ > > + struct ibft_nic *nic = attr->nic; > > + void *ibft_loc = entry->data->hdr; > > + char *str = buf; > > + > > + if (!nic) > > + return 0; > > + /* > > +* Assume dhcp if any non-zero portions of its address are set. > > +*/ > > + if (memcmp(nic->dhcp, nulls, sizeof(nic->dhcp))) { > > + str += sprintf_ipaddr(str, "dhcp", nic->dhcp); > > + } else { > > + str += sprintf_ipaddr(str, "ciaddr", nic->ip_addr); > > + str += sprintf_ipaddr(str, "giaddr", nic->gateway); > > + str += sprintf_ipaddr(str, "dnsaddr1", nic->primary_dns); > > + str += sprintf_ipaddr(str, "dnsaddr2", nic->secondary_dns); > > + } > > + if (nic->hostname_len) > > + str += sprintf_string(str, "hostname", nic->hostname_len, > > + (char *)ibft_loc + nic->hostname_off); > > + /* Cut off the comma. */ > > + str--; > > + > > + return str-buf; > > +} > > Same here. > > > +ssize_t ibft_attr_show_target(struct ibft_kobject *entry, > > + struct ibft_attribute *attr, > > + char *buf) > > +{ > > + struct ibft_tgt *tgt = attr->tgt; > > + void *ibft_loc = entry->data->hdr; > > + char *str = buf; > > + int i; > > + > > + if (!tgt) > > + return 0; > > + > > + str += sprintf_ipaddr(str, "siaddr", tgt->ip_addr); > > + str += sprintf(str, "iport=%d,", tgt->port); > > + str += sprintf(str, "ilun="); > > + for (i = 0; i < 8; i++) > > + str += sprintf(str, "%x", (u8)tgt->lun[i]); > > + str += sprintf(str, ","); > > + > > + if (tgt->tgt_name_len) > > + str += sprintf_string(str, "iname", tgt->tgt_name_len, > > + (void *)ibft_loc + tgt->tgt_name_off); > > + > > + if (tgt->chap_name_len) > > + str += sprintf_string(str, "chapid", tgt->chap_name_len, > > + (char *)ibft_loc + tgt->chap_name_off); > > + if (tgt->chap_secret_len) > > + str += sprintf_string(str, "chappw", tgt->chap_secret_len, > > + (char *)ibft_loc + tgt->chap_secret_off); > > + if (tgt->rev_chap_name_len) > > + str += sprintf_string(str, "ichapid", tgt->rev_chap_name_len, > > + (char *)ibft_loc + tgt->rev_chap_name_off); > > + if (tgt->rev_chap_secret_len) > > + str += sprintf_string(str, "ichappw", tgt->rev_chap_secret_len, > > + (char *)ibft_loc + tgt->rev_chap_secret_off); > > + > > + /* Cut off the comma. */ > > + str--; > > + > > + return str-buf; > > +} > > Same here, are we writing a novella or something to userspace? :) Yep. Just like real OFW. > > > +ssize_t ibft_attr_show_disk(struct ibft_kobject *dev, > > + struct ibft_attribute *ibft_attr, > > + char *buf) > > +{ > > + char *str = buf; > > + > > + str += sprintf(str, "//[EMAIL PROTECTED],%d:iscsi,", dev->data->index); > > + str += ibft_attr_show_initiator(dev, ibft_attr, str); > > + str += sprintf(str, ","); > > + str += ibft_attr_show_target(dev, ibft_attr, str); > > + str += sprintf(str, ","); > > + str += ibft_attr_show_nic(dev, ibft_attr, str); > > + > > + return str-buf; > > +} > > And here, do I need to go
[Patch 4/5] sched: introduce a mutex and corresponding API to serialize access to doms_cur[] array
doms_cur[] array represents various scheduling domains which are mutually exclusive. Currently cpusets code can modify this array (by calling partition_sched_domains()) as a result of user modifying sched_load_balance flag for various cpusets. This patch introduces a mutex and corresponding API (only when CONFIG_FAIR_GROUP_SCHED is defined) which allows a reader to safely read the doms_cur[] array w/o worrying abt concurrent modifications to the array. The fair group scheduler code (introduced in next patch of this series) makes use of this mutex to walk thr' doms_cur[] array while rebalancing shares of task groups across cpus. Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]> --- kernel/sched.c | 19 +++ 1 files changed, 19 insertions(+) Index: current/kernel/sched.c === --- current.orig/kernel/sched.c +++ current/kernel/sched.c @@ -186,6 +186,9 @@ static struct cfs_rq *init_cfs_rq_p[NR_C */ static DEFINE_MUTEX(task_group_mutex); +/* doms_cur_mutex serializes access to doms_cur[] array */ +static DEFINE_MUTEX(doms_cur_mutex); + /* Default task group. * Every task in system belong to this group at bootup. */ @@ -236,11 +239,23 @@ static inline void unlock_task_group_lis mutex_unlock(_group_mutex); } +static inline void lock_doms_cur(void) +{ + mutex_lock(_cur_mutex); +} + +static inline void unlock_doms_cur(void) +{ + mutex_unlock(_cur_mutex); +} + #else static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { } static inline void lock_task_group_list(void) { } static inline void unlock_task_group_list(void) { } +static inline void lock_doms_cur(void) { } +static inline void unlock_doms_cur(void) { } #endif /* CONFIG_FAIR_GROUP_SCHED */ @@ -6547,6 +6562,8 @@ void partition_sched_domains(int ndoms_n { int i, j; + lock_doms_cur(); + /* always unregister in case we don't destroy any domains */ unregister_sched_domain_sysctl(); @@ -6587,6 +6604,8 @@ match2: ndoms_cur = ndoms_new; register_sched_domain_sysctl(); + + unlock_doms_cur(); } #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT) -- Regards, vatsa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RESEND] crypto test: use print_hex_dump from kernel.h instead
On Nov 27, 2007 10:58 AM, Richard Knutsson <[EMAIL PROTECTED]> wrote: ... > > + print_hex_dump(KERN_CONT, "", DUMP_PREFIX_OFFSET, > > + 16, 1, > > + buf, len, 0); > > > Not important, but why use '0' instead of 'false'? after read http://lkml.org/lkml/2006/7/27/281, I agreed with you. this is refreshed patch against the lastest cryptodev tree. Cc: Randy Dunlap <[EMAIL PROTECTED]> Signed-off-by: Denis Cheng <[EMAIL PROTECTED]> --- crypto/tcrypt.c |9 - 1 files changed, 4 insertions(+), 5 deletions(-) diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index 1e12b86..ae762c2 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -87,12 +87,11 @@ static char *check[] = { "camellia", "seed", "salsa20", NULL }; -static void hexdump(unsigned char *buf, unsigned int len) +static inline void hexdump(unsigned char *buf, unsigned int len) { - while (len--) - printk("%02x", *buf++); - - printk("\n"); + print_hex_dump(KERN_CONT, "", DUMP_PREFIX_OFFSET, + 16, 1, + buf, len, false); } static void tcrypt_complete(struct crypto_async_request *req, int err) -- Denis Cheng - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: time accounting problem (powerpc only?)
On Mon, Nov 26, 2007 at 05:23:13PM +0100, Johannes Berg wrote: > Contrary to what I claimed later in the thread, my 64-bit powerpc box > (quad-core G5) doesn't suffer from this problem. > > Does anybody have any idea? I don't even know how to debug it further. I'll see if I can grab an appropriate machine tomorrow and have a look at it. I think it's just an accounting bug, which is probably my fault :) Yours Tony linux.conf.auhttp://linux.conf.au/ || http://lca2008.linux.org.au/ Jan 28 - Feb 02 2008 The Australian Linux Technical Conference! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch 3/5 v2] sched: change how cpu load is calculated
This patch changes how the cpu load exerted by fair_sched_class tasks is calculated. Load exerted by fair_sched_class tasks on a cpu is now a summation of the group weights, rather than summation of task weights. Weight exerted by a group on a cpu is dependent on the shares allocated to it. This version of patch (v2 of Patch 3/5) has a minor impact on code size (but should have no runtime/functional impact) for !CONFIG_FAIR_GROUP_SCHED case, but the overall code, IMHO, is neater compared to v1 of Patch 3/5 (because of lesser #ifdefs). I prefer v2 of Patch 3/5. Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]> --- kernel/sched.c | 27 +++ kernel/sched_fair.c | 31 +++ kernel/sched_rt.c |2 ++ 3 files changed, 40 insertions(+), 20 deletions(-) Index: current/kernel/sched.c === --- current.orig/kernel/sched.c +++ current/kernel/sched.c @@ -870,6 +870,16 @@ iter_move_one_task(struct rq *this_rq, i struct rq_iterator *iterator); #endif +static inline void inc_cpu_load(struct rq *rq, unsigned long load) +{ + update_load_add(>load, load); +} + +static inline void dec_cpu_load(struct rq *rq, unsigned long load) +{ + update_load_sub(>load, load); +} + #include "sched_stats.h" #include "sched_idletask.c" #include "sched_fair.c" @@ -880,26 +890,14 @@ iter_move_one_task(struct rq *this_rq, i #define sched_class_highest (_sched_class) -static inline void inc_load(struct rq *rq, const struct task_struct *p) -{ - update_load_add(>load, p->se.load.weight); -} - -static inline void dec_load(struct rq *rq, const struct task_struct *p) -{ - update_load_sub(>load, p->se.load.weight); -} - static void inc_nr_running(struct task_struct *p, struct rq *rq) { rq->nr_running++; - inc_load(rq, p); } static void dec_nr_running(struct task_struct *p, struct rq *rq) { rq->nr_running--; - dec_load(rq, p); } static void set_load_weight(struct task_struct *p) @@ -4071,10 +4069,8 @@ void set_user_nice(struct task_struct *p goto out_unlock; } on_rq = p->se.on_rq; - if (on_rq) { + if (on_rq) dequeue_task(rq, p, 0); - dec_load(rq, p); - } p->static_prio = NICE_TO_PRIO(nice); set_load_weight(p); @@ -4084,7 +4080,6 @@ void set_user_nice(struct task_struct *p if (on_rq) { enqueue_task(rq, p, 0); - inc_load(rq, p); /* * If the task increased its priority or is running and * lowered its priority, then reschedule its CPU: Index: current/kernel/sched_fair.c === --- current.orig/kernel/sched_fair.c +++ current/kernel/sched_fair.c @@ -755,15 +755,26 @@ static inline struct sched_entity *paren static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup) { struct cfs_rq *cfs_rq; - struct sched_entity *se = >se; + struct sched_entity *se = >se, *topse = NULL; + int incload = 1; for_each_sched_entity(se) { - if (se->on_rq) + topse = se; + if (se->on_rq) { + incload = 0; break; + } cfs_rq = cfs_rq_of(se); enqueue_entity(cfs_rq, se, wakeup); wakeup = 1; } + /* +* Increment cpu load if we just enqueued the first task of a group on +* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs +* at the highest grouping level. +*/ + if (incload) + inc_cpu_load(rq, topse->load.weight); } /* @@ -774,16 +785,28 @@ static void enqueue_task_fair(struct rq static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep) { struct cfs_rq *cfs_rq; - struct sched_entity *se = >se; + struct sched_entity *se = >se, *topse = NULL; + int decload = 1; for_each_sched_entity(se) { + topse = se; cfs_rq = cfs_rq_of(se); dequeue_entity(cfs_rq, se, sleep); /* Don't dequeue parent if it has other entities besides us */ - if (cfs_rq->load.weight) + if (cfs_rq->load.weight) { + if (parent_entity(se)) + decload = 0; break; + } sleep = 1; } + /* +* Decrement cpu load if we just dequeued the last task of a group on +* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs +* at the highest grouping level. +*/ + if (decload) + dec_cpu_load(rq, topse->load.weight); } /* Index: current/kernel/sched_rt.c
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
On Monday 26 November 2007 17:15:44 Roland Dreier wrote: > > Except C doesn't have namespaces and this mechanism doesn't create them. > > So this is just complete and utter makework; as I said before, noone's > > going to confuse all those udp_* functions if they're not in the udp > > namespace. > > I don't understand why you're so opposed to organizing the kernel's > exported symbols in a more self-documenting way. No, I was the one who moved exports near their declarations. That's organised. I just don't see how this new "organization" will help: oh good, I won't accidentally use the udp functions any more?!? > It seems pretty > clear to me that having a mechanism that requires modules to make > explicit which (semi-)internal APIs makes reviewing easier Perhaps you've got lots of patches were people are using internal APIs they shouldn't? > , makes it > easier to communicate "please don't use that API" to module authors, Well, introduce an EXPORT_SYMBOL_INTERNAL(). It's a lot less code. But you'd still need to show that people are having trouble knowing what APIs to use. > and takes at least a small step towards bringing the kernel's exported > API under control. There is no "exported API" to bring under control. There are symbols we expose for the kernel's own use which can be used by external modules at their own risk. > What's the real downside? No. That's the wrong question. What's the real upside? Let's not put code in the core because "it doesn't seem to hurt". I'm sure you think there's a real problem, but I'm still waiting for someone to *show* it to me. Then we can look at solutions. Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch 3/5 v1] sched: change how cpu load is calculated
This patch changes how the cpu load exerted by fair_sched_class tasks is calculated. Load exerted by fair_sched_class tasks on a cpu is now a summation of the group weights, rather than summation of task weights. Weight exerted by a group on a cpu is dependent on the shares allocated to it. This version of patch (v1 of Patch 3/5) has zero impact for !CONFIG_FAIR_GROUP_SCHED case. Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]> --- kernel/sched.c | 38 ++ kernel/sched_fair.c | 31 +++ kernel/sched_rt.c |2 ++ 3 files changed, 59 insertions(+), 12 deletions(-) Index: current/kernel/sched.c === --- current.orig/kernel/sched.c +++ current/kernel/sched.c @@ -870,15 +870,25 @@ iter_move_one_task(struct rq *this_rq, i struct rq_iterator *iterator); #endif -#include "sched_stats.h" -#include "sched_idletask.c" -#include "sched_fair.c" -#include "sched_rt.c" -#ifdef CONFIG_SCHED_DEBUG -# include "sched_debug.c" -#endif +#ifdef CONFIG_FAIR_GROUP_SCHED -#define sched_class_highest (_sched_class) +static inline void inc_cpu_load(struct rq *rq, unsigned long load) +{ + update_load_add(>load, load); +} + +static inline void dec_cpu_load(struct rq *rq, unsigned long load) +{ + update_load_sub(>load, load); +} + +static inline void inc_load(struct rq *rq, const struct task_struct *p) { } +static inline void dec_load(struct rq *rq, const struct task_struct *p) { } + +#else /* CONFIG_FAIR_GROUP_SCHED */ + +static inline void inc_cpu_load(struct rq *rq, unsigned long load) { } +static inline void dec_cpu_load(struct rq *rq, unsigned long load) { } static inline void inc_load(struct rq *rq, const struct task_struct *p) { @@ -890,6 +900,18 @@ static inline void dec_load(struct rq *r update_load_sub(>load, p->se.load.weight); } +#endif /* CONFIG_FAIR_GROUP_SCHED */ + +#include "sched_stats.h" +#include "sched_idletask.c" +#include "sched_fair.c" +#include "sched_rt.c" +#ifdef CONFIG_SCHED_DEBUG +# include "sched_debug.c" +#endif + +#define sched_class_highest (_sched_class) + static void inc_nr_running(struct task_struct *p, struct rq *rq) { rq->nr_running++; Index: current/kernel/sched_fair.c === --- current.orig/kernel/sched_fair.c +++ current/kernel/sched_fair.c @@ -755,15 +755,26 @@ static inline struct sched_entity *paren static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup) { struct cfs_rq *cfs_rq; - struct sched_entity *se = >se; + struct sched_entity *se = >se, *topse = NULL; + int incload = 1; for_each_sched_entity(se) { - if (se->on_rq) + topse = se; + if (se->on_rq) { + incload = 0; break; + } cfs_rq = cfs_rq_of(se); enqueue_entity(cfs_rq, se, wakeup); wakeup = 1; } + /* +* Increment cpu load if we just enqueued the first task of a group on +* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs +* at the highest grouping level. +*/ + if (incload) + inc_cpu_load(rq, topse->load.weight); } /* @@ -774,16 +785,28 @@ static void enqueue_task_fair(struct rq static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep) { struct cfs_rq *cfs_rq; - struct sched_entity *se = >se; + struct sched_entity *se = >se, *topse = NULL; + int decload = 1; for_each_sched_entity(se) { + topse = se; cfs_rq = cfs_rq_of(se); dequeue_entity(cfs_rq, se, sleep); /* Don't dequeue parent if it has other entities besides us */ - if (cfs_rq->load.weight) + if (cfs_rq->load.weight) { + if (parent_entity(se)) + decload = 0; break; + } sleep = 1; } + /* +* Decrement cpu load if we just dequeued the last task of a group on +* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs +* at the highest grouping level. +*/ + if (decload) + dec_cpu_load(rq, topse->load.weight); } /* Index: current/kernel/sched_rt.c === --- current.orig/kernel/sched_rt.c +++ current/kernel/sched_rt.c @@ -31,6 +31,7 @@ static void enqueue_task_rt(struct rq *r list_add_tail(>run_list, array->queue + p->prio); __set_bit(p->prio, array->bitmap); + inc_cpu_load(rq, p->se.load.weight); } /* @@ -45,6 +46,7 @@ static void dequeue_task_rt(struct rq *r list_del(>run_list); if
[Patch 2/5] sched: minor fixes for group scheduler
Minor bug fixes for group scheduler: - Use a mutex to serialize add/remove of task groups and also when changing shares of a task group. Use the same mutex when printing cfs_rq stats for various task groups. - Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when walking task group list) Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]> --- kernel/sched.c | 34 ++ kernel/sched_fair.c |4 +++- 2 files changed, 29 insertions(+), 9 deletions(-) Index: current/kernel/sched.c === --- current.orig/kernel/sched.c +++ current/kernel/sched.c @@ -169,8 +169,6 @@ struct task_group { /* runqueue "owned" by this group on each cpu */ struct cfs_rq **cfs_rq; unsigned long shares; - /* spinlock to serialize modification to shares */ - spinlock_t lock; struct rcu_head rcu; }; @@ -182,6 +180,12 @@ static DEFINE_PER_CPU(struct cfs_rq, ini static struct sched_entity *init_sched_entity_p[NR_CPUS]; static struct cfs_rq *init_cfs_rq_p[NR_CPUS]; +/* + * task_group_mutex serializes add/remove of task groups and also changes to + * a task group's cpu shares. + */ +static DEFINE_MUTEX(task_group_mutex); + /* Default task group. * Every task in system belong to this group at bootup. */ @@ -222,9 +226,21 @@ static inline void set_task_cfs_rq(struc p->se.parent = task_group(p)->se[cpu]; } +static inline void lock_task_group_list(void) +{ + mutex_lock(_group_mutex); +} + +static inline void unlock_task_group_list(void) +{ + mutex_unlock(_group_mutex); +} + #else static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { } +static inline void lock_task_group_list(void) { } +static inline void unlock_task_group_list(void) { } #endif /* CONFIG_FAIR_GROUP_SCHED */ @@ -6747,7 +6763,6 @@ void __init sched_init(void) se->parent = NULL; } init_task_group.shares = init_task_group_load; - spin_lock_init(_task_group.lock); #endif for (j = 0; j < CPU_LOAD_IDX_MAX; j++) @@ -6987,14 +7002,15 @@ struct task_group *sched_create_group(vo se->parent = NULL; } + tg->shares = NICE_0_LOAD; + + lock_task_group_list(); for_each_possible_cpu(i) { rq = cpu_rq(i); cfs_rq = tg->cfs_rq[i]; list_add_rcu(_rq->leaf_cfs_rq_list, >leaf_cfs_rq_list); } - - tg->shares = NICE_0_LOAD; - spin_lock_init(>lock); + unlock_task_group_list(); return tg; @@ -7040,10 +7056,12 @@ void sched_destroy_group(struct task_gro struct cfs_rq *cfs_rq = NULL; int i; + lock_task_group_list(); for_each_possible_cpu(i) { cfs_rq = tg->cfs_rq[i]; list_del_rcu(_rq->leaf_cfs_rq_list); } + unlock_task_group_list(); BUG_ON(!cfs_rq); @@ -7117,7 +7135,7 @@ int sched_group_set_shares(struct task_g { int i; - spin_lock(>lock); + lock_task_group_list(); if (tg->shares == shares) goto done; @@ -7126,7 +7144,7 @@ int sched_group_set_shares(struct task_g set_se_shares(tg->se[i], shares); done: - spin_unlock(>lock); + unlock_task_group_list(); return 0; } Index: current/kernel/sched_fair.c === --- current.orig/kernel/sched_fair.c +++ current/kernel/sched_fair.c @@ -685,7 +685,7 @@ static inline struct cfs_rq *cpu_cfs_rq( /* Iterate thr' all leaf cfs_rq's on a runqueue */ #define for_each_leaf_cfs_rq(rq, cfs_rq) \ - list_for_each_entry(cfs_rq, >leaf_cfs_rq_list, leaf_cfs_rq_list) + list_for_each_entry_rcu(cfs_rq, >leaf_cfs_rq_list, leaf_cfs_rq_list) /* Do the two (enqueued) entities belong to the same group ? */ static inline int @@ -1126,7 +1126,9 @@ static void print_cfs_stats(struct seq_f #ifdef CONFIG_FAIR_GROUP_SCHED print_cfs_rq(m, cpu, _rq(cpu)->cfs); #endif + lock_task_group_list(); for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq) print_cfs_rq(m, cpu, cfs_rq); + unlock_task_group_list(); } #endif -- Regards, vatsa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add iSCSI IBFT Support (v0.3)
> > > > sysfs files have ONE VALUE PER FILE, not a whole bunch of different > > things in a single file. Please fix this. > > The subparameters _are_ actually part of a single value, that value being > associated with the initiator instance. > > Konrad is trying to implement a "work-alike" for what open firmware does. > open-iscsi already has the ability to extract the same format > bits from real OFW. > > See open-iscsi.git/utils/fwparam_ppc. Greg, In light of what Doug says (which is all true), should I go ahead with a new version of this module which would export one value per file? The problem that will be encountered is that a ethernetX sysfs directory would have (for example): /sys/firmware/ibft/ethernet0/pci-bdf 5:1:0 /sys/firmware/ibft/ethernet0/mac 00:11:25:9d:8b:00 /sys/firmware/ibft/ethernet0/vlan 0 /sys/firmware/ibft/ethernet0/gateway 192.168.79.254 /sys/firmware/ibft/ethernet0/origin 0 /sys/firmware/ibft/ethernet0/subnet-mask 22 /sys/firmware/ibft/ethernet0/ip-addr 192.168.77.41 /sys/firmware/ibft/ethernet0/flags 7 And the flag would contain the value "7" which would mean the user would have to parse what each bit means? (the v0.3 of the module does not export this flag but uses it to figure out which is the boot iSCSI target). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch 1/5] sched: code cleanup
Minor cleanups: - Fix coding style - remove obsolete comment Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]> --- kernel/sched.c | 21 +++-- 1 files changed, 3 insertions(+), 18 deletions(-) Index: current/kernel/sched.c === --- current.orig/kernel/sched.c +++ current/kernel/sched.c @@ -191,12 +191,12 @@ struct task_group init_task_group = { }; #ifdef CONFIG_FAIR_USER_SCHED -# define INIT_TASK_GRP_LOAD2*NICE_0_LOAD +# define INIT_TASK_GROUP_LOAD 2*NICE_0_LOAD #else -# define INIT_TASK_GRP_LOADNICE_0_LOAD +# define INIT_TASK_GROUP_LOAD NICE_0_LOAD #endif -static int init_task_group_load = INIT_TASK_GRP_LOAD; +static int init_task_group_load = INIT_TASK_GROUP_LOAD; /* return group to which a task belongs */ static inline struct task_group *task_group(struct task_struct *p) @@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i #define sched_class_highest (_sched_class) -/* - * Update delta_exec, delta_fair fields for rq. - * - * delta_fair clock advances at a rate inversely proportional to - * total load (rq->load.weight) on the runqueue, while - * delta_exec advances at the same rate as wall-clock (provided - * cpu is not idle). - * - * delta_exec / delta_fair is a measure of the (smoothened) load on this - * runqueue over any given interval. This (smoothened) load is used - * during load balance. - * - * This function is called /before/ updating rq->load - * and when switching tasks. - */ static inline void inc_load(struct rq *rq, const struct task_struct *p) { update_load_add(>load, p->se.load.weight); -- Regards, vatsa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mm: add dirty_highmem option
On Thu, 22 Nov 2007 14:42:04 +1100 Bron Gondwana <[EMAIL PROTECTED]> wrote: > /* > + * free highmem will not be subtracted from the total free memory > + * for calculating free ratios if vm_dirty_highmem is true > + */ > +int vm_dirty_highmem; One would expect that setting dirty_highmem to true would cause highmem to be accounted in dirty-memory calculations. However with this change reality is in fact the inverse of that. So how about this? Documentation/filesystems/proc.txt |4 ++-- mm/page-writeback.c|8 2 files changed, 6 insertions(+), 6 deletions(-) diff -puN Documentation/filesystems/proc.txt~mm-add-dirty_highmem-option-fix Documentation/filesystems/proc.txt --- a/Documentation/filesystems/proc.txt~mm-add-dirty_highmem-option-fix +++ a/Documentation/filesystems/proc.txt @@ -1265,8 +1265,8 @@ Contains, as a boolean, a switch to allo part of the "available" memory against which the dirty ratios will be applied. -Setting this to 1 can be useful on 32 bit machines where you want to make -random changes within an MMAPed file that is larger than your available +Setting this to 0 (false) can be useful on 32 bit machines where you wish to +make random changes within an MMAPed file that is larger than your available lowmem, however it is potentially dangerous and has serious bounce-buffer issues. diff -puN mm/page-writeback.c~mm-add-dirty_highmem-option-fix mm/page-writeback.c --- a/mm/page-writeback.c~mm-add-dirty_highmem-option-fix +++ a/mm/page-writeback.c @@ -69,10 +69,10 @@ static inline long sync_writeback_pages( int dirty_background_ratio = 5; /* - * free highmem will not be subtracted from the total free memory - * for calculating free ratios if vm_dirty_highmem is true + * free highmem will be subtracted from the total free memory for calculating + * free ratios if vm_dirty_highmem is true */ -int vm_dirty_highmem; +int vm_dirty_highmem = 1; /* * The generator of dirty data starts writeback at this percentage @@ -293,7 +293,7 @@ static unsigned long determine_dirtyable x = global_page_state(NR_FREE_PAGES) + global_page_state(NR_INACTIVE) + global_page_state(NR_ACTIVE); - if (!vm_dirty_highmem) + if (vm_dirty_highmem) x -= highmem_dirtyable_memory(x); return x + 1; /* Ensure that we never return 0 */ } _ (I dropped the already-merged part of your patch) (I fixed a build error in kernel/sysctl.c: "one" was defined twice when suitable config options were set). (It's an unpleasing patch, btw. But it's an unpleasant problem and at least this way people can tell us "hey, I did and it started to work") - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch 0/5] sched: group scheduler related patches (V4)
On Mon, Nov 26, 2007 at 09:28:36PM +0100, Ingo Molnar wrote: > the first SCHED_RR priority is 1, not 0 - so this call will always fail. Thanks for spotting this bug and rest of your review comments. Here's V4 of the patchset, aimed at improving fairness of cpu bandwidth allocation for task groups. Changes since V3 (http://marc.info/?l=linux-kernel=119605252303359): - Fix bug in setting SCHED_RR priority for load_balance_monitor thread - Fix coding style related issues - Separate "introduction of lock_doms_cur() API" into a separate patch I have also tested this patchset against your latest git tree as of today morning. Please apply if there are no major concerns. -- Regards, vatsa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] [libata] Set proper ATA UDMA mode for bf548 according to system clock.
UDMA Mode - Frequency compatibility UDMA5 - 100 MB/s - SCLK = 133 MHz UDMA4 - 66 MB/s- SCLK >= 80 MHz UDMA3 - 44.4 MB/s - SCLK >= 50 MHz UDMA2 - 33 MB/s- SCLK >= 40 MHz Signed-off-by: Sonic Zhang <[EMAIL PROTECTED]> --- drivers/ata/pata_bf54x.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/drivers/ata/pata_bf54x.c b/drivers/ata/pata_bf54x.c index 81db405..088a41f 100644 --- a/drivers/ata/pata_bf54x.c +++ b/drivers/ata/pata_bf54x.c @@ -1489,6 +1489,8 @@ static int __devinit bfin_atapi_probe(st int board_idx = 0; struct resource *res; struct ata_host *host; + unsigned int fsclk = get_sclk(); + int udma_mode = 5; const struct ata_port_info *ppi[] = { _port_info[board_idx], NULL }; @@ -1507,6 +1509,11 @@ static int __devinit bfin_atapi_probe(st if (res == NULL) return -EINVAL; + while (bfin_port_info[board_idx].udma_mask>0 && udma_fsclk[udma_mode] > fsclk) { + udma_mode--; + bfin_port_info[board_idx].udma_mask >>= 1; + } + /* * Now that that's out of the way, wire up the port.. */ -- 1.4.3.4 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Dynticks Causing High Context Switch Rate in ksoftirqd
[EMAIL PROTECTED] wrote: Question: Why is ksoftirqd eating about 5 to 10 percent of my CPU on an idle system? The problem occurs if I config the kernel with tickless support (i.e. CONFIG_TICK_ONESHOT=y). (Thanks to "oprofile" for putting me onto this.) I have noted this same problem on kernel versions: 2.6.23.1, 2.6.23.8 and 2.6.23.9 ** *** Output from "vmstat -n 1 10" -- Note very high context switch rate *** *** This is on a idle machine! *** ** procs ---memory-- ---swap-- -io --system-- cpu r b swpd free buff cache si sobibo incs us sy id wa 0 0 0 1925556 4768 11610400 124 26 7538 1 2 96 1 0 0 0 1925556 4768 11610400 0 02 147329 0 1 99 0 What did oprofile show? It should be able to narrow down what function(s) are responsible for the CPU usage.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] remove CONFIG_EXPERIMENTAL
On Mon, Nov 26, 2007 at 10:44:44PM -0500, [EMAIL PROTECTED] wrote: > I suspect that given the "once it escapes, it's cast in stone" view we take > towards user-visible API/etc, there isn't much *real* room for an > 'EXPERIMENTAL' flag anymore. Most of the usage should probably be confined > to > individual drivers, where all we should need is a 'default n' and suitable > warning verbiage in the Kconfig file warning about the driver eating your > filesystems and small animals for breakfast. Potential corruptors are usually flagged with (DANGEROUS) in the text, (One may argue that they shouldn't have escaped -mm) > We certainly shouldn't have > one big flag for *all* in-progress drivers - I don't need to accidentally > enable a busticated ethernet driver because I want a USB widget. So no ethernet driver at all is better than a broken but mostly working one? Again if it isn't mostly working, it shouldn't have escaped -mm Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add iSCSI IBFT Support (v0.3)
.. snip.. > > +#else > > +static void __init reserve_ibft_region(void) { }; > > No ending ; above. Fixed. > ..snip.. > > +static void __init reserve_ibft_region(void) { }; > > Ditto. Fixed. .. snip.. > > +#include > > + > > No blank line here, please. Why that creeps back in the code I am not sure myself. In your first review you mentioned this, I fixed it in my tree, and now it is back!? Either way, it is fixed. > > > +#include ..snip.. > > + printk(KERN_INFO \ > > Looks like this should use KERN_ERROR or KERN_WARNING? Yes! Thanks for catching that. > > > + "error, in IBFT structure (%s) expected %d but" \ > > + return str-buf; > > preferred form: > return str - buf; Fixed. > ..snip.. > > + > > + return str-buf; > > Ditto. Fixed. > ..snip.. > > + return str-buf; > > Ditto. Fixed. ..snip.. > > + return str-buf; > > Ditto. Fixed. ..snip.. > > + int len = 6; > > Could you just use ETH_ALEN instead of and 6? > and #include Yes. That makes much more sense. > > Or add a define for IBFT_ALEN (of 6) and use that? Either one works. The first suggestion is much better. ..snip.. > > > + > > + /* Based on the header index value find the data tuple, > > + if possibly. */ > > if possible. */ > > or better: > /* >* Based on the header index value, find the data tuple >* if possible. >*/ Yes, much more understandable - and now that I read I realized this was not a proper assumption. One of the data structures (struct ibft_tgt) has a 'nic_assoc' value which makes a N-to-1 mapping to the NIC data structure, so this will re-work. Thanks for catching a bug that early in the cycle! ..snip.. > > + struct carry it for convience. */ > > convenience. Fixed. ..snip.. > > + * Scan the IBFT table structure for the NIC and Target fields. When > > + * found add them on the passed in list. > > passed-in list. Fixed. > > > + */ > > +static int ibft_scan_device(struct ibft_table_header *header, > > + struct list_head *list) > > +{ > > + > > + /* We can have multiple NICs and multiple targets. The index in > > + their header defines their 1-to-1 correlation. Not true. I will have to re-work this code to do a 1-to-N correlation. > > + */ > > + for (ptr = >nic0_off; ptr <= end; ptr += sizeof(u16)) { > > In many searches, would be the first address beyond the end of the > table, so the loop-terminating condition test would be: > > ptr < end; Yes. That is correct. It did actually check the next offset, which fortunately had nothing in it. > > It looks like that should be the case here also To check the offset to make sure it is within the full IBFT data structure? Yes, that is a good check - will implement. > ..snip.. > > + if (rc) break; > > break; > on a separate line. > > Did you check this patch with scripts/checkpatch.pl ? Yes. I ran it with check-patch-0.99.pl that I downloaded somewhere from Dave Jones web page. I hadn't realized that its home is now in scripts/checkpatch.pl - will make sure to use that improved-new version. > ..snip.. > > + printk(KERN_INFO "iBFT detected at 0x%lx.\n", > > + (unsigned long)ibft_phys); > > Use %p to print pointer values. This is actually not a pointer yet. It is a true physical address which I thought might be useful for troubleshooting purposes. > ..snip. > > + if (!rc) > > + return rc; > > Can't this always just be > return 0; > ? Yes, I was thinking that perhaps a more nicer way was to do "goto end;" where the end label is just "return rc;" But this definitely trumps it. > > > + ..snip.. > > + > > +struct ibft_tgt { > > + struct ibft_hdr hdr; > > + char ip_addr[16]; > > + u16 port; > > + char lun[8]; > > + u8 chap_type; > > + u8 nic_assoc; > > + u16 tgt_name_len; > > + u16 tgt_name_off; > > + u16 chap_name_len; > > + u16 chap_name_off; > > + u16 chap_secret_len; > > + u16 chap_secret_off; > > + u16 rev_chap_name_len; > > + u16 rev_chap_name_off; > > + u16 rev_chap_secret_len; > > + u16 rev_chap_secret_off; > > +} __attribute__((__packed__)); > > + > > +#if defined(CONFIG_ISCSI_IBFT) || defined(CONFIG_ISCSI_IBFT_MODULE) > > Why is this #if line here instead of nearer the top of this header file? My thought was that if other kernel users might want to include this header file they do not have to exposed to the semi-internal data structures of this header file. If that is not a concern then I think I can remove the conditional altogether. > > > +#define IBFT_SIGN "iBFT" > > +#define IBFT_SIGN_LEN 4 > > +#define IBFT_START 0x8 /* 512kB */ > > +#define IBFT_END 0x10 /* 1MB */ > > +#define VGA_MEM 0xA /* VGA buffer */ > > +#define VGA_SIZE 0x2 /* 132kB */ > > I'd say
Re: [PATCH] Add iSCSI IBFT Support (v0.3)
On Monday 26 November 2007 22:31:38 Greg KH wrote: > On Mon, Nov 26, 2007 at 06:56:42PM -0400, Konrad Rzeszutek wrote: > > +/* > > + * Routines for reading of the iBFT data in a human readable fashion. > > + */ > > +ssize_t ibft_attr_show_initiator(struct ibft_kobject *entry, > > +struct ibft_attribute *attr, > > +char *buf) > > +{ .. snip.. > > + > > + str += sprintf_ipaddr(str, "isns", initiator->isns_server); > > + str += sprintf_ipaddr(str, "slp", initiator->slp_server); .. snip .. > > sysfs files have ONE VALUE PER FILE, not a whole bunch of different > things in a single file. Please fix this. No problem. I will have that shortly posted. > > > + > > +ssize_t ibft_attr_show_nic(struct ibft_kobject *entry, > > + struct ibft_attribute *attr, > > + char *buf) .. snip.. > > + str += sprintf_ipaddr(str, "giaddr", nic->gateway); > > + str += sprintf_ipaddr(str, "dnsaddr1", nic->primary_dns); > > Same here. Yup. > > > +ssize_t ibft_attr_show_target(struct ibft_kobject *entry, > > + struct ibft_attribute *attr, > > + char *buf) > > +{ .. snip.. > > +} > > Same here, are we writing a novella or something to userspace? :) Hehe.. I will make it simpler :-) > > > +ssize_t ibft_attr_show_disk(struct ibft_kobject *dev, > > + struct ibft_attribute *ibft_attr, > > + char *buf) > > +{ .. snip .. > > +} > > And here, do I need to go on? I will have a new version posted quite shortly. > > > +ssize_t ibft_attr_show_mac(struct ibft_kobject *entry, > > + struct ibft_attribute *attr, > > + char *buf) > > +{ ..snip.. > > + > > + memcpy(buf, attr->nic->mac, len); > > + > > + return len; > > +} > > Is mac a user readable string? Then perhaps a simple sprintf would work > instead, as I doubt you are including a \n here... It was meant to be as a binary value. But that doesn't fit in sysfs directory, so let me make it use sprintf here. > > > +/* > > + * The main routine which allows the user to read the IBFT data. > > + */ > > +static ssize_t ibft_show_attribute(struct kobject *kobj, > > + struct attribute *attr, > > + char *buf) > > +{ ..snip.. > > + > > +static struct sysfs_ops ibft_attr_ops = { > > + .show = ibft_show_attribute, > > +}; > > I think this whole mess can go away in the new rework Kay and I have > done, please document this whole thing and I'll see what I can do. Absolutely. > > > +struct ibft_control { > > +struct ibft_hdr hdr; > > +u16 extensions; > > +u16 initiator_off; > > +u16 nic0_off; > > +u16 tgt0_off; > > +u16 nic1_off; > > +u16 tgt1_off; > > +} __attribute__((__packed__)); > > Did we loose tabs for some reason? I'm guessing your editor is not > showing them properly, nor did you use scripts/checkpatch.pl :( I did use checkpatch.pl v0.99 downloaded somewhere from the web. I hadn't realized it was now residing in scripts/checkpatch.pl - and from now on I will use that. > > > +#if defined(CONFIG_ISCSI_IBFT) || defined(CONFIG_ISCSI_IBFT_MODULE) ..snip.. > > +static ssize_t find_ibft(void) > > +{ ..snip.. > > +} > > What is a function (not even an inline one) doing in a .h file? I was not sure where to put it. This function (find_ibft) is used by the setup_[32|64].c and the iscsi_ibft.c code. Randy suggested I put in .c file, but I am not sure exactly where? Should I make a new file in called libs/iscsi_ibft_helper.c ? > ..snip.. > > +struct ibft_kobject { > > + struct ibft_data *data; > > + char name[IBFT_ISCSI_KOBJECT_MAX_LEN]; > > Why have this, > > > + u8 type; > > + struct kobject kobj; > > When the kobject itself has an unlimited size name associated with it? Absolutely no reason at all. It was a evolution vestige of the code that is not needed anymore. > ..snip.. > > + char name[IBFT_ISCSI_ATTR_MAX_LEN]; > > Same here, an attribute already has a pointer to a name, no need to have > another one in the same structure. Thanks. Will remove it. > > > + struct list_head node; > > +}; > > thanks, Thank you for taking your time to review the code. I will have the new version out shortly. > > greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Error returns not handled correctly by sysfs.c:subsys_attr_store()
On Wed, 21 Nov 2007 15:16:59 -0700 Andrew Patterson <[EMAIL PROTECTED]> wrote: > The buf in fs/sysfs.c:subsys_attr_store() does not seem to be updated > correctly when returning a negative value (indicating that an error > condition has occurred) is returned. If a negative value is returned, > the next subsequent call to subsys_attr_store will have the contents of > buf appended to the previous call. subsys_attr_store() gets deleted by http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-01-driver/kset-kill-subsys-attr.patch So maybe we will soon accidentally fix whatever-this-is? Or maybe we will faithfully maintain it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 05/14] percpu: Use a Kconfig variable to configure arch specific percpu setup
On Tuesday 27 November 2007 11:14:12 Christoph Lameter wrote: > The use of the __GENERIC_PERCPU is a bit problematic since arches > may want to run their own percpu setup while using the generic > percpu definitions. Replace it through a kconfig variable. Thanks for this Christoph! These patches are great: the early experiments are obviously over, and so this consolidation is overdue. Have you considered moving x86-64's setup_per_cpu_areas into generic code? It's a bit messier because some archs might not have set up NUMA stuff yet, but it's logically generic... Thanks! Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
On Monday 26 November 2007 16:58:08 Roland Dreier wrote: > > > I agree that we shouldn't make things too hard for out-of-tree > > > modules, but I disagree with your first statement: there clearly is a > > > large class of symbols that are used by multiple modules but which are > > > not generically useful -- they are only useful by a certain small > > > class of modules. > > > > If it is so clear, you should be able to easily provide examples? > > Sure -- Andi's example of symbols required only by TCP congestion > modules; Exactly. Why exactly should someone not write a new TCP congestion module? > the SCSI internals that Christoph wants to mark He didn't justify those though, either. > ; the symbols exported by my mlx4_core driver (which I admit are > currently only used > by the mlx4_ib driver, but which will also be used by at least the > ethernet NIC driver for the same hardware). Right. So presumably there will only ever be two drivers using this core code, so no new users will ever be written? Now we've found one use case, is it worth the complexity of namespaces? Is it worth the halfway point of export-to-module? What problem will it solve? > I thought this was > already covered repeatedly in the thread and indeed in Andi's code so > there was no need to repeat it... No, we've seen the solution and various people applying it. I'm still trying to discover the problem it's solving. Hope that helps, Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Andi Kleen wrote: On Tuesday 20 November 2007 04:50, Christoph Lameter wrote: On Tue, 20 Nov 2007, Andi Kleen wrote: You could in theory move the modules, but then you would need to implement a full PIC dynamic linker for them first and also increase runtime overhead for them because they would need to use a GOT/PLT. On x86-64? The GOT/PLT should stay in cache due to temporal locality. The x86-64 instruction set itself handles GOT-relative addressing rather well; what's a 1% loss on x86 is like 0.01% on x86-64, so I'm thinking 100 times better? I think I got this by `-fpic -pie` compiling nbyte benchmark versus fixed position, each with and without on 32-bit (which made about a 1% difference) and on 64-bit (which made a 0.01% difference). It was a long time ago. Still, yeah I know. Complexity. (You have the ability to textrel these things too, and just rewrite non-PIC, depending on how you feel about that) -- Bring back the Firefox plushy! http://digg.com/linux_unix/Is_the_Firefox_plush_gone_for_good https://bugzilla.mozilla.org/show_bug.cgi?id=322367 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu
Neil Horman <[EMAIL PROTECTED]> writes: > Hey all- > I've been working on an issue lately involving multi socket x86_64 > systems connected via hypertransport bridges. It appears that some systems, > disable the hypertransport connections during a kdump operation when all but > the > crashing processor gets halted in machine_crash_shutdown. This becomes a > problem when the ioapic attempts to route interrupts to the only remaining > processor. Even though the active processor is targeted for interrupt > reception, the fact that the hypertransport connections are inactive result in > interrupts not getting delivered. The effective result is that timer > interrupts > are not delivered to the running cpu, and the system hangs on reboot into the > kdump kernel during calibrate_delay. I've found that I've been able to avoid > this hang, by forcing a transition to the bios defined boot cpu during the > crashing kernel shutdown. This patch accomplished that. Tested by myself and > the origional reporter with successful results. If you can get to calibrate_delay hypertransport is still routing traffic. Your diagnosis of the problem is wrong. Most likely it is just an ioapic programming error in restoring the system to PIC mode. I agree that there is a problem. The reliable fix is to totally skip the PIC interrupt mode and go directly to apic mode. To make the code kexec on panic code path reliable we need to remove code not add it. Frankly I think switching cpus is one of the least reliable things that we can do in general. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: unionfs: several more problems
In message <[EMAIL PROTECTED]>, Hugh Dickins writes: > On Mon, 26 Nov 2007, Erez Zadok wrote: [...] > > The small patch below fixed the problem. Let me know what you think. > > I've one issue with it: please move that wait_on_page_writeback before > the clear_page_dirty_for_io instead of after it, then resubmit your 14/16. [...] Done, tested, and working. Here's the revised patch (pushed to unionfs.git on korg). Thanks, Erez. -- Unionfs: prevent multiple writers to lower_page Without this patch, the LTP fs test "rwtest04" triggers a BUG_ON(PageWriteback(page)) in fs/buffer.c:1706. CC: Hugh Dickins <[EMAIL PROTECTED]> Signed-off-by: Erez Zadok <[EMAIL PROTECTED]> diff --git a/fs/unionfs/mmap.c b/fs/unionfs/mmap.c index 623a913..74f2e53 100644 --- a/fs/unionfs/mmap.c +++ b/fs/unionfs/mmap.c @@ -72,6 +72,7 @@ static int unionfs_writepage(struct page *page, struct writeback_control *wbc) } BUG_ON(!lower_mapping->a_ops->writepage); + wait_on_page_writeback(lower_page); /* prevent multiple writers */ clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ err = lower_mapping->a_ops->writepage(lower_page, wbc); if (err < 0) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] file capabilities: don't prevent signaling setuid root programs.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Serge, I still feel a bit uneasy about this. Looking ahead, with filesystem capabilities, one can simulate this same situation with a setuid 'non-root' program as follows: [EMAIL PROTECTED] ~]$ cat > test.c main() { printf("sleeping (%u)\n", getpid()); sleep(100); printf("woke up\n"); } [EMAIL PROTECTED] ~]$ cc -o test test.c [EMAIL PROTECTED] ~]$ chmod u+s ./test [EMAIL PROTECTED] ~]$ ls -ltr test - -rwsrwxr-x 1 morgan morgan 7090 Nov 26 20:01 test [EMAIL PROTECTED] ~]$ setcap cap_net_raw+ep ~/test [EMAIL PROTECTED] ~]$ getcap ~/test /home/morgan/test = cap_net_raw+ep [EMAIL PROTECTED] ~]$ su luser Password: [EMAIL PROTECTED] morgan]$ ./test sleeping (5935) [EMAIL PROTECTED] morgan]$ kill 5935 bash: kill: (5935) - Operation not permitted Because of the euid=0 test, the piece of code you are adding will behave differently in this situation. Is the root-behavior deserving of less protection than this one? To my eye they seem equivalent. Is there a compelling reason to include the euid==0 check? Thanks Andrew Serge E. Hallyn wrote: > This patch is needed to preserve legacy behavior when > CONFIG_SECURITY_FILE_CAPABILITIES=y. Without this patch, xinit can't > kill X, so manually starting X in runlevel 3 then exiting your window > manager will not cause X to exit. > > thanks, > -serge > >>From 81a6d780ad570f9a326fc27912ec0e373f5fa14f Mon Sep 17 00:00:00 2001 > From: Serge E. Hallyn <[EMAIL PROTECTED]> > Date: Tue, 20 Nov 2007 08:47:35 + > Subject: [PATCH] file capabilities: don't prevent signaling setuid root > programs. > > An unprivileged process must be able to kill a setuid root > program started by the same user. This is legacy behavior > needed for instance for xinit to kill X when the window manager > exits. > > When an unprivileged user runs a setuid root program in !SECURE_NOROOT > mode, fP, fI, and fE are set full on, so pP' and pE' are full on. > Then cap_task_kill() prevents the user from signaling the setuid root > task. This is a change in behavior compared to when > !CONFIG_SECURITY_FILE_CAPABILITIES. > > This patch introduces a special check into cap_task_kill() just > to check whether a non-root user is signaling a setuid root > program started by the same user. If so, then signal is allowed. > > Changelog: > Nov 26: move test up above CAP_KILL test as per Andrew > Morgan's suggestion. > > Signed-off-by: Serge E. Hallyn <[EMAIL PROTECTED]> > --- > security/commoncap.c |9 + > 1 files changed, 9 insertions(+), 0 deletions(-) > > diff --git a/security/commoncap.c b/security/commoncap.c > index 302e8d0..5bc1895 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -526,6 +526,15 @@ int cap_task_kill(struct task_struct *p, struct siginfo > *info, > if (info != SEND_SIG_NOINFO && (is_si_special(info) || > SI_FROMKERNEL(info))) > return 0; > > + /* > + * Running a setuid root program raises your capabilities. > + * Killing your own setuid root processes was previously > + * allowed. > + * We must preserve legacy signal behavior in this case. > + */ > + if (p->euid == 0 && p->uid == current->uid) > + return 0; > + > /* sigcont is permitted within same session */ > if (sig == SIGCONT && (task_session_nr(current) == task_session_nr(p))) > return 0; -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.6 (GNU/Linux) iD8DBQFHS5m/QheEq9QabfIRAmouAJkBBB0kXH57s9mvlgdG3XZhC0pZMwCfZUW3 L4vJUkR4tgAh33GTqEquIqw= =sKCy -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: profile code added to netif_receive_skb function
On Sun, 25 Nov 2007 21:46:26 PST, kernel coder said: > hi, > > I have added some code to netif_receive_skb function.As linux kernel > is multhreaded , so there is no gaurantee than mine code is completely > executed without being disturbed by any other process .Timer interrupt > handler is an example of code which might interrupt execution of mine > code. The trick is to write your code so it doesn't *matter* if other code runs. For example - the timer interrupt almost certainly doesn't look at or modify any of *your* code's variables. So 98% of the kernel's code you don't even have to *care* if it runs (as long as you aren't doing something real-time or has similar response-time or throughput constraints). And if you are worried about that other 2%, where related code, for example the IRQ handler for a network interface, may have to look at and/or modify some of your variables, that's when you should be using appropriate locking - there's mutexes, semaphores, the whole RCU family, and more - none of which I'll attempt to explain, because I'm not all that good at that stuff. Basic rule of thumb - if you have something that will break if two things access it at the same time, put a lock around it, so they take turns. pgpnHLBRHOG0V.pgp Description: PGP signature
Re: Small System Paging Problem - OOM-killer goes nuts
When you untar, which filesystem do you untar too? I've untarred it to Ext3, Ext2, and Reiser filesystems. I've been fighting with this for a while. I did manage to get it to happen again doing a recursive chmod after untarring the kernel (I stopped the untar a few times to let the system catch up). Interesting output below. -J top - 17:58:03 up 3:08, 1 user, load average: 3.54, 4.09, 4.08 Tasks: 53 total, 2 running, 51 sleeping, 0 stopped, 0 zombie Cpu(s): 2.1%us, 11.4%sy, 0.6%ni, 0.0%id, 81.4%wa, 2.7%hi, 1.8%si, 0.0%st Mem: 30352k total,28252k used, 2100k free,19448k buffers Swap: 465876k total,15736k used, 450140k free, 1072k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 1357 root 30 15 1568 168 88 R 8.1 0.6 0:07.87 chmod 168 root 10 -5 000 S 3.1 0.0 6:39.25 usb-storage 1353 root 15 0 2408 540 400 R 2.2 1.8 0:14.29 top 989 root 15 0 3600 292 192 S 1.2 1.0 0:37.81 sshd 2 root 34 19 000 S 0.6 0.0 2:14.65 ksoftirqd/0 56 root 15 0 000 S 0.3 0.0 0:23.85 pdflush 58 root 10 -5 000 S 0.3 0.0 0:54.70 kswapd0 950 root 15 0 3128 108 64 S 0.3 0.4 0:13.88 ntpd 1 root 16 0 144000 S 0.0 0.0 0:10.40 init 3 root 10 -5 000 S 0.0 0.0 0:00.02 events/0 4 root 10 -5 000 S 0.0 0.0 0:00.02 khelper 5 root 10 -5 000 S 0.0 0.0 0:00.00 kthread 38 root 10 -5 000 S 0.0 0.0 0:00.04 kblockd/0 41 root 10 -5 000 S 0.0 0.0 0:00.02 khubd 57 root 15 0 000 D 0.0 0.0 0:20.29 pdflush And the first of the oom-killer syslog messages: ntpd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0 Mem-info: DMA per-cpu: CPU0: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 sshd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0 Active:2816 inactive:2778 dirty:0 writeback:0 unstable:0 free:179 slab:858 mapped:1 pagetables:93 bounce:0 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] remove CONFIG_EXPERIMENTAL
On Mon, 26 Nov 2007 12:27:07 GMT, Pavel Machek said: > I don't think this is good idea. But perhaps 'experimental' should be > removed from stuff that is really stable these days, like SATA? I suspect that given the "once it escapes, it's cast in stone" view we take towards user-visible API/etc, there isn't much *real* room for an 'EXPERIMENTAL' flag anymore. Most of the usage should probably be confined to individual drivers, where all we should need is a 'default n' and suitable warning verbiage in the Kconfig file warning about the driver eating your filesystems and small animals for breakfast. We certainly shouldn't have one big flag for *all* in-progress drivers - I don't need to accidentally enable a busticated ethernet driver because I want a USB widget. And if you're worried about people accidentally enabling it, then *each driver* should have a 'Do you really mean it?' flag with *opposite* sense (so that 'make allyesconfig' doesn't turn it on by accident). Anything bigger than that, we probably want to redefine 'experimental' as "it doesn't escape from -mm to mainline till it's ready". pgpNJBVzT18KH.pgp Description: PGP signature
Re: [PATCH] capabilities: introduce per-process capability bounding set (v10)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 This looks good to me. [As you anticipated, there is a potential merge issue with Casey's recent addition of MAC capabilities - which will make CAP_MAC_ADMIN the highest allocated capability: ie., #define CAP_LAST_CAP CAP_MAC_ADMIN ]. Signed-off-by: Andrew G. Morgan <[EMAIL PROTECTED]> Cheers Andrew Serge E. Hallyn wrote: >>From 22da6ccb1a24d1b6fa481d990a26197c6bfdfa77 Mon Sep 17 00:00:00 2001 > From: Serge E. Hallyn <[EMAIL PROTECTED]> > Date: Mon, 19 Nov 2007 13:54:05 -0500 > Subject: [PATCH 1/1] capabilities: introduce per-process capability bounding > set (v10) > > The capability bounding set is a set beyond which capabilities > cannot grow. Currently cap_bset is per-system. It can be > manipulated through sysctl, but only init can add capabilities. > Root can remove capabilities. By default it includes all caps > except CAP_SETPCAP. > > This patch makes the bounding set per-process when file > capabilities are enabled. It is inherited at fork from parent. > Noone can add elements, CAP_SETPCAP is required to remove them. > > One example use of this is to start a safer container. For > instance, until device namespaces or per-container device > whitelists are introduced, it is best to take CAP_MKNOD away > from a container. > > The bounding set will not affect pP and pE immediately. It will > only affect pP' and pE' after subsequent exec()s. It also does > not affect pI, and exec() does not constrain pI'. So to really > start a shell with no way of regain CAP_MKNOD, you would do > > prctl(PR_CAPBSET_DROP, CAP_MKNOD); > cap_t cap = cap_get_proc(); > cap_value_t caparray[1]; > caparray[0] = CAP_MKNOD; > cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP); > cap_set_proc(cap); > cap_free(cap); > > The following test program will get and set the bounding > set (but not pI). For instance > > ./bset get > (lists capabilities in bset) > ./bset drop cap_net_raw > (starts shell with new bset) > (use capset, setuid binary, or binary with > file capabilities to try to increase caps) > > > cap_bound.c > > #include > #include > #include > #include > #include > #include > #include > > #ifndef PR_CAPBSET_READ > #define PR_CAPBSET_READ 23 > #endif > > #ifndef PR_CAPBSET_DROP > #define PR_CAPBSET_DROP 24 > #endif > > int usage(char *me) > { > printf("Usage: %s get\n", me); > printf(" %s drop \n", me); > return 1; > } > > #define numcaps 32 > char *captable[numcaps] = { > "cap_chown", > "cap_dac_override", > "cap_dac_read_search", > "cap_fowner", > "cap_fsetid", > "cap_kill", > "cap_setgid", > "cap_setuid", > "cap_setpcap", > "cap_linux_immutable", > "cap_net_bind_service", > "cap_net_broadcast", > "cap_net_admin", > "cap_net_raw", > "cap_ipc_lock", > "cap_ipc_owner", > "cap_sys_module", > "cap_sys_rawio", > "cap_sys_chroot", > "cap_sys_ptrace", > "cap_sys_pacct", > "cap_sys_admin", > "cap_sys_boot", > "cap_sys_nice", > "cap_sys_resource", > "cap_sys_time", > "cap_sys_tty_config", > "cap_mknod", > "cap_lease", > "cap_audit_write", > "cap_audit_control", > "cap_setfcap" > }; > > int getbcap(void) > { > int comma=0; > unsigned long i; > int ret; > > printf("i know of %d capabilities\n", numcaps); > printf("capability bounding set:"); > for (i=0; i ret = prctl(PR_CAPBSET_READ, i); > if (ret < 0) > perror("prctl"); > else if (ret==1) > printf("%s%s", (comma++) ? ", " : " ", captable[i]); > } > printf("\n"); > return 0; > } > > int capdrop(char *str) > { > unsigned long i; > > int found=0; > for (i=0; i if (strcmp(captable[i], str) == 0) { > found=1; > break; > } > } > if (!found) > return 1; > if (prctl(PR_CAPBSET_DROP, i)) { > perror("prctl"); > return 1; > } > return 0; > } > > int main(int argc, char *argv[]) > { > if (argc<2) > return usage(argv[0]); > if (strcmp(argv[1], "get")==0) > return getbcap(); > if (strcmp(argv[1], "drop")!=0 || argc<3) > return usage(argv[0]); > if (capdrop(argv[2])) { > printf("unknown capability\n"); > return 1; > } > return execl("/bin/bash", "/bin/bash", NULL); > } > > >
Re: [PATCH] fix plip 1
On Thu, 22 Nov 2007, Mikulas Patocka wrote: > > netif_rx is meant to be called from interrupts because it doesn't wake up > ksoftirqd. For calling from outside interrupts, netif_rx_ni exists. Argh. Can you _please_ use more useful subject lines than "fix plip 1/2"? Those subject lines are what becomes the single-line description of the problem, used by visualizers like gitk and gitweb. So "fix plip 1" is a singularly bad such line! Which is why it should be something like Subject: [PATCH 1/2] plip: use netif_rx_ni() for packet receive or similar.. (My scripts will then get rid of the stuff in brackets, so all that is useful for giving information that is interesting while in *email*, but not when actually applied as a patch) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add iSCSI IBFT Support (v0.3)
On Mon, Nov 26, 2007 at 06:56:42PM -0400, Konrad Rzeszutek wrote: > +/* > + * Routines for reading of the iBFT data in a human readable fashion. > + */ > +ssize_t ibft_attr_show_initiator(struct ibft_kobject *entry, > + struct ibft_attribute *attr, > + char *buf) > +{ > + struct ibft_initiator *initiator = attr->initiator; > + void *ibft_loc = entry->data->hdr; > + char *str = buf; > + > + if (!initiator) > + return 0; > + > + str += sprintf_ipaddr(str, "isns", initiator->isns_server); > + str += sprintf_ipaddr(str, "slp", initiator->slp_server); > + str += sprintf_ipaddr(str, "primary_radius_server", > + initiator->pri_radius_server); > + str += sprintf_ipaddr(str, "secondary_radius_server", > + initiator->sec_radius_server); > + str += sprintf_string(str, "itname", initiator->initiator_name_len, > + (char *)ibft_loc + initiator->initiator_name_off); > + str--; > + > + return str-buf; > +} sysfs files have ONE VALUE PER FILE, not a whole bunch of different things in a single file. Please fix this. > + > +ssize_t ibft_attr_show_nic(struct ibft_kobject *entry, > +struct ibft_attribute *attr, > +char *buf) > +{ > + struct ibft_nic *nic = attr->nic; > + void *ibft_loc = entry->data->hdr; > + char *str = buf; > + > + if (!nic) > + return 0; > + /* > + * Assume dhcp if any non-zero portions of its address are set. > + */ > + if (memcmp(nic->dhcp, nulls, sizeof(nic->dhcp))) { > + str += sprintf_ipaddr(str, "dhcp", nic->dhcp); > + } else { > + str += sprintf_ipaddr(str, "ciaddr", nic->ip_addr); > + str += sprintf_ipaddr(str, "giaddr", nic->gateway); > + str += sprintf_ipaddr(str, "dnsaddr1", nic->primary_dns); > + str += sprintf_ipaddr(str, "dnsaddr2", nic->secondary_dns); > + } > + if (nic->hostname_len) > + str += sprintf_string(str, "hostname", nic->hostname_len, > + (char *)ibft_loc + nic->hostname_off); > + /* Cut off the comma. */ > + str--; > + > + return str-buf; > +} Same here. > +ssize_t ibft_attr_show_target(struct ibft_kobject *entry, > + struct ibft_attribute *attr, > + char *buf) > +{ > + struct ibft_tgt *tgt = attr->tgt; > + void *ibft_loc = entry->data->hdr; > + char *str = buf; > + int i; > + > + if (!tgt) > + return 0; > + > + str += sprintf_ipaddr(str, "siaddr", tgt->ip_addr); > + str += sprintf(str, "iport=%d,", tgt->port); > + str += sprintf(str, "ilun="); > + for (i = 0; i < 8; i++) > + str += sprintf(str, "%x", (u8)tgt->lun[i]); > + str += sprintf(str, ","); > + > + if (tgt->tgt_name_len) > + str += sprintf_string(str, "iname", tgt->tgt_name_len, > + (void *)ibft_loc + tgt->tgt_name_off); > + > + if (tgt->chap_name_len) > + str += sprintf_string(str, "chapid", tgt->chap_name_len, > + (char *)ibft_loc + tgt->chap_name_off); > + if (tgt->chap_secret_len) > + str += sprintf_string(str, "chappw", tgt->chap_secret_len, > + (char *)ibft_loc + tgt->chap_secret_off); > + if (tgt->rev_chap_name_len) > + str += sprintf_string(str, "ichapid", tgt->rev_chap_name_len, > + (char *)ibft_loc + tgt->rev_chap_name_off); > + if (tgt->rev_chap_secret_len) > + str += sprintf_string(str, "ichappw", tgt->rev_chap_secret_len, > + (char *)ibft_loc + tgt->rev_chap_secret_off); > + > + /* Cut off the comma. */ > + str--; > + > + return str-buf; > +} Same here, are we writing a novella or something to userspace? :) > +ssize_t ibft_attr_show_disk(struct ibft_kobject *dev, > + struct ibft_attribute *ibft_attr, > + char *buf) > +{ > + char *str = buf; > + > + str += sprintf(str, "//[EMAIL PROTECTED],%d:iscsi,", dev->data->index); > + str += ibft_attr_show_initiator(dev, ibft_attr, str); > + str += sprintf(str, ","); > + str += ibft_attr_show_target(dev, ibft_attr, str); > + str += sprintf(str, ","); > + str += ibft_attr_show_nic(dev, ibft_attr, str); > + > + return str-buf; > +} And here, do I need to go on? > +ssize_t ibft_attr_show_mac(struct ibft_kobject *entry, > +struct ibft_attribute *attr, > +char *buf) > +{ > + struct ibft_nic *nic = attr->nic; > + int len = 6; > + > + if (!nic) > + return 0; > + > + memcpy(buf, attr->nic->mac, len); > + > + return len; > +} Is mac a user readable string? Then perhaps a simple sprintf would work instead, as I doubt you are including a \n here... > +/*
Re: [PATCH] Add iSCSI IBFT Support (v0.3)
On Mon, Nov 26, 2007 at 06:56:42PM -0400, Konrad Rzeszutek wrote: > > This patch adds /sysfs/firmware/ibft/[chosen|aliases|[EMAIL > PROTECTED],X|[EMAIL PROTECTED],X] > directories along with text properties which export the the iSCSI Boot > Firmware Table (iBFT) structure. The layout of the directories mirrors > how PowerPC OpenBoot exports this data. > > What is iSCSI Boot Firmware Table? It is a mechanism for the iSCSI > tools to extract from the machine NICs the iSCSI connection information > so that they can automagically mount the iSCSI share/target. Currently > the iSCSI information is hard-coded in th initrd. > > For full details of the IBFT structure please take a look at: > ftp://ftp.software.ibm.com/systems/support/system_x_pdf/ibm_iscsi_boot_firmware_table_v1.02.pdf As you are adding sysfs files in /sys/firmware, please add documentation to Documentation/ABI as to what these files are, what they do, what is in them, and what they are to be used for. > + rc = firmware_register(_subsys); > + if (rc) > + return rc; This function, as well as the whole decl_subsys() stuff is gone in my tree and in -mm. /sys/firmware is now just a simple kobject that you are free to chain off of. If you describe just what these sysfs subdirectories and files are for and how they are going to be used, I'd be glad to rework this patch to use the new interfaces. thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] -mm (2.4.26-rc3-mm1) v2 Smack using capabilities 32 and 33
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Signed-off-by: Andrew G. Morgan <[EMAIL PROTECTED]> Cheers Andrew Casey Schaufler wrote: > From: Casey Schaufler <[EMAIL PROTECTED]> > > This patch takes advantage of the increase in capability bits > to allocate capabilities for Mandatory Access Control. Whereas > Smack was overloading a previously allocated capability it is > now using a pair, one for overriding access control checks and > the other for changes to the MAC configuration. > > The two capabilities allocated should be obvious in their intent. > The comments in capability.h are intended to make it clear that > there is no intention that implementations of MAC LSM modules > be any more constrained by the presence of these capabilities > than an implementation of DAC LSM modules are by the analogous > DAC capabilities. > > > Signed-off-by: Casey Schaufler <[EMAIL PROTECTED]> > > --- > > The companion patch for libcap-2.02 is provided as an attachment. > The attachment is not a kernel patch, although it would be easy to > mistake it for one. > > Introduces CAP_FS_MASK_B1 and uses it as appropriate. I think that > I found all the places it needs to be used, but don't hesitate to > let me know if I missed something. > > Thank you. > > include/linux/capability.h | 24 ++-- > security/smack/smack.h |8 > security/smack/smack_lsm.c |8 > security/smack/smackfs.c | 12 ++-- > 4 files changed, 32 insertions(+), 20 deletions(-) > > diff -uprN -X linux-2.6.24-rc3-mm1-base/Documentation/dontdiff > linux-2.6.24-rc3-mm1-base/include/linux/capability.h > linux-2.6.24-rc3-mm1-smack/include/linux/capability.h > --- linux-2.6.24-rc3-mm1-base/include/linux/capability.h 2007-11-22 > 01:51:36.0 -0800 > +++ linux-2.6.24-rc3-mm1-smack/include/linux/capability.h 2007-11-25 > 21:38:34.0 -0800 > @@ -314,6 +314,23 @@ typedef struct kernel_cap_struct { > > #define CAP_SETFCAP 31 > > +/* Override MAC access. > + The base kernel enforces no MAC policy. > + An LSM may enforce a MAC policy, and if it does and it chooses > + to implement capability based overrides of that policy, this is > + the capability it should use to do so. */ > + > +#define CAP_MAC_OVERRIDE 32 > + > +/* Allow MAC configuration or state changes. > + The base kernel requires no MAC configuration. > + An LSM may enforce a MAC policy, and if it does and it chooses > + to implement capability based checks on modifications to that > + policy or the data required to maintain it, this is the > + capability it should use to do so. */ > + > +#define CAP_MAC_ADMIN33 > + > /* > * Bit location of each capability (used by user-space library and kernel) > */ > @@ -336,6 +353,8 @@ typedef struct kernel_cap_struct { > | CAP_TO_MASK(CAP_FOWNER) \ > | CAP_TO_MASK(CAP_FSETID)) > > +# define CAP_FS_MASK_B1 (CAP_TO_MASK(CAP_MAC_OVERRIDE)) > + > #if _LINUX_CAPABILITY_U32S != 2 > # error Fix up hand-coded capability macro initializers > #else /* HAND-CODED capability initializers */ > @@ -343,8 +362,9 @@ typedef struct kernel_cap_struct { > # define CAP_EMPTY_SET{{ 0, 0 }} > # define CAP_FULL_SET {{ ~0, ~0 }} > # define CAP_INIT_EFF_SET {{ ~CAP_TO_MASK(CAP_SETPCAP), ~0 }} > -# define CAP_FS_SET {{ CAP_FS_MASK_B0, 0 }} > -# define CAP_NFSD_SET {{ CAP_FS_MASK_B0|CAP_TO_MASK(CAP_SYS_RESOURCE), 0 > }} > +# define CAP_FS_SET {{ CAP_FS_MASK_B0, CAP_FS_MASK_B1 } } > +# define CAP_NFSD_SET {{ CAP_FS_MASK_B0|CAP_TO_MASK(CAP_SYS_RESOURCE), \ > + CAP_FS_MASK_B1 } } > > #endif /* _LINUX_CAPABILITY_U32S != 2 */ > > diff -uprN -X linux-2.6.24-rc3-mm1-base/Documentation/dontdiff > linux-2.6.24-rc3-mm1-base/security/smack/smackfs.c > linux-2.6.24-rc3-mm1-smack/security/smack/smackfs.c > --- linux-2.6.24-rc3-mm1-base/security/smack/smackfs.c2007-11-22 > 01:51:43.0 -0800 > +++ linux-2.6.24-rc3-mm1-smack/security/smack/smackfs.c 2007-11-24 > 11:29:29.0 -0800 > @@ -241,7 +241,7 @@ static ssize_t smk_write_load(struct fil >* No partial writes. >* Enough data must be present. >*/ > - if (!capable(CAP_MAC_OVERRIDE)) > + if (!capable(CAP_MAC_ADMIN)) > return -EPERM; > if (*ppos != 0) > return -EINVAL; > @@ -474,7 +474,7 @@ static ssize_t smk_write_cipso(struct fi >* No partial writes. >* Enough data must be present. >*/ > - if (!capable(CAP_MAC_OVERRIDE)) > + if (!capable(CAP_MAC_ADMIN)) > return -EPERM; > if (*ppos != 0) > return -EINVAL; > @@ -601,7 +601,7 @@ static ssize_t smk_write_doi(struct file > char temp[80]; > int i; > > - if (!capable(CAP_MAC_OVERRIDE)) > + if (!capable(CAP_MAC_ADMIN)) > return -EPERM; > >
Re: [PATCH] [RESEND] crypto test: use print_hex_dump from kernel.h instead
Denis Cheng wrote: Cc: Randy Dunlap <[EMAIL PROTECTED]> Signed-off-by: Denis Cheng <[EMAIL PROTECTED]> --- this is against the lastest cryptodev tree. crypto/tcrypt.c |9 - 1 files changed, 4 insertions(+), 5 deletions(-) diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index 1e12b86..ae762c2 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -87,12 +87,11 @@ static char *check[] = { "camellia", "seed", "salsa20", NULL }; -static void hexdump(unsigned char *buf, unsigned int len) +static inline void hexdump(unsigned char *buf, unsigned int len) { - while (len--) - printk("%02x", *buf++); - - printk("\n"); + print_hex_dump(KERN_CONT, "", DUMP_PREFIX_OFFSET, + 16, 1, + buf, len, 0); Not important, but why use '0' instead of 'false'? } static void tcrypt_complete(struct crypto_async_request *req, int err) cu Richard Knutsson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [9/10] per zone lru for cgroup
This patch implements per-zone lru for memory cgroup. This patch makes use of mem_cgroup_per_zone struct for per zone lru. LRU can be accessed by mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone); >active_list >inactive_list or mz = page_cgroup_zoneinfo(page_cgroup); >active_list >inactive_list Changelog v1->v2 - merged to mem_cgroup_per_zone struct. - handle page migraiton. Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> mm/memcontrol.c | 63 ++-- 1 file changed, 39 insertions(+), 24 deletions(-) Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c === --- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c 2007-11-27 11:24:04.0 +0900 +++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-27 11:24:16.0 +0900 @@ -89,6 +89,8 @@ }; struct mem_cgroup_per_zone { + struct list_headactive_list; + struct list_headinactive_list; unsigned long count[NR_MEM_CGROUP_ZSTAT]; }; /* Macro for accessing counter */ @@ -122,10 +124,7 @@ /* * Per cgroup active and inactive list, similar to the * per zone LRU lists. -* TODO: Consider making these lists per zone */ - struct list_head active_list; - struct list_head inactive_list; struct mem_cgroup_lru_info info; /* * spin_lock to protect the per cgroup LRU @@ -367,10 +366,10 @@ if (!to) { MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1; - list_add(>lru, >mem_cgroup->inactive_list); + list_add(>lru, >inactive_list); } else { MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1; - list_add(>lru, >mem_cgroup->active_list); + list_add(>lru, >active_list); } mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true); } @@ -388,11 +387,11 @@ if (active) { MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1; pc->flags |= PAGE_CGROUP_FLAG_ACTIVE; - list_move(>lru, >mem_cgroup->active_list); + list_move(>lru, >active_list); } else { MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1; pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE; - list_move(>lru, >mem_cgroup->inactive_list); + list_move(>lru, >inactive_list); } } @@ -518,11 +517,16 @@ LIST_HEAD(pc_list); struct list_head *src; struct page_cgroup *pc, *tmp; + int nid = z->zone_pgdat->node_id; + int zid = zone_idx(z); + struct mem_cgroup_per_zone *mz; + mz = mem_cgroup_zoneinfo(mem_cont, nid, zid); if (active) - src = _cont->active_list; + src = >active_list; else - src = _cont->inactive_list; + src = >inactive_list; + spin_lock(_cont->lru_lock); scan = 0; @@ -544,13 +548,6 @@ continue; } - /* -* Reclaim, per zone -* TODO: make the active/inactive lists per zone -*/ - if (page_zone(page) != z) - continue; - scan++; list_move(>lru, _list); @@ -832,6 +829,8 @@ int count; unsigned long flags; + if (list_empty(list)) + return; retry: count = FORCE_UNCHARGE_BATCH; spin_lock_irqsave(>lru_lock, flags); @@ -867,20 +866,27 @@ int mem_cgroup_force_empty(struct mem_cgroup *mem) { int ret = -EBUSY; + int node, zid; css_get(>css); /* * page reclaim code (kswapd etc..) will move pages between ` * active_list <-> inactive_list while we don't take a lock. * So, we have to do loop here until all lists are empty. */ - while (!(list_empty(>active_list) && -list_empty(>inactive_list))) { + while (mem->res.usage > 0) { if (atomic_read(>css.cgroup->count) > 0) goto out; - /* drop all page_cgroup in active_list */ - mem_cgroup_force_empty_list(mem, >active_list); - /* drop all page_cgroup in inactive_list */ - mem_cgroup_force_empty_list(mem, >inactive_list); + for_each_node_state(node, N_POSSIBLE) + for (zid = 0; zid < MAX_NR_ZONES; zid++) { + struct mem_cgroup_per_zone *mz; + mz = mem_cgroup_zoneinfo(mem, node, zid); + /* drop all page_cgroup in active_list */ + mem_cgroup_force_empty_list(mem, + >active_list); + /* drop all
[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [10/10] per-zone-lock for cgroup
Now, lru is per-zone. Then, lru_lock can be (should be) per-zone, too. This patch implementes per-zone lru lock. lru_lock is placed into mem_cgroup_per_zone struct. lock can be accessed by mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone); >lru_lock or mz = page_cgroup_zoneinfo(page_cgroup); >lru_lock Signed-off-by: KAMEZAWA hiroyuki <[EMAIL PROTECTED]> mm/memcontrol.c | 71 ++-- 1 file changed, 44 insertions(+), 27 deletions(-) Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c === --- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c 2007-11-27 11:24:16.0 +0900 +++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-27 11:24:22.0 +0900 @@ -89,6 +89,10 @@ }; struct mem_cgroup_per_zone { + /* +* spin_lock to protect the per cgroup LRU +*/ + spinlock_t lru_lock; struct list_headactive_list; struct list_headinactive_list; unsigned long count[NR_MEM_CGROUP_ZSTAT]; @@ -126,10 +130,7 @@ * per zone LRU lists. */ struct mem_cgroup_lru_info info; - /* -* spin_lock to protect the per cgroup LRU -*/ - spinlock_t lru_lock; + unsigned long control_type; /* control RSS or RSS+Pagecache */ int prev_priority; /* for recording reclaim priority */ /* @@ -410,15 +411,16 @@ */ void mem_cgroup_move_lists(struct page_cgroup *pc, bool active) { - struct mem_cgroup *mem; + struct mem_cgroup_per_zone *mz; + unsigned long flags; + if (!pc) return; - mem = pc->mem_cgroup; - - spin_lock(>lru_lock); + mz = page_cgroup_zoneinfo(pc); + spin_lock_irqsave(>lru_lock, flags); __mem_cgroup_move_lists(pc, active); - spin_unlock(>lru_lock); + spin_unlock_irqrestore(>lru_lock, flags); } /* @@ -528,7 +530,7 @@ src = >inactive_list; - spin_lock(_cont->lru_lock); + spin_lock(>lru_lock); scan = 0; list_for_each_entry_safe_reverse(pc, tmp, src, lru) { if (scan >= nr_to_scan) @@ -558,7 +560,7 @@ } list_splice(_list, src); - spin_unlock(_cont->lru_lock); + spin_unlock(>lru_lock); *scanned = scan; return nr_taken; @@ -577,6 +579,7 @@ struct page_cgroup *pc; unsigned long flags; unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES; + struct mem_cgroup_per_zone *mz; /* * Should page_cgroup's go to their own slab? @@ -688,10 +691,11 @@ goto retry; } - spin_lock_irqsave(>lru_lock, flags); + mz = page_cgroup_zoneinfo(pc); + spin_lock_irqsave(>lru_lock, flags); /* Update statistics vector */ __mem_cgroup_add_list(pc); - spin_unlock_irqrestore(>lru_lock, flags); + spin_unlock_irqrestore(>lru_lock, flags); done: return 0; @@ -733,6 +737,7 @@ void mem_cgroup_uncharge(struct page_cgroup *pc) { struct mem_cgroup *mem; + struct mem_cgroup_per_zone *mz; struct page *page; unsigned long flags; @@ -745,6 +750,7 @@ if (atomic_dec_and_test(>ref_cnt)) { page = pc->page; + mz = page_cgroup_zoneinfo(pc); /* * get page->cgroup and clear it under lock. * force_empty can drop page->cgroup without checking refcnt. @@ -753,9 +759,9 @@ mem = pc->mem_cgroup; css_put(>css); res_counter_uncharge(>res, PAGE_SIZE); - spin_lock_irqsave(>lru_lock, flags); + spin_lock_irqsave(>lru_lock, flags); __mem_cgroup_remove_list(pc); - spin_unlock_irqrestore(>lru_lock, flags); + spin_unlock_irqrestore(>lru_lock, flags); kfree(pc); } } @@ -794,24 +800,29 @@ struct page_cgroup *pc; struct mem_cgroup *mem; unsigned long flags; + struct mem_cgroup_per_zone *mz; retry: pc = page_get_page_cgroup(page); if (!pc) return; mem = pc->mem_cgroup; + mz = page_cgroup_zoneinfo(pc); if (clear_page_cgroup(page, pc) != pc) goto retry; - - spin_lock_irqsave(>lru_lock, flags); + spin_lock_irqsave(>lru_lock, flags); __mem_cgroup_remove_list(pc); + spin_unlock_irqrestore(>lru_lock, flags); + pc->page = newpage; lock_page_cgroup(newpage); page_assign_page_cgroup(newpage, pc); unlock_page_cgroup(newpage); - __mem_cgroup_add_list(pc); - spin_unlock_irqrestore(>lru_lock, flags); + mz = page_cgroup_zoneinfo(pc); +
[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [8/10] modifies vmscan.c for isolate globa/cgroup lru activity
When using memory controller, there are 2 levels of memory reclaim. 1. zone memory reclaim because of system/zone memory shortage. 2. memory cgroup memory reclaim because of hitting limit. These two can be distinguished by sc->mem_cgroup parameter. (scan_global_lru() macro) This patch tries to make memory cgroup reclaim routine avoid affecting system/zone memory reclaim. This patch inserts if (scan_global_lru()) and hook to memory_cgroup reclaim support functions. This patch can be a help for isolating system lru activity and group lru activity and shows what additional functions are necessary. * mem_cgroup_calc_mapped_ratio() ... calculate mapped ratio for cgroup. * mem_cgroup_reclaim_imbalance() ... calculate active/inactive balance in cgroup. * mem_cgroup_calc_reclaim_active() ... calculate the number of active pages to be scanned in this priority in mem_cgroup. * mem_cgroup_calc_reclaim_inactive() ... calculate the number of inactive pages to be scanned in this priority in mem_cgroup. * mem_cgroup_all_unreclaimable() .. checks cgroup's page is all unreclaimable or not. * mem_cgroup_get_reclaim_priority() ... * mem_cgroup_note_reclaim_priority() ... record reclaim priority (temporal) * mem_cgroup_remember_reclaim_priority() record reclaim priority as zone->prev_priority. This value is used for calc reclaim_mapped. Changelog V1->V2: - merged calc_reclaim_mapped patch in previous version. Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> mm/vmscan.c | 326 1 file changed, 197 insertions(+), 129 deletions(-) Index: linux-2.6.24-rc3-mm1/mm/vmscan.c === --- linux-2.6.24-rc3-mm1.orig/mm/vmscan.c 2007-11-26 16:38:46.0 +0900 +++ linux-2.6.24-rc3-mm1/mm/vmscan.c2007-11-26 16:42:38.0 +0900 @@ -863,7 +863,8 @@ __mod_zone_page_state(zone, NR_ACTIVE, -nr_active); __mod_zone_page_state(zone, NR_INACTIVE, -(nr_taken - nr_active)); - zone->pages_scanned += nr_scan; + if (scan_global_lru(sc)) + zone->pages_scanned += nr_scan; spin_unlock_irq(>lru_lock); nr_scanned += nr_scan; @@ -950,6 +951,113 @@ } /* + * Determine we should try to reclaim mapped pages. + * This is called only when sc->mem_cgroup is NULL. + */ +static int calc_reclaim_mapped(struct scan_control *sc, struct zone *zone, + int priority) +{ + long mapped_ratio; + long distress; + long swap_tendency; + long imbalance; + int reclaim_mapped; + int prev_priority; + + if (scan_global_lru(sc) && zone_is_near_oom(zone)) + return 1; + /* +* `distress' is a measure of how much trouble we're having +* reclaiming pages. 0 -> no problems. 100 -> great trouble. +*/ + if (scan_global_lru(sc)) + prev_priority = zone->prev_priority; + else + prev_priority = mem_cgroup_get_reclaim_priority(sc->mem_cgroup); + + distress = 100 >> min(prev_priority, priority); + + /* +* The point of this algorithm is to decide when to start +* reclaiming mapped memory instead of just pagecache. Work out +* how much memory +* is mapped. +*/ + if (scan_global_lru(sc)) + mapped_ratio = ((global_page_state(NR_FILE_MAPPED) + + global_page_state(NR_ANON_PAGES)) * 100) / + vm_total_pages; + else + mapped_ratio = mem_cgroup_calc_mapped_ratio(sc->mem_cgroup); + + /* +* Now decide how much we really want to unmap some pages. The +* mapped ratio is downgraded - just because there's a lot of +* mapped memory doesn't necessarily mean that page reclaim +* isn't succeeding. +* +* The distress ratio is important - we don't want to start +* going oom. +* +* A 100% value of vm_swappiness overrides this algorithm +* altogether. +*/ + swap_tendency = mapped_ratio / 2 + distress + sc->swappiness; + + /* +* If there's huge imbalance between active and inactive +* (think active 100 times larger than inactive) we should +* become more permissive, or the system will take too much +* cpu before it start swapping during memory pressure. +* Distress is about avoiding early-oom, this is about +* making swappiness graceful despite setting it to low +* values. +* +
Dynticks Causing High Context Switch Rate in ksoftirqd
Question: Why is ksoftirqd eating about 5 to 10 percent of my CPU on an idle system? The problem occurs if I config the kernel with tickless support (i.e. CONFIG_TICK_ONESHOT=y). (Thanks to "oprofile" for putting me onto this.) I have noted this same problem on kernel versions: 2.6.23.1, 2.6.23.8 and 2.6.23.9 ** *** Output from "vmstat -n 1 10" -- Note very high context switch rate *** *** This is on a idle machine! *** ** procs ---memory-- ---swap-- -io --system-- cpu r b swpd free buff cache si sobibo incs us sy id wa 0 0 0 1925556 4768 11610400 124 26 7538 1 2 96 1 0 0 0 1925556 4768 11610400 0 02 147329 0 1 99 0 0 0 0 1925548 4768 11610400 0 00 154515 0 1 99 0 0 0 0 1925548 4768 11610400 0 01 153898 0 2 98 0 0 0 0 1925548 4780 11610400 0163 155216 0 1 99 0 0 0 0 1925548 4780 11610400 0 01 161718 0 1 99 0 0 0 0 1925548 4780 11610400 0 00 147587 0 2 98 0 0 0 0 1925548 4780 11610400 0 01 153524 0 2 98 0 0 0 0 1925448 4780 11610400 0 00 153434 0 1 99 0 0 0 0 1925448 4792 11609200 0164 153527 0 2 98 0 *** System Stats *** Distro: Slackware 10.2 Mobo: MSI MasterX FA6R E7210 CPUs: Dual 2.4 GHz P4 Xeons 400 MHz FSB - Hyperthreading enabled Mem:2 GB ECC DDR PC 266 ** *** PCI Config *** ** 00:00.0 Host bridge: Intel Corporation 82875P/E7210 Memory Controller Hub (rev 02) 00:03.0 PCI bridge: Intel Corporation 82875P/E7210 Processor to PCI to CSA Bridge (rev 02) 00:06.0 System peripheral: Intel Corporation 82875P/E7210 Processor to I/O Memory Interface (rev 02) 00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02) 00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) 00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) 00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02) 00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt Controller (rev 02) 00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a) 00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02) 00:1f.1 IDE interface: Intel Corporation 6300ESB PATA Storage Controller (rev 02) 00:1f.2 IDE interface: Intel Corporation 6300ESB SATA Storage Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02) 01:01.0 Ethernet controller: Intel Corporation 82547GI Gigabit Ethernet Controller 02:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08) 03:09.0 Mass storage controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02) 03:0a.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Controller 03:0c.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4, v3] Physical PCI slot objects
On Mon, Nov 26, 2007 at 03:22:53PM -0700, Alex Chiang wrote: > Hi Gary, Kenji-san, et. al, > > * Gary Hade <[EMAIL PROTECTED]>: > > > > Alex, What I was trying to suggest is a boot-time kernel > > option, not a kernel configuration option. The basic idea is > > to give the user (with a single binary kernel) the ability to > > include your ACPI-PCI slot driver feature changes only when > > they are really needed. In addition to reducing the number of > > system/PCI hotplug driver combinations where your changes would > > need to be validated, I believe would also help alleviate other > > worries (e.g. Andi Kleen's memory consumption concern). I > > believe this goal could also be achieved with the kernel config > > option by making the pci_slot module runtime loadable with the > > PCI hotplug drivers only visiting your new code when the > > pci_slot driver is loaded, although I think this would be more > > difficult to implement. > > I have modified my patch series so that the final patch that > introduces my ACPI-PCI slot driver is a full-fledged module, that > has a tristate Kconfig option. > > It can be modprobe'd/rmmod'ed in any combination, and in any > order with other PCI hotplug modules. There is no ordering > dependency, even at module unload time, so you can safely rmmod > pci_slot, and safely continue using features provided by the PCI > hotplug drivers (acpiphp, pciehp, etc.). The opposite works too. Nice! I like the loadable module approach much better than my boot-time kernel option suggestion. > > The one limitation is that two separate hotplug drivers cannot > both claim the same device (2nd module loaded will get -EBUSY > errors), but I do not believe that is a regression from current > behavior. I cannot confirm this since the systems I am using only support a single hotplug driver (acpiphp). > > I have only tested with acpiphp and pciehp, as that's the only > hardware I have, but I believe my code will play nicely with the > other PCI hp drivers as well. I have only tested your changes with acpiphp. > > The patch series is fully bisectable, and the correct behavior > occurs no matter which patch you happen to have applied. Based on my testing (see below) this appears to be true. > > I'll be sending v5 of patches 3 and 4 shortly (patches 1 and 2 > did not change). It is still based on 2.6.24-rc2, because I was > too scared to do another git rebase while using stgit. :-/ I have been using 2.6.24-rc3 source for my testing. > > > Also, I notice that even with your current CONFIG_ACPI_PCI_SLOT > > implementation your numerous PCI hotplug driver changes (except > > for only two places in pci_hotplug_core.c where there is > > `#ifndef CONFIG_ACPI_PCI_SLOT` and `#ifdef CONFIG_ACPI_PCI_SLOT`) > > are _always_ exposed. So, even with CONFIG_ACPI_PCI_SLOT disabled > > there is IMO a need for testing of the affected PCI hotplug drivers > > on more than a small number of isolated systems. > > You are, of course, correct. > > In my opinion, though, I would say most of the changes to the PCI > hotplug drivers themselves are pretty straightforward, as in > removing the different ways of getting the PCI address. > > The scary part of the changes (aside from the ACPI-PCI slot > driver) revolve around the new struct pci_slot, which is > relatively self-contained, and only expose themselves via the > pci_create_slot/pci_destroy_slot interfaces which only the PCI > hotplug corecares about. I think this sounds like a reasonable argument for not doing what I was trying to suggest. > > Regardless, your point stands. How do you suggest I get more > testing time? I am only able to test with acpiphp. In addition to the testing on the x3850 described below I would also like to do some testing on an x3950 which has a mix of hotplug and non-hotplug slots. If this testing which I hope to complete this week goes well, I will be satisfied. I will let others speak for the other hotplug drivers and platforms. > Is this patchset appropriate for the -mm tree yet? I would defer to our illustrious maintainers on this one. :) > Or do you think it still needs more work? I am now much more comfortable with your changes with respect to acpiphp on the systems I worry about but others may have concerns with respect to the other hotplug drivers, or even acpiphp, on other systems. > > > The good news is that I was able to test your v3 changes > > (w/2.6.24-rc3 source) on our x3850 today with 'acpiphp' and, > > except for the above mentioned inability to run-time > > include/exclude them, they seemed to work fine. The previous > > boot-time ACPI error messages are gone and I was able to > > successfully hot-remove and hot-add both PCI-X and PCIe > > adapters. > > Thanks for testing. Please let me know how v5 works for you too. I just tried your v5 (1/4 v3, 2/4 v3, 3/4 v5, 4/4 v5) applied to 2.6.24-rc3 source with acpiphp on the x3850 and found nothing to complain about. About time, eh? :)
Re: Fw: Re: [PATCH 1/3] signal(i386): alternative signal stack wraparound occurs
cf http://lkml.org/lkml/2007/10/3/41 To summarize: on Linux, SA_ONSTACK decides whether you are already on the signal stack based on the value of the SP at the time of a signal. If you are not already inside the range, you are not "on the signal stack" and so the new signal handler frame starts over at the base of the signal stack. sigaltstack (and sigstack before it) was invented in BSD. There, the SA_ONSTACK behavior has always been different. It uses a kernel state flag to decide, rather than the SP value. When you first take an SA_ONSTACK signal and switch to the alternate signal stack, it sets the SS_ONSTACK flag in the thread's sigaltstack state in the kernel. Thereafter you are "on the signal stack" and don't switch SP before pushing a handler frame no matter what the SP value is. Only when you sigreturn from the original handler context do you clear the SS_ONSTACK flag so that a new handler frame will start over at the base of the alternate signal stack. The undesireable effect of the Linux behavior is that an overflow of the alternate signal stack can not only go undetected, but lead to a ring buffer effect of clobbering the original handler frame at the base of the signal stack for each successive signal that comes just after the overflow. This is what Shi Weihua's test case demonstrates. Normally this does not come up because of the signal mask, but the test case uses SA_NODEFER for its SIGSEGV handler. The other subtle part of the existing Linux semantics is that a simple longjmp out of a signal handler serves to take you off the signal stack in a safe and reliable fashion without having used sigreturn (nor having just returned from the handler normally, which means the same). After the longjmp (or even informal stack switching not via any proper libc or kernel interface), the alternate signal stack stands ready to be used again. A paranoid program would allocate a PROT_NONE red zone around its alternate signal stack. Then a small overflow would trigger a SIGSEGV in handler setup, and be fatal (core dump) whether or not SIGSEGV is blocked. As with thread stack red zones, that cannot catch all overflows (or underflows). e.g., a local array as large as page size allocated in a function called from a handler, but not actually touched before more calls push more stack, could cause an overflow that silently pushes into some unrelated allocated pages. The BSD behavior does not do anything in particular about overflow. But it does at least avoid the wraparound or "ring buffer effect", so you'll just get a straightforward all-out overflow down your address space past the low end of the alternate signal stack. I don't know what the BSD behavior is for longjmp out of an SA_ONSTACK handler. The POSIX wording relating to sigaltstack is pretty minimal. I don't think it speaks to this issue one way or another. (The program that overflows its stack is clearly in undefined behavior territory of one sort or another anyhow.) Given the longjmp issue and the potential for highly subtle complications in existing programs relying on this in arcane ways deep in their code, I am very dubious about changing the behavior to the BSD style persistent flag. I think Shi Weihua's patches have a similar effect by tracking the SP used in the last handler setup. I think it would be sensible for the signal handler setup code to detect when it would itself be causing a stack overflow. Maybe something like the following patch (untested). This issue exists in the same way on all machines, so ideally they would all do a similar check. When it's the handler function itself or its callees that cause the overflow, rather than the signal handler frame setup alone crossing the boundary, this still won't help. But I don't see any way to distinguish that from the valid longjmp case. Thanks, Roland --- diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c index d58d455..000 100644 --- a/arch/x86/kernel/signal_32.c +++ b/arch/x86/kernel/signal_32.c @@ -295,6 +295,13 @@ get_sigframe(struct k_sigaction *ka, str /* Default to using normal stack */ esp = regs->esp; + /* +* If we are on the alternate signal stack and would overflow it, don't. +* Return an always-bogus address instead so we will die with SIGSEGV. +*/ + if (on_sig_stack(esp) && !likely(on_sig_stack(esp - frame_size))) + return (void __user *) -1L; + /* This is the X/Open sanctioned signal stack switching. */ if (ka->sa.sa_flags & SA_ONSTACK) { if (sas_ss_flags(esp) == 0) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [7/10] calculate the number of pages to be scanned per cgroup
Define function for calculating the number of scan target on each Zone/LRU. Changelog V1->V2. - fixed types of variable. Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> include/linux/memcontrol.h | 15 +++ mm/memcontrol.c| 33 + 2 files changed, 48 insertions(+) Index: linux-2.6.24-rc3-mm1/include/linux/memcontrol.h === --- linux-2.6.24-rc3-mm1.orig/include/linux/memcontrol.h2007-11-27 11:22:14.0 +0900 +++ linux-2.6.24-rc3-mm1/include/linux/memcontrol.h 2007-11-27 11:22:51.0 +0900 @@ -73,6 +73,10 @@ extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority); +extern long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem, + struct zone *zone, int priority); +extern long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem, + struct zone *zone, int priority); #else /* CONFIG_CGROUP_MEM_CONT */ static inline void mm_init_cgroup(struct mm_struct *mm, @@ -173,6 +177,17 @@ return 0; } +static inline long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem, + struct zone *zone, int priority) +{ + return 0; +} + +static inline long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem, + struct zone *zone, int priority) +{ + return 0; +} #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c === --- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c 2007-11-27 11:22:14.0 +0900 +++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-27 11:24:04.0 +0900 @@ -472,6 +472,39 @@ mem->prev_priority = priority; } +/* + * Calculate # of pages to be scanned in this priority/zone. + * See also vmscan.c + * + * priority starts from "DEF_PRIORITY" and decremented in each loop. + * (see include/linux/mmzone.h) + */ + +long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem, + struct zone *zone, int priority) +{ + long nr_active; + int nid = zone->zone_pgdat->node_id; + int zid = zone_idx(zone); + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid); + + nr_active = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE); + return (nr_active >> priority); +} + +long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem, + struct zone *zone, int priority) +{ + long nr_inactive; + int nid = zone->zone_pgdat->node_id; + int zid = zone_idx(zone); + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid); + + nr_inactive = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE); + + return (nr_inactive >> priority); +} + unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, struct list_head *dst, unsigned long *scanned, int order, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [6/10] remember reclaim priority in memory cgroup
Functions to remember reclaim priority per cgroup (as zone->prev_priority) Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> include/linux/memcontrol.h | 23 +++ mm/memcontrol.c| 20 2 files changed, 43 insertions(+) Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c === --- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c 2007-11-27 11:19:51.0 +0900 +++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-27 11:22:14.0 +0900 @@ -132,6 +132,7 @@ */ spinlock_t lru_lock; unsigned long control_type; /* control RSS or RSS+Pagecache */ + int prev_priority; /* for recording reclaim priority */ /* * statistics. */ @@ -452,6 +453,25 @@ return (long) (active / (inactive + 1)); } +/* + * prev_priority control...this will be used in memory reclaim path. + */ +int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem) +{ + return mem->prev_priority; +} + +void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, int priority) +{ + if (priority < mem->prev_priority) + mem->prev_priority = priority; +} + +void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority) +{ + mem->prev_priority = priority; +} + unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, struct list_head *dst, unsigned long *scanned, int order, Index: linux-2.6.24-rc3-mm1/include/linux/memcontrol.h === --- linux-2.6.24-rc3-mm1.orig/include/linux/memcontrol.h2007-11-27 11:19:00.0 +0900 +++ linux-2.6.24-rc3-mm1/include/linux/memcontrol.h 2007-11-27 11:22:14.0 +0900 @@ -67,6 +67,11 @@ extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem); extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem); +extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem); +extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, + int priority); +extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, + int priority); #else /* CONFIG_CGROUP_MEM_CONT */ @@ -150,6 +155,24 @@ return 0; } +static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem, + int priority) +{ + return 0; +} + +static inline void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, + int priority) +{ + return 0; +} + +static inline void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, + int priority) +{ + return 0; +} + #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [5/10] calculate active/inactive imbalance per cgroup
calculate active/inactive imbalance per memory cgroup. Changelog V1 -> V2: - removed "total" (just count inactive and active) - fixed comment - fixed return type to be "long". Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> include/linux/memcontrol.h |8 mm/memcontrol.c| 14 ++ 2 files changed, 22 insertions(+) Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c === --- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c 2007-11-27 10:44:19.0 +0900 +++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-27 11:19:51.0 +0900 @@ -437,6 +437,20 @@ rss = (long)mem_cgroup_read_stat(>stat, MEM_CGROUP_STAT_RSS); return (int)((rss * 100L) / total); } +/* + * This function is called from vmscan.c. In page reclaiming loop. balance + * between active and inactive list is calculated. For memory controller + * page reclaiming, we should use using mem_cgroup's imbalance rather than + * zone's global lru imbalance. + */ +long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem) +{ + unsigned long active, inactive; + /* active and inactive are the number of pages. 'long' is ok.*/ + active = mem_cgroup_get_all_zonestat(mem, MEM_CGROUP_ZSTAT_ACTIVE); + inactive = mem_cgroup_get_all_zonestat(mem, MEM_CGROUP_ZSTAT_INACTIVE); + return (long) (active / (inactive + 1)); +} unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, struct list_head *dst, Index: linux-2.6.24-rc3-mm1/include/linux/memcontrol.h === --- linux-2.6.24-rc3-mm1.orig/include/linux/memcontrol.h2007-11-27 10:44:19.0 +0900 +++ linux-2.6.24-rc3-mm1/include/linux/memcontrol.h 2007-11-27 11:19:00.0 +0900 @@ -65,6 +65,8 @@ * For memory reclaim. */ extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem); +extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem); + #else /* CONFIG_CGROUP_MEM_CONT */ @@ -142,6 +144,12 @@ { return 0; } + +static inline int mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem) +{ + return 0; +} + #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [4/10] calculate mapper_ratio per cgroup
Define function for calculating mapped_ratio in memory cgroup. Changelog V1->V2 - Fixed possible divide-by-zero bug. - Use "long" to avoid 64bit division on 32 bit system. and does necessary type casts. - Added comments. Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> include/linux/memcontrol.h | 11 ++- mm/memcontrol.c| 17 + 2 files changed, 27 insertions(+), 1 deletion(-) Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c === --- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c 2007-11-26 16:39:02.0 +0900 +++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-26 16:41:34.0 +0900 @@ -421,6 +421,23 @@ spin_unlock(>lru_lock); } +/* + * Calculate mapped_ratio under memory controller. This will be used in + * vmscan.c for deteremining we have to reclaim mapped pages. + */ +int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem) +{ + long total, rss; + + /* +* usage is recorded in bytes. But, here, we assume the number of +* physical pages can be represented by "long" on any arch. +*/ + total = (long) (mem->res.usage >> PAGE_SHIFT) + 1L; + rss = (long)mem_cgroup_read_stat(>stat, MEM_CGROUP_STAT_RSS); + return (int)((rss * 100L) / total); +} + unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, struct list_head *dst, unsigned long *scanned, int order, Index: linux-2.6.24-rc3-mm1/include/linux/memcontrol.h === --- linux-2.6.24-rc3-mm1.orig/include/linux/memcontrol.h2007-11-26 15:31:19.0 +0900 +++ linux-2.6.24-rc3-mm1/include/linux/memcontrol.h 2007-11-26 16:39:05.0 +0900 @@ -61,6 +61,12 @@ extern void mem_cgroup_end_migration(struct page *page); extern void mem_cgroup_page_migration(struct page *page, struct page *newpage); +/* + * For memory reclaim. + */ +extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem); + + #else /* CONFIG_CGROUP_MEM_CONT */ static inline void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p) @@ -132,7 +138,10 @@ { } - +static inline int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem) +{ + return 0; +} #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [3/10] per-zone active inactive counter
Counting active/inactive per-zone in memory controller. This patch adds per-zone status in memory cgroup. These values are often read (as per-zone value) by page reclaiming. In current design, per-zone stat is just a unsigned long value and not an atomic value because they are modified only under lru_lock. (So, atomic_ops is not necessary.) This patch adds ACTIVE and INACTIVE per-zone status values. For handling per-zone status, this patch adds struct mem_cgroup_per_zone { ... } and some helper functions. This will be useful to add per-zone objects in mem_cgroup. This patch turns memory controller's early_init to be 0 for calling kmalloc() in initialization. Changelog V2 -> V3 - fixed comments. Changelog V1 -> V2 - added mem_cgroup_per_zone struct. This will help following patches to implement per-zone objects and pack them into a struct. - added __mem_cgroup_add_list() and __mem_cgroup_remove_list() - fixed page migration handling. - renamed zstat to info (per-zone-info) This will be place for per-zone information(lru, lock, ..) - use page_cgroup_nid()/zid() funcs. Acked-by: Balbir Singh <[EMAIL PROTECTED]> Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> mm/memcontrol.c | 164 +--- 1 file changed, 157 insertions(+), 7 deletions(-) Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c === --- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c 2007-11-26 16:39:00.0 +0900 +++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-26 16:39:02.0 +0900 @@ -78,6 +78,31 @@ } /* + * per-zone information in memory controller. + */ + +enum mem_cgroup_zstat_index { + MEM_CGROUP_ZSTAT_ACTIVE, + MEM_CGROUP_ZSTAT_INACTIVE, + + NR_MEM_CGROUP_ZSTAT, +}; + +struct mem_cgroup_per_zone { + unsigned long count[NR_MEM_CGROUP_ZSTAT]; +}; +/* Macro for accessing counter */ +#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)]) + +struct mem_cgroup_per_node { + struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES]; +}; + +struct mem_cgroup_lru_info { + struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES]; +}; + +/* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide * statistics based on the statistics developed by Rik Van Riel for clock-pro, @@ -101,6 +126,7 @@ */ struct list_head active_list; struct list_head inactive_list; + struct mem_cgroup_lru_info info; /* * spin_lock to protect the per cgroup LRU */ @@ -158,6 +184,7 @@ MEM_CGROUP_CHARGE_TYPE_MAPPED, }; + /* * Always modified under lru lock. Then, not necessary to preempt_disable() */ @@ -173,7 +200,39 @@ MEM_CGROUP_STAT_CACHE, val); else __mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val); +} +static inline struct mem_cgroup_per_zone * +mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid) +{ + if (!mem->info.nodeinfo[nid]) + return NULL; + return >info.nodeinfo[nid]->zoneinfo[zid]; +} + +static inline struct mem_cgroup_per_zone * +page_cgroup_zoneinfo(struct page_cgroup *pc) +{ + struct mem_cgroup *mem = pc->mem_cgroup; + int nid = page_cgroup_nid(pc); + int zid = page_cgroup_zid(pc); + + return mem_cgroup_zoneinfo(mem, nid, zid); +} + +static unsigned long mem_cgroup_get_all_zonestat(struct mem_cgroup *mem, + enum mem_cgroup_zstat_index idx) +{ + int nid, zid; + struct mem_cgroup_per_zone *mz; + u64 total = 0; + + for_each_online_node(nid) + for (zid = 0; zid < MAX_NR_ZONES; zid++) { + mz = mem_cgroup_zoneinfo(mem, nid, zid); + total += MEM_CGROUP_ZSTAT(mz, idx); + } + return total; } static struct mem_cgroup init_mem_cgroup; @@ -286,12 +345,51 @@ return ret; } +static void __mem_cgroup_remove_list(struct page_cgroup *pc) +{ + int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE; + struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc); + + if (from) + MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) -= 1; + else + MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) -= 1; + + mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false); + list_del_init(>lru); +} + +static void __mem_cgroup_add_list(struct page_cgroup *pc) +{ + int to = pc->flags & PAGE_CGROUP_FLAG_ACTIVE; + struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc); + + if (!to) { + MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1; + list_add(>lru, >mem_cgroup->inactive_list); + } else { + MEM_CGROUP_ZSTAT(mz,
[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [2/10] nid/zid helper function for cgroup
Add macro to get node_id and zone_id of page_cgroup. Will be used in per-zone-xxx patches and others. Changelog: - returns zone_type instead of int. Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> mm/memcontrol.c | 10 ++ 1 file changed, 10 insertions(+) Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c === --- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c 2007-11-26 15:31:19.0 +0900 +++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-26 16:39:00.0 +0900 @@ -135,6 +135,16 @@ #define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */ #define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */ +static inline int page_cgroup_nid(struct page_cgroup *pc) +{ + return page_to_nid(pc->page); +} + +static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc) +{ + return page_zonenum(pc->page); +} + enum { MEM_CGROUP_TYPE_UNSPEC = 0, MEM_CGROUP_TYPE_MAPPED, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [1/10] add scan_global_lru macro
add macro scan_global_lru(). This is used to detect which scan_control scans global lru or mem_cgroup lru. And compiled to be static value (1) when memory controller is not configured. This may make the meaning obvious. Acked-by: Balbir Singh <[EMAIL PROTECTED]> Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> mm/vmscan.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) Index: linux-2.6.24-rc3-mm1/mm/vmscan.c === --- linux-2.6.24-rc3-mm1.orig/mm/vmscan.c 2007-11-26 15:31:19.0 +0900 +++ linux-2.6.24-rc3-mm1/mm/vmscan.c2007-11-26 16:38:46.0 +0900 @@ -127,6 +127,12 @@ static LIST_HEAD(shrinker_list); static DECLARE_RWSEM(shrinker_rwsem); +#ifdef CONFIG_CGROUP_MEM_CONT +#define scan_global_lru(sc)(!(sc)->mem_cgroup) +#else +#define scan_global_lru(sc)(1) +#endif + /* * Add a shrinker callback to be called from the vm */ @@ -1290,11 +1296,12 @@ * Don't shrink slabs when reclaiming memory from * over limit cgroups */ - if (sc->mem_cgroup == NULL) + if (scan_global_lru(sc)) { shrink_slab(sc->nr_scanned, gfp_mask, lru_pages); - if (reclaim_state) { - nr_reclaimed += reclaim_state->reclaimed_slab; - reclaim_state->reclaimed_slab = 0; + if (reclaim_state) { + nr_reclaimed += reclaim_state->reclaimed_slab; + reclaim_state->reclaimed_slab = 0; + } } total_scanned += sc->nr_scanned; if (nr_reclaimed >= sc->swap_cluster_max) { @@ -1321,7 +1328,7 @@ congestion_wait(WRITE, HZ/10); } /* top priority shrink_caches still had more to do? don't OOM, then */ - if (!sc->all_unreclaimable && sc->mem_cgroup == NULL) + if (!sc->all_unreclaimable && scan_global_lru(sc)) ret = 1; out: /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [0/10] introduction
Hi, this is per-zone/reclaim support patch set for memory controller (cgroup). Major changes from previous one is -- tested with 2.6.24-rc3-mm1 + ia64/NUMA -- applied comments. I did small test on real NUMA machine. My machine was ia64/8CPU/2Node NUMA. I tried to complile the kernel under 800M bytes limit with 32 parallel make. (make -j 32) - 2.6.24-rc3-mm1 (+ scsi fix) shows soft lock-up. before soft lock-up, %sys was almost 100% in several times. - 2.6.24-rc3-mm1 (+ scsi fix) + this set completed succesfully It seems %iowait dominates the total performance. (current memory controller has no background reclaim) Seems this set give us some progress. (*) I'd like to merge YAMAMOTO-san's background page reclaim for memory controller before discussing about the number of performance. Andrew, could you pick these up to -mm ? Patch series brief description: [1/10] ... add scan_global_lru() macro (clean up) [2/10] ... nid/zid helper function for cgroup [3/10] ... introduce per-zone object for memory controller and add active/inactive counter. [4/10] ... calculate mapper_ratio per cgroup (for memory reclaim) [5/10] ... calculate active/inactive imbalance per cgroup (based on [3]) [6/10] ... remember reclaim priority in memory controller [7/10] ... calculate the number of pages to be reclaimed per cgroup [8/10] ... modifies vmscan.c to isolate global-lru-reclaim and memory-cgroup-reclaim in obvious manner. (this patch uses functions defined in [4 - 7]) [9/10] ... implement per-zone-lru for cgroup (based on [3]) [10/10] ... implement per-zone lru lock for cgroup (based on [3][9]) Any comments are welcome. Thanks, -Kame - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] [RESEND] crypto test: use print_hex_dump from kernel.h instead
Cc: Randy Dunlap <[EMAIL PROTECTED]> Signed-off-by: Denis Cheng <[EMAIL PROTECTED]> --- this is against the lastest cryptodev tree. crypto/tcrypt.c |9 - 1 files changed, 4 insertions(+), 5 deletions(-) diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index 1e12b86..ae762c2 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -87,12 +87,11 @@ static char *check[] = { "camellia", "seed", "salsa20", NULL }; -static void hexdump(unsigned char *buf, unsigned int len) +static inline void hexdump(unsigned char *buf, unsigned int len) { - while (len--) - printk("%02x", *buf++); - - printk("\n"); + print_hex_dump(KERN_CONT, "", DUMP_PREFIX_OFFSET, + 16, 1, + buf, len, 0); } static void tcrypt_complete(struct crypto_async_request *req, int err) -- 1.5.3.5 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: __rcu_process_callbacks() in Linux 2.6
On Mon, Nov 26, 2007 at 02:48:08PM -0800, James Huang wrote: > > > -Original Message- > > From: James Huang [mailto:[EMAIL PROTECTED] > > Sent: Monday, November 26, 2007 2:21 PM > > To: James Huang > > Subject: Fw: __rcu_process_callbacks() in Linux 2.6 > > > > - Forwarded Message > > From: Manfred Spraul <[EMAIL PROTECTED]> > > To: James Huang <[EMAIL PROTECTED]> > > Cc: Paul E. McKenney <[EMAIL PROTECTED]>; linux- > > [EMAIL PROTECTED] > > Sent: Monday, November 26, 2007 10:28:37 AM > > Subject: __rcu_process_callbacks() in Linux 2.6 > > > > Hi James, > > > > If I understand the issue correctly, then the race is: > > > > step 1: cpu 1: starts a new rcu batch (i.e. rcp->cur++, smb_mb) > > > > step 2: cpu 2: completes the quiet state > > step 3: cpu 2: reads pointer 0x123 (ptr to a rcu protected struct) > > > > step 4: cpu 3: call_rcu(0x123): rcu protected struct added to > rdp->nxtlist > > step 5: cpu 3: moves a new batch into rdp->curlist, rdp->batch = rcp- > > >cur+1. > > xxx Problem: where is the smp_rmb() that guarantees that > > xxx update to rcp->cur from step 1 is seen by cpu 3? > > step 6: cpu 3: completes quiet state > > step 7: cpu 3: struct 0x123 destroyed > > > > step 8: cpu 2: accesses pointer 0x123, but the struct is already > destroyed > > > > James: Is that the race? > > > [James Huang] > > Yes, this is the race condition that I am concerned about. > > > > > > I agree with Paul, there are smb_rmb's on cpu 3 between Step 1 and > Step 5: > > Either the test_and_set_bit in tasklet_action for rcu_process_callback > > if step 4 happens before the tasklet or somewhere in the irq handler > > path if step 4 happens in an irq handler that interrupted > > rcu_process_callback. > > > > Thus theoretically no additional smb_rmb() should be necessary. > > What is missing is proper documentation. > > > > > [James Huang] > > Is it true that a smb_rmb() before a read operation (say from variable > X) will guarantee that the read will always retrieve the most "current" > value of X? I can not find such a guarantee in atomic_ops.txt or > memory-barriers.txt under Linux's documentation directory. What is > described in both documents is relative ordering, e.g. > > CPU1 CPU2 >-- -- > write X = x1 > smp_wmb() > write Y = y1 > > read Y > smp_rmb() > read X > > Then CPU2 will read X with a value of x1 if it reads Y with a value of > y1. > > Please point me to the right section in the document if smp_rmb() does > provide such a guarantee. You are correct, smp_rmb() is about ordering rather than about any sort of immediacy. For one thing, it can be quite difficult to say exactly what the most "current" version of X might be at a given point in time from the viewpoint of a given CPU -- the different CPUs might well disagree as to what the "current" version is for awhile (though they are guaranteed to come to agreement). > Thanks, > -- James Huang > > > I'm analyzing the code right now: > > Is it really true that typically a cpu only completes data in every > other > > rcu > > cycle? I.e. that most structures are stored in the rcu callback list > until > > two > > quiet states happened? That is correct. This does mean that we should be able to leverage locking primitives and memory barriers executed from the scheduling clock interrupt. > > I've tried to track the values of rcp->cur and rdp->batch. > > If next_pending is set, then cpu_quiet() immetiately starts > > the next rcu cycle and a cpu cannot both complete the currently > > pending rcu callbacks and add new callbacks to the next cycle, > > thus a cpu only takes part in every other rcu cycle. > > > > The oocalc file is at > > http://www.colorfullife.com/~manfred/rcu.ods > > http://www.colorfullife.com/~manfred/rcu.pdf > > > > Is that analysis correct? Perhaps the whole code should be rewritten? I believe that the sequencing in spreadsheet is correct (and thank you very much for going through it!!!), but it seems to be silent on memory-barrier issues. I also believe that Gautham's new CPU-hotplug setup will make it possible to simplify the code quite a bit. And given that the grace-period-detection code is not on any sort of hot code path, it should be possible to use a less-aggressive design, perhaps one using straight locking to guard the shared structures. Also, we are working in the -rt implementation on a scheme that allows CPUs to stay asleep through a grace period without the heavy overhead that is otherwise required to interact with them. The trick is to maintain a per-CPU counter that is incremented on each entry and exit to low-power state. But I would like to get this right in -rt before trying it in Classic RCU. ;-)
Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
Linus Torvalds wrote: The 6-word limit is a red herring. There is at least two ways to deal with it (and this doesn't mean wiping the legacy stuff we already have): - Let each architecture pick a calling convention and redefine the architecture-independent bits to take an arbitrary number of arguments. This is a one-time panarchitectural change. Not applicable on x86-32. The six-word limit is effectively a hardware limit there. Once it goes past that limit, one of the words needs to be a pointer to extended information that is fundamentally slower to access. Happily, only very rare system calls do that (and none of them are of the simple variety where we see a few cycles easily). On other architectures, we could more easily just use more registers. But x86-32 is still a big part (bulk) of what matters for most people. Well, x86-32 and x86-64 are surprisingly similar here, for very different reasons (x86-64 is because there are only seven clobbered registers that aren't destroyed by the syscall instruction itself.) However, on both of these we could make the user-space side cheaper, by making sure that we don't have to do additional copies in user space. For both these architectures, anything more than 3 parameters (i386) or 6 parameters (x86-64) will be already in memory on the stack, so if we can use that image as-is then we at least save the intra-user-space copy that goes along with it. x86-64 requires some minor thought, since the obvious way of doing it (using arg register 6 to push in a pointer) would end up with a discontiguous frame. One can do it with something like this, although it's not clear to me it is a win at all (the more obvious sequence using XCHG isn't usable since XCHG locks unconditionally): pop %r10# Return address push%r9 # Argument 6 movq%rsp, %r11 push%r10 movq%rcx, %r10 syscall cmpq$-4095, %rax jae ... pop %r10 pop %r9 push%r10 retq The number of registers do vary, obviously, with s390 being the smallest number (5). Immediately when you do anything but registers, it is much *much* more costly. The "get_user()" and "copy_from_user()" stuff is not exactly slow, but it's quite noticeable overhead for simple system calls. It gets worse if this all is described by some indirect table setup. True, of course, although we're talking here about different ways to pull arguments out of userspace memory; *definitely* agreed with that we don't want to have any additional indirection necessary. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch](Resend) mm/sparse.c: Improve the error handling for sparse_add_one_section()
On Mon, Nov 26, 2007 at 07:19:49PM +0900, Yasunori Goto wrote: >Hi, Cong-san. > >> ms->section_mem_map |= SECTION_MARKED_PRESENT; >> >> ret = sparse_init_one_section(ms, section_nr, memmap, usemap); >> >> out: >> pgdat_resize_unlock(pgdat, ); >> -if (ret <= 0) >> -__kfree_section_memmap(memmap, nr_pages); >> + >> return ret; >> } >> #endif > >Hmm. When sparse_init_one_section() returns error, memmap and >usemap should be free. Hi, Yasunori. Thanks for your comments. Is the following one fine for you? Signed-off-by: WANG Cong <[EMAIL PROTECTED]> --- Index: linux-2.6/mm/sparse.c === --- linux-2.6.orig/mm/sparse.c +++ linux-2.6/mm/sparse.c @@ -391,9 +391,17 @@ int sparse_add_one_section(struct zone * * no locking for this, because it does its own * plus, it does a kmalloc */ - sparse_index_init(section_nr, pgdat->node_id); + ret = sparse_index_init(section_nr, pgdat->node_id); + if (ret < 0) + return ret; memmap = kmalloc_section_memmap(section_nr, pgdat->node_id, nr_pages); + if (!memmap) + return -ENOMEM; usemap = __kmalloc_section_usemap(); + if (!usemap) { + __kfree_section_memmap(memmap, nr_pages); + return -ENOMEM; + } pgdat_resize_lock(pgdat, ); @@ -403,10 +411,6 @@ int sparse_add_one_section(struct zone * goto out; } - if (!usemap) { - ret = -ENOMEM; - goto out; - } ms->section_mem_map |= SECTION_MARKED_PRESENT; ret = sparse_init_one_section(ms, section_nr, memmap, usemap); @@ -414,7 +418,7 @@ int sparse_add_one_section(struct zone * out: pgdat_resize_unlock(pgdat, ); if (ret <= 0) - __kfree_section_memmap(memmap, nr_pages); + kfree(usemap); return ret; } #endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Kobjects: drop child->parent ref at unregistration
On Mon, 26 Nov 2007, Andrew Morton wrote: > On Mon, 19 Nov 2007 10:53:40 -0500 (EST) > Alan Stern <[EMAIL PROTECTED]> wrote: > > > This patch (as1015) reverts changes that were made to the driver core > > about four years ago. The intent back then was to avoid certain kinds > > of invalid memory accesses by leaving kernel objects allocated as long > > as any of their children were still allocated. The original and > > correct approach was to wait only as long as any children were still > > _registered_; that's what this patch reinstates. > > What happened with this? As far as I know, it's on Greg's queue. > > This fixes a problem in the SCSI core made visible by the class_device > > to regular device conversion: A reference loop (scsi_device holds > > reference to request_queue, which is the child of a gendisk, which is > > the child of the scsi_device) prevents the data structures from being > > released, even though they are deregistered okay. > > > > It's possible that this change will cause a few bugs to surface, > > things that have been hidden for several years. They can be fixed > > easily enough by having the child device take an explicit reference to > > the parent whenever needed. > > > > How will such bugs manifest? Ideally via a nice printk and a stack trace > followed by damage avoidance. They will manifest in the same way as any other use-after-free bug: an oops message and either death of the current process or a system hang. Obviously I'm not aware of any such bugs -- if I were, I'd fix them. Greg has expressed concern that some USB serial drivers might have this problem. I'll do what testing I can (not much because I don't have any USB serial devices). > If it's via a mysterious crash or something similarly obscure then can we > improve that? I can't think of anything offhand. Maybe someone else can. Alan Stern - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [ACPI] utilities/: Compliment va_start() with va_end().
Moore, Robert wrote: Yes, it's official ANSI C, so I agree with the portability. I'm probably asking more about the history of the thing. "the history of the thing"? Sorry, you lost me there. I know there were a pre-ANSI version of va_start() & co., but they seemed quite messy. When it comes to va_end() and maintainers, they often seem positive to this. I guess the occasional lack off va_end() is usually an oversight. -Original Message- From: Richard Knutsson [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 4:16 PM To: Moore, Robert Cc: Len Brown; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] Subject: Re: [PATCH] [ACPI] utilities/: Compliment va_start() with va_end(). Moore, Robert wrote: This is an interesting one to me. From various documentation: After all arguments have been retrieved, va_end resets the pointer to NULL. va_end Each invocation of va_start must be matched by a corresponding invocation of va_end in the same function. After the call va_end(ap) the variable ap is undefined. Multiple transversals of the list, each bracketed by va_start and va_end are possible. va_end may be a macro or a function. Now, I'm all for defensive programming, but I don't really see the point of va_end when the list will be only traversed once. First off, I think it is a good idea to follow the documentation, which stated: "va_end Each invocation of va_start must be matched by a corresponding invocation of va_end in the same function." Then if it is not really needed, does it take up extra cycles? "In practice, with most C compilers, calling |va_end| does nothing and you do not really need to call it. This is always true in the GNU C compiler."[1] Portability: "But you might as well call |va_end| just in case your program is someday compiled with a peculiar compiler."[2] This argument is not as likely thou, but who knows? (Since I guess Intel's compiler is included in the 'most C compilers') We don't set all local pointers to NULL at function exit, what is the point of doing it here? I think it is a good thing if the code follows the documentation, both for the person who tries to understand the code (to see when the 'args' is no longer needed and not getting confused by the absent of va_end(), after all, IMHO we should write the code how we want things to work and let the compiler do the optimizations (it usually does a better job at it then we do)) and to automated searches (that is how I found this one). I suppose some implementation could allocate memory at va_start, but in practice, does this happen? Not sure what you mean. Bob cu Richard Knutsson [1] http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_28.ht ml [2] The rest of [1]'s line. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
On Mon, 26 Nov 2007, H. Peter Anvin wrote: > > I'm presuming you're not talking about some sort of syslets/fibrils/threadlets > here (executing an interpreted thread of execution in kernel space). That's a > whole separate ball of wax. Indeed. I'm hoping that just dies. It's too complex. But the "do this single system call asynchronously" isn't, and has lots of historical implementations, ranging from VMS to the braindead POSIX "aio" setup. I do think that more complex threadlets could be useful in theory, I just doubt they'd be used in practice.. > > So the choice is basically one of: > > > > - come up with a totally new interface to system calls, and effectively > > duplicating the whole system call table. > > > >I'd hate to do this. We already have duplicated system call tables due > > to compat stuff, it's painful. > > This would be the right thing to do if we were to redesign the system call > interface from the ground up, which it doesn't exactly sound like we are > intending. Yeah. I'm also not sure it's the right thing even if we did redesign from scratch. The current system call interface may look less than regular, but it has some very solid foundation: it's fast. Passing arguments in registers is by definition a lot faster *and*safer* than passing them any other way. There are no subtle security issues with people playing games with the argument base pointer (ie usually the stack pointer) and trying to fool the kernel into accessing kernel memory etc. Immediately when you do anything but registers, it is much *much* more costly. The "get_user()" and "copy_from_user()" stuff is not exactly slow, but it's quite noticeable overhead for simple system calls. It gets worse if this all is described by some indirect table setup. In the system call path, right now, for some system calls, the biggest two overheads are - the CPU system call overhead itself. We can't do much about this, but the CPU designers do seem to be slowly getting it fixed (ie it's slower than it should need to be, but it's a hell of a lot faster than a P4 used to be ;) - the cost of just the single indirect - and unpredictable - call. (The second cost is actually often totally hidden in the trivial system call benchmarks people run: if the benchmark just does "getppid()" a million times in a tight loop, the indirect call on the system call number seems really quite fast, but outside of benchmarks it is generally totally unpredictable indeed, and a real cost for real-life system call usage). Everything else in the system call path is generally as fast as we can make it. Doing more indirection and conditionals would be really quite nasty. Of course, for *most* of system calls, the work the kernel actually does ends up being so big that it doesn't much matter, but I was literally chasing down why a page fault had slowed down by ~70 cycles two weeks ago. And it doesn't take more than a couple of unpredictable jumps to do things like that! > The 6-word limit is a red herring. There is at least two ways to deal with it > (and this doesn't mean wiping the legacy stuff we already have): > > - Let each architecture pick a calling convention and redefine the > architecture-independent bits to take an arbitrary number of arguments. This > is a one-time panarchitectural change. Not applicable on x86-32. The six-word limit is effectively a hardware limit there. Once it goes past that limit, one of the words needs to be a pointer to extended information that is fundamentally slower to access. Happily, only very rare system calls do that (and none of them are of the simple variety where we see a few cycles easily). On other architectures, we could more easily just use more registers. But x86-32 is still a big part (bulk) of what matters for most people. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/18] x86 vDSO: arch/x86/vdso/vdso32
> But whatever works. I'm currently skipping the patches since they didn't > seem like 2.6.24 fodder anyway. The vdso cleanups are pure cleanup, not fixing anything that's actively broken. Thanks, Roland - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/18] x86 vDSO: arch/x86/vdso/vdso32
On Tue, 20 Nov 2007, Roland McGrath wrote: > > > git format-patch -p > > > > does the trick at least here :) > > Ok, I can use that in future. I hope it still means that in the eventual > merged state, GIT will be aware of all the renames. Git doesn't care. You can do renames by hand, or with "git mv", you can do them as a delete/create pair, you can use "git-apply" with a rename patch, and you can do them by re-typing in all of the file contents from scratch. Regardless of how the rename is done, git will represent the data the exact same way: the state of the tree before and after. The rename-patches are a lot denser and a lot more readable for humans (ie you can actually see what *happens*, unlike a traditional stupid unified diff), and I was hoping that eventually somebody in the GNU patch community would see how wonderful the extended patch information is, but when I tried to write a patch to "patch" to do it, I almost dug out my eyes with spoons from looking at the source code, so I haven't actually helped it happen. So you can ask for patches in traditional format (*most* git command lines will default to that anyway, and only give a copy-patch with -C or -M on the command line), or people could realize that "git-apply" actually works even on non-git source code, and just stop using that abomination that is "patch" with all of it's totally wrong and unsafe defaults (*). But whatever works. I'm currently skipping the patches since they didn't seem like 2.6.24 fodder anyway. Linus (*) Let me count the ways: applying patches partially when it fails half-way through a series. Defaulting to totally randomly guessing the path-name skip depth when not explicitly given a -pX option. Defaulting to "--fuzz=2" which is almost guaranteed to apply a patch even when it makes no sense what-so-ever. Yes, git-apply has stricter rules, but they are stricter for damn good reasons. For people who want the insane unsafe GNU patch defaults, they just have to specifically ask for unsafe modes.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu
Hey all- I've been working on an issue lately involving multi socket x86_64 systems connected via hypertransport bridges. It appears that some systems, disable the hypertransport connections during a kdump operation when all but the crashing processor gets halted in machine_crash_shutdown. This becomes a problem when the ioapic attempts to route interrupts to the only remaining processor. Even though the active processor is targeted for interrupt reception, the fact that the hypertransport connections are inactive result in interrupts not getting delivered. The effective result is that timer interrupts are not delivered to the running cpu, and the system hangs on reboot into the kdump kernel during calibrate_delay. I've found that I've been able to avoid this hang, by forcing a transition to the bios defined boot cpu during the crashing kernel shutdown. This patch accomplished that. Tested by myself and the origional reporter with successful results. Regards, Neil Signed-off-by: Neil Horman <[EMAIL PROTECTED]> arch/x86/kernel/crash.c | 46 ++ include/linux/kexec.h |3 +++ init/main.c |6 ++ kernel/kexec.c |8 4 files changed, 55 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index 8bb482f..0682e60 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -67,13 +67,36 @@ static int crash_nmi_callback(struct notifier_block *self, } #endif crash_save_cpu(regs, cpu); - disable_local_APIC(); - atomic_dec(_for_crash_ipi); - /* Assume hlt works */ - halt(); - for (;;) - cpu_relax(); - + if (smp_processor_id() == kexec_boot_cpu) { + /* +* This is the boot cpu. We need to: +* 1) Wait for the other processors to halt +* 2) clear our nmi interrupt +* 3) launch the new kernel +*/ + unsigned long msecs = 1000; + while ((atomic_read(_for_crash_ipi) > 0) && msecs) { + /* +* Use udelay to avoid the warnings here +* I know we shouldn't delay in an irq +* but we're about to reboot the box during +* a crash, a delay doesn't hurt here +*/ + udelay(1000); + msecs--; + } + ack_APIC_irq(); + disable_local_APIC(); + disable_IO_APIC(); + machine_kexec(kexec_crash_image); + + } else { + disable_local_APIC(); + atomic_dec(_for_crash_ipi); + /* Assume hlt works */ + for(;;) + halt(); + } return 1; } @@ -138,7 +161,14 @@ void machine_crash_shutdown(struct pt_regs *regs) nmi_shootdown_cpus(); lapic_shutdown(); #if defined(CONFIG_X86_IO_APIC) - disable_IO_APIC(); + if (crashing_cpu == kexec_boot_cpu) + disable_IO_APIC(); #endif crash_save_cpu(regs, safe_smp_processor_id()); + if (crashing_cpu != kexec_boot_cpu) { + atomic_dec(_for_crash_ipi); + for(;;) + halt(); + } + } diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 2d9c448..b5c12d6 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -187,6 +187,9 @@ extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4]; extern size_t vmcoreinfo_size; extern size_t vmcoreinfo_max_size; +extern int kexec_boot_cpu; +extern void kexec_record_boot_cpu(); + int __init parse_crashkernel(char *cmdline, unsigned long long system_ram, unsigned long long *crash_size, unsigned long long *crash_base); diff --git a/init/main.c b/init/main.c index 58f5a99..0f11ee0 100644 --- a/init/main.c +++ b/init/main.c @@ -58,6 +58,9 @@ #include #include #include +#ifdef CONFIG_KEXEC +#include +#endif #include #include @@ -538,6 +541,9 @@ asmlinkage void __init start_kernel(void) unwind_setup(); setup_per_cpu_areas(); smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ +#ifdef CONFIG_KEXEC + kexec_record_boot_cpu(); +#endif /* * Set up the scheduler prior starting any interrupts (such as the diff --git a/kernel/kexec.c b/kernel/kexec.c index aa74a1e..cb6b1f3 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -41,6 +41,14 @@ u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4]; size_t vmcoreinfo_size; size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data); +int kexec_boot_cpu = 0; + +void __init kexec_record_boot_cpu() +{ + kexec_boot_cpu = smp_processor_id(); + printk(KERN_CRIT "kexec records boot cpu as %d\n",kexec_boot_cpu); +} + /* Location of the reserved area for the crash