Re: [PATCH] get rid of NR_OPEN and introduce a sysctl_nr_open

2007-11-26 Thread Eric Dumazet

[EMAIL PROTECTED] a écrit :

On Tue, 27 Nov 2007 08:09:19 +0100, Eric Dumazet said:

Changing NR_OPEN is not considered safe because of vmalloc space potential 
exhaust.


Verbiage about this point...



+nr_open
+---
+
+Denotes the maximum number of file-handles a process can
+allocate. Default value is 1024*1024 (1048576) which should be
+enough for most machines. Actual limit depends on RLIMIT_NOFILE
+resource limit.
+


should probably be in here - can you add something of the form "Setting this
too high can cause vmalloc failures, especially on smaller-RAM machines",
and/or *say* how much RAM the default takes?  Sure, it's 1M entries, but
my tuning on a 2G-RAM machine will differ if these are byte-sized, or 128-byte
sized - one is off in a corner, the other is 1/16th of my entire memory.


vmalloc failures can already happen if you start 32 processes on i386 kernels, 
each of them wanting to open file handle number 600.000 (if their 
RLIMIT_NOFILE >= 60)


fcntl(0, F_DUPFD, 60);

We are not going to add warnings about vmalloc on every sysctl around there 
that could allow a root user to exhaust vmalloc space. This is a vmalloc issue 
on 32bit kernel, and quite frankly I never hit this limit.


If you take a look at vmalloc() implementation, fact that it uses a 'struct 
vm_struct *vmlist;' to track all active zones show that vmalloc() is not used 
that much.




Also, would it be useful to *lower* the value drastically, if you know a priori
that no process should get up to 1K file handles, much less 1M? Does that
buy me anything different than setting RLIMIT_NOFILE=1024?


NR_OPEN is the max value that RLIMIT_NOFILE can reach, nothing more.

You can set it to 256*1024*1024 or 4*1024 it wont change memory needs on your 
machine, unless you raise RLIMIT_NOFILE and one of your program leaks file 
handles, or really want to open simultaneously many of them.


Most programs wont open more than 500 files, so their file table is allocated 
via kmalloc()


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-26 Thread Valdis . Kletnieks
On Mon, 26 Nov 2007 23:27:03 PST, Andrew Morton said:

> > git-x86.patch
> > git-x86-fixup.patch
> > git-x86-thread_order-borkage.patch
> > git-x86-thread_order-borkage-fix.patch
> > git-x86-identify_cpu-fix.patch
> > git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
> > git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
> > git-x86-inlining-borkage.patch
> > x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
> > x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD

> You could try http://userweb.kernel.org/~akpm/mmotm/ - we might have already
> fixed it.

I suspect that trying -rc3-mm1 but refreshing just the 10 patches above
from -mmotm would be far less likely to pull in other heartburn?

> Otherwise, please proceed to work out which diff I need to drop and hope like
> hell that it isn't git-x86..

That's a 41,240 line diff, the rest *total* to about 400 lines.  I don't have
warm-n-fuzzies about my odds here. ;)

I'm a git-idiot, but *do* know how to git-bisect through Linus tree - what
would I need to do to git-bisect through git-x86.patch? (I do *not* know how
to deal with more than 1 source git tree, so if the magic is just 'get a
linus tree, merge git-x86, then bisect as usual", I'm stuck on "merge 
git-x86")..



pgpxMGUuWzdJd.pgp
Description: PGP signature


Re: [RFC] Documentation about unaligned memory access

2007-11-26 Thread Kumar Gala


On Nov 23, 2007, at 5:43 AM, Heikki Orsila wrote:


On Fri, Nov 23, 2007 at 12:15:53AM +, Daniel Drake wrote:

Why unaligned access is bad
===

Most architectures are unable to perform unaligned memory accesses.  
Any

unaligned access causes a processor exception.


"Some architectures are unable to perform unaligned memory accesses,
either an exception is generated, or the data
access is silently invalid. In architectures that allow unaligned
access, natural aligned accesses are usually faster than non-aligned."

In summary: if your code causes unaligned memory accesses to  
happen, your code
will not work on some platforms, and will perform *very* badly on  
others.


*very* -> *slower*


Natural alignment
=


Please move this definition before "Why unaligned access is bad".

Also, it would be nice to have a table of ISAs:

ISA NeedNeed
natural alignment
alignment   by x

m68kNo  2
powerpc/ppc Yes Word size


on ppc it varies from processor to processor if misaligned data is  
fixed up or causes an exception.  However its highly recommend to be  
naturally aligned.  I'm not sure I follow what is meant by the second  
column (need alignment by x).


- k
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 11/14] Powerpc: Use generic per cpu

2007-11-26 Thread Kumar Gala


On Nov 26, 2007, at 6:14 PM, Christoph Lameter wrote:


Powerpc has a way to determine the address of the per cpu area of the
currently executing processor via the paca and the array of per cpu
offsets is avoided by looking up the per cpu area from the remote
paca's (copying x86_64).

Cc: Paul Mackerras <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
include/asm-powerpc/percpu.h |   19 ---
1 file changed, 19 deletions(-)

Index: linux-2.6/include/asm-powerpc/percpu.h
===
--- linux-2.6.orig/include/asm-powerpc/percpu.h	2007-11-24  
10:27:31.088350556 -0800
+++ linux-2.6/include/asm-powerpc/percpu.h	2007-11-24  
10:29:20.752350757 -0800

@@ -16,25 +16,6 @@
#define __my_cpu_offset() get_paca()->data_offset
#define per_cpu_offset(x) (__per_cpu_offset(x))


This concerns me.  paca doesn't exist on all PPC platforms.

- k
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Booting latest linux kernel(2.6.20) on MPC8548ECDS

2007-11-26 Thread Kumar Gala


On Nov 26, 2007, at 11:54 PM, rajendra prasad wrote:


Hi,

I am using MPC8548ECDS board from CDS for my telecom application. I am
able to build 2.6.10 linux kernel and boot 2.6.10 kernel on
MPC8548ECDS board.When I take same configuration file and built
successfully but not able to boot on MPC8548E CDS board.I am  using
u-boot-1.1.6 as boot loader.I came to know taht latest kernel is
booted with new procedure.Pls tell me the procedure how to boot
procedure.


Ask this question on the linuxppc-dev list.  You're more likely to get  
an answer.


Its unclear, but you are trying to use the latest kernel on a MPC8548E  
CDS?


- k
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-26 Thread Andrew Morton
On Tue, 27 Nov 2007 02:16:26 -0500 [EMAIL PROTECTED] wrote:

> On Tue, 20 Nov 2007 20:45:25 PST, Andrew Morton said:
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> 
> Finally got both time and motivation to at least start a bisect..
> 
> 2.6.23-mm1 works on my D820 (x86_64 kernel, Core2 Duo T7200)
> 
> 24-rc3-mm1 (plus 3 patches from hotfixes/) bricks *instantly* at boot - grub
> prints its 3 or 4 lines saying what it loaded, the screen clears, and *blam*
> dead. No serial console output, no pair of penguins on the monitor, no
> netconsole, no earlyprintk=vga output, no alt-sysrq, only thing that does
> *anything* is "hold the power button for 5 seconds".  Whatever it is, it
> happens *very* early (before we get as far as the 'Linux version 2.6.mumble'
> banner), and happens *hard*.
> 
> I've bisected it down this far:
> 
> git-ipwireless_cs.patch GOOD
> git-x86.patch
> git-x86-fixup.patch
> git-x86-thread_order-borkage.patch
> git-x86-thread_order-borkage-fix.patch
> git-x86-identify_cpu-fix.patch
> git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
> git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
> git-x86-inlining-borkage.patch
> x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
> x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD
> 
> Anybody got any good debugging ideas before I go through and do the final
> 3 or 4 bisects?  I suspect I'll need them once I find the offending patch
> to tell *why* said patch dies on my box - I've seen enough traffic regarding
> -rc3-mm1 dying *later* to know it's probably a subtle issue and not one
> that will be obvious once I finger a specific patch.  For example, it's
> probably not the IO-APIC panic that people are seeing, because their kernels
> live long enough to panic. ;)
> 

You could try http://userweb.kernel.org/~akpm/mmotm/ - we might have already
fixed it.

Otherwise, please proceed to work out which diff I need to drop and hope like
hell that it isn't git-x86..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] get rid of NR_OPEN and introduce a sysctl_nr_open

2007-11-26 Thread Valdis . Kletnieks
On Tue, 27 Nov 2007 08:09:19 +0100, Eric Dumazet said:

> Changing NR_OPEN is not considered safe because of vmalloc space potential 
> exhaust.

Verbiage about this point...


> +nr_open
> +---
> +
> +Denotes the maximum number of file-handles a process can
> +allocate. Default value is 1024*1024 (1048576) which should be
> +enough for most machines. Actual limit depends on RLIMIT_NOFILE
> +resource limit.
> +

should probably be in here - can you add something of the form "Setting this
too high can cause vmalloc failures, especially on smaller-RAM machines",
and/or *say* how much RAM the default takes?  Sure, it's 1M entries, but
my tuning on a 2G-RAM machine will differ if these are byte-sized, or 128-byte
sized - one is off in a corner, the other is 1/16th of my entire memory.

Also, would it be useful to *lower* the value drastically, if you know a priori
that no process should get up to 1K file handles, much less 1M? Does that
buy me anything different than setting RLIMIT_NOFILE=1024?


pgp1uLtbj6Sc1.pgp
Description: PGP signature


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-26 Thread Valdis . Kletnieks
On Tue, 20 Nov 2007 20:45:25 PST, Andrew Morton said:
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/

Finally got both time and motivation to at least start a bisect..

2.6.23-mm1 works on my D820 (x86_64 kernel, Core2 Duo T7200)

24-rc3-mm1 (plus 3 patches from hotfixes/) bricks *instantly* at boot - grub
prints its 3 or 4 lines saying what it loaded, the screen clears, and *blam*
dead. No serial console output, no pair of penguins on the monitor, no
netconsole, no earlyprintk=vga output, no alt-sysrq, only thing that does
*anything* is "hold the power button for 5 seconds".  Whatever it is, it
happens *very* early (before we get as far as the 'Linux version 2.6.mumble'
banner), and happens *hard*.

I've bisected it down this far:

git-ipwireless_cs.patch GOOD
git-x86.patch
git-x86-fixup.patch
git-x86-thread_order-borkage.patch
git-x86-thread_order-borkage-fix.patch
git-x86-identify_cpu-fix.patch
git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
git-x86-inlining-borkage.patch
x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD

Anybody got any good debugging ideas before I go through and do the final
3 or 4 bisects?  I suspect I'll need them once I find the offending patch
to tell *why* said patch dies on my box - I've seen enough traffic regarding
-rc3-mm1 dying *later* to know it's probably a subtle issue and not one
that will be obvious once I finger a specific patch.  For example, it's
probably not the IO-APIC panic that people are seeing, because their kernels
live long enough to panic. ;)



pgpbW8UIlUa1z.pgp
Description: PGP signature


patch driver-core-fix-race-in-__device_release_driver.patch added to gregkh-2.6 tree

2007-11-26 Thread gregkh

This is a note to let you know that I've just added the patch titled

 Subject: Driver core: fix race in __device_release_driver

to my gregkh-2.6 tree.  Its filename is

 driver-core-fix-race-in-__device_release_driver.patch

This tree can be found at 
http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From [EMAIL PROTECTED]  Mon Nov 26 22:49:20 2007
From: Alan Stern <[EMAIL PROTECTED]>
Date: Fri, 16 Nov 2007 11:57:28 -0500 (EST)
Subject: Driver core: fix race in __device_release_driver
To: Greg KH <[EMAIL PROTECTED]>, David Woodhouse <[EMAIL PROTECTED]>
Cc: USB development list <[EMAIL PROTECTED]>,  Kernel development list 

Message-ID: <[EMAIL PROTECTED]>


This patch (as1013) was suggested by David Woodhouse; it fixes a race
in the driver core.  If a device is unregistered at the same time as
its driver is unloaded, the driver's code pages may be unmapped while
the remove method is still running.  The calls to get_driver() and
put_driver() were intended to prevent this, but they don't work if the
driver's module count has already dropped to 0.

Instead, the patch keeps the device on the driver's list until after
the remove method has returned.  This forces the necessary
synchronization to occur.

Signed-off-by: Alan Stern <[EMAIL PROTECTED]>
Signed-off-by: David Woodhouse <[EMAIL PROTECTED]>
Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]>

---
 drivers/base/dd.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -289,11 +289,10 @@ static void __device_release_driver(stru
 {
struct device_driver * drv;
 
-   drv = get_driver(dev->driver);
+   drv = dev->driver;
if (drv) {
driver_sysfs_remove(dev);
sysfs_remove_link(>kobj, "driver");
-   klist_remove(>knode_driver);
 
if (dev->bus)
blocking_notifier_call_chain(>bus->p->bus_notifier,
@@ -306,7 +305,7 @@ static void __device_release_driver(stru
drv->remove(dev);
devres_release_all(dev);
dev->driver = NULL;
-   put_driver(drv);
+   klist_remove(>knode_driver);
}
 }
 


Patches currently in gregkh-2.6 which might be from [EMAIL PROTECTED] are

driver/pm-acquire-device-locks-prior-to-suspending.patch
driver/create-sys-...-power-when-config_pm-is-set.patch
driver/driver-core-fix-race-in-__device_release_driver.patch
usb/usb-add-support-for-an-older-firmware-revision-for-the-nikon-d200.patch
usb/usb-fix-priority-mistakes-in-drivers-usb-core-hub.c.patch
usb/usb-fix-signr-comment-in-usbdevice_fs.h.patch
usb/usb-mailing-lists-have-changed.patch
usb/usb-power-management-documenation-update.patch
usb/usb-hcd-avoid-duplicate-local_irq_disable.patch
usb/usb-usb-mon-mon_bin.c-cleanups.patch
usb/usb-keep-track-of-whether-interface-sysfs-files-exist.patch
usb/usb-uevent-environment-key-fix.patch
usb/usb-autosuspend-for-cdc-acm.patch
usb/usb-fix-up-ehci-startup-synchronization.patch
usb/usb-usb-storage-new-lockable-subclass-0x07.patch
usb/usb-don-t-change-hc-power-state-for-a-freeze.patch
usb/usb-dummy_hcd-don-t-register-drivers-on-the-platform-bus.patch
usb/usb-force-handover-port-to-companion-when-hub_port_connect_change-fails.patch
usb/usb-make-ksuspend_usbd-thread-non-freezable.patch
usb/usb-usb-storage-unusual_devs-entry-for-jetflash-ts1gjf2a.patch
usb/usb-storage-always-set-the-allow_restart-flag.patch
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] get rid of NR_OPEN and introduce a sysctl_nr_open

2007-11-26 Thread Eric Dumazet

As changing NR_OPEN from 1024*1024 to 16*1024*1024 was considered a litle
bit dangerous, just let it default to 1024*1024 but adds a new sysctl
to let sysadmins change this value.

Thank you

[PATCH] get rid of NR_OPEN and introduce a sysctl_nr_open

NR_OPEN (historically set to 1024*1024) actually forbids processes to open 
more than 1024*1024 handles.


Unfortunatly some production servers hit the not so 'ridiculously high value' 
of 1024*1024 file descriptors per process.


Changing NR_OPEN is not considered safe because of vmalloc space potential 
exhaust.


This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to 
1024*1024, so that admins can decide to change this limit if their workload 
needs it.



Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
Cc: Alan Cox <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>

 Documentation/filesystems/proc.txt |8 
 Documentation/sysctl/fs.txt|   10 ++
 fs/file.c  |8 +---
 include/linux/fs.h |2 +-
 kernel/sys.c   |2 +-
 kernel/sysctl.c|8 
 6 files changed, 33 insertions(+), 5 deletions(-)
diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index dec9945..9b390d7 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -989,6 +989,14 @@ nr_inodes
 Denotes the  number  of  inodes the system has allocated. This number will
 grow and shrink dynamically.
 
+nr_open
+---
+
+Denotes the maximum number of file-handles a process can
+allocate. Default value is 1024*1024 (1048576) which should be
+enough for most machines. Actual limit depends on RLIMIT_NOFILE
+resource limit.
+
 nr_free_inodes
 --
 
diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt
index aa986a3..f992543 100644
--- a/Documentation/sysctl/fs.txt
+++ b/Documentation/sysctl/fs.txt
@@ -23,6 +23,7 @@ Currently, these files are in /proc/sys/fs:
 - inode-max
 - inode-nr
 - inode-state
+- nr_open
 - overflowuid
 - overflowgid
 - suid_dumpable
@@ -91,6 +92,15 @@ usage of file handles and you don't need to increase the 
maximum.
 
 ==
 
+nr_open:
+
+This denotes the maximum number of file-handles a process can
+allocate. Default value is 1024*1024 (1048576) which should be
+enough for most machines. Actual limit depends on RLIMIT_NOFILE
+resource limit.
+
+==
+
 inode-max, inode-nr & inode-state:
 
 As with file handles, the kernel allocates the inode structures
diff --git a/fs/file.c b/fs/file.c
index c5575de..5110acb 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -24,6 +24,8 @@ struct fdtable_defer {
struct fdtable *next;
 };
 
+int sysctl_nr_open __read_mostly = 1024*1024;
+
 /*
  * We use this list to defer free fdtables that have vmalloced
  * sets/arrays. By keeping a per-cpu list, we avoid having to embed
@@ -147,8 +149,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
nr /= (1024 / sizeof(struct file *));
nr = roundup_pow_of_two(nr + 1);
nr *= (1024 / sizeof(struct file *));
-   if (nr > NR_OPEN)
-   nr = NR_OPEN;
+   if (nr > sysctl_nr_open)
+   nr = sysctl_nr_open;
 
fdt = kmalloc(sizeof(struct fdtable), GFP_KERNEL);
if (!fdt)
@@ -233,7 +235,7 @@ int expand_files(struct files_struct *files, int nr)
if (nr < fdt->max_fds)
return 0;
/* Can we expand? */
-   if (nr >= NR_OPEN)
+   if (nr >= sysctl_nr_open)
return -EMFILE;
 
/* All good, so we try */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..1cda287 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -21,7 +21,7 @@
 
 /* Fixed constants first: */
 #undef NR_OPEN
-#define NR_OPEN (1024*1024)/* Absolute upper limit on fd num */
+extern int sysctl_nr_open;
 #define INR_OPEN 1024  /* Initial setting for nfile rlimits */
 
 #define BLOCK_SIZE_BITS 10
diff --git a/kernel/sys.c b/kernel/sys.c
index d1fe71e..99c6ce1 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1472,7 +1472,7 @@ asmlinkage long sys_setrlimit(unsigned int resource, 
struct rlimit __user *rlim)
if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
!capable(CAP_SYS_RESOURCE))
return -EPERM;
-   if (resource == RLIMIT_NOFILE && new_rlim.rlim_max > NR_OPEN)
+   if (resource == RLIMIT_NOFILE && new_rlim.rlim_max > sysctl_nr_open)
return -EPERM;
 
retval = security_task_setrlimit(resource, _rlim);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 0deed82..de22f7b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1127,6 +1127,14 @@ static struct ctl_table fs_table[] = {
.proc_handler   = _dointvec,
},
{
+   

Re: [PATCH] dmaengine: Driver for the AVR32 DMACA controller

2007-11-26 Thread Andrew Morton

This:

> Subject: Re: [PATCH] dmaengine: Driver for the AVR32 DMACA controller

in no way describes this:

> This patch corrects recently changed (and now invalid) Kconfig
> descriptions for the DMA engine framework:

grr.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [VIDEO]: Complement va_start() with va_end().

2007-11-26 Thread Richard Knutsson
Complement va_start() with va_end().

Signed-off-by: Richard Knutsson <[EMAIL PROTECTED]>
---
Compile-tested on i386 with allyesconfig and allmodconfig.


diff --git a/drivers/media/video/saa5246a.c b/drivers/media/video/saa5246a.c
index ad02329..996b494 100644
--- a/drivers/media/video/saa5246a.c
+++ b/drivers/media/video/saa5246a.c
@@ -187,12 +187,14 @@ static int i2c_senddata(struct saa5246a_device *t, ...)
 {
unsigned char buf[64];
int v;
-   int ct=0;
+   int ct = 0;
va_list argp;
-   va_start(argp,t);
+   va_start(argp, t);
 
-   while((v=va_arg(argp,int))!=-1)
-   buf[ct++]=v;
+   while ((v = va_arg(argp, int)) != -1)
+   buf[ct++] = v;
+
+   va_end(argp);
return i2c_sendbuf(t, buf[0], ct-1, buf+1);
 }
 
diff --git a/drivers/media/video/saa5249.c b/drivers/media/video/saa5249.c
index 94bb59a..da5ca30 100644
--- a/drivers/media/video/saa5249.c
+++ b/drivers/media/video/saa5249.c
@@ -282,12 +282,14 @@ static int i2c_senddata(struct saa5249_device *t, ...)
 {
unsigned char buf[64];
int v;
-   int ct=0;
+   int ct = 0;
va_list argp;
va_start(argp,t);
 
-   while((v=va_arg(argp,int))!=-1)
-   buf[ct++]=v;
+   while ((v = va_arg(argp, int)) != -1)
+   buf[ct++] = v;
+
+   va_end(argp);
return i2c_sendbuf(t, buf[0], ct-1, buf+1);
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6 patch] make I/O schedulers non-modular

2007-11-26 Thread Jarek Poplawski
On 25-11-2007 18:22, Jens Axboe wrote:
> On Sun, Nov 25 2007, Adrian Bunk wrote:
...
>> Is there any technical reason why we need 4 different schedulers at all?
> 
> Until we have the perfect scheduler :-)

IMHO this is not enough yet. There is something called "the right
of choice", and, it seems, things are usually far from perfect
where this right is not respected.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] [NET]: Fix TX bug VLAN in VLAN

2007-11-26 Thread Joonwoo Park
2007/11/27, Herbert Xu <[EMAIL PROTECTED]>:
> On Tue, Nov 27, 2007 at 02:32:49PM +0900, Joonwoo Park wrote:
> >
> > Thanks Herbert.
> > Well.. I think patch would work propely for AF_PACKET also.
> > (I did not insert BUG() macro in my patch)
> > How do you think?
> 
> Are you sure? I thought you need to check both in the xmit function.
> That is,
> 
>if (veth->h_vlan_proto != htons(ETH_P_8021Q) ||
>VLAN_DEV_INFO(dev)->flags & VLAN_FLAG_REORDER_HDR) {
> 
> Otherwise you'll miss AF_PACKET packets when REORDER is off.

Thanks Herbert!
I agree with you.

Thanks.
Joonwoo

[NET]: Fix TX bug VLAN in VLAN
Fix misbehavior of vlan_dev_hard_start_xmit() for recursive encapsulations.

Signed-off-by: Joonwoo Park <[EMAIL PROTECTED]>

---
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 7a36878..4f99bb8 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -462,7 +462,8 @@ int vlan_dev_hard_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
 * OTHER THINGS LIKE FDDI/TokenRing/802.3 SNAPs...
 */
 
-   if (veth->h_vlan_proto != htons(ETH_P_8021Q)) {
+   if (veth->h_vlan_proto != htons(ETH_P_8021Q) ||
+   VLAN_DEV_INFO(dev)->flags & VLAN_FLAG_REORDER_HDR) {
int orig_headroom = skb_headroom(skb);
unsigned short veth_TCI;
 
---

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] utsns: Restore proper namespace handling.

2007-11-26 Thread Andrew Morton
On Mon, 26 Nov 2007 09:19:17 -0600 "Serge E. Hallyn" <[EMAIL PROTECTED]> wrote:

> Quoting Eric W. Biederman ([EMAIL PROTECTED]):
> > 
> > When CONFIG_UTS_NS was removed it seems that we also deleted
> > the code for handling sysctls in the other then the initial
> > uts namespace.   This patch restores that code.
> > 
> > Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
> 
> Thanks, Eric.
> 
> Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
> 
> > ---
> >  kernel/utsname_sysctl.c |2 ++
> >  1 files changed, 2 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/utsname_sysctl.c b/kernel/utsname_sysctl.c
> > index c76c064..71f58c3 100644
> > --- a/kernel/utsname_sysctl.c
> > +++ b/kernel/utsname_sysctl.c
> > @@ -18,6 +18,8 @@
> >  static void *get_uts(ctl_table *table, int write)
> >  {
> > char *which = table->data;
> > +   struct uts_namespace *uts_ns = current->nsproxy->uts_ns;
> > +   which = (which - (char *)_uts_ns) + (char *)uts_ns;
> > 
> > if (!write)
> > down_read(_sem);

I already have a (more codingstylely attractive) version of this from
Pavel, for which I shall steal your ack.

--- 
a/kernel/utsname_sysctl.c~isolate-the-uts-namespaces-domainname-and-hostname-back
+++ a/kernel/utsname_sysctl.c
@@ -18,6 +18,10 @@
 static void *get_uts(ctl_table *table, int write)
 {
char *which = table->data;
+   struct uts_namespace *uts_ns;
+
+   uts_ns = current->nsproxy->uts_ns;
+   which = (which - (char *)_uts_ns) + (char *)uts_ns;
 
if (!write)
down_read(_sem);
_


Those pointer tricksies are revolting.  What's going on in there?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.24-rc3-mm1] IPC: consolidate sem_exit_ns(), msg_exit_ns and shm_exit_ns()

2007-11-26 Thread Andrew Morton
On Mon, 26 Nov 2007 22:44:38 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Fri, 23 Nov 2007 17:52:50 +0100 Pierre Peiffer <[EMAIL PROTECTED]> wrote:
> 
> > sem_exit_ns(), msg_exit_ns() and shm_exit_ns() are all called when an 
> > ipc_namespace is
> > released to free all ipcs of each type.
> > But in fact, they do the same thing: they loop around all ipcs to free them
> > individually by calling a specific routine.
> > 
> > This patch proposes to consolidate this by introducing a common function, 
> > free_ipcs(),
> > that do the job. The specific routine to call on each individual ipcs is 
> > passed as
> > parameter. For this, these ipc-specific 'free' routines are reworked to 
> > take a
> > generic 'struct ipc_perm' as parameter.
> 
> This conflicts in more-than-trivial ways with Pavel's
> move-the-ipc-namespace-under-ipc_ns-option.patch, which was in
> 2.6.24-rc3-mm1.
> 

err, no, it wasn't that patch.  For some reason your change assumes that
msg_exit_ns() (for example) doesn't have these lines:

kfree(ns->ids[IPC_MSG_IDS]);
ns->ids[IPC_MSG_IDS] = NULL;

in it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.24-rc3-mm1] IPC: consolidate sem_exit_ns(), msg_exit_ns and shm_exit_ns()

2007-11-26 Thread Andrew Morton
On Fri, 23 Nov 2007 17:52:50 +0100 Pierre Peiffer <[EMAIL PROTECTED]> wrote:

> sem_exit_ns(), msg_exit_ns() and shm_exit_ns() are all called when an 
> ipc_namespace is
> released to free all ipcs of each type.
> But in fact, they do the same thing: they loop around all ipcs to free them
> individually by calling a specific routine.
> 
> This patch proposes to consolidate this by introducing a common function, 
> free_ipcs(),
> that do the job. The specific routine to call on each individual ipcs is 
> passed as
> parameter. For this, these ipc-specific 'free' routines are reworked to take a
> generic 'struct ipc_perm' as parameter.

This conflicts in more-than-trivial ways with Pavel's
move-the-ipc-namespace-under-ipc_ns-option.patch, which was in
2.6.24-rc3-mm1.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [NET]: Fix TX bug VLAN in VLAN

2007-11-26 Thread Herbert Xu
On Tue, Nov 27, 2007 at 02:32:49PM +0900, Joonwoo Park wrote:
> 
> Thanks Herbert.
> Well.. I think patch would work propely for AF_PACKET also.
> (I did not insert BUG() macro in my patch)
> How do you think?

Are you sure? I thought you need to check both in the xmit function.
That is,

if (veth->h_vlan_proto != htons(ETH_P_8021Q) ||
VLAN_DEV_INFO(dev)->flags & VLAN_FLAG_REORDER_HDR) {

Otherwise you'll miss AF_PACKET packets when REORDER is off.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


out of office

2007-11-26 Thread Christine Abraham
I am responding to the oil spill in San Francisco Bay and will have very 
limited email contact for the forseeable future. If this is urgent, please call 
my cell phone (415-717-6348), and I'll get back to you as soon as possible.  


Thanks for your patience, 
Christine Abraham


Christine Abraham
Marine Ecology Division 
PRBO Conservation Science 
3820 Cypress Dr. #11
Petaluma, California
94954
707-781-2555 ext. 334
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] please pull infiniband.git

2007-11-26 Thread Roland Dreier
Linus, please pull from

master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git 
for-linus

This will pull some small fixes for 2.6.24:

Erez Zilber (1):
  IB/iser: Add missing counter increment in iser_data_buf_aligned_len()

Jack Morgenstein (1):
  mlx4_core: Fix state check in mlx4_qp_modify()

Joachim Fenkes (1):
  IB/ehca: Fix static rate regression

Ralph Campbell (4):
  IB/ipath: Fix offset returned to ibv_resize_cq()
  IB/ipath: Fix error path in QP creation
  IB/ipath: Fix offset returned to ibv_modify_srq()
  IB/ipath: Normalize error return codes for posting work requests

 drivers/infiniband/hw/ehca/ehca_qp.c  |4 +-
 drivers/infiniband/hw/ipath/ipath_cq.c|   19 +---
 drivers/infiniband/hw/ipath/ipath_qp.c|   15 ++
 drivers/infiniband/hw/ipath/ipath_srq.c   |   44 +
 drivers/infiniband/hw/ipath/ipath_verbs.c |8 +++--
 drivers/infiniband/ulp/iser/iser_memory.c |6 ++-
 drivers/net/mlx4/qp.c |2 +-
 7 files changed, 61 insertions(+), 37 deletions(-)


diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c 
b/drivers/infiniband/hw/ehca/ehca_qp.c
index 2e3e654..dd12668 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -1203,7 +1203,7 @@ static int internal_modify_qp(struct ib_qp *ibqp,
mqpcb->service_level = attr->ah_attr.sl;
update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SERVICE_LEVEL, 1);
 
-   if (ehca_calc_ipd(shca, my_qp->init_attr.port_num,
+   if (ehca_calc_ipd(shca, mqpcb->prim_phys_port,
  attr->ah_attr.static_rate,
  >max_static_rate)) {
ret = -EINVAL;
@@ -1302,7 +1302,7 @@ static int internal_modify_qp(struct ib_qp *ibqp,
mqpcb->source_path_bits_al = attr->alt_ah_attr.src_path_bits;
mqpcb->service_level_al = attr->alt_ah_attr.sl;
 
-   if (ehca_calc_ipd(shca, my_qp->init_attr.port_num,
+   if (ehca_calc_ipd(shca, mqpcb->alt_phys_port,
  attr->alt_ah_attr.static_rate,
  >max_static_rate_al)) {
ret = -EINVAL;
diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c 
b/drivers/infiniband/hw/ipath/ipath_cq.c
index 08d8ae1..d1380c7 100644
--- a/drivers/infiniband/hw/ipath/ipath_cq.c
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c
@@ -395,12 +395,9 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct 
ib_udata *udata)
goto bail;
}
 
-   /*
-* Return the address of the WC as the offset to mmap.
-* See ipath_mmap() for details.
-*/
+   /* Check that we can write the offset to mmap. */
if (udata && udata->outlen >= sizeof(__u64)) {
-   __u64 offset = (__u64) wc;
+   __u64 offset = 0;
 
ret = ib_copy_to_udata(udata, , sizeof(offset));
if (ret)
@@ -450,6 +447,18 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct 
ib_udata *udata)
struct ipath_mmap_info *ip = cq->ip;
 
ipath_update_mmap_info(dev, ip, sz, wc);
+
+   /*
+* Return the offset to mmap.
+* See ipath_mmap() for details.
+*/
+   if (udata && udata->outlen >= sizeof(__u64)) {
+   ret = ib_copy_to_udata(udata, >offset,
+  sizeof(ip->offset));
+   if (ret)
+   goto bail;
+   }
+
spin_lock_irq(>pending_lock);
if (list_empty(>pending_mmaps))
list_add(>pending_mmaps, >pending_mmaps);
diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c 
b/drivers/infiniband/hw/ipath/ipath_qp.c
index 6a41fdb..b997ff8 100644
--- a/drivers/infiniband/hw/ipath/ipath_qp.c
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c
@@ -835,7 +835,8 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd,
  init_attr->qp_type);
if (err) {
ret = ERR_PTR(err);
-   goto bail_rwq;
+   vfree(qp->r_rq.wq);
+   goto bail_qp;
}
qp->ip = NULL;
ipath_reset_qp(qp);
@@ -863,7 +864,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd,
   sizeof(offset));
if (err) {
ret = ERR_PTR(err);
-   goto bail_rwq;
+   goto bail_ip;
}
} else {
   

Re: 2.6.24-rc3-mm1

2007-11-26 Thread Andrew Morton
On Fri, 23 Nov 2007 06:55:41 +0100 Gabriel C <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Fri, 23 Nov 2007 02:39:08 +0100 Gabriel C <[EMAIL PROTECTED]> wrote:
> > 
> >> I have some warnings on each SCSI disc:
> >>
> >>
> >> ...
> >>
> >> [   30.724410] scsi 0:0:0:0: Direct-Access SEAGATE  ST318406LW   
> >> 0109 PQ: 0 ANSI: 3
> >> [   30.724419] scsi0:A:0:0: Tagged Queuing enabled.  Depth 32
> >> [   30.724435]  target0:0:0: Beginning Domain Validation
> >> [   30.724446]  target0:0:0: Domain Validation Initial Inquiry Failed <--
> >> [   30.724572]  target0:0:0: Ending Domain Validation
> >> [   30.729747] scsi 0:0:1:0: Direct-Access FUJITSU  MAH3182MP
> >> 0114 PQ: 0 ANSI: 4
> >> [   30.729754] scsi0:A:1:0: Tagged Queuing enabled.  Depth 32
> >> [   30.729771]  target0:0:1: Beginning Domain Validation
> >> [   30.729780]  target0:0:1: Domain Validation Initial Inquiry Failed <--
> >> [   30.729908]  target0:0:1: Ending Domain Validation
> >>
> > 
> > Don't know what would have caused that.  But yes, something is wrong in
> > scsi land.
> 
> Actually I'm lucky the author didn't fix that FIXME in scsi_transport_spi.c 
> and I still can boot ;)
> 
> > 
> >> no idea whatever this is related but buffered disk reads are 2.XX MB/sec 
> >> and the box is somewhat laggy.
> >>
> >> hdparm -t on sda and sdb reports :
> >>
> >> /dev/sda:
> >>  Timing buffered disk reads:8 MB in  3.26 seconds =   2.46 MB/sec
> >>
> >> /dev/sdb:
> >>  Timing buffered disk reads:8 MB in  3.56 seconds =   2.25 MB/sec
> >>
> >> My IDE discs are fine.
> >>
> >> Please let me know if you need my config or any other informations.
> >>
> > 
> > And you're the second to report very slow scsi throughput in 2.6.24-rc3-mm1.
> > 
> 
> I found the commit which cause these problems , it is in git-scsi-misc patch 
> and reverting it fixes both problems for me.
> 
> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff_plain;h=8655a546c83fc43f0a73416bbd126d02de7ad6c0;hp=5bc717b6bdaaf52edf365eb7d9d8c89fec79df5d
> 

OK, thanks.  I'll assume that James and Hannes have this in hand (or will
have, by mid-week) and I won't do anything here.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6 patch] remove CONFIG_EXPERIMENTAL

2007-11-26 Thread Valdis . Kletnieks
On Mon, 26 Nov 2007 23:34:11 EST, Dave Jones said:
> On Mon, Nov 26, 2007 at 10:44:44PM -0500, [EMAIL PROTECTED] wrote:
>  
>  > I suspect that given the "once it escapes, it's cast in stone" view we take
>  > towards user-visible API/etc, there isn't much *real* room for an
>  > 'EXPERIMENTAL' flag anymore.  Most of the usage should probably be 
> confined to
>  > individual drivers, where all we should need is a 'default n' and suitable
>  > warning verbiage in the Kconfig file warning about the driver eating your
>  > filesystems and small animals for breakfast.
> 
> Potential corruptors are usually flagged with (DANGEROUS) in the text,

Right, and given that, an additional EXPERIMENTAL flag seems superfluous.

> (One may argue that they shouldn't have escaped -mm)
> 
>  >  We certainly shouldn't have
>  > one big flag for *all* in-progress drivers - I don't need to accidentally
>  > enable a busticated ethernet driver because I want a USB widget.
> 
> So no ethernet driver at all is better than a broken but mostly working one?
> Again if it isn't mostly working, it shouldn't have escaped -mm

No.

The point is that using the *same* flag to control whether I can select
a mostly-working USB widget and a mostly-working Ethernet driver is Just Wrong.

Those of us who live in the US may have seen the insurance commercial where
Joe Sixpack is asking "Honey, what does this switch do?" "I don't know" flip,
flip, flip with no obvious impact.  Meanwhile, 3 houses down, somebody's car
is being beat up by a garage door opener going open/close/open/close...

I enable EXPERIMENTAL to enable my USB widget. When the next release comes out,
I then go and do something like a 'make [foo]config'.  What indication do I
get that now-selectable device drivers are 'depends on EXPERIMENTAL' and *not*
safe for selection? (Yes, in menuconfig, you can ask it to show the 'depends
on' list, *if* you suspect that it might be an issue.  But why would I suspect
that?)

In no case should we be creating a situation where users are thinking "Damn,
every driver may or may not be bodgy, I have to *check* if it's experimental
before I enable it, just because there was one that I *asked* for".
Particularly fun if you're migrating to new hardware and you don't *know* yet
which drivers you need, and you're getting prompted for possibly-dodgy ALSA
modules because you asked for a USB module

(And yes, trying to wade through all the ALSA/Intel HDA/AC97/Sigmatel *was*
painful enough when I moved from a Dell Latitude C840 to a D820 - fortunately
enough, I didn't have to deal with EXPERIMENTAL ALSA drivers adding to the
mix.. )

EXPERIMENTAL in a mainline kernel as a *single* switch for a *lot* of totally
unreleated code is even more broken than EMBEDDED (which at least had a common
rationale). And over in the -mm kernel where it *should* be, it's superfluous
at best, because a -mm kernel might as well just add -DCONFIG_EXPERIMENTAL=y to
CFLAGS and save you the effort. ;)






pgpgLWOI5NjRe.pgp
Description: PGP signature


Re: [PATCH 38/54] efivars: remove new_var and del_var files from sysfs

2007-11-26 Thread Greg KH
On Fri, Nov 16, 2007 at 09:01:16AM -0600, Matt Domsch wrote:
> On Fri, Nov 02, 2007 at 04:59:16PM -0700, Greg Kroah-Hartman wrote:
> > WTF?  Passing binary structures into a sysfs file, expecting it to be in
> > the correct format/endianness?  That's just wrong on so many levels.
> > 
> > So, these files are deleted.  If you want to add them back, please do so
> > in configfs, or in debugfs.  Or use text strings, which is what sysfs is
> > only for.
> 
> 
> I have tested gregkh's patches tree, which includes this patch, the
> patch to put these back as binary blob interfaces, as well as other
> cleanups, on an Itanium2 system.  The efibootmgr userspace application
> continues to work as it did before this patch series, which I claim is
> success.  For the patches that touch drivers/firmware/efivars.c I can
> say:
> 
> Tested-by: Matt Domsch <[EMAIL PROTECTED]>

Great, thanks for doing this, I appreciate it.

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: WARNING: at kernel/resource.c:189 __release_resource

2007-11-26 Thread Andrew Morton
On Thu, 22 Nov 2007 22:41:16 +0100 Jiri Slaby <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> Step aside. What's the purpose of having two similar patches for one issue,
> it then warns about the same thing twice:
> make-sure-nobodys-leaking-resources.patch
> releasing-resources-with-children.patch

Oh well.  It's better than having none.  Matthew, could you have think
about something for mainline please?

> Ok, I hit the bug, suspend of 00:06 device complains about it:
> WARNING: at .../kernel/resource.c:185 __release_resource()
> 
> Call Trace:
>  [] release_resource+0xb5/0xf0
>  [] pnp_release_resources+0x70/0x130
>  [] pnp_stop_dev+0x45/0x90
>  [] pnp_bus_suspend+0x92/0xb0
>  [] suspend_device+0x113/0x180
>  [] device_suspend+0x200/0x320
>  [] suspend_devices_and_enter+0xa5/0x170
>  [] enter_state+0x209/0x270
>  [] state_store+0xaf/0xf0
>  [] kobj_attr_store+0x17/0x20
>  [] sysfs_write_file+0xce/0x140
>  [] vfs_write+0xc7/0x170
>  [] sys_write+0x50/0x90
>  [] system_call+0x7e/0x83
> 
> # LANG=en ll /sys/devices/pnp0/00:06/
> total 0
> lrwxrwxrwx 1 root root0 Nov 22 22:35 driver -> 
> ../../../bus/pnp/drivers/serial
> -r--r--r-- 1 root root 4096 Nov 22 22:35 id
> -r--r--r-- 1 root root 4096 Nov 22 22:35 options
> drwxr-xr-x 2 root root0 Nov 22 22:35 power
> -rw-r--r-- 1 root root 4096 Nov 22 22:35 resources
> lrwxrwxrwx 1 root root0 Nov 22 22:35 subsystem -> ../../../bus/pnp
> drwxr-xr-x 3 root root0 Nov 22 22:35 tty
> -rw-r--r-- 1 root root 4096 Nov 22 22:35 uevent
> 

I suppose that's a genuine leak, presumably in 8250_pnp.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-26 Thread Tom Tucker

On Tue, 2007-11-27 at 15:49 +1100, Rusty Russell wrote:
> On Monday 26 November 2007 17:15:44 Roland Dreier wrote:
> >  > Except C doesn't have namespaces and this mechanism doesn't create them.
> >  >  So this is just complete and utter makework; as I said before, noone's
> >  > going to confuse all those udp_* functions if they're not in the udp
> >  > namespace.
> >
> > I don't understand why you're so opposed to organizing the kernel's
> > exported symbols in a more self-documenting way.
> 
> No, I was the one who moved exports near their declarations.  That's 
> organised.  I just don't see how this new "organization" will help: oh good, 
> I won't accidentally use the udp functions any more?!?
> 
> > It seems pretty   
> > clear to me that having a mechanism that requires modules to make
> > explicit which (semi-)internal APIs makes reviewing easier
> 
> Perhaps you've got lots of patches were people are using internal APIs they 
> shouldn't?
> 

Maybe the issue is "who can tell" since what is external and what is
internal is not explicitly defined?

> > , makes it 
> > easier to communicate "please don't use that API" to module authors,
> 
> Well, introduce an EXPORT_SYMBOL_INTERNAL().  It's a lot less code.  But 
> you'd 
> still need to show that people are having trouble knowing what APIs to use.

> > and takes at least a small step towards bringing the kernel's exported
> > API under control.
> 
> There is no "exported API" to bring under control.  

Hmm...apparently, there are those that are struggling...

> There are symbols we 
> expose for the kernel's own use which can be used by external modules at 
> their own risk.  
> 
> > What's the real downside? 
> 
> No.  That's the wrong question.  What's the real upside?

Explicitly documenting what comprises the kernel API (external,
supported) and what comprises the kernel implementation (internal, not
supported).

> 
> Let's not put code in the core because "it doesn't seem to hurt".
> 

agreed.

> I'm sure you think there's a real problem, but I'm still waiting for someone 
> to *show* it to me.  Then we can look at solutions.

I think the benefits should include:

- forcing developers to identify their exports as part of the
implementation or as part of the kernel API

- making it easier for reviewers to identify when developers are adding
to the kernel API and thereby focusing the appropriate level of review
to the new function

- making it obvious to developers when they are binding their
implementation to a particular kernel release



> Rusty.
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[git pull] Input updates for 2.6.24-rc3

2007-11-26 Thread Dmitry Torokhov
Hi Linus,

Please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git for-linus
or
master.kernel.org:/pub/scm/linux/kernel/git/dtor/input.git for-linus

to receive updates for the input subsystem.

Changelog:
-

Aristeu Rozanski (2):
  Input: add definitions for frame forward and frame back keys
  Input: adds the context menu key (HUT GenDesc 0x84)

Dmitry Torokhov (3):
  sony-laptop: fit input devices into sysfs tree
  sonypi: fit input devices into sysfs tree
  Sonypi: use synchronize_irq instead of sycnronize_sched

Herbert Valerio Riedel (1):
  Input: gpio-keys - request and configure GPIOs

Jiri Kosina (1):
  Input: i8042 - add i8042.noloop quirk for MS Virtual Machine

Mike Frysinger (1):
  Input: bf54x-keys - keypad does not exist on BF544 parts

Diffstat:

 drivers/char/sonypi.c |8 --
 drivers/input/keyboard/Kconfig|2 +-
 drivers/input/keyboard/gpio_keys.c|   38 
 drivers/input/serio/i8042-x86ia64io.h |8 +++
 drivers/misc/sony-laptop.c|   10 +---
 include/linux/input.h |5 
 6 files changed, 53 insertions(+), 18 deletions(-)


-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] [INPUT/KEYPAD] Blackfin BF54x keypad driver: keypad does not exist on BF544 parts

2007-11-26 Thread Dmitry Torokhov
On Friday 23 November 2007, Bryan Wu wrote:
> From: Mike Frysinger <[EMAIL PROTECTED]>
> 
> Signed-off-by: Mike Frysinger <[EMAIL PROTECTED]>
> Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>

Applied, thank you Mike & Bryan.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmc: Add missing sg_init_table() call

2007-11-26 Thread Pierre Ossman
On Mon, 26 Nov 2007 21:29:55 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> 
> Pierre, I can queue this up but if you merge it into your tree I shall drop
> it and shall lose track of it.  So it's then all down to you to remember to
> get the fix into 2.6.24.
> 
> (Except this particular bug looks like a post-2.6.23 regression, so I can cc
> the Rafael which never forgets, so it will then get tracked all the way into
> Linus's tree)
> 

Jens said he applied it, so I figured the issue was handled. Jens, what 
happened to it?

Rgds
-- 
 -- Pierre Ossman

  Linux kernel, MMC maintainerhttp://www.kernel.org
  PulseAudio, core developer  http://pulseaudio.org
  rdesktop, core developer  http://www.rdesktop.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Booting latest linux kernel(2.6.20) on MPC8548ECDS

2007-11-26 Thread rajendra prasad
Hi,

I am using MPC8548ECDS board from CDS for my telecom application. I am
able to build 2.6.10 linux kernel and boot 2.6.10 kernel on
MPC8548ECDS board.When I take same configuration file and built
successfully but not able to boot on MPC8548E CDS board.I am  using
u-boot-1.1.6 as boot loader.I came to know taht latest kernel is
booted with new procedure.Pls tell me the procedure how to boot
procedure.

Regards,.
RAJ
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mm: add dirty_highmem option

2007-11-26 Thread Andrew Morton
On Tue, 27 Nov 2007 16:24:24 +1100 "Bron Gondwana" <[EMAIL PROTECTED]> wrote:

> On Mon, 26 Nov 2007 20:54:28 -0800, "Andrew Morton" <[EMAIL PROTECTED]> said:
> > On Thu, 22 Nov 2007 14:42:04 +1100 Bron Gondwana <[EMAIL PROTECTED]>
> > wrote:
> > 
> > >  /*
> > > + * free highmem will not be subtracted from the total free memory
> > > + * for calculating free ratios if vm_dirty_highmem is true
> > > + */
> > > +int vm_dirty_highmem;
> > 
> > One would expect that setting dirty_highmem to true would cause highmem
> > to
> > be accounted in dirty-memory calculations.  However with this change
> > reality is in fact the inverse of that.
> > 
> > So how about this?
> 
> Actually, I'm confused now.  Maybe I chose a bad name to begin with.
> Does it mean "I am allowed to dirty high memory" or "my high memory
> will be dirty if this is on"?

But we're always allowed to dirty highmem - there'd be no point in having
it otherwise.  Hence the term dirty_highmem is confusing.

umm, really you want
/proc/sys/vm/dont-account-highmem-in-dirty-memory-calculations, only
shorter.

Do you agree?

If so, then it's still not a very pleasing interface - setting something to
"true" to disable a particular piece of kernel behaviour implies a single
negation which we don't really need.

It would be simpler to have
/proc/sys/vm/do-account-highmem-in-dirty-memory-calculations,
defaulting to "true" - this has no negations.

So... how about /proc/sys/vm/, umm.



OK, I give up.  Please see if you can think of something less confusing
which involves no negations?

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: freeze vs freezer

2007-11-26 Thread Matthew Garrett
On Mon, Nov 26, 2007 at 10:53:34PM +0100, Rafael J. Wysocki wrote:
> On Monday, 26 of November 2007, David Chinner wrote:
> > So how do you handle threads that are blocked on I/O or a lock during
> > the system freeze process, then?
> 
> We wait until they can continue.

So if I have a process blocked on an unavilable NFS mount, I can't 
suspend?

-- 
Matthew Garrett | [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Dynticks Causing High Context Switch Rate in ksoftirqd

2007-11-26 Thread Arjan van de Ven
On Mon, 26 Nov 2007 22:36:17 -0600
Robert Hancock <[EMAIL PROTECTED]> wrote:

> [EMAIL PROTECTED] wrote:
> > Question: Why is ksoftirqd eating about 5 to 10 percent of my CPU
> > on an idle system? The problem occurs if I config the kernel with
> > tickless support (i.e. CONFIG_TICK_ONESHOT=y).  (Thanks to
> > "oprofile" for putting me onto this.)
> > 
> > I have noted this same problem on kernel versions: 2.6.23.1,
> > 2.6.23.8 and 2.6.23.9
> > 
> > **
> > *** Output from "vmstat -n 1 10" -- Note very high context switch
> > rate *** *** This is on a idle
> > machine! ***
> > **
> > 
> > procs ---memory-- ---swap-- -io --system--
> > cpu
> >  r  b   swpd   free   buff  cache   si   sobibo   incs
> > us sy id wa
> >  0  0  0 1925556   4768 11610400   124 26
> > 7538  1  2 96  1
> >  0  0  0 1925556   4768 11610400 0 02
> > 147329  0  1 99  0
> 
> What did oprofile show? It should be able to narrow down what 
> function(s) are responsible for the CPU usage..
> 

or better, what does powertop version 1.9 show?
that tends to show tickless wakeup artifacts quite nicely


-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Error returns not handled correctly by sysfs.c:subsys_attr_store()

2007-11-26 Thread Tejun Heo
Greg KH wrote:
> On Mon, Nov 26, 2007 at 08:31:16PM -0800, Andrew Morton wrote:
>> On Wed, 21 Nov 2007 15:16:59 -0700 Andrew Patterson <[EMAIL PROTECTED]> 
>> wrote:
>>
>>> The buf in fs/sysfs.c:subsys_attr_store() does not seem to be updated
>>> correctly when returning a negative value (indicating that an error
>>> condition has occurred) is returned.  If a negative value is returned,
>>> the next subsequent call to subsys_attr_store will have the contents of
>>> buf appended to the previous call.
>> subsys_attr_store() gets deleted by
>> http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-01-driver/kset-kill-subsys-attr.patch
>>
>> So maybe we will soon accidentally fix whatever-this-is?  Or maybe we will
>> faithfully maintain it.
> 
> Yes, subsys attributes go away, but this is showing a bug in the sysfs
> core with attributes, not in the "middle" layers of attributes.
> 
> I bounced the original bug report to Tejun, who has been changing the
> logic around this area to see if he sees anything that might be
> different now.
> 
> Tejun?

(groaning buried under ATA bugs) Will take a look soon.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Error returns not handled correctly by sysfs.c:subsys_attr_store()

2007-11-26 Thread Greg KH
On Mon, Nov 26, 2007 at 08:31:16PM -0800, Andrew Morton wrote:
> On Wed, 21 Nov 2007 15:16:59 -0700 Andrew Patterson <[EMAIL PROTECTED]> wrote:
> 
> > The buf in fs/sysfs.c:subsys_attr_store() does not seem to be updated
> > correctly when returning a negative value (indicating that an error
> > condition has occurred) is returned.  If a negative value is returned,
> > the next subsequent call to subsys_attr_store will have the contents of
> > buf appended to the previous call.
> 
> subsys_attr_store() gets deleted by
> http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-01-driver/kset-kill-subsys-attr.patch
> 
> So maybe we will soon accidentally fix whatever-this-is?  Or maybe we will
> faithfully maintain it.

Yes, subsys attributes go away, but this is showing a bug in the sysfs
core with attributes, not in the "middle" layers of attributes.

I bounced the original bug report to Tejun, who has been changing the
logic around this area to see if he sees anything that might be
different now.

Tejun?

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add iSCSI IBFT Support (v0.3)

2007-11-26 Thread Greg KH
On Mon, Nov 26, 2007 at 11:50:10PM -0500, Konrad Rzeszutek wrote:
> > >
> > > sysfs files have ONE VALUE PER FILE, not a whole bunch of different
> > > things in a single file.  Please fix this.
> >
> > The subparameters _are_ actually part of a single value, that value being
> > associated with the initiator instance.
> >
> > Konrad is trying to implement a "work-alike" for what open firmware does.
> > open-iscsi already has the ability to extract the same format
> > bits from real OFW.
> >
> > See open-iscsi.git/utils/fwparam_ppc.
> 
> 
> Greg,
> 
> In light of what Doug says (which is all true), should I go ahead with a new 
> version of this module which would export one value per file? The problem 
> that will be encountered is that a ethernetX sysfs directory would have (for 
> example):
> 
> /sys/firmware/ibft/ethernet0/pci-bdf
> 5:1:0
> /sys/firmware/ibft/ethernet0/mac
> 00:11:25:9d:8b:00
> /sys/firmware/ibft/ethernet0/vlan
> 0
> /sys/firmware/ibft/ethernet0/gateway
> 192.168.79.254
> /sys/firmware/ibft/ethernet0/origin
> 0
> /sys/firmware/ibft/ethernet0/subnet-mask
> 22
> /sys/firmware/ibft/ethernet0/ip-addr
> 192.168.77.41
> /sys/firmware/ibft/ethernet0/flags
> 7

Yes, that is the proper way to do this kind of thing in sysfs.

> And the flag would contain the value "7" which would mean the user would have 
> to parse what each bit means? (the v0.3 of the module does not export this 
> flag but uses it to figure out which is the boot iSCSI target).

Sure, as long as it means something to userspace, and is a single value,
and is documented, that's fine.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add iSCSI IBFT Support (v0.3)

2007-11-26 Thread Greg KH
On Mon, Nov 26, 2007 at 11:23:31PM -0500, Konrad Rzeszutek wrote:
> On Monday 26 November 2007 22:31:38 Greg KH wrote:
> > > +#if defined(CONFIG_ISCSI_IBFT) || defined(CONFIG_ISCSI_IBFT_MODULE)
> ..snip..
> > > +static ssize_t find_ibft(void)
> > > +{
> ..snip..
> > > +}
> >
> > What is a function (not even an inline one) doing in a .h file?
> 
> I was not sure where to put it. This function (find_ibft) is used by the 
> setup_[32|64].c and the iscsi_ibft.c code. Randy suggested I put in .c file, 
> but I am not sure exactly where? Should I make a new file in called 
> libs/iscsi_ibft_helper.c ?

Put it in your .c file and make it a global function to be called by
someone else if they need it.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [NET]: Fix TX bug VLAN in VLAN

2007-11-26 Thread Joonwoo Park
2007/11/26, Herbert Xu <[EMAIL PROTECTED]>:
> On Fri, Nov 23, 2007 at 12:12:52PM +, Joonwoo Park wrote:
> > This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=8766
> >
> > Is it possible?
> > BUG((veth->h_vlan_proto != htons(ETH_P_8021Q)) && 
> > !(VLAN_DEV_INFO(dev)->flags & VLAN_FLAG_REORDER_HDR))
> > I'm afraid, queued packet before vconfig set_flag would do that.
>
> Yes, AF_PACKET would do that.  So you should check both.
>

Thanks Herbert.
Well.. I think patch would work propely for AF_PACKET also.
(I did not insert BUG() macro in my patch)
How do you think?

Thanks
Joonwoo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmc: Add missing sg_init_table() call

2007-11-26 Thread Andrew Morton
On Thu, 22 Nov 2007 20:32:51 +0100 Haavard Skinnemoen <[EMAIL PROTECTED]> wrote:

> mmc_init_queue only initializes the scatterlists with sg_init_table()
> when using a bounce buffer. This leads to a BUG() when CONFIG_DEBUG_SG
> is set.
> 

I assume that 2.6.23 is not afflicted in this way?

> ---
>  drivers/mmc/card/queue.c |3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
> index 1b9c9b6..30cd13b 100644
> --- a/drivers/mmc/card/queue.c
> +++ b/drivers/mmc/card/queue.c
> @@ -180,12 +180,13 @@ int mmc_init_queue(struct mmc_queue *mq, struct 
> mmc_card *card, spinlock_t *lock
>   blk_queue_max_hw_segments(mq->queue, host->max_hw_segs);
>   blk_queue_max_segment_size(mq->queue, host->max_seg_size);
>  
> - mq->sg = kzalloc(sizeof(struct scatterlist) *
> + mq->sg = kmalloc(sizeof(struct scatterlist) *
>   host->max_phys_segs, GFP_KERNEL);
>   if (!mq->sg) {
>   ret = -ENOMEM;
>   goto cleanup_queue;
>   }
> + sg_init_table(mq->sg, host->max_phys_segs);
>   }
>  
>   init_MUTEX(>thread_sem);

Pierre, I can queue this up but if you merge it into your tree I shall drop
it and shall lose track of it.  So it's then all down to you to remember to
get the fix into 2.6.24.

(Except this particular bug looks like a post-2.6.23 regression, so I can cc
the Rafael which never forgets, so it will then get tracked all the way into
Linus's tree)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mm: add dirty_highmem option

2007-11-26 Thread Bron Gondwana
On Mon, 26 Nov 2007 20:54:28 -0800, "Andrew Morton" <[EMAIL PROTECTED]> said:
> On Thu, 22 Nov 2007 14:42:04 +1100 Bron Gondwana <[EMAIL PROTECTED]>
> wrote:
> 
> >  /*
> > + * free highmem will not be subtracted from the total free memory
> > + * for calculating free ratios if vm_dirty_highmem is true
> > + */
> > +int vm_dirty_highmem;
> 
> One would expect that setting dirty_highmem to true would cause highmem
> to
> be accounted in dirty-memory calculations.  However with this change
> reality is in fact the inverse of that.
> 
> So how about this?

Actually, I'm confused now.  Maybe I chose a bad name to begin with.
Does it mean "I am allowed to dirty high memory" or "my high memory
will be dirty if this is on"?

Hmm... I'm even having trouble articulating what's odd about it.

I guess my internal model was: "if this flag is set then you are
allowed to make high memory dirty without needing to flush it
immediately", which is why I made it that way around.


No - you're wrong.  My patch _did_ include high memory in the dirty
memory calculations when dirty_highmem was true.

>   x = global_page_state(NR_FREE_PAGES)
>   + global_page_state(NR_INACTIVE)
>   + global_page_state(NR_ACTIVE);

This is the total memory, _including_ high memory.

>   x -= highmem_dirtyable_memory(x);

This removes the high memory from the total count.


I think I got it right.  If dirty_highmem is set to true, then
don't subtract highmem from the total memory count before
calculating the percentages.  That's what I meant, and that's
what the toggle did.  Removed the subtraction.

Bron.


>  Documentation/filesystems/proc.txt |4 ++--
>  mm/page-writeback.c|8 
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff -puN
> Documentation/filesystems/proc.txt~mm-add-dirty_highmem-option-fix
> Documentation/filesystems/proc.txt
> --- a/Documentation/filesystems/proc.txt~mm-add-dirty_highmem-option-fix
> +++ a/Documentation/filesystems/proc.txt
> @@ -1265,8 +1265,8 @@ Contains, as a boolean, a switch to allo
>  part of the "available" memory against which the dirty ratios will be
>  applied.
>  
> -Setting this to 1 can be useful on 32 bit machines where you want to
> make
> -random changes within an MMAPed file that is larger than your available
> +Setting this to 0 (false) can be useful on 32 bit machines where you
> wish to
> +make random changes within an MMAPed file that is larger than your
> available
>  lowmem, however it is potentially dangerous and has serious
>  bounce-buffer
>  issues.
>  
> diff -puN mm/page-writeback.c~mm-add-dirty_highmem-option-fix
> mm/page-writeback.c
> --- a/mm/page-writeback.c~mm-add-dirty_highmem-option-fix
> +++ a/mm/page-writeback.c
> @@ -69,10 +69,10 @@ static inline long sync_writeback_pages(
>  int dirty_background_ratio = 5;
>  
>  /*
> - * free highmem will not be subtracted from the total free memory
> - * for calculating free ratios if vm_dirty_highmem is true
> + * free highmem will be subtracted from the total free memory for
> calculating
> + * free ratios if vm_dirty_highmem is true
>   */
> -int vm_dirty_highmem;
> +int vm_dirty_highmem = 1;
>  
>  /*
>   * The generator of dirty data starts writeback at this percentage
> @@ -293,7 +293,7 @@ static unsigned long determine_dirtyable
>   x = global_page_state(NR_FREE_PAGES)
>   + global_page_state(NR_INACTIVE)
>   + global_page_state(NR_ACTIVE);
> - if (!vm_dirty_highmem)
> + if (vm_dirty_highmem)
>   x -= highmem_dirtyable_memory(x);
>   return x + 1;   /* Ensure that we never return 0 */
>  }
> _
> 
> 
> 
> 
> (I dropped the already-merged part of your patch)
> 
> (I fixed a build error in kernel/sysctl.c: "one" was defined twice when
> suitable config options were set).
> 
> (It's an unpleasing patch, btw.  But it's an unpleasant problem and at
> least
> this way people can tell us "hey, I did  and it started to work")
-- 
  Bron Gondwana
  [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.23.9

2007-11-26 Thread Valdis . Kletnieks
On Tue, 27 Nov 2007 02:39:08 +0100, Patrick McHardy said:
> Tomasz K wrote:
> > On Mon, 26 Nov 2007, Greg Kroah-Hartman wrote:
> > [..]
> >
> > Still there is no aroud officialy released iptables tarball with 
> > support for rules for new xt_{connlimit,time,u32} modules.
> > Anyone know where are patches for manage connlimit, time, u32 rules 
> > which will be included in next release ?
> 
> A well chosen thread to ask this question. xt_time is not even
> included in 2.6.23.
> 
> http://netfilter.org/news.html#2007-10-15

And I don't see any mention of connlimit, time, or u32 in the Changelog for 
that...

I admit I haven't peeked inside the actual tarball to see if they added stuff
and didn't Changelog it...


pgpykpjdHQNhy.pgp
Description: PGP signature


Re: [PATCH][SHMEM] Factor out sbi->free_inodes manipulations

2007-11-26 Thread Andrew Morton
On Fri, 23 Nov 2007 13:41:55 + (GMT) Hugh Dickins <[EMAIL PROTECTED]> wrote:

> Looks good, but we can save slightly more there (depending on config),
> and I found your inc/dec names a little confusing, since the count is
> going the other way: how do you feel about this version?  (I'd like it
> better if those helpers could take a struct inode *, but they cannot.)
> Hugh
> 
> 
> From: Pavel Emelyanov <[EMAIL PROTECTED]>
> 
> The shmem_sb_info structure has a number of free_inodes. This
> value is altered in appropriate places under spinlock and with
> the sbi->max_inodes != 0 check.
> 
> Consolidate these manipulations into two helpers.
> 
> This is minus 42 bytes of shmem.o and minus 4 :) lines of code.
> 
> Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>
> Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>
> ---
> 
>  mm/shmem.c |   72 ---
>  1 file changed, 34 insertions(+), 38 deletions(-)
> 
> --- 2.6.24-rc3/mm/shmem.c 2007-11-07 04:21:45.0 +
> +++ linux/mm/shmem.c  2007-11-23 12:43:28.0 +
> @@ -207,6 +207,31 @@ static void shmem_free_blocks(struct ino
>   }
>  }
>  
> +static int shmem_reserve_inode(struct super_block *sb)
> +{
> + struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
> + if (sbinfo->max_inodes) {
> + spin_lock(>stat_lock);
> + if (!sbinfo->free_inodes) {
> + spin_unlock(>stat_lock);
> + return -ENOMEM;
> + }
> + sbinfo->free_inodes--;
> + spin_unlock(>stat_lock);
> + }
> + return 0;
> +}

It is peculair to (wrongly) return -ENOMEM

> + if (shmem_reserve_inode(inode->i_sb))
> + return -ENOSPC;

and to then correct it in the caller..


Something boringly conventional such as the below, perhaps?

--- a/mm/shmem.c~shmem-factor-out-sbi-free_inodes-manipulations-fix
+++ a/mm/shmem.c
@@ -212,7 +212,7 @@ static int shmem_reserve_inode(struct su
spin_lock(>stat_lock);
if (!sbinfo->free_inodes) {
spin_unlock(>stat_lock);
-   return -ENOMEM;
+   return -ENOSPC;
}
sbinfo->free_inodes--;
spin_unlock(>stat_lock);
@@ -1679,14 +1679,16 @@ static int shmem_create(struct inode *di
 static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct 
dentry *dentry)
 {
struct inode *inode = old_dentry->d_inode;
+   int ret;
 
/*
 * No ordinary (disk based) filesystem counts links as inodes;
 * but each new link needs a new dentry, pinning lowmem, and
 * tmpfs dentries cannot be pruned until they are unlinked.
 */
-   if (shmem_reserve_inode(inode->i_sb))
-   return -ENOSPC;
+   ret = shmem_reserve_inode(inode->i_sb);
+   if (ret)
+   goto out;
 
dir->i_size += BOGO_DIRENT_SIZE;
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
@@ -1694,7 +1696,8 @@ static int shmem_link(struct dentry *old
atomic_inc(>i_count);/* New dentry reference */
dget(dentry);   /* Extra pinning count for the created dentry */
d_instantiate(dentry, inode);
-   return 0;
+out:
+   return ret;
 }
 
 static int shmem_unlink(struct inode *dir, struct dentry *dentry)
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 04/14] ia64: Remove the __SMALL_ADDR_AREA attribute for per cpu access

2007-11-26 Thread David Mosberger-Tang
On 11/26/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:
> The model(small) attribute is not supported by gcc 4.X. The tests
> will always be negative today.

What was the rationale for removing this attribute?

  --david
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 5/5] sched: Improve fairness of cpu bandwidth allocation for task groups

2007-11-26 Thread Srivatsa Vaddagiri

The current load balancing scheme isn't good for group fairness.

For ex: on a 8-cpu system, I created 3 groups as under:

a = 8 tasks (cpu.shares = 1024) 
b = 4 tasks (cpu.shares = 1024) 
c = 3 tasks (cpu.shares = 1024) 

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:


Col1  | Col2| Col3  |  Col4
--|-|---|---
a | 277.676 | 57.8% | 54.1%  54.1%  54.1%  54.2%  56.7%  62.2%  62.8% 64.5%
b | 116.108 | 24.2% | 47.4%  48.1%  48.7%  49.3%
c |  86.326 | 18.0% | 47.5%  47.9%  48.5%


Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that 
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in 
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

a = +24.47%
b = -9.13%
c = -15.33%

which is quite high.

After the patch below is applied, here are the results:


Col1  | Col2| Col3  |  Col4
--|-|---|---
a | 163.112 | 34.0% | 33.2%  33.4%  33.5%  33.5%  33.7%  34.4%  34.8% 35.3%
b | 156.220 | 32.5% | 63.3%  64.5%  66.1%  66.5%
c | 160.653 | 33.5% | 85.8%  90.6%  91.4%


Deviation from desired group fairness is as below:

a = +0.67%
b = -0.83%
c = +0.17%

which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.

Why do we see bad (group) fairness with current scheuler?
=

Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:

CPU0CPU1
---
A (10)  B(5)
C(5)
---

Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).

The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:

A = 50%
B = 25%
C = 25%

which is not the desired result.

What's changing in the patch?
=

- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
  defined (see below)
- API Change 
- Two tunables introduced in sysfs (under SCHED_DEBUG) to 
  control the frequency at which the load balance monitor
  thread runs. 

The basic change made in this patch is how cpu weight (rq->load.weight) is 
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.

Let,
W(K,i)  = Weight of group K on cpu i
T(K,i)  = Task load present in group K's cfs_rq on cpu i
T(K)= Total task load of group K across various cpus
S(K)= Shares allocated to group K
NRCPUS  = Number of online cpus in the scheduler domain to
  which group K is assigned.

Then,
W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate at which
it runs.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 include/linux/sched.h |4 
 kernel/sched.c|  259 --
 kernel/sched_fair.c   |   88 ++--
 kernel/sysctl.c   |   18 +++
 4 files changed, 330 insertions(+), 39 deletions(-)

Index: current/include/linux/sched.h
===
--- 

Re: [PATCH] Add iSCSI IBFT Support (v0.3)

2007-11-26 Thread Doug Maxey

On Mon, 26 Nov 2007 19:31:38 PST, Greg KH wrote:
> On Mon, Nov 26, 2007 at 06:56:42PM -0400, Konrad Rzeszutek wrote:
> > +/*
> > + *  Routines for reading of the iBFT data in a human readable fashion.
> > + */
> > +ssize_t ibft_attr_show_initiator(struct ibft_kobject *entry,
> > +struct ibft_attribute *attr,
> > +char *buf)
> > +{
> > +   struct ibft_initiator *initiator = attr->initiator;
> > +   void *ibft_loc = entry->data->hdr;
> > +   char *str = buf;
> > +
> > +   if (!initiator)
> > +   return 0;
> > +
> > +   str += sprintf_ipaddr(str, "isns", initiator->isns_server);
> > +   str += sprintf_ipaddr(str, "slp", initiator->slp_server);
> > +   str += sprintf_ipaddr(str, "primary_radius_server",
> > +   initiator->pri_radius_server);
> > +   str += sprintf_ipaddr(str, "secondary_radius_server",
> > +   initiator->sec_radius_server);
> > +   str += sprintf_string(str, "itname", initiator->initiator_name_len,
> > +   (char *)ibft_loc + initiator->initiator_name_off);
> > +   str--;
> > +
> > +   return str-buf;
> > +}
> 
> sysfs files have ONE VALUE PER FILE, not a whole bunch of different
> things in a single file.  Please fix this.

The subparameters _are_ actually part of a single value, that value being 
associated with the initiator instance.

Konrad is trying to implement a "work-alike" for what open firmware does.
open-iscsi already has the ability to extract the same format 
bits from real OFW.

See open-iscsi.git/utils/fwparam_ppc.

> 
> 
> > +
> > +ssize_t ibft_attr_show_nic(struct ibft_kobject *entry,
> > +  struct ibft_attribute *attr,
> > +  char *buf)
> > +{
> > +   struct ibft_nic *nic = attr->nic;
> > +   void *ibft_loc = entry->data->hdr;
> > +   char *str = buf;
> > +
> > +   if (!nic)
> > +   return 0;
> > +   /*
> > +* Assume dhcp if any non-zero portions of its address are set.
> > +*/
> > +   if (memcmp(nic->dhcp, nulls, sizeof(nic->dhcp))) {
> > +   str += sprintf_ipaddr(str, "dhcp", nic->dhcp);
> > +   } else {
> > +   str += sprintf_ipaddr(str, "ciaddr", nic->ip_addr);
> > +   str += sprintf_ipaddr(str, "giaddr", nic->gateway);
> > +   str += sprintf_ipaddr(str, "dnsaddr1", nic->primary_dns);
> > +   str += sprintf_ipaddr(str, "dnsaddr2", nic->secondary_dns);
> > +   }
> > +   if (nic->hostname_len)
> > +   str += sprintf_string(str, "hostname", nic->hostname_len,
> > +   (char *)ibft_loc + nic->hostname_off);
> > +   /* Cut off the comma. */
> > +   str--;
> > +
> > +   return str-buf;
> > +}
> 
> Same here.
> 
> > +ssize_t ibft_attr_show_target(struct ibft_kobject *entry,
> > + struct ibft_attribute *attr,
> > + char *buf)
> > +{
> > +   struct ibft_tgt *tgt = attr->tgt;
> > +   void *ibft_loc = entry->data->hdr;
> > +   char *str = buf;
> > +   int i;
> > +
> > +   if (!tgt)
> > +   return 0;
> > +
> > +   str += sprintf_ipaddr(str, "siaddr", tgt->ip_addr);
> > +   str += sprintf(str, "iport=%d,", tgt->port);
> > +   str += sprintf(str, "ilun=");
> > +   for (i = 0; i < 8; i++)
> > +   str += sprintf(str, "%x", (u8)tgt->lun[i]);
> > +   str += sprintf(str, ",");
> > +
> > +   if (tgt->tgt_name_len)
> > +   str += sprintf_string(str, "iname", tgt->tgt_name_len,
> > +   (void *)ibft_loc + tgt->tgt_name_off);
> > +
> > +   if (tgt->chap_name_len)
> > +   str += sprintf_string(str, "chapid", tgt->chap_name_len,
> > +   (char *)ibft_loc + tgt->chap_name_off);
> > +   if (tgt->chap_secret_len)
> > +   str += sprintf_string(str, "chappw", tgt->chap_secret_len,
> > +   (char *)ibft_loc + tgt->chap_secret_off);
> > +   if (tgt->rev_chap_name_len)
> > +   str += sprintf_string(str, "ichapid", tgt->rev_chap_name_len,
> > +   (char *)ibft_loc + tgt->rev_chap_name_off);
> > +   if (tgt->rev_chap_secret_len)
> > +   str += sprintf_string(str, "ichappw", tgt->rev_chap_secret_len,
> > +   (char *)ibft_loc + tgt->rev_chap_secret_off);
> > +
> > +   /* Cut off the comma. */
> > +   str--;
> > +
> > +   return str-buf;
> > +}
> 
> Same here, are we writing a novella or something to userspace?  :)

Yep.  Just like real OFW.

> 
> > +ssize_t ibft_attr_show_disk(struct ibft_kobject *dev,
> > +   struct ibft_attribute *ibft_attr,
> > +   char *buf)
> > +{
> > +   char *str = buf;
> > +
> > +   str += sprintf(str, "//[EMAIL PROTECTED],%d:iscsi,", dev->data->index);
> > +   str += ibft_attr_show_initiator(dev, ibft_attr, str);
> > +   str += sprintf(str, ",");
> > +   str += ibft_attr_show_target(dev, ibft_attr, str);
> > +   str += sprintf(str, ",");
> > +   str += ibft_attr_show_nic(dev, ibft_attr, str);
> > +
> > +   return str-buf;
> > +}
> 
> And here, do I need to go 

[Patch 4/5] sched: introduce a mutex and corresponding API to serialize access to doms_cur[] array

2007-11-26 Thread Srivatsa Vaddagiri
doms_cur[] array represents various scheduling domains which are mutually
exclusive. Currently cpusets code can modify this array (by calling
partition_sched_domains()) as a result of user modifying sched_load_balance 
flag for various cpusets.

This patch introduces a mutex and corresponding API (only when
CONFIG_FAIR_GROUP_SCHED is defined) which allows a reader to safely read the
doms_cur[] array w/o worrying abt concurrent modifications to the array.

The fair group scheduler code (introduced in next patch of this series)
makes use of this mutex to walk thr' doms_cur[] array while rebalancing
shares of task groups across cpus.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c |   19 +++
 1 files changed, 19 insertions(+)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -186,6 +186,9 @@ static struct cfs_rq *init_cfs_rq_p[NR_C
  */
 static DEFINE_MUTEX(task_group_mutex);
 
+/* doms_cur_mutex serializes access to doms_cur[] array */
+static DEFINE_MUTEX(doms_cur_mutex);
+
 /* Default task group.
  * Every task in system belong to this group at bootup.
  */
@@ -236,11 +239,23 @@ static inline void unlock_task_group_lis
mutex_unlock(_group_mutex);
 }
 
+static inline void lock_doms_cur(void)
+{
+   mutex_lock(_cur_mutex);
+}
+
+static inline void unlock_doms_cur(void)
+{
+   mutex_unlock(_cur_mutex);
+}
+
 #else
 
 static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
 static inline void lock_task_group_list(void) { }
 static inline void unlock_task_group_list(void) { }
+static inline void lock_doms_cur(void) { }
+static inline void unlock_doms_cur(void) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -6547,6 +6562,8 @@ void partition_sched_domains(int ndoms_n
 {
int i, j;
 
+   lock_doms_cur();
+
/* always unregister in case we don't destroy any domains */
unregister_sched_domain_sysctl();
 
@@ -6587,6 +6604,8 @@ match2:
ndoms_cur = ndoms_new;
 
register_sched_domain_sysctl();
+
+   unlock_doms_cur();
 }
 
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)


-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RESEND] crypto test: use print_hex_dump from kernel.h instead

2007-11-26 Thread rae l
On Nov 27, 2007 10:58 AM, Richard Knutsson <[EMAIL PROTECTED]> wrote:
...
> > + print_hex_dump(KERN_CONT, "", DUMP_PREFIX_OFFSET,
> > + 16, 1,
> > + buf, len, 0);
> >
> Not important, but why use '0' instead of 'false'?
after read http://lkml.org/lkml/2006/7/27/281, I agreed with you.
this is refreshed patch against the lastest cryptodev tree.

Cc: Randy Dunlap <[EMAIL PROTECTED]>
Signed-off-by: Denis Cheng <[EMAIL PROTECTED]>
---
 crypto/tcrypt.c |9 -
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 1e12b86..ae762c2 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -87,12 +87,11 @@ static char *check[] = {
"camellia", "seed", "salsa20", NULL
 };

-static void hexdump(unsigned char *buf, unsigned int len)
+static inline void hexdump(unsigned char *buf, unsigned int len)
 {
-   while (len--)
-   printk("%02x", *buf++);
-
-   printk("\n");
+   print_hex_dump(KERN_CONT, "", DUMP_PREFIX_OFFSET,
+   16, 1,
+   buf, len, false);
 }

 static void tcrypt_complete(struct crypto_async_request *req, int err)

-- 
Denis Cheng
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: time accounting problem (powerpc only?)

2007-11-26 Thread Tony Breeds
On Mon, Nov 26, 2007 at 05:23:13PM +0100, Johannes Berg wrote:
> Contrary to what I claimed later in the thread, my 64-bit powerpc box
> (quad-core G5) doesn't suffer from this problem.
> 
> Does anybody have any idea? I don't even know how to debug it further.

I'll see if I can grab an appropriate machine tomorrow and have a look at
it.  I think it's just an accounting bug, which is probably my fault :)

Yours Tony

  linux.conf.auhttp://linux.conf.au/ || http://lca2008.linux.org.au/
  Jan 28 - Feb 02 2008 The Australian Linux Technical Conference!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 3/5 v2] sched: change how cpu load is calculated

2007-11-26 Thread Srivatsa Vaddagiri
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v2 of Patch 3/5) has a minor impact on code size
(but should have no runtime/functional impact) for !CONFIG_FAIR_GROUP_SCHED
case, but the overall code, IMHO, is neater compared to v1 of Patch 3/5
(because of lesser #ifdefs).

I prefer v2 of Patch 3/5.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c  |   27 +++
 kernel/sched_fair.c |   31 +++
 kernel/sched_rt.c   |2 ++
 3 files changed, 40 insertions(+), 20 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -870,6 +870,16 @@ iter_move_one_task(struct rq *this_rq, i
   struct rq_iterator *iterator);
 #endif
 
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_add(>load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_sub(>load, load);
+}
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -880,26 +890,14 @@ iter_move_one_task(struct rq *this_rq, i
 
 #define sched_class_highest (_sched_class)
 
-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
-   update_load_add(>load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
-   update_load_sub(>load, p->se.load.weight);
-}
-
 static void inc_nr_running(struct task_struct *p, struct rq *rq)
 {
rq->nr_running++;
-   inc_load(rq, p);
 }
 
 static void dec_nr_running(struct task_struct *p, struct rq *rq)
 {
rq->nr_running--;
-   dec_load(rq, p);
 }
 
 static void set_load_weight(struct task_struct *p)
@@ -4071,10 +4069,8 @@ void set_user_nice(struct task_struct *p
goto out_unlock;
}
on_rq = p->se.on_rq;
-   if (on_rq) {
+   if (on_rq)
dequeue_task(rq, p, 0);
-   dec_load(rq, p);
-   }
 
p->static_prio = NICE_TO_PRIO(nice);
set_load_weight(p);
@@ -4084,7 +4080,6 @@ void set_user_nice(struct task_struct *p
 
if (on_rq) {
enqueue_task(rq, p, 0);
-   inc_load(rq, p);
/*
 * If the task increased its priority or is running and
 * lowered its priority, then reschedule its CPU:
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -755,15 +755,26 @@ static inline struct sched_entity *paren
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = >se;
+   struct sched_entity *se = >se, *topse = NULL;
+   int incload = 1;
 
for_each_sched_entity(se) {
-   if (se->on_rq)
+   topse = se;
+   if (se->on_rq) {
+   incload = 0;
break;
+   }
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+   /*
+* Increment cpu load if we just enqueued the first task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (incload)
+   inc_cpu_load(rq, topse->load.weight);
 }
 
 /*
@@ -774,16 +785,28 @@ static void enqueue_task_fair(struct rq 
 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = >se;
+   struct sched_entity *se = >se, *topse = NULL;
+   int decload = 1;
 
for_each_sched_entity(se) {
+   topse = se;
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
-   if (cfs_rq->load.weight)
+   if (cfs_rq->load.weight) {
+   if (parent_entity(se))
+   decload = 0;
break;
+   }
sleep = 1;
}
+   /*
+* Decrement cpu load if we just dequeued the last task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (decload)
+   dec_cpu_load(rq, topse->load.weight);
 }
 
 /*
Index: current/kernel/sched_rt.c

Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-26 Thread Rusty Russell
On Monday 26 November 2007 17:15:44 Roland Dreier wrote:
>  > Except C doesn't have namespaces and this mechanism doesn't create them.
>  >  So this is just complete and utter makework; as I said before, noone's
>  > going to confuse all those udp_* functions if they're not in the udp
>  > namespace.
>
> I don't understand why you're so opposed to organizing the kernel's
> exported symbols in a more self-documenting way.

No, I was the one who moved exports near their declarations.  That's 
organised.  I just don't see how this new "organization" will help: oh good, 
I won't accidentally use the udp functions any more?!?

> It seems pretty   
> clear to me that having a mechanism that requires modules to make
> explicit which (semi-)internal APIs makes reviewing easier

Perhaps you've got lots of patches were people are using internal APIs they 
shouldn't?

> , makes it 
> easier to communicate "please don't use that API" to module authors,

Well, introduce an EXPORT_SYMBOL_INTERNAL().  It's a lot less code.  But you'd 
still need to show that people are having trouble knowing what APIs to use.

> and takes at least a small step towards bringing the kernel's exported
> API under control.

There is no "exported API" to bring under control.  There are symbols we 
expose for the kernel's own use which can be used by external modules at 
their own risk.  

> What's the real downside? 

No.  That's the wrong question.  What's the real upside?

Let's not put code in the core because "it doesn't seem to hurt".

I'm sure you think there's a real problem, but I'm still waiting for someone 
to *show* it to me.  Then we can look at solutions.

Rusty.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 3/5 v1] sched: change how cpu load is calculated

2007-11-26 Thread Srivatsa Vaddagiri
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v1 of Patch 3/5) has zero impact for
!CONFIG_FAIR_GROUP_SCHED case.

Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>


---
 kernel/sched.c  |   38 ++
 kernel/sched_fair.c |   31 +++
 kernel/sched_rt.c   |2 ++
 3 files changed, 59 insertions(+), 12 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -870,15 +870,25 @@ iter_move_one_task(struct rq *this_rq, i
   struct rq_iterator *iterator);
 #endif
 
-#include "sched_stats.h"
-#include "sched_idletask.c"
-#include "sched_fair.c"
-#include "sched_rt.c"
-#ifdef CONFIG_SCHED_DEBUG
-# include "sched_debug.c"
-#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
 
-#define sched_class_highest (_sched_class)
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_add(>load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+   update_load_sub(>load, load);
+}
+
+static inline void inc_load(struct rq *rq, const struct task_struct *p) { }
+static inline void dec_load(struct rq *rq, const struct task_struct *p) { }
+
+#else  /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void inc_cpu_load(struct rq *rq, unsigned long load) { }
+static inline void dec_cpu_load(struct rq *rq, unsigned long load) { }
 
 static inline void inc_load(struct rq *rq, const struct task_struct *p)
 {
@@ -890,6 +900,18 @@ static inline void dec_load(struct rq *r
update_load_sub(>load, p->se.load.weight);
 }
 
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+#include "sched_stats.h"
+#include "sched_idletask.c"
+#include "sched_fair.c"
+#include "sched_rt.c"
+#ifdef CONFIG_SCHED_DEBUG
+# include "sched_debug.c"
+#endif
+
+#define sched_class_highest (_sched_class)
+
 static void inc_nr_running(struct task_struct *p, struct rq *rq)
 {
rq->nr_running++;
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -755,15 +755,26 @@ static inline struct sched_entity *paren
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = >se;
+   struct sched_entity *se = >se, *topse = NULL;
+   int incload = 1;
 
for_each_sched_entity(se) {
-   if (se->on_rq)
+   topse = se;
+   if (se->on_rq) {
+   incload = 0;
break;
+   }
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+   /*
+* Increment cpu load if we just enqueued the first task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (incload)
+   inc_cpu_load(rq, topse->load.weight);
 }
 
 /*
@@ -774,16 +785,28 @@ static void enqueue_task_fair(struct rq 
 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
 {
struct cfs_rq *cfs_rq;
-   struct sched_entity *se = >se;
+   struct sched_entity *se = >se, *topse = NULL;
+   int decload = 1;
 
for_each_sched_entity(se) {
+   topse = se;
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
-   if (cfs_rq->load.weight)
+   if (cfs_rq->load.weight) {
+   if (parent_entity(se))
+   decload = 0;
break;
+   }
sleep = 1;
}
+   /*
+* Decrement cpu load if we just dequeued the last task of a group on
+* 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+* at the highest grouping level.
+*/
+   if (decload)
+   dec_cpu_load(rq, topse->load.weight);
 }
 
 /*
Index: current/kernel/sched_rt.c
===
--- current.orig/kernel/sched_rt.c
+++ current/kernel/sched_rt.c
@@ -31,6 +31,7 @@ static void enqueue_task_rt(struct rq *r
 
list_add_tail(>run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
+   inc_cpu_load(rq, p->se.load.weight);
 }
 
 /*
@@ -45,6 +46,7 @@ static void dequeue_task_rt(struct rq *r
list_del(>run_list);
if 

[Patch 2/5] sched: minor fixes for group scheduler

2007-11-26 Thread Srivatsa Vaddagiri
Minor bug fixes for group scheduler:

- Use a mutex to serialize add/remove of task groups and also when
  changing shares of a task group. Use the same mutex when printing cfs_rq
  stats for various task groups.
- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when
  walking task group list)


Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c  |   34 ++
 kernel/sched_fair.c |4 +++-
 2 files changed, 29 insertions(+), 9 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -169,8 +169,6 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
-   /* spinlock to serialize modification to shares */
-   spinlock_t lock;
struct rcu_head rcu;
 };
 
@@ -182,6 +180,12 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
 static struct sched_entity *init_sched_entity_p[NR_CPUS];
 static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
 
+/*
+ * task_group_mutex serializes add/remove of task groups and also changes to
+ * a task group's cpu shares.
+ */
+static DEFINE_MUTEX(task_group_mutex);
+
 /* Default task group.
  * Every task in system belong to this group at bootup.
  */
@@ -222,9 +226,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
 }
 
+static inline void lock_task_group_list(void)
+{
+   mutex_lock(_group_mutex);
+}
+
+static inline void unlock_task_group_list(void)
+{
+   mutex_unlock(_group_mutex);
+}
+
 #else
 
 static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
+static inline void lock_task_group_list(void) { }
+static inline void unlock_task_group_list(void) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -6747,7 +6763,6 @@ void __init sched_init(void)
se->parent = NULL;
}
init_task_group.shares = init_task_group_load;
-   spin_lock_init(_task_group.lock);
 #endif
 
for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
@@ -6987,14 +7002,15 @@ struct task_group *sched_create_group(vo
se->parent = NULL;
}
 
+   tg->shares = NICE_0_LOAD;
+
+   lock_task_group_list();
for_each_possible_cpu(i) {
rq = cpu_rq(i);
cfs_rq = tg->cfs_rq[i];
list_add_rcu(_rq->leaf_cfs_rq_list, >leaf_cfs_rq_list);
}
-
-   tg->shares = NICE_0_LOAD;
-   spin_lock_init(>lock);
+   unlock_task_group_list();
 
return tg;
 
@@ -7040,10 +7056,12 @@ void sched_destroy_group(struct task_gro
struct cfs_rq *cfs_rq = NULL;
int i;
 
+   lock_task_group_list();
for_each_possible_cpu(i) {
cfs_rq = tg->cfs_rq[i];
list_del_rcu(_rq->leaf_cfs_rq_list);
}
+   unlock_task_group_list();
 
BUG_ON(!cfs_rq);
 
@@ -7117,7 +7135,7 @@ int sched_group_set_shares(struct task_g
 {
int i;
 
-   spin_lock(>lock);
+   lock_task_group_list();
if (tg->shares == shares)
goto done;
 
@@ -7126,7 +7144,7 @@ int sched_group_set_shares(struct task_g
set_se_shares(tg->se[i], shares);
 
 done:
-   spin_unlock(>lock);
+   unlock_task_group_list();
return 0;
 }
 
Index: current/kernel/sched_fair.c
===
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -685,7 +685,7 @@ static inline struct cfs_rq *cpu_cfs_rq(
 
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
-   list_for_each_entry(cfs_rq, >leaf_cfs_rq_list, leaf_cfs_rq_list)
+   list_for_each_entry_rcu(cfs_rq, >leaf_cfs_rq_list, leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline int
@@ -1126,7 +1126,9 @@ static void print_cfs_stats(struct seq_f
 #ifdef CONFIG_FAIR_GROUP_SCHED
print_cfs_rq(m, cpu, _rq(cpu)->cfs);
 #endif
+   lock_task_group_list();
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
+   unlock_task_group_list();
 }
 #endif

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add iSCSI IBFT Support (v0.3)

2007-11-26 Thread Konrad Rzeszutek
> >
> > sysfs files have ONE VALUE PER FILE, not a whole bunch of different
> > things in a single file.  Please fix this.
>
> The subparameters _are_ actually part of a single value, that value being
> associated with the initiator instance.
>
> Konrad is trying to implement a "work-alike" for what open firmware does.
> open-iscsi already has the ability to extract the same format
> bits from real OFW.
>
> See open-iscsi.git/utils/fwparam_ppc.


Greg,

In light of what Doug says (which is all true), should I go ahead with a new 
version of this module which would export one value per file? The problem 
that will be encountered is that a ethernetX sysfs directory would have (for 
example):

/sys/firmware/ibft/ethernet0/pci-bdf
5:1:0
/sys/firmware/ibft/ethernet0/mac
00:11:25:9d:8b:00
/sys/firmware/ibft/ethernet0/vlan
0
/sys/firmware/ibft/ethernet0/gateway
192.168.79.254
/sys/firmware/ibft/ethernet0/origin
0
/sys/firmware/ibft/ethernet0/subnet-mask
22
/sys/firmware/ibft/ethernet0/ip-addr
192.168.77.41
/sys/firmware/ibft/ethernet0/flags
7

And the flag would contain the value "7" which would mean the user would have 
to parse what each bit means? (the v0.3 of the module does not export this 
flag but uses it to figure out which is the boot iSCSI target).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 1/5] sched: code cleanup

2007-11-26 Thread Srivatsa Vaddagiri
Minor cleanups:

- Fix coding style
- remove obsolete comment


Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
 kernel/sched.c |   21 +++--
 1 files changed, 3 insertions(+), 18 deletions(-)

Index: current/kernel/sched.c
===
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
 };
 
 #ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD  2*NICE_0_LOAD
 #else
-# define INIT_TASK_GRP_LOADNICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD  NICE_0_LOAD
 #endif
 
-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;
 
 /* return group to which a task belongs */
 static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i
 
 #define sched_class_highest (_sched_class)
 
-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
 static inline void inc_load(struct rq *rq, const struct task_struct *p)
 {
update_load_add(>load, p->se.load.weight);


-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mm: add dirty_highmem option

2007-11-26 Thread Andrew Morton
On Thu, 22 Nov 2007 14:42:04 +1100 Bron Gondwana <[EMAIL PROTECTED]> wrote:

>  /*
> + * free highmem will not be subtracted from the total free memory
> + * for calculating free ratios if vm_dirty_highmem is true
> + */
> +int vm_dirty_highmem;

One would expect that setting dirty_highmem to true would cause highmem to
be accounted in dirty-memory calculations.  However with this change
reality is in fact the inverse of that.

So how about this?

 Documentation/filesystems/proc.txt |4 ++--
 mm/page-writeback.c|8 
 2 files changed, 6 insertions(+), 6 deletions(-)

diff -puN Documentation/filesystems/proc.txt~mm-add-dirty_highmem-option-fix 
Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt~mm-add-dirty_highmem-option-fix
+++ a/Documentation/filesystems/proc.txt
@@ -1265,8 +1265,8 @@ Contains, as a boolean, a switch to allo
 part of the "available" memory against which the dirty ratios will be
 applied.
 
-Setting this to 1 can be useful on 32 bit machines where you want to make
-random changes within an MMAPed file that is larger than your available
+Setting this to 0 (false) can be useful on 32 bit machines where you wish to
+make random changes within an MMAPed file that is larger than your available
 lowmem, however it is potentially dangerous and has serious bounce-buffer
 issues.
 
diff -puN mm/page-writeback.c~mm-add-dirty_highmem-option-fix 
mm/page-writeback.c
--- a/mm/page-writeback.c~mm-add-dirty_highmem-option-fix
+++ a/mm/page-writeback.c
@@ -69,10 +69,10 @@ static inline long sync_writeback_pages(
 int dirty_background_ratio = 5;
 
 /*
- * free highmem will not be subtracted from the total free memory
- * for calculating free ratios if vm_dirty_highmem is true
+ * free highmem will be subtracted from the total free memory for calculating
+ * free ratios if vm_dirty_highmem is true
  */
-int vm_dirty_highmem;
+int vm_dirty_highmem = 1;
 
 /*
  * The generator of dirty data starts writeback at this percentage
@@ -293,7 +293,7 @@ static unsigned long determine_dirtyable
x = global_page_state(NR_FREE_PAGES)
+ global_page_state(NR_INACTIVE)
+ global_page_state(NR_ACTIVE);
-   if (!vm_dirty_highmem)
+   if (vm_dirty_highmem)
x -= highmem_dirtyable_memory(x);
return x + 1;   /* Ensure that we never return 0 */
 }
_




(I dropped the already-merged part of your patch)

(I fixed a build error in kernel/sysctl.c: "one" was defined twice when
suitable config options were set).

(It's an unpleasing patch, btw.  But it's an unpleasant problem and at least
this way people can tell us "hey, I did  and it started to work")
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 0/5] sched: group scheduler related patches (V4)

2007-11-26 Thread Srivatsa Vaddagiri
On Mon, Nov 26, 2007 at 09:28:36PM +0100, Ingo Molnar wrote:
> the first SCHED_RR priority is 1, not 0 - so this call will always fail.

Thanks for spotting this bug and rest of your review comments.

Here's V4 of the patchset, aimed at improving fairness of cpu bandwidth
allocation for task groups.

Changes since V3 (http://marc.info/?l=linux-kernel=119605252303359):

- Fix bug in setting SCHED_RR priority for load_balance_monitor thread
- Fix coding style related issues
- Separate "introduction of lock_doms_cur() API" into a separate patch

I have also tested this patchset against your latest git tree as of
today morning.

Please apply if there are no major concerns.


-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [libata] Set proper ATA UDMA mode for bf548 according to system clock.

2007-11-26 Thread sonic zhang
UDMA Mode - Frequency compatibility

UDMA5 - 100 MB/s   - SCLK  = 133 MHz
UDMA4 - 66 MB/s- SCLK >=  80 MHz
UDMA3 - 44.4 MB/s  - SCLK >=  50 MHz
UDMA2 - 33 MB/s- SCLK >=  40 MHz


Signed-off-by: Sonic Zhang <[EMAIL PROTECTED]>
---
 drivers/ata/pata_bf54x.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/ata/pata_bf54x.c b/drivers/ata/pata_bf54x.c
index 81db405..088a41f 100644
--- a/drivers/ata/pata_bf54x.c
+++ b/drivers/ata/pata_bf54x.c
@@ -1489,6 +1489,8 @@ static int __devinit bfin_atapi_probe(st
int board_idx = 0;
struct resource *res;
struct ata_host *host;
+   unsigned int fsclk = get_sclk();
+   int udma_mode = 5;
const struct ata_port_info *ppi[] =
{ _port_info[board_idx], NULL };
 
@@ -1507,6 +1509,11 @@ static int __devinit bfin_atapi_probe(st
if (res == NULL)
return -EINVAL;
 
+   while (bfin_port_info[board_idx].udma_mask>0 && udma_fsclk[udma_mode] > 
fsclk) {
+   udma_mode--;
+   bfin_port_info[board_idx].udma_mask >>= 1;
+   }
+
/*
 * Now that that's out of the way, wire up the port..
 */
-- 
1.4.3.4


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Dynticks Causing High Context Switch Rate in ksoftirqd

2007-11-26 Thread Robert Hancock

[EMAIL PROTECTED] wrote:

Question: Why is ksoftirqd eating about 5 to 10 percent of my CPU on an idle
system? The problem occurs if I config the kernel with tickless
support (i.e. CONFIG_TICK_ONESHOT=y).  (Thanks to "oprofile" for putting me
onto this.)

I have noted this same problem on kernel versions: 2.6.23.1, 2.6.23.8 and
2.6.23.9

**
*** Output from "vmstat -n 1 10" -- Note very high context switch rate ***
*** This is on a idle machine! ***
**

procs ---memory-- ---swap-- -io --system--
cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy
id wa
 0  0  0 1925556   4768 11610400   124 26  7538  1  2
96  1
 0  0  0 1925556   4768 11610400 0 02 147329  0  1
99  0


What did oprofile show? It should be able to narrow down what 
function(s) are responsible for the CPU usage..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6 patch] remove CONFIG_EXPERIMENTAL

2007-11-26 Thread Dave Jones
On Mon, Nov 26, 2007 at 10:44:44PM -0500, [EMAIL PROTECTED] wrote:
 
 > I suspect that given the "once it escapes, it's cast in stone" view we take
 > towards user-visible API/etc, there isn't much *real* room for an
 > 'EXPERIMENTAL' flag anymore.  Most of the usage should probably be confined 
 > to
 > individual drivers, where all we should need is a 'default n' and suitable
 > warning verbiage in the Kconfig file warning about the driver eating your
 > filesystems and small animals for breakfast.

Potential corruptors are usually flagged with (DANGEROUS) in the text,
(One may argue that they shouldn't have escaped -mm)

 >  We certainly shouldn't have
 > one big flag for *all* in-progress drivers - I don't need to accidentally
 > enable a busticated ethernet driver because I want a USB widget.

So no ethernet driver at all is better than a broken but mostly working one?
Again if it isn't mostly working, it shouldn't have escaped -mm

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add iSCSI IBFT Support (v0.3)

2007-11-26 Thread Konrad Rzeszutek

.. snip..
> > +#else
> > +static void __init reserve_ibft_region(void) { };
>
> No ending ; above.

Fixed.
>
..snip..
> > +static void __init reserve_ibft_region(void) { };
>
> Ditto.

Fixed.

.. snip..
> > +#include 
> > +
>
> No blank line here, please.

Why that creeps back in the code I am not sure myself. In your first review 
you mentioned this, I fixed it in my tree, and now it is back!? Either way, 
it is fixed.

>
> > +#include 

..snip..

> > +   printk(KERN_INFO  \
>
> Looks like this should use KERN_ERROR or KERN_WARNING?

Yes! Thanks for catching that.
>
> > +   "error, in IBFT structure (%s) expected %d but" \
> > +   return str-buf;
>
>   preferred form:
>   return str - buf;

Fixed.
>
..snip..
> > +
> > +   return str-buf;
>
>   Ditto.
Fixed.

>
..snip..
> > +   return str-buf;
>
>   Ditto.
Fixed.

..snip..
> > +   return str-buf;
>
>   Ditto.
Fixed.
..snip..
> > +   int len = 6;
>
> Could you just use ETH_ALEN instead of  and 6?
> and #include 

Yes. That makes much more sense.

>
> Or add a define for IBFT_ALEN (of 6) and use that?

Either one works. The first suggestion is much better.


..snip..
>
> > +
> > +   /* Based on the header index value find the data tuple,
> > +   if possibly. */
>
>  if possible. */
>
> or better:
>   /*
>* Based on the header index value, find the data tuple
>* if possible.
>*/

Yes, much more understandable - and now that I read I realized this was
not a proper assumption. One of the data structures (struct ibft_tgt) has a
'nic_assoc' value which makes a N-to-1 mapping to the NIC data structure,
so this will re-work. Thanks for catching a bug that early in the cycle!

..snip..
> > +  struct carry it for convience. */
>
>   convenience.
Fixed.

..snip..
> > + * Scan the IBFT table structure for the NIC and Target fields. When
> > + * found add them on the passed in list.
>
>   passed-in list.

Fixed.

>
> > + */
> > +static int ibft_scan_device(struct ibft_table_header *header,
> > +   struct list_head *list)
> > +{
> > +
> > +   /* We can have multiple NICs and multiple targets. The index in
> > +  their header defines their 1-to-1 correlation.

Not true. I will have to re-work this code to do a 1-to-N correlation.
> > +   */
> > +   for (ptr = >nic0_off; ptr <= end; ptr += sizeof(u16)) {
>
> In many searches,  would be the first address beyond the end of the
> table, so the loop-terminating condition test would be:
>
>   ptr < end;

Yes. That is correct. It did actually check the next offset, which fortunately 
had nothing in it.
>
> It looks like that should be the case here also

To check the offset to make sure it is within the full IBFT data structure? 
Yes, that is a good check - will implement.

>
..snip..
> > +   if (rc) break;
>
>   break;
>   on a separate line.
>
> Did you check this patch with scripts/checkpatch.pl ?

Yes. I ran it with check-patch-0.99.pl that I downloaded somewhere from Dave 
Jones web page. I hadn't realized that its home is now in 
scripts/checkpatch.pl - will make sure to use that improved-new version.

>
..snip..
> > +   printk(KERN_INFO "iBFT detected at 0x%lx.\n",
> > +  (unsigned long)ibft_phys);
>
> Use %p to print pointer values.

This is actually not a pointer yet. It is a true physical address which I 
thought might be useful for troubleshooting purposes.

>
..snip.
> > +   if (!rc)
> > +   return rc;
>
> Can't this always just be
>   return 0;
> ?

Yes, I was thinking that perhaps a more nicer way was to do 
"goto end;" where the end label is just "return rc;"  But this 
definitely trumps it.

>
> > +
..snip..
> > +
> > +struct ibft_tgt {
> > +   struct ibft_hdr hdr;
> > +   char ip_addr[16];
> > +   u16 port;
> > +   char lun[8];
> > +   u8 chap_type;
> > +   u8 nic_assoc;
> > +   u16 tgt_name_len;
> > +   u16 tgt_name_off;
> > +   u16 chap_name_len;
> > +   u16 chap_name_off;
> > +   u16 chap_secret_len;
> > +   u16 chap_secret_off;
> > +   u16 rev_chap_name_len;
> > +   u16 rev_chap_name_off;
> > +   u16 rev_chap_secret_len;
> > +   u16 rev_chap_secret_off;
> > +} __attribute__((__packed__));
> > +
> > +#if defined(CONFIG_ISCSI_IBFT) || defined(CONFIG_ISCSI_IBFT_MODULE)
>
> Why is this #if line here instead of nearer the top of this header file?

My thought was that if other kernel users might want to include this header 
file they do not have to exposed to the semi-internal data structures of this 
header file. If that is not a concern then I think I can remove the 
conditional altogether.

>
> > +#define IBFT_SIGN "iBFT"
> > +#define IBFT_SIGN_LEN 4
> > +#define IBFT_START 0x8 /* 512kB */
> > +#define IBFT_END 0x10 /* 1MB */
> > +#define VGA_MEM 0xA /* VGA buffer */
> > +#define VGA_SIZE 0x2 /* 132kB */
>
> I'd say 

Re: [PATCH] Add iSCSI IBFT Support (v0.3)

2007-11-26 Thread Konrad Rzeszutek
On Monday 26 November 2007 22:31:38 Greg KH wrote:
> On Mon, Nov 26, 2007 at 06:56:42PM -0400, Konrad Rzeszutek wrote:
> > +/*
> > + *  Routines for reading of the iBFT data in a human readable fashion.
> > + */
> > +ssize_t ibft_attr_show_initiator(struct ibft_kobject *entry,
> > +struct ibft_attribute *attr,
> > +char *buf)
> > +{
.. snip..
> > +
> > +   str += sprintf_ipaddr(str, "isns", initiator->isns_server);
> > +   str += sprintf_ipaddr(str, "slp", initiator->slp_server);
.. snip ..
>
> sysfs files have ONE VALUE PER FILE, not a whole bunch of different
> things in a single file.  Please fix this.

No problem. I will have that shortly posted.

>
> > +
> > +ssize_t ibft_attr_show_nic(struct ibft_kobject *entry,
> > +  struct ibft_attribute *attr,
> > +  char *buf)
.. snip.. 
> > +   str += sprintf_ipaddr(str, "giaddr", nic->gateway);
> > +   str += sprintf_ipaddr(str, "dnsaddr1", nic->primary_dns);
>
> Same here.

Yup. 
>
> > +ssize_t ibft_attr_show_target(struct ibft_kobject *entry,
> > + struct ibft_attribute *attr,
> > + char *buf)
> > +{
.. snip..
> > +}
>
> Same here, are we writing a novella or something to userspace?  :)

Hehe.. I will make it simpler :-)

>
> > +ssize_t ibft_attr_show_disk(struct ibft_kobject *dev,
> > +   struct ibft_attribute *ibft_attr,
> > +   char *buf)
> > +{
.. snip ..
> > +}
>
> And here, do I need to go on?

I will have a new version posted quite shortly.

>
> > +ssize_t ibft_attr_show_mac(struct ibft_kobject *entry,
> > +  struct ibft_attribute *attr,
> > +  char *buf)
> > +{
..snip..
> > +
> > +   memcpy(buf, attr->nic->mac, len);
> > +
> > +   return len;
> > +}
>
> Is mac a user readable string?  Then perhaps a simple sprintf would work
> instead, as I doubt you are including a \n here...

It was meant to be as a binary value. But that doesn't fit in sysfs directory, 
so let me make it use sprintf here.

>
> > +/*
> > + * The main routine which allows the user to read the IBFT data.
> > + */
> > +static ssize_t ibft_show_attribute(struct kobject *kobj,
> > +  struct attribute *attr,
> > +  char *buf)
> > +{
..snip..
> > +
> > +static struct sysfs_ops ibft_attr_ops = {
> > +   .show = ibft_show_attribute,
> > +};
>
> I think this whole mess can go away in the new rework Kay and I have
> done, please document this whole thing and I'll see what I can do.

Absolutely.

>
> > +struct ibft_control {
> > +struct ibft_hdr hdr;
> > +u16 extensions;
> > +u16 initiator_off;
> > +u16 nic0_off;
> > +u16 tgt0_off;
> > +u16 nic1_off;
> > +u16 tgt1_off;
> > +} __attribute__((__packed__));
>
> Did we loose tabs for some reason?  I'm guessing your editor is not
> showing them properly, nor did you use scripts/checkpatch.pl :(

I did use checkpatch.pl v0.99 downloaded somewhere from the web. I hadn't
realized it was now residing in scripts/checkpatch.pl - and from now on I will 
use that.

>
> > +#if defined(CONFIG_ISCSI_IBFT) || defined(CONFIG_ISCSI_IBFT_MODULE)
..snip..
> > +static ssize_t find_ibft(void)
> > +{
..snip..
> > +}
>
> What is a function (not even an inline one) doing in a .h file?

I was not sure where to put it. This function (find_ibft) is used by the 
setup_[32|64].c and the iscsi_ibft.c code. Randy suggested I put in .c file, 
but I am not sure exactly where? Should I make a new file in called 
libs/iscsi_ibft_helper.c ?

>
..snip..
> > +struct ibft_kobject {
> > +   struct ibft_data *data;
> > +   char name[IBFT_ISCSI_KOBJECT_MAX_LEN];
>
> Why have this,
>
> > +   u8 type;
> > +   struct kobject kobj;
>
> When the kobject itself has an unlimited size name associated with it?

Absolutely no reason at all. It was a evolution vestige of the code that is
not needed anymore. 

>
..snip..
> > +   char name[IBFT_ISCSI_ATTR_MAX_LEN];
>
> Same here, an attribute already has a pointer to a name, no need to have
> another one in the same structure.

Thanks. Will remove it.
>
> > +   struct list_head node;
> > +};
>
> thanks,

Thank you for taking your time to review the code. I will have the new
version out shortly.
>
> greg k-h


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Error returns not handled correctly by sysfs.c:subsys_attr_store()

2007-11-26 Thread Andrew Morton
On Wed, 21 Nov 2007 15:16:59 -0700 Andrew Patterson <[EMAIL PROTECTED]> wrote:

> The buf in fs/sysfs.c:subsys_attr_store() does not seem to be updated
> correctly when returning a negative value (indicating that an error
> condition has occurred) is returned.  If a negative value is returned,
> the next subsequent call to subsys_attr_store will have the contents of
> buf appended to the previous call.

subsys_attr_store() gets deleted by
http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-01-driver/kset-kill-subsys-attr.patch

So maybe we will soon accidentally fix whatever-this-is?  Or maybe we will
faithfully maintain it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 05/14] percpu: Use a Kconfig variable to configure arch specific percpu setup

2007-11-26 Thread Rusty Russell
On Tuesday 27 November 2007 11:14:12 Christoph Lameter wrote:
> The use of the __GENERIC_PERCPU is a bit problematic since arches
> may want to run their own percpu setup while using the generic
> percpu definitions. Replace it through a kconfig variable.

Thanks for this Christoph!

These patches are great: the early experiments are obviously over, and so this 
consolidation is overdue.

Have you considered moving x86-64's setup_per_cpu_areas into generic code?  
It's a bit messier because some archs might not have set up NUMA stuff yet, 
but it's logically generic...

Thanks!
Rusty.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-26 Thread Rusty Russell
On Monday 26 November 2007 16:58:08 Roland Dreier wrote:
>  > > I agree that we shouldn't make things too hard for out-of-tree
>  > > modules, but I disagree with your first statement: there clearly is a
>  > > large class of symbols that are used by multiple modules but which are
>  > > not generically useful -- they are only useful by a certain small
>  > > class of modules.
>  >
>  > If it is so clear, you should be able to easily provide examples?
>
> Sure -- Andi's example of symbols required only by TCP congestion
> modules;

Exactly.  Why exactly should someone not write a new TCP congestion module?

> the SCSI internals that Christoph wants to mark

He didn't justify those though, either.

> ; the symbols  exported by my mlx4_core driver (which I admit are
> currently only used 
> by the mlx4_ib driver, but which will also be used by at least the
> ethernet NIC driver for the same hardware).

Right.  So presumably there will only ever be two drivers using this core 
code, so no new users will ever be written?  Now we've found one use case, is 
it worth the complexity of namespaces?  Is it worth the halfway point of 
export-to-module?

What problem will it solve?

> I thought this was 
> already covered repeatedly in the thread and indeed in Andi's code so
> there was no need to repeat it...

No, we've seen the solution and various people applying it.  I'm still trying 
to discover the problem it's solving.

Hope that helps,
Rusty.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-26 Thread John Richard Moser



Andi Kleen wrote:

On Tuesday 20 November 2007 04:50, Christoph Lameter wrote:

On Tue, 20 Nov 2007, Andi Kleen wrote:



You could in theory move the modules, but then you would need to implement
a full PIC dynamic linker for them  first and also increase runtime overhead
for them because they would need to use a GOT/PLT.


On x86-64?  The GOT/PLT should stay in cache due to temporal locality. 
The x86-64 instruction set itself handles GOT-relative addressing rather 
well; what's a 1% loss on x86 is like 0.01% on x86-64, so I'm thinking 
100 times better?


I think I got this by `-fpic -pie` compiling nbyte benchmark versus 
fixed position, each with and without on 32-bit (which made about a 1% 
difference) and on 64-bit (which made a 0.01% difference).  It was a 
long time ago.


Still, yeah I know.  Complexity.

(You have the ability to textrel these things too, and just rewrite 
non-PIC, depending on how you feel about that)

--
Bring back the Firefox plushy!
http://digg.com/linux_unix/Is_the_Firefox_plush_gone_for_good
https://bugzilla.mozilla.org/show_bug.cgi?id=322367
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu

2007-11-26 Thread Eric W. Biederman
Neil Horman <[EMAIL PROTECTED]> writes:

> Hey all-
>   I've been working on an issue lately involving multi socket x86_64
> systems connected via hypertransport bridges.  It appears that some systems,
> disable the hypertransport connections during a kdump operation when all but 
> the
> crashing processor gets halted in machine_crash_shutdown.  This becomes a
> problem when the ioapic attempts to route interrupts to the only remaining
> processor.  Even though the active processor is targeted for interrupt
> reception, the fact that the hypertransport connections are inactive result in
> interrupts not getting delivered.  The effective result is that timer 
> interrupts
> are not delivered to the running cpu, and the system hangs on reboot into the
> kdump kernel during calibrate_delay.  I've found that I've been able to avoid
> this hang, by forcing a transition to the bios defined boot cpu during the
> crashing kernel shutdown.  This patch accomplished that.  Tested by myself and
> the origional reporter with successful results.

If you can get to calibrate_delay hypertransport is still routing traffic.
Your diagnosis of the problem is wrong.  Most likely it is just an ioapic
programming error in restoring the system to PIC mode.

I agree that there is a problem.

The reliable fix is to totally skip the PIC interrupt mode and go directly
to apic mode.

To make the code kexec on panic code path reliable we need to remove code
not add it.

Frankly I think switching cpus is one of the least reliable things that
we can do in general.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unionfs: several more problems

2007-11-26 Thread Erez Zadok
In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> On Mon, 26 Nov 2007, Erez Zadok wrote:
[...]
> > The small patch below fixed the problem.  Let me know what you think.
> 
> I've one issue with it: please move that wait_on_page_writeback before
> the clear_page_dirty_for_io instead of after it, then resubmit your 14/16.
[...]

Done, tested, and working.  Here's the revised patch (pushed to unionfs.git
on korg).

Thanks,
Erez.

--

Unionfs: prevent multiple writers to lower_page

Without this patch, the LTP fs test "rwtest04" triggers a
BUG_ON(PageWriteback(page)) in fs/buffer.c:1706.

CC: Hugh Dickins <[EMAIL PROTECTED]>

Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
diff --git a/fs/unionfs/mmap.c b/fs/unionfs/mmap.c
index 623a913..74f2e53 100644
--- a/fs/unionfs/mmap.c
+++ b/fs/unionfs/mmap.c
@@ -72,6 +72,7 @@ static int unionfs_writepage(struct page *page, struct 
writeback_control *wbc)
}
 
BUG_ON(!lower_mapping->a_ops->writepage);
+   wait_on_page_writeback(lower_page); /* prevent multiple writers */
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping->a_ops->writepage(lower_page, wbc);
if (err < 0)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] file capabilities: don't prevent signaling setuid root programs.

2007-11-26 Thread Andrew Morgan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Serge,

I still feel a bit uneasy about this. Looking ahead, with filesystem
capabilities, one can simulate this same situation with a setuid
'non-root' program as follows:

[EMAIL PROTECTED] ~]$ cat > test.c
main()
{
printf("sleeping (%u)\n", getpid());
sleep(100);
printf("woke up\n");
}
[EMAIL PROTECTED] ~]$ cc -o test test.c
[EMAIL PROTECTED] ~]$ chmod u+s ./test
[EMAIL PROTECTED] ~]$ ls -ltr test
- -rwsrwxr-x  1 morgan morgan 7090 Nov 26 20:01 test
[EMAIL PROTECTED] ~]$ setcap cap_net_raw+ep ~/test
[EMAIL PROTECTED] ~]$ getcap ~/test
/home/morgan/test = cap_net_raw+ep
[EMAIL PROTECTED] ~]$ su luser
Password:
[EMAIL PROTECTED] morgan]$ ./test
sleeping (5935)


[EMAIL PROTECTED] morgan]$ kill 5935
bash: kill: (5935) - Operation not permitted

Because of the euid=0 test, the piece of code you are adding will behave
differently in this situation. Is the root-behavior deserving of less
protection than this one? To my eye they seem equivalent.

Is there a compelling reason to include the euid==0 check?

Thanks

Andrew

Serge E. Hallyn wrote:
> This patch is needed to preserve legacy behavior when
> CONFIG_SECURITY_FILE_CAPABILITIES=y.  Without this patch, xinit can't
> kill X, so manually starting X in runlevel 3 then exiting your window
> manager will not cause X to exit. 
> 
> thanks,
> -serge
> 
>>From 81a6d780ad570f9a326fc27912ec0e373f5fa14f Mon Sep 17 00:00:00 2001
> From: Serge E. Hallyn <[EMAIL PROTECTED]>
> Date: Tue, 20 Nov 2007 08:47:35 +
> Subject: [PATCH] file capabilities: don't prevent signaling setuid root 
> programs.
> 
> An unprivileged process must be able to kill a setuid root
> program started by the same user.  This is legacy behavior
> needed for instance for xinit to kill X when the window manager
> exits.
> 
> When an unprivileged user runs a setuid root program in !SECURE_NOROOT
> mode, fP, fI, and fE are set full on, so pP' and pE' are full on.
> Then cap_task_kill() prevents the user from signaling the setuid root
> task.  This is a change in behavior compared to when
> !CONFIG_SECURITY_FILE_CAPABILITIES.
> 
> This patch introduces a special check into cap_task_kill() just
> to check whether a non-root user is signaling a setuid root
> program started by the same user.  If so, then signal is allowed.
> 
> Changelog:
>   Nov 26: move test up above CAP_KILL test as per Andrew
>   Morgan's suggestion.
> 
> Signed-off-by: Serge E. Hallyn <[EMAIL PROTECTED]>
> ---
>  security/commoncap.c |9 +
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 302e8d0..5bc1895 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -526,6 +526,15 @@ int cap_task_kill(struct task_struct *p, struct siginfo 
> *info,
>   if (info != SEND_SIG_NOINFO && (is_si_special(info) || 
> SI_FROMKERNEL(info)))
>   return 0;
>  
> + /*
> +  * Running a setuid root program raises your capabilities.
> +  * Killing your own setuid root processes was previously
> +  * allowed.
> +  * We must preserve legacy signal behavior in this case.
> +  */
> + if (p->euid == 0 && p->uid == current->uid)
> + return 0;
> +
>   /* sigcont is permitted within same session */
>   if (sig == SIGCONT && (task_session_nr(current) == task_session_nr(p)))
>   return 0;
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFHS5m/QheEq9QabfIRAmouAJkBBB0kXH57s9mvlgdG3XZhC0pZMwCfZUW3
L4vJUkR4tgAh33GTqEquIqw=
=sKCy
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: profile code added to netif_receive_skb function

2007-11-26 Thread Valdis . Kletnieks
On Sun, 25 Nov 2007 21:46:26 PST, kernel coder said:
> hi,
> 
> I have added some code to netif_receive_skb function.As linux kernel
> is multhreaded , so there is no gaurantee than mine code is completely
> executed without being disturbed by any other process .Timer interrupt
> handler is an example of code which might interrupt execution of mine
> code.

The trick is to write your code so it doesn't *matter* if other code runs.

For example - the timer interrupt almost certainly doesn't look at or modify
any of *your* code's variables.  So 98% of the kernel's code you don't
even have to *care* if it runs (as long as you aren't doing something
real-time or has similar response-time or throughput constraints).

And if you are worried about that other 2%, where related code, for example the
IRQ handler for a network interface, may have to look at and/or modify some of
your variables, that's when you should be using appropriate locking - there's
mutexes, semaphores, the whole RCU family, and more - none of which I'll
attempt to explain, because I'm not all that good at that stuff.

Basic rule of thumb - if you have something that will break if two things
access it at the same time, put a lock around it, so they take turns.


pgpnHLBRHOG0V.pgp
Description: PGP signature


Re: Small System Paging Problem - OOM-killer goes nuts

2007-11-26 Thread Josh Goldsmith

When you untar, which filesystem do you untar too?

I've untarred it to Ext3, Ext2, and Reiser filesystems.  I've been fighting
with this for a while.

I did manage to get it to happen again doing a recursive chmod after
untarring the kernel (I stopped the untar a few times to let the system
catch up).

Interesting output below.

-J

top - 17:58:03 up  3:08,  1 user,  load average: 3.54, 4.09, 4.08
Tasks:  53 total,   2 running,  51 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.1%us, 11.4%sy,  0.6%ni,  0.0%id, 81.4%wa,  2.7%hi,  1.8%si,
0.0%st
Mem: 30352k total,28252k used, 2100k free,19448k buffers
Swap:   465876k total,15736k used,   450140k free, 1072k cached

 PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
1357 root  30  15  1568  168   88 R  8.1  0.6   0:07.87 chmod
 168 root  10  -5 000 S  3.1  0.0   6:39.25 usb-storage
1353 root  15   0  2408  540  400 R  2.2  1.8   0:14.29 top
 989 root  15   0  3600  292  192 S  1.2  1.0   0:37.81 sshd
   2 root  34  19 000 S  0.6  0.0   2:14.65 ksoftirqd/0
  56 root  15   0 000 S  0.3  0.0   0:23.85 pdflush
  58 root  10  -5 000 S  0.3  0.0   0:54.70 kswapd0
 950 root  15   0  3128  108   64 S  0.3  0.4   0:13.88 ntpd
   1 root  16   0  144000 S  0.0  0.0   0:10.40 init
   3 root  10  -5 000 S  0.0  0.0   0:00.02 events/0
   4 root  10  -5 000 S  0.0  0.0   0:00.02 khelper
   5 root  10  -5 000 S  0.0  0.0   0:00.00 kthread
  38 root  10  -5 000 S  0.0  0.0   0:00.04 kblockd/0
  41 root  10  -5 000 S  0.0  0.0   0:00.02 khubd
  57 root  15   0 000 D  0.0  0.0   0:20.29 pdflush


And the first of the oom-killer syslog messages:

ntpd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0
Mem-info:
DMA per-cpu:
CPU0: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:
0
sshd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Active:2816 inactive:2778 dirty:0 writeback:0 unstable:0
free:179 slab:858 mapped:1 pagetables:93 bounce:0 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6 patch] remove CONFIG_EXPERIMENTAL

2007-11-26 Thread Valdis . Kletnieks
On Mon, 26 Nov 2007 12:27:07 GMT, Pavel Machek said:

> I don't think this is good idea. But perhaps 'experimental' should be
> removed from stuff that is really stable these days, like SATA?

I suspect that given the "once it escapes, it's cast in stone" view we take
towards user-visible API/etc, there isn't much *real* room for an
'EXPERIMENTAL' flag anymore.  Most of the usage should probably be confined to
individual drivers, where all we should need is a 'default n' and suitable
warning verbiage in the Kconfig file warning about the driver eating your
filesystems and small animals for breakfast.  We certainly shouldn't have
one big flag for *all* in-progress drivers - I don't need to accidentally
enable a busticated ethernet driver because I want a USB widget.  And if
you're worried about people accidentally enabling it, then *each driver*
should have a 'Do you really mean it?' flag with *opposite* sense (so
that 'make allyesconfig' doesn't turn it on by accident).

Anything bigger than that, we probably want to redefine 'experimental'
as "it doesn't escape from -mm to mainline till it's ready".


pgpNJBVzT18KH.pgp
Description: PGP signature


Re: [PATCH] capabilities: introduce per-process capability bounding set (v10)

2007-11-26 Thread Andrew Morgan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

This looks good to me.

[As you anticipated, there is a potential merge issue with Casey's
recent addition of MAC capabilities - which will make CAP_MAC_ADMIN the
highest allocated capability: ie.,

#define CAP_LAST_CAP CAP_MAC_ADMIN

].

Signed-off-by: Andrew G. Morgan <[EMAIL PROTECTED]>

Cheers

Andrew

Serge E. Hallyn wrote:
>>From 22da6ccb1a24d1b6fa481d990a26197c6bfdfa77 Mon Sep 17 00:00:00 2001
> From: Serge E. Hallyn <[EMAIL PROTECTED]>
> Date: Mon, 19 Nov 2007 13:54:05 -0500
> Subject: [PATCH 1/1] capabilities: introduce per-process capability bounding 
> set (v10)
> 
> The capability bounding set is a set beyond which capabilities
> cannot grow.  Currently cap_bset is per-system.  It can be
> manipulated through sysctl, but only init can add capabilities.
> Root can remove capabilities.  By default it includes all caps
> except CAP_SETPCAP.
> 
> This patch makes the bounding set per-process when file
> capabilities are enabled.  It is inherited at fork from parent.
> Noone can add elements, CAP_SETPCAP is required to remove them.
> 
> One example use of this is to start a safer container.  For
> instance, until device namespaces or per-container device
> whitelists are introduced, it is best to take CAP_MKNOD away
> from a container.
> 
> The bounding set will not affect pP and pE immediately.  It will
> only affect pP' and pE' after subsequent exec()s.  It also does
> not affect pI, and exec() does not constrain pI'.  So to really
> start a shell with no way of regain CAP_MKNOD, you would do
> 
>   prctl(PR_CAPBSET_DROP, CAP_MKNOD);
>   cap_t cap = cap_get_proc();
>   cap_value_t caparray[1];
>   caparray[0] = CAP_MKNOD;
>   cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
>   cap_set_proc(cap);
>   cap_free(cap);
> 
> The following test program will get and set the bounding
> set (but not pI).  For instance
> 
>   ./bset get
>   (lists capabilities in bset)
>   ./bset drop cap_net_raw
>   (starts shell with new bset)
>   (use capset, setuid binary, or binary with
>   file capabilities to try to increase caps)
> 
> 
> cap_bound.c
> 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> 
>  #ifndef PR_CAPBSET_READ
>  #define PR_CAPBSET_READ 23
>  #endif
> 
>  #ifndef PR_CAPBSET_DROP
>  #define PR_CAPBSET_DROP 24
>  #endif
> 
> int usage(char *me)
> {
>   printf("Usage: %s get\n", me);
>   printf("   %s drop \n", me);
>   return 1;
> }
> 
>  #define numcaps 32
> char *captable[numcaps] = {
>   "cap_chown",
>   "cap_dac_override",
>   "cap_dac_read_search",
>   "cap_fowner",
>   "cap_fsetid",
>   "cap_kill",
>   "cap_setgid",
>   "cap_setuid",
>   "cap_setpcap",
>   "cap_linux_immutable",
>   "cap_net_bind_service",
>   "cap_net_broadcast",
>   "cap_net_admin",
>   "cap_net_raw",
>   "cap_ipc_lock",
>   "cap_ipc_owner",
>   "cap_sys_module",
>   "cap_sys_rawio",
>   "cap_sys_chroot",
>   "cap_sys_ptrace",
>   "cap_sys_pacct",
>   "cap_sys_admin",
>   "cap_sys_boot",
>   "cap_sys_nice",
>   "cap_sys_resource",
>   "cap_sys_time",
>   "cap_sys_tty_config",
>   "cap_mknod",
>   "cap_lease",
>   "cap_audit_write",
>   "cap_audit_control",
>   "cap_setfcap"
> };
> 
> int getbcap(void)
> {
>   int comma=0;
>   unsigned long i;
>   int ret;
> 
>   printf("i know of %d capabilities\n", numcaps);
>   printf("capability bounding set:");
>   for (i=0; i   ret = prctl(PR_CAPBSET_READ, i);
>   if (ret < 0)
>   perror("prctl");
>   else if (ret==1)
>   printf("%s%s", (comma++) ? ", " : " ", captable[i]);
>   }
>   printf("\n");
>   return 0;
> }
> 
> int capdrop(char *str)
> {
>   unsigned long i;
> 
>   int found=0;
>   for (i=0; i   if (strcmp(captable[i], str) == 0) {
>   found=1;
>   break;
>   }
>   }
>   if (!found)
>   return 1;
>   if (prctl(PR_CAPBSET_DROP, i)) {
>   perror("prctl");
>   return 1;
>   }
>   return 0;
> }
> 
> int main(int argc, char *argv[])
> {
>   if (argc<2)
>   return usage(argv[0]);
>   if (strcmp(argv[1], "get")==0)
>   return getbcap();
>   if (strcmp(argv[1], "drop")!=0 || argc<3)
>   return usage(argv[0]);
>   if (capdrop(argv[2])) {
>   printf("unknown capability\n");
>   return 1;
>   }
>   return execl("/bin/bash", "/bin/bash", NULL);
> }
> 
> 
> 

Re: [PATCH] fix plip 1

2007-11-26 Thread Linus Torvalds


On Thu, 22 Nov 2007, Mikulas Patocka wrote:
> 
> netif_rx is meant to be called from interrupts because it doesn't wake up 
> ksoftirqd. For calling from outside interrupts, netif_rx_ni exists.

Argh. Can you _please_ use more useful subject lines than "fix plip 1/2"?

Those subject lines are what becomes the single-line description of the 
problem, used by visualizers like gitk and gitweb. So "fix plip 1" is a 
singularly bad such line!

Which is why it should be something like

Subject: [PATCH 1/2] plip: use netif_rx_ni() for packet receive

or similar.. (My scripts will then get rid of the stuff in brackets, so 
all that is useful for giving information that is interesting while in 
*email*, but not when actually applied as a patch)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add iSCSI IBFT Support (v0.3)

2007-11-26 Thread Greg KH
On Mon, Nov 26, 2007 at 06:56:42PM -0400, Konrad Rzeszutek wrote:
> +/*
> + *  Routines for reading of the iBFT data in a human readable fashion.
> + */
> +ssize_t ibft_attr_show_initiator(struct ibft_kobject *entry,
> +  struct ibft_attribute *attr,
> +  char *buf)
> +{
> + struct ibft_initiator *initiator = attr->initiator;
> + void *ibft_loc = entry->data->hdr;
> + char *str = buf;
> +
> + if (!initiator)
> + return 0;
> +
> + str += sprintf_ipaddr(str, "isns", initiator->isns_server);
> + str += sprintf_ipaddr(str, "slp", initiator->slp_server);
> + str += sprintf_ipaddr(str, "primary_radius_server",
> + initiator->pri_radius_server);
> + str += sprintf_ipaddr(str, "secondary_radius_server",
> + initiator->sec_radius_server);
> + str += sprintf_string(str, "itname", initiator->initiator_name_len,
> + (char *)ibft_loc + initiator->initiator_name_off);
> + str--;
> +
> + return str-buf;
> +}

sysfs files have ONE VALUE PER FILE, not a whole bunch of different
things in a single file.  Please fix this.


> +
> +ssize_t ibft_attr_show_nic(struct ibft_kobject *entry,
> +struct ibft_attribute *attr,
> +char *buf)
> +{
> + struct ibft_nic *nic = attr->nic;
> + void *ibft_loc = entry->data->hdr;
> + char *str = buf;
> +
> + if (!nic)
> + return 0;
> + /*
> +  * Assume dhcp if any non-zero portions of its address are set.
> +  */
> + if (memcmp(nic->dhcp, nulls, sizeof(nic->dhcp))) {
> + str += sprintf_ipaddr(str, "dhcp", nic->dhcp);
> + } else {
> + str += sprintf_ipaddr(str, "ciaddr", nic->ip_addr);
> + str += sprintf_ipaddr(str, "giaddr", nic->gateway);
> + str += sprintf_ipaddr(str, "dnsaddr1", nic->primary_dns);
> + str += sprintf_ipaddr(str, "dnsaddr2", nic->secondary_dns);
> + }
> + if (nic->hostname_len)
> + str += sprintf_string(str, "hostname", nic->hostname_len,
> + (char *)ibft_loc + nic->hostname_off);
> + /* Cut off the comma. */
> + str--;
> +
> + return str-buf;
> +}

Same here.

> +ssize_t ibft_attr_show_target(struct ibft_kobject *entry,
> +   struct ibft_attribute *attr,
> +   char *buf)
> +{
> + struct ibft_tgt *tgt = attr->tgt;
> + void *ibft_loc = entry->data->hdr;
> + char *str = buf;
> + int i;
> +
> + if (!tgt)
> + return 0;
> +
> + str += sprintf_ipaddr(str, "siaddr", tgt->ip_addr);
> + str += sprintf(str, "iport=%d,", tgt->port);
> + str += sprintf(str, "ilun=");
> + for (i = 0; i < 8; i++)
> + str += sprintf(str, "%x", (u8)tgt->lun[i]);
> + str += sprintf(str, ",");
> +
> + if (tgt->tgt_name_len)
> + str += sprintf_string(str, "iname", tgt->tgt_name_len,
> + (void *)ibft_loc + tgt->tgt_name_off);
> +
> + if (tgt->chap_name_len)
> + str += sprintf_string(str, "chapid", tgt->chap_name_len,
> + (char *)ibft_loc + tgt->chap_name_off);
> + if (tgt->chap_secret_len)
> + str += sprintf_string(str, "chappw", tgt->chap_secret_len,
> + (char *)ibft_loc + tgt->chap_secret_off);
> + if (tgt->rev_chap_name_len)
> + str += sprintf_string(str, "ichapid", tgt->rev_chap_name_len,
> + (char *)ibft_loc + tgt->rev_chap_name_off);
> + if (tgt->rev_chap_secret_len)
> + str += sprintf_string(str, "ichappw", tgt->rev_chap_secret_len,
> + (char *)ibft_loc + tgt->rev_chap_secret_off);
> +
> + /* Cut off the comma. */
> + str--;
> +
> + return str-buf;
> +}

Same here, are we writing a novella or something to userspace?  :)

> +ssize_t ibft_attr_show_disk(struct ibft_kobject *dev,
> + struct ibft_attribute *ibft_attr,
> + char *buf)
> +{
> + char *str = buf;
> +
> + str += sprintf(str, "//[EMAIL PROTECTED],%d:iscsi,", dev->data->index);
> + str += ibft_attr_show_initiator(dev, ibft_attr, str);
> + str += sprintf(str, ",");
> + str += ibft_attr_show_target(dev, ibft_attr, str);
> + str += sprintf(str, ",");
> + str += ibft_attr_show_nic(dev, ibft_attr, str);
> +
> + return str-buf;
> +}

And here, do I need to go on?

> +ssize_t ibft_attr_show_mac(struct ibft_kobject *entry,
> +struct ibft_attribute *attr,
> +char *buf)
> +{
> + struct ibft_nic *nic = attr->nic;
> + int len = 6;
> +
> + if (!nic)
> + return 0;
> +
> + memcpy(buf, attr->nic->mac, len);
> +
> + return len;
> +}

Is mac a user readable string?  Then perhaps a simple sprintf would work
instead, as I doubt you are including a \n here...

> +/*

Re: [PATCH] Add iSCSI IBFT Support (v0.3)

2007-11-26 Thread Greg KH
On Mon, Nov 26, 2007 at 06:56:42PM -0400, Konrad Rzeszutek wrote:
> 
> This patch adds /sysfs/firmware/ibft/[chosen|aliases|[EMAIL 
> PROTECTED],X|[EMAIL PROTECTED],X]
> directories along with text properties which export the the iSCSI Boot
> Firmware Table (iBFT) structure. The layout of the directories mirrors
> how PowerPC OpenBoot exports this data.
> 
> What is iSCSI Boot Firmware Table? It is a mechanism for the iSCSI
> tools to extract from the machine NICs the iSCSI connection information
> so that they can automagically mount the iSCSI share/target. Currently
> the iSCSI information is hard-coded in th initrd.
> 
> For full details of the IBFT structure please take a look at:
> ftp://ftp.software.ibm.com/systems/support/system_x_pdf/ibm_iscsi_boot_firmware_table_v1.02.pdf

As you are adding sysfs files in /sys/firmware, please add documentation
to Documentation/ABI as to what these files are, what they do, what is
in them, and what they are to be used for.

> + rc = firmware_register(_subsys);
> + if (rc)
> + return rc;

This function, as well as the whole decl_subsys() stuff is gone in my
tree and in -mm.  /sys/firmware is now just a simple kobject that you
are free to chain off of.  If you describe just what these sysfs
subdirectories and files are for and how they are going to be used, I'd
be glad to rework this patch to use the new interfaces.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] -mm (2.4.26-rc3-mm1) v2 Smack using capabilities 32 and 33

2007-11-26 Thread Andrew Morgan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Signed-off-by: Andrew G. Morgan <[EMAIL PROTECTED]>

Cheers

Andrew

Casey Schaufler wrote:
> From: Casey Schaufler <[EMAIL PROTECTED]>
> 
> This patch takes advantage of the increase in capability bits
> to allocate capabilities for Mandatory Access Control. Whereas
> Smack was overloading a previously allocated capability it is
> now using a pair, one for overriding access control checks and
> the other for changes to the MAC configuration.
> 
> The two capabilities allocated should be obvious in their intent.
> The comments in capability.h are intended to make it clear that
> there is no intention that implementations of MAC LSM modules
> be any more constrained by the presence of these capabilities
> than an implementation of DAC LSM modules are by the analogous
> DAC capabilities.
> 
> 
> Signed-off-by: Casey Schaufler <[EMAIL PROTECTED]>
> 
> ---
> 
> The companion patch for libcap-2.02 is provided as an attachment.
> The attachment is not a kernel patch, although it would be easy to
> mistake it for one.
> 
> Introduces CAP_FS_MASK_B1 and uses it as appropriate. I think that
> I found all the places it needs to be used, but don't hesitate to
> let me know if I missed something.
> 
> Thank you.
> 
>  include/linux/capability.h |   24 ++--
>  security/smack/smack.h |8 
>  security/smack/smack_lsm.c |8 
>  security/smack/smackfs.c   |   12 ++--
>  4 files changed, 32 insertions(+), 20 deletions(-)
> 
> diff -uprN -X linux-2.6.24-rc3-mm1-base/Documentation/dontdiff 
> linux-2.6.24-rc3-mm1-base/include/linux/capability.h 
> linux-2.6.24-rc3-mm1-smack/include/linux/capability.h
> --- linux-2.6.24-rc3-mm1-base/include/linux/capability.h  2007-11-22 
> 01:51:36.0 -0800
> +++ linux-2.6.24-rc3-mm1-smack/include/linux/capability.h 2007-11-25 
> 21:38:34.0 -0800
> @@ -314,6 +314,23 @@ typedef struct kernel_cap_struct {
>  
>  #define CAP_SETFCAP   31
>  
> +/* Override MAC access.
> +   The base kernel enforces no MAC policy.
> +   An LSM may enforce a MAC policy, and if it does and it chooses
> +   to implement capability based overrides of that policy, this is
> +   the capability it should use to do so. */
> +
> +#define CAP_MAC_OVERRIDE 32
> +
> +/* Allow MAC configuration or state changes.
> +   The base kernel requires no MAC configuration.
> +   An LSM may enforce a MAC policy, and if it does and it chooses
> +   to implement capability based checks on modifications to that
> +   policy or the data required to maintain it, this is the
> +   capability it should use to do so. */
> +
> +#define CAP_MAC_ADMIN33
> +
>  /*
>   * Bit location of each capability (used by user-space library and kernel)
>   */
> @@ -336,6 +353,8 @@ typedef struct kernel_cap_struct {
>   | CAP_TO_MASK(CAP_FOWNER)   \
>   | CAP_TO_MASK(CAP_FSETID))
>  
> +# define CAP_FS_MASK_B1 (CAP_TO_MASK(CAP_MAC_OVERRIDE))
> +
>  #if _LINUX_CAPABILITY_U32S != 2
>  # error Fix up hand-coded capability macro initializers
>  #else /* HAND-CODED capability initializers */
> @@ -343,8 +362,9 @@ typedef struct kernel_cap_struct {
>  # define CAP_EMPTY_SET{{ 0, 0 }}
>  # define CAP_FULL_SET {{ ~0, ~0 }}
>  # define CAP_INIT_EFF_SET {{ ~CAP_TO_MASK(CAP_SETPCAP), ~0 }}
> -# define CAP_FS_SET   {{ CAP_FS_MASK_B0, 0 }}
> -# define CAP_NFSD_SET {{ CAP_FS_MASK_B0|CAP_TO_MASK(CAP_SYS_RESOURCE), 0 
> }}
> +# define CAP_FS_SET   {{ CAP_FS_MASK_B0, CAP_FS_MASK_B1 } }
> +# define CAP_NFSD_SET {{ CAP_FS_MASK_B0|CAP_TO_MASK(CAP_SYS_RESOURCE), \
> +  CAP_FS_MASK_B1 } }
>  
>  #endif /* _LINUX_CAPABILITY_U32S != 2 */
>  
> diff -uprN -X linux-2.6.24-rc3-mm1-base/Documentation/dontdiff 
> linux-2.6.24-rc3-mm1-base/security/smack/smackfs.c 
> linux-2.6.24-rc3-mm1-smack/security/smack/smackfs.c
> --- linux-2.6.24-rc3-mm1-base/security/smack/smackfs.c2007-11-22 
> 01:51:43.0 -0800
> +++ linux-2.6.24-rc3-mm1-smack/security/smack/smackfs.c   2007-11-24 
> 11:29:29.0 -0800
> @@ -241,7 +241,7 @@ static ssize_t smk_write_load(struct fil
>* No partial writes.
>* Enough data must be present.
>*/
> - if (!capable(CAP_MAC_OVERRIDE))
> + if (!capable(CAP_MAC_ADMIN))
>   return -EPERM;
>   if (*ppos != 0)
>   return -EINVAL;
> @@ -474,7 +474,7 @@ static ssize_t smk_write_cipso(struct fi
>* No partial writes.
>* Enough data must be present.
>*/
> - if (!capable(CAP_MAC_OVERRIDE))
> + if (!capable(CAP_MAC_ADMIN))
>   return -EPERM;
>   if (*ppos != 0)
>   return -EINVAL;
> @@ -601,7 +601,7 @@ static ssize_t smk_write_doi(struct file
>   char temp[80];
>   int i;
>  
> - if (!capable(CAP_MAC_OVERRIDE))
> + if (!capable(CAP_MAC_ADMIN))
>   return -EPERM;
>  
>  

Re: [PATCH] [RESEND] crypto test: use print_hex_dump from kernel.h instead

2007-11-26 Thread Richard Knutsson

Denis Cheng wrote:

Cc: Randy Dunlap <[EMAIL PROTECTED]>
Signed-off-by: Denis Cheng <[EMAIL PROTECTED]>
---
this is against the lastest cryptodev tree.

 crypto/tcrypt.c |9 -
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 1e12b86..ae762c2 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -87,12 +87,11 @@ static char *check[] = {
"camellia", "seed", "salsa20", NULL
 };
 
-static void hexdump(unsigned char *buf, unsigned int len)

+static inline void hexdump(unsigned char *buf, unsigned int len)
 {
-   while (len--)
-   printk("%02x", *buf++);
-
-   printk("\n");
+   print_hex_dump(KERN_CONT, "", DUMP_PREFIX_OFFSET,
+   16, 1,
+   buf, len, 0);
  

Not important, but why use '0' instead of 'false'?

 }
 
 static void tcrypt_complete(struct crypto_async_request *req, int err)
  

cu
Richard Knutsson

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [9/10] per zone lru for cgroup

2007-11-26 Thread KAMEZAWA Hiroyuki
This patch implements per-zone lru for memory cgroup.
This patch makes use of mem_cgroup_per_zone struct for per zone lru.

LRU can be accessed by

   mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone);
   >active_list
   >inactive_list

   or
   mz = page_cgroup_zoneinfo(page_cgroup);
   >active_list
   >inactive_list

Changelog v1->v2
  - merged to mem_cgroup_per_zone struct.
  - handle page migraiton.

Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>


 mm/memcontrol.c |   63 ++--
 1 file changed, 39 insertions(+), 24 deletions(-)

Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c
===
--- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c   2007-11-27 11:24:04.0 
+0900
+++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-27 11:24:16.0 
+0900
@@ -89,6 +89,8 @@
 };
 
 struct mem_cgroup_per_zone {
+   struct list_headactive_list;
+   struct list_headinactive_list;
unsigned long count[NR_MEM_CGROUP_ZSTAT];
 };
 /* Macro for accessing counter */
@@ -122,10 +124,7 @@
/*
 * Per cgroup active and inactive list, similar to the
 * per zone LRU lists.
-* TODO: Consider making these lists per zone
 */
-   struct list_head active_list;
-   struct list_head inactive_list;
struct mem_cgroup_lru_info info;
/*
 * spin_lock to protect the per cgroup LRU
@@ -367,10 +366,10 @@
 
if (!to) {
MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
-   list_add(>lru, >mem_cgroup->inactive_list);
+   list_add(>lru, >inactive_list);
} else {
MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1;
-   list_add(>lru, >mem_cgroup->active_list);
+   list_add(>lru, >active_list);
}
mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
 }
@@ -388,11 +387,11 @@
if (active) {
MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1;
pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
-   list_move(>lru, >mem_cgroup->active_list);
+   list_move(>lru, >active_list);
} else {
MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-   list_move(>lru, >mem_cgroup->inactive_list);
+   list_move(>lru, >inactive_list);
}
 }
 
@@ -518,11 +517,16 @@
LIST_HEAD(pc_list);
struct list_head *src;
struct page_cgroup *pc, *tmp;
+   int nid = z->zone_pgdat->node_id;
+   int zid = zone_idx(z);
+   struct mem_cgroup_per_zone *mz;
 
+   mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
if (active)
-   src = _cont->active_list;
+   src = >active_list;
else
-   src = _cont->inactive_list;
+   src = >inactive_list;
+
 
spin_lock(_cont->lru_lock);
scan = 0;
@@ -544,13 +548,6 @@
continue;
}
 
-   /*
-* Reclaim, per zone
-* TODO: make the active/inactive lists per zone
-*/
-   if (page_zone(page) != z)
-   continue;
-
scan++;
list_move(>lru, _list);
 
@@ -832,6 +829,8 @@
int count;
unsigned long flags;
 
+   if (list_empty(list))
+   return;
 retry:
count = FORCE_UNCHARGE_BATCH;
spin_lock_irqsave(>lru_lock, flags);
@@ -867,20 +866,27 @@
 int mem_cgroup_force_empty(struct mem_cgroup *mem)
 {
int ret = -EBUSY;
+   int node, zid;
css_get(>css);
/*
 * page reclaim code (kswapd etc..) will move pages between
 `   * active_list <-> inactive_list while we don't take a lock.
 * So, we have to do loop here until all lists are empty.
 */
-   while (!(list_empty(>active_list) &&
-list_empty(>inactive_list))) {
+   while (mem->res.usage > 0) {
if (atomic_read(>css.cgroup->count) > 0)
goto out;
-   /* drop all page_cgroup in active_list */
-   mem_cgroup_force_empty_list(mem, >active_list);
-   /* drop all page_cgroup in inactive_list */
-   mem_cgroup_force_empty_list(mem, >inactive_list);
+   for_each_node_state(node, N_POSSIBLE)
+   for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+   struct mem_cgroup_per_zone *mz;
+   mz = mem_cgroup_zoneinfo(mem, node, zid);
+   /* drop all page_cgroup in active_list */
+   mem_cgroup_force_empty_list(mem,
+   >active_list);
+   /* drop all 

[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [10/10] per-zone-lock for cgroup

2007-11-26 Thread KAMEZAWA Hiroyuki
Now, lru is per-zone.

Then, lru_lock can be (should be) per-zone, too.
This patch implementes per-zone lru lock.

lru_lock is placed into mem_cgroup_per_zone struct.

lock can be accessed by
   mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone);
   >lru_lock

   or
   mz = page_cgroup_zoneinfo(page_cgroup);
   >lru_lock


Signed-off-by: KAMEZAWA hiroyuki <[EMAIL PROTECTED]>

 mm/memcontrol.c |   71 ++--
 1 file changed, 44 insertions(+), 27 deletions(-)

Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c
===
--- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c   2007-11-27 11:24:16.0 
+0900
+++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-27 11:24:22.0 
+0900
@@ -89,6 +89,10 @@
 };
 
 struct mem_cgroup_per_zone {
+   /*
+* spin_lock to protect the per cgroup LRU
+*/
+   spinlock_t  lru_lock;
struct list_headactive_list;
struct list_headinactive_list;
unsigned long count[NR_MEM_CGROUP_ZSTAT];
@@ -126,10 +130,7 @@
 * per zone LRU lists.
 */
struct mem_cgroup_lru_info info;
-   /*
-* spin_lock to protect the per cgroup LRU
-*/
-   spinlock_t lru_lock;
+
unsigned long control_type; /* control RSS or RSS+Pagecache */
int prev_priority;  /* for recording reclaim priority */
/*
@@ -410,15 +411,16 @@
  */
 void mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
-   struct mem_cgroup *mem;
+   struct mem_cgroup_per_zone *mz;
+   unsigned long flags;
+
if (!pc)
return;
 
-   mem = pc->mem_cgroup;
-
-   spin_lock(>lru_lock);
+   mz = page_cgroup_zoneinfo(pc);
+   spin_lock_irqsave(>lru_lock, flags);
__mem_cgroup_move_lists(pc, active);
-   spin_unlock(>lru_lock);
+   spin_unlock_irqrestore(>lru_lock, flags);
 }
 
 /*
@@ -528,7 +530,7 @@
src = >inactive_list;
 
 
-   spin_lock(_cont->lru_lock);
+   spin_lock(>lru_lock);
scan = 0;
list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
if (scan >= nr_to_scan)
@@ -558,7 +560,7 @@
}
 
list_splice(_list, src);
-   spin_unlock(_cont->lru_lock);
+   spin_unlock(>lru_lock);
 
*scanned = scan;
return nr_taken;
@@ -577,6 +579,7 @@
struct page_cgroup *pc;
unsigned long flags;
unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+   struct mem_cgroup_per_zone *mz;
 
/*
 * Should page_cgroup's go to their own slab?
@@ -688,10 +691,11 @@
goto retry;
}
 
-   spin_lock_irqsave(>lru_lock, flags);
+   mz = page_cgroup_zoneinfo(pc);
+   spin_lock_irqsave(>lru_lock, flags);
/* Update statistics vector */
__mem_cgroup_add_list(pc);
-   spin_unlock_irqrestore(>lru_lock, flags);
+   spin_unlock_irqrestore(>lru_lock, flags);
 
 done:
return 0;
@@ -733,6 +737,7 @@
 void mem_cgroup_uncharge(struct page_cgroup *pc)
 {
struct mem_cgroup *mem;
+   struct mem_cgroup_per_zone *mz;
struct page *page;
unsigned long flags;
 
@@ -745,6 +750,7 @@
 
if (atomic_dec_and_test(>ref_cnt)) {
page = pc->page;
+   mz = page_cgroup_zoneinfo(pc);
/*
 * get page->cgroup and clear it under lock.
 * force_empty can drop page->cgroup without checking refcnt.
@@ -753,9 +759,9 @@
mem = pc->mem_cgroup;
css_put(>css);
res_counter_uncharge(>res, PAGE_SIZE);
-   spin_lock_irqsave(>lru_lock, flags);
+   spin_lock_irqsave(>lru_lock, flags);
__mem_cgroup_remove_list(pc);
-   spin_unlock_irqrestore(>lru_lock, flags);
+   spin_unlock_irqrestore(>lru_lock, flags);
kfree(pc);
}
}
@@ -794,24 +800,29 @@
struct page_cgroup *pc;
struct mem_cgroup *mem;
unsigned long flags;
+   struct mem_cgroup_per_zone *mz;
 retry:
pc = page_get_page_cgroup(page);
if (!pc)
return;
mem = pc->mem_cgroup;
+   mz = page_cgroup_zoneinfo(pc);
if (clear_page_cgroup(page, pc) != pc)
goto retry;
-
-   spin_lock_irqsave(>lru_lock, flags);
+   spin_lock_irqsave(>lru_lock, flags);
 
__mem_cgroup_remove_list(pc);
+   spin_unlock_irqrestore(>lru_lock, flags);
+
pc->page = newpage;
lock_page_cgroup(newpage);
page_assign_page_cgroup(newpage, pc);
unlock_page_cgroup(newpage);
-   __mem_cgroup_add_list(pc);
 
-   spin_unlock_irqrestore(>lru_lock, flags);
+   mz = page_cgroup_zoneinfo(pc);
+   

[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [8/10] modifies vmscan.c for isolate globa/cgroup lru activity

2007-11-26 Thread KAMEZAWA Hiroyuki
When using memory controller, there are 2 levels of memory reclaim.
 1. zone memory reclaim because of system/zone memory shortage.
 2. memory cgroup memory reclaim because of hitting limit.

These two can be distinguished by sc->mem_cgroup parameter.
(scan_global_lru() macro)

This patch tries to make memory cgroup reclaim routine avoid affecting
system/zone memory reclaim. This patch inserts if (scan_global_lru()) and
hook to memory_cgroup reclaim support functions.

This patch can be a help for isolating system lru activity and group lru
activity and shows what additional functions are necessary.

 * mem_cgroup_calc_mapped_ratio() ... calculate mapped ratio for cgroup.
 * mem_cgroup_reclaim_imbalance() ... calculate active/inactive balance in
cgroup.
 * mem_cgroup_calc_reclaim_active() ... calculate the number of active pages to
be scanned in this priority in mem_cgroup.

 * mem_cgroup_calc_reclaim_inactive() ... calculate the number of inactive pages
to be scanned in this priority in mem_cgroup.

 * mem_cgroup_all_unreclaimable() .. checks cgroup's page is all unreclaimable
 or not.
 * mem_cgroup_get_reclaim_priority() ...
 * mem_cgroup_note_reclaim_priority() ... record reclaim priority (temporal)
 * mem_cgroup_remember_reclaim_priority()
  record reclaim priority as
  zone->prev_priority.
  This value is used for calc reclaim_mapped.
Changelog V1->V2:
 - merged calc_reclaim_mapped patch in previous version.

Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>

 mm/vmscan.c |  326 
 1 file changed, 197 insertions(+), 129 deletions(-)

Index: linux-2.6.24-rc3-mm1/mm/vmscan.c
===
--- linux-2.6.24-rc3-mm1.orig/mm/vmscan.c   2007-11-26 16:38:46.0 
+0900
+++ linux-2.6.24-rc3-mm1/mm/vmscan.c2007-11-26 16:42:38.0 +0900
@@ -863,7 +863,8 @@
__mod_zone_page_state(zone, NR_ACTIVE, -nr_active);
__mod_zone_page_state(zone, NR_INACTIVE,
-(nr_taken - nr_active));
-   zone->pages_scanned += nr_scan;
+   if (scan_global_lru(sc))
+   zone->pages_scanned += nr_scan;
spin_unlock_irq(>lru_lock);
 
nr_scanned += nr_scan;
@@ -950,6 +951,113 @@
 }
 
 /*
+ * Determine we should try to reclaim mapped pages.
+ * This is called only when sc->mem_cgroup is NULL.
+ */
+static int calc_reclaim_mapped(struct scan_control *sc, struct zone *zone,
+   int priority)
+{
+   long mapped_ratio;
+   long distress;
+   long swap_tendency;
+   long imbalance;
+   int reclaim_mapped;
+   int prev_priority;
+
+   if (scan_global_lru(sc) && zone_is_near_oom(zone))
+   return 1;
+   /*
+* `distress' is a measure of how much trouble we're having
+* reclaiming pages.  0 -> no problems.  100 -> great trouble.
+*/
+   if (scan_global_lru(sc))
+   prev_priority = zone->prev_priority;
+   else
+   prev_priority = mem_cgroup_get_reclaim_priority(sc->mem_cgroup);
+
+   distress = 100 >> min(prev_priority, priority);
+
+   /*
+* The point of this algorithm is to decide when to start
+* reclaiming mapped memory instead of just pagecache.  Work out
+* how much memory
+* is mapped.
+*/
+   if (scan_global_lru(sc))
+   mapped_ratio = ((global_page_state(NR_FILE_MAPPED) +
+   global_page_state(NR_ANON_PAGES)) * 100) /
+   vm_total_pages;
+   else
+   mapped_ratio = mem_cgroup_calc_mapped_ratio(sc->mem_cgroup);
+
+   /*
+* Now decide how much we really want to unmap some pages.  The
+* mapped ratio is downgraded - just because there's a lot of
+* mapped memory doesn't necessarily mean that page reclaim
+* isn't succeeding.
+*
+* The distress ratio is important - we don't want to start
+* going oom.
+*
+* A 100% value of vm_swappiness overrides this algorithm
+* altogether.
+*/
+   swap_tendency = mapped_ratio / 2 + distress + sc->swappiness;
+
+   /*
+* If there's huge imbalance between active and inactive
+* (think active 100 times larger than inactive) we should
+* become more permissive, or the system will take too much
+* cpu before it start swapping during memory pressure.
+* Distress is about avoiding early-oom, this is about
+* making swappiness graceful despite setting it to low
+* values.
+*
+ 

Dynticks Causing High Context Switch Rate in ksoftirqd

2007-11-26 Thread bdupree
Question: Why is ksoftirqd eating about 5 to 10 percent of my CPU on an idle
system? The problem occurs if I config the kernel with tickless
support (i.e. CONFIG_TICK_ONESHOT=y).  (Thanks to "oprofile" for putting me
onto this.)

I have noted this same problem on kernel versions: 2.6.23.1, 2.6.23.8 and
2.6.23.9

**
*** Output from "vmstat -n 1 10" -- Note very high context switch rate ***
*** This is on a idle machine! ***
**

procs ---memory-- ---swap-- -io --system--
cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy
id wa
 0  0  0 1925556   4768 11610400   124 26  7538  1  2
96  1
 0  0  0 1925556   4768 11610400 0 02 147329  0  1
99  0
 0  0  0 1925548   4768 11610400 0 00 154515  0  1
99  0
 0  0  0 1925548   4768 11610400 0 01 153898  0  2
98  0
 0  0  0 1925548   4780 11610400 0163 155216  0  1
99  0
 0  0  0 1925548   4780 11610400 0 01 161718  0  1
99  0
 0  0  0 1925548   4780 11610400 0 00 147587  0  2
98  0
 0  0  0 1925548   4780 11610400 0 01 153524  0  2
98  0
 0  0  0 1925448   4780 11610400 0 00 153434  0  1
99  0
 0  0  0 1925448   4792 11609200 0164 153527  0  2
98  0


*** System Stats ***


 Distro: Slackware 10.2
 Mobo:   MSI MasterX FA6R E7210
 CPUs:   Dual 2.4 GHz P4 Xeons 400 MHz FSB - Hyperthreading enabled
 Mem:2 GB ECC DDR PC 266


**
*** PCI Config ***
**

00:00.0 Host bridge: Intel Corporation 82875P/E7210 Memory Controller Hub
(rev 02)
00:03.0 PCI bridge: Intel Corporation 82875P/E7210 Processor to PCI to CSA
Bridge (rev 02)
00:06.0 System peripheral: Intel Corporation 82875P/E7210 Processor to I/O
Memory Interface (rev 02)
00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02)
00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host
Controller (rev 02)
00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host
Controller (rev 02)
00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02)
00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt
Controller (rev 02)
00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host
Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a)
00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller
(rev 02)
00:1f.1 IDE interface: Intel Corporation 6300ESB PATA Storage Controller
(rev 02)
00:1f.2 IDE interface: Intel Corporation 6300ESB SATA Storage Controller
(rev 02)
00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02)
01:01.0 Ethernet controller: Intel Corporation 82547GI Gigabit Ethernet
Controller
02:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
Fusion-MPT Dual Ultra320 SCSI (rev 08)
03:09.0 Mass storage controller: Silicon Image, Inc. SiI 3114
[SATALink/SATARaid] Serial ATA Controller (rev 02)
03:0a.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet
Controller
03:0c.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4, v3] Physical PCI slot objects

2007-11-26 Thread Gary Hade
On Mon, Nov 26, 2007 at 03:22:53PM -0700, Alex Chiang wrote:
> Hi Gary, Kenji-san, et. al,
> 
> * Gary Hade <[EMAIL PROTECTED]>:
> > 
> > Alex, What I was trying to suggest is a boot-time kernel
> > option, not a kernel configuration option.  The basic idea is
> > to give the user (with a single binary kernel) the ability to
> > include your ACPI-PCI slot driver feature changes only when
> > they are really needed.  In addition to reducing the number of
> > system/PCI hotplug driver combinations where your changes would
> > need to be validated, I believe would also help alleviate other
> > worries (e.g. Andi Kleen's memory consumption concern).  I
> > believe this goal could also be achieved with the kernel config
> > option by making the pci_slot module runtime loadable with the
> > PCI hotplug drivers only visiting your new code when the
> > pci_slot driver is loaded, although I think this would be more
> > difficult to implement.
> 
> I have modified my patch series so that the final patch that
> introduces my ACPI-PCI slot driver is a full-fledged module, that
> has a tristate Kconfig option.
> 
> It can be modprobe'd/rmmod'ed in any combination, and in any
> order with other PCI hotplug modules. There is no ordering
> dependency, even at module unload time, so you can safely rmmod
> pci_slot, and safely continue using features provided by the PCI
> hotplug drivers (acpiphp, pciehp, etc.). The opposite works too.

Nice!  I like the loadable module approach much better than my
boot-time kernel option suggestion.

> 
> The one limitation is that two separate hotplug drivers cannot
> both claim the same device (2nd module loaded will get -EBUSY
> errors), but I do not believe that is a regression from current
> behavior.

I cannot confirm this since the systems I am using only support
a single hotplug driver (acpiphp).

> 
> I have only tested with acpiphp and pciehp, as that's the only
> hardware I have, but I believe my code will play nicely with the
> other PCI hp drivers as well.

I have only tested your changes with acpiphp.

> 
> The patch series is fully bisectable, and the correct behavior
> occurs no matter which patch you happen to have applied.

Based on my testing (see below) this appears to be true.

> 
> I'll be sending v5 of patches 3 and 4 shortly (patches 1 and 2
> did not change). It is still based on 2.6.24-rc2, because I was
> too scared to do another git rebase while using stgit. :-/

I have been using 2.6.24-rc3 source for my testing.

> 
> > Also, I notice that even with your current CONFIG_ACPI_PCI_SLOT
> > implementation your numerous PCI hotplug driver changes (except
> > for only two places in pci_hotplug_core.c where there is 
> > `#ifndef CONFIG_ACPI_PCI_SLOT` and `#ifdef CONFIG_ACPI_PCI_SLOT`)
> > are _always_ exposed.  So, even with CONFIG_ACPI_PCI_SLOT disabled
> > there is IMO a need for testing of the affected PCI hotplug drivers
> > on more than a small number of isolated systems.
> 
> You are, of course, correct.
> 
> In my opinion, though, I would say most of the changes to the PCI
> hotplug drivers themselves are pretty straightforward, as in
> removing the different ways of getting the PCI address.
> 
> The scary part of the changes (aside from the ACPI-PCI slot
> driver) revolve around the new struct pci_slot, which is
> relatively self-contained, and only expose themselves via the
> pci_create_slot/pci_destroy_slot interfaces which only the PCI
> hotplug corecares about.

I think this sounds like a reasonable argument for not
doing what I was trying to suggest.

> 
> Regardless, your point stands. How do you suggest I get more
> testing time?

I am only able to test with acpiphp.  In addition to the
testing on the x3850 described below I would also like to
do some testing on an x3950 which has a mix of hotplug and
non-hotplug slots.  If this testing which I hope to complete
this week goes well, I will be satisfied.

I will let others speak for the other hotplug drivers
and platforms.

> Is this patchset appropriate for the -mm tree yet?

I would defer to our illustrious maintainers on this one. :)

> Or do you think it still needs more work?

I am now much more comfortable with your changes with respect
to acpiphp on the systems I worry about but others may have
concerns with respect to the other hotplug drivers, or even
acpiphp, on other systems.

> 
> > The good news is that I was able to test your v3 changes
> > (w/2.6.24-rc3 source) on our x3850 today with 'acpiphp' and,
> > except for the above mentioned inability to run-time
> > include/exclude them, they seemed to work fine.  The previous
> > boot-time ACPI error messages are gone and I was able to
> > successfully hot-remove and hot-add both PCI-X and PCIe
> > adapters.
> 
> Thanks for testing. Please let me know how v5 works for you too.

I just tried your v5 (1/4 v3, 2/4 v3, 3/4 v5, 4/4 v5) applied 
to 2.6.24-rc3 source with acpiphp on the x3850 and found 
nothing to complain about.  About time, eh? :)


Re: Fw: Re: [PATCH 1/3] signal(i386): alternative signal stack wraparound occurs

2007-11-26 Thread Roland McGrath
cf http://lkml.org/lkml/2007/10/3/41

To summarize: on Linux, SA_ONSTACK decides whether you are already on the
signal stack based on the value of the SP at the time of a signal.  If
you are not already inside the range, you are not "on the signal stack"
and so the new signal handler frame starts over at the base of the signal
stack.

sigaltstack (and sigstack before it) was invented in BSD.  There, the
SA_ONSTACK behavior has always been different.  It uses a kernel state
flag to decide, rather than the SP value.  When you first take an
SA_ONSTACK signal and switch to the alternate signal stack, it sets the
SS_ONSTACK flag in the thread's sigaltstack state in the kernel.
Thereafter you are "on the signal stack" and don't switch SP before
pushing a handler frame no matter what the SP value is.  Only when you
sigreturn from the original handler context do you clear the SS_ONSTACK
flag so that a new handler frame will start over at the base of the
alternate signal stack.

The undesireable effect of the Linux behavior is that an overflow of the
alternate signal stack can not only go undetected, but lead to a ring
buffer effect of clobbering the original handler frame at the base of the
signal stack for each successive signal that comes just after the
overflow.  This is what Shi Weihua's test case demonstrates.  Normally
this does not come up because of the signal mask, but the test case uses
SA_NODEFER for its SIGSEGV handler.

The other subtle part of the existing Linux semantics is that a simple
longjmp out of a signal handler serves to take you off the signal stack
in a safe and reliable fashion without having used sigreturn (nor having
just returned from the handler normally, which means the same).  After
the longjmp (or even informal stack switching not via any proper libc or
kernel interface), the alternate signal stack stands ready to be used
again.

A paranoid program would allocate a PROT_NONE red zone around its
alternate signal stack.  Then a small overflow would trigger a SIGSEGV in
handler setup, and be fatal (core dump) whether or not SIGSEGV is
blocked.  As with thread stack red zones, that cannot catch all overflows
(or underflows).  e.g., a local array as large as page size allocated in
a function called from a handler, but not actually touched before more
calls push more stack, could cause an overflow that silently pushes into
some unrelated allocated pages.

The BSD behavior does not do anything in particular about overflow.  But
it does at least avoid the wraparound or "ring buffer effect", so you'll
just get a straightforward all-out overflow down your address space past
the low end of the alternate signal stack.  I don't know what the BSD
behavior is for longjmp out of an SA_ONSTACK handler.

The POSIX wording relating to sigaltstack is pretty minimal.  I don't
think it speaks to this issue one way or another.  (The program that
overflows its stack is clearly in undefined behavior territory of one
sort or another anyhow.)

Given the longjmp issue and the potential for highly subtle complications
in existing programs relying on this in arcane ways deep in their code, I
am very dubious about changing the behavior to the BSD style persistent
flag.  I think Shi Weihua's patches have a similar effect by tracking the
SP used in the last handler setup.

I think it would be sensible for the signal handler setup code to detect
when it would itself be causing a stack overflow.  Maybe something like
the following patch (untested).  This issue exists in the same way on all
machines, so ideally they would all do a similar check.  

When it's the handler function itself or its callees that cause the
overflow, rather than the signal handler frame setup alone crossing the
boundary, this still won't help.  But I don't see any way to distinguish
that from the valid longjmp case.


Thanks,
Roland

---

diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index d58d455..000 100644  
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -295,6 +295,13 @@ get_sigframe(struct k_sigaction *ka, str
/* Default to using normal stack */
esp = regs->esp;
 
+   /*
+* If we are on the alternate signal stack and would overflow it, don't.
+* Return an always-bogus address instead so we will die with SIGSEGV.
+*/
+   if (on_sig_stack(esp) && !likely(on_sig_stack(esp - frame_size)))
+   return (void __user *) -1L;
+
/* This is the X/Open sanctioned signal stack switching.  */
if (ka->sa.sa_flags & SA_ONSTACK) {
if (sas_ss_flags(esp) == 0)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [7/10] calculate the number of pages to be scanned per cgroup

2007-11-26 Thread KAMEZAWA Hiroyuki
Define function for calculating the number of scan target on each Zone/LRU.

Changelog V1->V2.
 - fixed types of variable.

Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>

 include/linux/memcontrol.h |   15 +++
 mm/memcontrol.c|   33 +
 2 files changed, 48 insertions(+)

Index: linux-2.6.24-rc3-mm1/include/linux/memcontrol.h
===
--- linux-2.6.24-rc3-mm1.orig/include/linux/memcontrol.h2007-11-27 
11:22:14.0 +0900
+++ linux-2.6.24-rc3-mm1/include/linux/memcontrol.h 2007-11-27 
11:22:51.0 +0900
@@ -73,6 +73,10 @@
 extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
int priority);
 
+extern long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
+   struct zone *zone, int priority);
+extern long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
+   struct zone *zone, int priority);
 
 #else /* CONFIG_CGROUP_MEM_CONT */
 static inline void mm_init_cgroup(struct mm_struct *mm,
@@ -173,6 +177,17 @@
return 0;
 }
 
+static inline long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
+   struct zone *zone, int priority)
+{
+   return 0;
+}
+
+static inline long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
+   struct zone *zone, int priority)
+{
+   return 0;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c
===
--- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c   2007-11-27 11:22:14.0 
+0900
+++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-27 11:24:04.0 
+0900
@@ -472,6 +472,39 @@
mem->prev_priority = priority;
 }
 
+/*
+ * Calculate # of pages to be scanned in this priority/zone.
+ * See also vmscan.c
+ *
+ * priority starts from "DEF_PRIORITY" and decremented in each loop.
+ * (see include/linux/mmzone.h)
+ */
+
+long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
+  struct zone *zone, int priority)
+{
+   long nr_active;
+   int nid = zone->zone_pgdat->node_id;
+   int zid = zone_idx(zone);
+   struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
+
+   nr_active = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE);
+   return (nr_active >> priority);
+}
+
+long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
+   struct zone *zone, int priority)
+{
+   long nr_inactive;
+   int nid = zone->zone_pgdat->node_id;
+   int zid = zone_idx(zone);
+   struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
+
+   nr_inactive = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE);
+
+   return (nr_inactive >> priority);
+}
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [6/10] remember reclaim priority in memory cgroup

2007-11-26 Thread KAMEZAWA Hiroyuki
Functions to remember reclaim priority per cgroup (as zone->prev_priority)

Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>

 include/linux/memcontrol.h |   23 +++
 mm/memcontrol.c|   20 
 2 files changed, 43 insertions(+)

Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c
===
--- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c   2007-11-27 11:19:51.0 
+0900
+++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-27 11:22:14.0 
+0900
@@ -132,6 +132,7 @@
 */
spinlock_t lru_lock;
unsigned long control_type; /* control RSS or RSS+Pagecache */
+   int prev_priority;  /* for recording reclaim priority */
/*
 * statistics.
 */
@@ -452,6 +453,25 @@
return (long) (active / (inactive + 1));
 }
 
+/*
+ * prev_priority control...this will be used in memory reclaim path.
+ */
+int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
+{
+   return mem->prev_priority;
+}
+
+void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, int priority)
+{
+   if (priority < mem->prev_priority)
+   mem->prev_priority = priority;
+}
+
+void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority)
+{
+   mem->prev_priority = priority;
+}
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
Index: linux-2.6.24-rc3-mm1/include/linux/memcontrol.h
===
--- linux-2.6.24-rc3-mm1.orig/include/linux/memcontrol.h2007-11-27 
11:19:00.0 +0900
+++ linux-2.6.24-rc3-mm1/include/linux/memcontrol.h 2007-11-27 
11:22:14.0 +0900
@@ -67,6 +67,11 @@
 extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem);
 extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem);
 
+extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
+extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
+   int priority);
+extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
+   int priority);
 
 
 #else /* CONFIG_CGROUP_MEM_CONT */
@@ -150,6 +155,24 @@
return 0;
 }
 
+static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem,
+   int priority)
+{
+   return 0;
+}
+
+static inline void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
+   int priority)
+{
+   return 0;
+}
+
+static inline void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
+   int priority)
+{
+   return 0;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [5/10] calculate active/inactive imbalance per cgroup

2007-11-26 Thread KAMEZAWA Hiroyuki
calculate active/inactive imbalance per memory cgroup.

Changelog V1 -> V2:
 - removed "total" (just count inactive and active)
 - fixed comment
 - fixed return type to be "long".

Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>

 include/linux/memcontrol.h |8 
 mm/memcontrol.c|   14 ++
 2 files changed, 22 insertions(+)

Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c
===
--- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c   2007-11-27 10:44:19.0 
+0900
+++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-27 11:19:51.0 
+0900
@@ -437,6 +437,20 @@
rss = (long)mem_cgroup_read_stat(>stat, MEM_CGROUP_STAT_RSS);
return (int)((rss * 100L) / total);
 }
+/*
+ * This function is called from vmscan.c. In page reclaiming loop. balance
+ * between active and inactive list is calculated. For memory controller
+ * page reclaiming, we should use using mem_cgroup's imbalance rather than
+ * zone's global lru imbalance.
+ */
+long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem)
+{
+   unsigned long active, inactive;
+   /* active and inactive are the number of pages. 'long' is ok.*/
+   active = mem_cgroup_get_all_zonestat(mem, MEM_CGROUP_ZSTAT_ACTIVE);
+   inactive = mem_cgroup_get_all_zonestat(mem, MEM_CGROUP_ZSTAT_INACTIVE);
+   return (long) (active / (inactive + 1));
+}
 
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
Index: linux-2.6.24-rc3-mm1/include/linux/memcontrol.h
===
--- linux-2.6.24-rc3-mm1.orig/include/linux/memcontrol.h2007-11-27 
10:44:19.0 +0900
+++ linux-2.6.24-rc3-mm1/include/linux/memcontrol.h 2007-11-27 
11:19:00.0 +0900
@@ -65,6 +65,8 @@
  * For memory reclaim.
  */
 extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem);
+extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem);
+
 
 
 #else /* CONFIG_CGROUP_MEM_CONT */
@@ -142,6 +144,12 @@
 {
return 0;
 }
+
+static inline int mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem)
+{
+   return 0;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [4/10] calculate mapper_ratio per cgroup

2007-11-26 Thread KAMEZAWA Hiroyuki
Define function for calculating mapped_ratio in memory cgroup.

Changelog V1->V2
 - Fixed possible divide-by-zero bug.
 - Use "long" to avoid 64bit division on 32 bit system.
   and does necessary type casts.
 - Added comments.

Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>

 include/linux/memcontrol.h |   11 ++-
 mm/memcontrol.c|   17 +
 2 files changed, 27 insertions(+), 1 deletion(-)

Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c
===
--- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c   2007-11-26 16:39:02.0 
+0900
+++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-26 16:41:34.0 
+0900
@@ -421,6 +421,23 @@
spin_unlock(>lru_lock);
 }
 
+/*
+ * Calculate mapped_ratio under memory controller. This will be used in
+ * vmscan.c for deteremining we have to reclaim mapped pages.
+ */
+int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem)
+{
+   long total, rss;
+
+   /*
+* usage is recorded in bytes. But, here, we assume the number of
+* physical pages can be represented by "long" on any arch.
+*/
+   total = (long) (mem->res.usage >> PAGE_SHIFT) + 1L;
+   rss = (long)mem_cgroup_read_stat(>stat, MEM_CGROUP_STAT_RSS);
+   return (int)((rss * 100L) / total);
+}
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
Index: linux-2.6.24-rc3-mm1/include/linux/memcontrol.h
===
--- linux-2.6.24-rc3-mm1.orig/include/linux/memcontrol.h2007-11-26 
15:31:19.0 +0900
+++ linux-2.6.24-rc3-mm1/include/linux/memcontrol.h 2007-11-26 
16:39:05.0 +0900
@@ -61,6 +61,12 @@
 extern void mem_cgroup_end_migration(struct page *page);
 extern void mem_cgroup_page_migration(struct page *page, struct page *newpage);
 
+/*
+ * For memory reclaim.
+ */
+extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem);
+
+
 #else /* CONFIG_CGROUP_MEM_CONT */
 static inline void mm_init_cgroup(struct mm_struct *mm,
struct task_struct *p)
@@ -132,7 +138,10 @@
 {
 }
 
-
+static inline int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem)
+{
+   return 0;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [3/10] per-zone active inactive counter

2007-11-26 Thread KAMEZAWA Hiroyuki
Counting active/inactive per-zone in memory controller.

This patch adds per-zone status in memory cgroup.
These values are often read (as per-zone value) by page reclaiming.

In current design, per-zone stat is just a unsigned long value and 
not an atomic value because they are modified only under lru_lock.
(So, atomic_ops is not necessary.)

This patch adds ACTIVE and INACTIVE per-zone status values.

For handling per-zone status, this patch adds
  struct mem_cgroup_per_zone {
...
  }
and some helper functions. This will be useful to add per-zone objects
in mem_cgroup.

This patch turns memory controller's early_init to be 0 for calling 
kmalloc() in initialization.

Changelog V2 -> V3
  - fixed comments.

Changelog V1 -> V2
  - added mem_cgroup_per_zone struct.
  This will help following patches to implement per-zone objects and
  pack them into a struct.
  - added __mem_cgroup_add_list() and __mem_cgroup_remove_list()
  - fixed page migration handling.
  - renamed zstat to info (per-zone-info)
This will be place for per-zone information(lru, lock, ..)
  - use page_cgroup_nid()/zid() funcs.

Acked-by: Balbir Singh <[EMAIL PROTECTED]>
Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>


 mm/memcontrol.c |  164 +---
 1 file changed, 157 insertions(+), 7 deletions(-)

Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c
===
--- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c   2007-11-26 16:39:00.0 
+0900
+++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-26 16:39:02.0 
+0900
@@ -78,6 +78,31 @@
 }
 
 /*
+ * per-zone information in memory controller.
+ */
+
+enum mem_cgroup_zstat_index {
+   MEM_CGROUP_ZSTAT_ACTIVE,
+   MEM_CGROUP_ZSTAT_INACTIVE,
+
+   NR_MEM_CGROUP_ZSTAT,
+};
+
+struct mem_cgroup_per_zone {
+   unsigned long count[NR_MEM_CGROUP_ZSTAT];
+};
+/* Macro for accessing counter */
+#define MEM_CGROUP_ZSTAT(mz, idx)  ((mz)->count[(idx)])
+
+struct mem_cgroup_per_node {
+   struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+};
+
+struct mem_cgroup_lru_info {
+   struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
+};
+
+/*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
  * statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -101,6 +126,7 @@
 */
struct list_head active_list;
struct list_head inactive_list;
+   struct mem_cgroup_lru_info info;
/*
 * spin_lock to protect the per cgroup LRU
 */
@@ -158,6 +184,7 @@
MEM_CGROUP_CHARGE_TYPE_MAPPED,
 };
 
+
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
@@ -173,7 +200,39 @@
MEM_CGROUP_STAT_CACHE, val);
else
__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
+}
 
+static inline struct mem_cgroup_per_zone *
+mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
+{
+   if (!mem->info.nodeinfo[nid])
+   return NULL;
+   return >info.nodeinfo[nid]->zoneinfo[zid];
+}
+
+static inline struct mem_cgroup_per_zone *
+page_cgroup_zoneinfo(struct page_cgroup *pc)
+{
+   struct mem_cgroup *mem = pc->mem_cgroup;
+   int nid = page_cgroup_nid(pc);
+   int zid = page_cgroup_zid(pc);
+
+   return mem_cgroup_zoneinfo(mem, nid, zid);
+}
+
+static unsigned long mem_cgroup_get_all_zonestat(struct mem_cgroup *mem,
+   enum mem_cgroup_zstat_index idx)
+{
+   int nid, zid;
+   struct mem_cgroup_per_zone *mz;
+   u64 total = 0;
+
+   for_each_online_node(nid)
+   for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+   mz = mem_cgroup_zoneinfo(mem, nid, zid);
+   total += MEM_CGROUP_ZSTAT(mz, idx);
+   }
+   return total;
 }
 
 static struct mem_cgroup init_mem_cgroup;
@@ -286,12 +345,51 @@
return ret;
 }
 
+static void __mem_cgroup_remove_list(struct page_cgroup *pc)
+{
+   int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+   struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+
+   if (from)
+   MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) -= 1;
+   else
+   MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) -= 1;
+
+   mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
+   list_del_init(>lru);
+}
+
+static void __mem_cgroup_add_list(struct page_cgroup *pc)
+{
+   int to = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+   struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+
+   if (!to) {
+   MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
+   list_add(>lru, >mem_cgroup->inactive_list);
+   } else {
+   MEM_CGROUP_ZSTAT(mz, 

[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [2/10] nid/zid helper function for cgroup

2007-11-26 Thread KAMEZAWA Hiroyuki
Add macro to get node_id and zone_id of page_cgroup.
Will be used in per-zone-xxx patches and others.

Changelog:
 - returns zone_type instead of int.

Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>


 mm/memcontrol.c |   10 ++
 1 file changed, 10 insertions(+)

Index: linux-2.6.24-rc3-mm1/mm/memcontrol.c
===
--- linux-2.6.24-rc3-mm1.orig/mm/memcontrol.c   2007-11-26 15:31:19.0 
+0900
+++ linux-2.6.24-rc3-mm1/mm/memcontrol.c2007-11-26 16:39:00.0 
+0900
@@ -135,6 +135,16 @@
 #define PAGE_CGROUP_FLAG_CACHE (0x1)   /* charged as cache */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)  /* page is active in this cgroup */
 
+static inline int page_cgroup_nid(struct page_cgroup *pc)
+{
+   return page_to_nid(pc->page);
+}
+
+static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
+{
+   return page_zonenum(pc->page);
+}
+
 enum {
MEM_CGROUP_TYPE_UNSPEC = 0,
MEM_CGROUP_TYPE_MAPPED,

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [1/10] add scan_global_lru macro

2007-11-26 Thread KAMEZAWA Hiroyuki
add macro scan_global_lru().

This is used to detect which scan_control scans global lru or
mem_cgroup lru. And compiled to be static value (1) when 
memory controller is not configured. This may make the meaning obvious.

Acked-by: Balbir Singh <[EMAIL PROTECTED]>
Signed-off-by: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>


 mm/vmscan.c |   17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

Index: linux-2.6.24-rc3-mm1/mm/vmscan.c
===
--- linux-2.6.24-rc3-mm1.orig/mm/vmscan.c   2007-11-26 15:31:19.0 
+0900
+++ linux-2.6.24-rc3-mm1/mm/vmscan.c2007-11-26 16:38:46.0 +0900
@@ -127,6 +127,12 @@
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
+#ifdef CONFIG_CGROUP_MEM_CONT
+#define scan_global_lru(sc)(!(sc)->mem_cgroup)
+#else
+#define scan_global_lru(sc)(1)
+#endif
+
 /*
  * Add a shrinker callback to be called from the vm
  */
@@ -1290,11 +1296,12 @@
 * Don't shrink slabs when reclaiming memory from
 * over limit cgroups
 */
-   if (sc->mem_cgroup == NULL)
+   if (scan_global_lru(sc)) {
shrink_slab(sc->nr_scanned, gfp_mask, lru_pages);
-   if (reclaim_state) {
-   nr_reclaimed += reclaim_state->reclaimed_slab;
-   reclaim_state->reclaimed_slab = 0;
+   if (reclaim_state) {
+   nr_reclaimed += reclaim_state->reclaimed_slab;
+   reclaim_state->reclaimed_slab = 0;
+   }
}
total_scanned += sc->nr_scanned;
if (nr_reclaimed >= sc->swap_cluster_max) {
@@ -1321,7 +1328,7 @@
congestion_wait(WRITE, HZ/10);
}
/* top priority shrink_caches still had more to do? don't OOM, then */
-   if (!sc->all_unreclaimable && sc->mem_cgroup == NULL)
+   if (!sc->all_unreclaimable && scan_global_lru(sc))
ret = 1;
 out:
/*

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [0/10] introduction

2007-11-26 Thread KAMEZAWA Hiroyuki
Hi, this is per-zone/reclaim support patch set for memory controller (cgroup).

Major changes from previous one is
 -- tested with 2.6.24-rc3-mm1 + ia64/NUMA
 -- applied comments.

I did small test on real NUMA machine.
My machine was ia64/8CPU/2Node NUMA. I tried to complile the kernel under 800M
bytes limit with 32 parallel make. (make -j 32)

 - 2.6.24-rc3-mm1 (+ scsi fix)  shows soft lock-up.
   before soft lock-up, %sys was almost 100% in several times.

 - 2.6.24-rc3-mm1 (+ scsi fix) + this set  completed succesfully
   It seems %iowait dominates the total performance.
   (current memory controller has no background reclaim)

Seems this set give us some progress.

(*) I'd like to merge YAMAMOTO-san's background page reclaim for memory
controller before discussing about the number of performance.

Andrew, could you pick these up to -mm ?

Patch series brief description:

[1/10] ... add scan_global_lru() macro  (clean up)
[2/10] ... nid/zid helper function for cgroup
[3/10] ... introduce per-zone object for memory controller and add
   active/inactive counter.
[4/10] ... calculate mapper_ratio per cgroup (for memory reclaim)
[5/10] ... calculate active/inactive imbalance per cgroup (based on [3])
[6/10] ... remember reclaim priority in memory controller
[7/10] ... calculate the number of pages to be reclaimed per cgroup

[8/10] ... modifies vmscan.c to isolate global-lru-reclaim and
   memory-cgroup-reclaim in obvious manner.
   (this patch uses functions defined in [4 - 7])
[9/10] ... implement per-zone-lru for cgroup (based on [3])
[10/10] ... implement per-zone lru lock for cgroup (based on [3][9])

Any comments are welcome.

Thanks,
-Kame
 



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [RESEND] crypto test: use print_hex_dump from kernel.h instead

2007-11-26 Thread Denis Cheng
Cc: Randy Dunlap <[EMAIL PROTECTED]>
Signed-off-by: Denis Cheng <[EMAIL PROTECTED]>
---
this is against the lastest cryptodev tree.

 crypto/tcrypt.c |9 -
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 1e12b86..ae762c2 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -87,12 +87,11 @@ static char *check[] = {
"camellia", "seed", "salsa20", NULL
 };
 
-static void hexdump(unsigned char *buf, unsigned int len)
+static inline void hexdump(unsigned char *buf, unsigned int len)
 {
-   while (len--)
-   printk("%02x", *buf++);
-
-   printk("\n");
+   print_hex_dump(KERN_CONT, "", DUMP_PREFIX_OFFSET,
+   16, 1,
+   buf, len, 0);
 }
 
 static void tcrypt_complete(struct crypto_async_request *req, int err)
-- 
1.5.3.5

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: __rcu_process_callbacks() in Linux 2.6

2007-11-26 Thread Paul E. McKenney
On Mon, Nov 26, 2007 at 02:48:08PM -0800, James Huang wrote:
> 
> > -Original Message-
> > From: James Huang [mailto:[EMAIL PROTECTED]
> > Sent: Monday, November 26, 2007 2:21 PM
> > To: James Huang
> > Subject: Fw: __rcu_process_callbacks() in Linux 2.6
> > 
> > - Forwarded Message 
> > From: Manfred Spraul <[EMAIL PROTECTED]>
> > To: James Huang <[EMAIL PROTECTED]>
> > Cc: Paul E. McKenney <[EMAIL PROTECTED]>; linux-
> > [EMAIL PROTECTED]
> > Sent: Monday, November 26, 2007 10:28:37 AM
> > Subject: __rcu_process_callbacks() in Linux 2.6
> > 
> > Hi James,
> > 
> > If I understand the issue correctly, then the race is:
> > 
> > step 1: cpu 1: starts a new rcu batch (i.e. rcp->cur++, smb_mb)
> > 
> > step 2: cpu 2: completes the quiet state
> > step 3: cpu 2: reads pointer 0x123 (ptr to a rcu protected struct)
> > 
> > step 4: cpu 3: call_rcu(0x123): rcu protected struct added to
> rdp->nxtlist
> > step 5: cpu 3: moves a new batch into rdp->curlist, rdp->batch = rcp-
> > >cur+1.
> > xxx Problem: where is the smp_rmb() that guarantees that
> > xxx  update to rcp->cur from step 1 is seen by cpu 3?
> > step 6: cpu 3: completes quiet state
> > step 7: cpu 3: struct 0x123 destroyed
> > 
> > step 8: cpu 2: accesses pointer 0x123, but the struct is already
> destroyed
> > 
> > James: Is that the race?
> 
> 
> [James Huang] 
> 
> Yes, this is the race condition that I am concerned about.
> 
> 
> > 
> > I agree with Paul, there are smb_rmb's on cpu 3 between Step 1 and
> Step 5:
> > Either the test_and_set_bit in tasklet_action for rcu_process_callback
> > if step 4 happens before the tasklet or somewhere in the irq handler
> > path if step 4 happens in an irq handler that interrupted
> > rcu_process_callback.
> > 
> > Thus theoretically no additional smb_rmb() should be necessary.
> > What is missing is proper documentation.
> > 
> 
> 
> [James Huang] 
> 
> Is it true that a smb_rmb() before a read operation (say from variable
> X) will guarantee that the read will always retrieve the most "current"
> value of X?   I can not find such a guarantee in atomic_ops.txt or
> memory-barriers.txt under Linux's documentation directory.  What is
> described in both documents is relative ordering, e.g.
> 
> CPU1   CPU2
>-- --
>   write X = x1
>   smp_wmb()  
>   write Y = y1 
> 
>   read Y
>   smp_rmb()
>   read X
> 
> Then CPU2 will read X with a value of x1 if it reads Y with a value of
> y1.
> 
> Please point me to the right section in the document if smp_rmb() does
> provide such a guarantee.

You are correct, smp_rmb() is about ordering rather than about any sort
of immediacy.  For one thing, it can be quite difficult to say exactly what
the most "current" version of X might be at a given point in time from
the viewpoint of a given CPU -- the different CPUs might well disagree as
to what the "current" version is for awhile (though they are guaranteed
to come to agreement).

> Thanks,
> -- James Huang
> 
> > I'm analyzing the code right now:
> > Is it really true that typically a cpu only completes data in every
> other
> > rcu
> > cycle? I.e. that most structures are stored in the rcu callback list
> until
> > two
> > quiet states happened?

That is correct.  This does mean that we should be able to leverage
locking primitives and memory barriers executed from the scheduling
clock interrupt.

> > I've tried to track the values of rcp->cur and rdp->batch.
> > If next_pending is set, then cpu_quiet() immetiately starts
> > the next rcu cycle and a cpu cannot both complete the currently
> > pending rcu callbacks and add new callbacks to the next cycle,
> > thus a cpu only takes part in every other rcu cycle.
> > 
> > The oocalc file is at
> > http://www.colorfullife.com/~manfred/rcu.ods
> > http://www.colorfullife.com/~manfred/rcu.pdf
> > 
> > Is that analysis correct? Perhaps the whole code should be rewritten?

I believe that the sequencing in spreadsheet is correct (and thank
you very much for going through it!!!), but it seems to be silent on
memory-barrier issues.

I also believe that Gautham's new CPU-hotplug setup will make
it possible to simplify the code quite a bit.  And given that the
grace-period-detection code is not on any sort of hot code path, it should
be possible to use a less-aggressive design, perhaps one using straight
locking to guard the shared structures.  Also, we are working in the
-rt implementation on a scheme that allows CPUs to stay asleep through
a grace period without the heavy overhead that is otherwise required to
interact with them.  The trick is to maintain a per-CPU counter that is
incremented on each entry and exit to low-power state.  But I would like
to get this right in -rt before trying it in Classic RCU.  ;-)

   

Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-26 Thread H. Peter Anvin

Linus Torvalds wrote:



The 6-word limit is a red herring.  There is at least two ways to deal with it
(and this doesn't mean wiping the legacy stuff we already have):

- Let each architecture pick a calling convention and redefine the
architecture-independent bits to take an arbitrary number of arguments.  This
is a one-time panarchitectural change.


Not applicable on x86-32.

The six-word limit is effectively a hardware limit there. Once it goes 
past that limit, one of the words needs to be a pointer to extended 
information that is fundamentally slower to access. Happily, only very 
rare system calls do that (and none of them are of the simple variety 
where we see a few cycles easily).


On other architectures, we could more easily just use more registers. But 
x86-32 is still a big part (bulk) of what matters for most people.




Well, x86-32 and x86-64 are surprisingly similar here, for very 
different reasons (x86-64 is because there are only seven clobbered 
registers that aren't destroyed by the syscall instruction itself.)


However, on both of these we could make the user-space side cheaper, by 
making sure that we don't have to do additional copies in user space. 
For both these architectures, anything more than 3 parameters (i386) or 
6 parameters (x86-64) will be already in memory on the stack, so if we 
can use that image as-is then we at least save the intra-user-space copy 
that goes along with it.


x86-64 requires some minor thought, since the obvious way of doing it 
(using arg register 6 to push in a pointer) would end up with a 
discontiguous frame.  One can do it with something like this, although 
it's not clear to me it is a win at all (the more obvious sequence using 
XCHG isn't usable since XCHG locks unconditionally):


pop %r10# Return address
push%r9 # Argument 6
movq%rsp, %r11
push%r10
movq%rcx, %r10
syscall
cmpq$-4095, %rax
jae ...
pop %r10
pop %r9
push%r10
retq

The number of registers do vary, obviously, with s390 being the smallest 
number (5).


Immediately when you do anything but registers, it is much *much* more 
costly. The "get_user()" and "copy_from_user()" stuff is not exactly slow, 
but it's quite noticeable overhead for simple system calls. It gets worse 
if this all is described by some indirect table setup.


True, of course, although we're talking here about different ways to 
pull arguments out of userspace memory; *definitely* agreed with that we 
don't want to have any additional indirection necessary.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch](Resend) mm/sparse.c: Improve the error handling for sparse_add_one_section()

2007-11-26 Thread WANG Cong
On Mon, Nov 26, 2007 at 07:19:49PM +0900, Yasunori Goto wrote:
>Hi, Cong-san.
>
>>  ms->section_mem_map |= SECTION_MARKED_PRESENT;
>>  
>>  ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
>>  
>>  out:
>>  pgdat_resize_unlock(pgdat, );
>> -if (ret <= 0)
>> -__kfree_section_memmap(memmap, nr_pages);
>> +
>>  return ret;
>>  }
>>  #endif
>
>Hmm. When sparse_init_one_section() returns error, memmap and 
>usemap should be free.

Hi, Yasunori.

Thanks for your comments. Is the following one fine for you?

Signed-off-by: WANG Cong <[EMAIL PROTECTED]>

---

Index: linux-2.6/mm/sparse.c
===
--- linux-2.6.orig/mm/sparse.c
+++ linux-2.6/mm/sparse.c
@@ -391,9 +391,17 @@ int sparse_add_one_section(struct zone *
 * no locking for this, because it does its own
 * plus, it does a kmalloc
 */
-   sparse_index_init(section_nr, pgdat->node_id);
+   ret = sparse_index_init(section_nr, pgdat->node_id);
+   if (ret < 0)
+   return ret;
memmap = kmalloc_section_memmap(section_nr, pgdat->node_id, nr_pages);
+   if (!memmap)
+   return -ENOMEM;
usemap = __kmalloc_section_usemap();
+   if (!usemap) {
+   __kfree_section_memmap(memmap, nr_pages);
+   return -ENOMEM;
+   }
 
pgdat_resize_lock(pgdat, );
 
@@ -403,10 +411,6 @@ int sparse_add_one_section(struct zone *
goto out;
}
 
-   if (!usemap) {
-   ret = -ENOMEM;
-   goto out;
-   }
ms->section_mem_map |= SECTION_MARKED_PRESENT;
 
ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
@@ -414,7 +418,7 @@ int sparse_add_one_section(struct zone *
 out:
pgdat_resize_unlock(pgdat, );
if (ret <= 0)
-   __kfree_section_memmap(memmap, nr_pages);
+   kfree(usemap);
return ret;
 }
 #endif

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Kobjects: drop child->parent ref at unregistration

2007-11-26 Thread Alan Stern
On Mon, 26 Nov 2007, Andrew Morton wrote:

> On Mon, 19 Nov 2007 10:53:40 -0500 (EST)
> Alan Stern <[EMAIL PROTECTED]> wrote:
> 
> > This patch (as1015) reverts changes that were made to the driver core
> > about four years ago.  The intent back then was to avoid certain kinds
> > of invalid memory accesses by leaving kernel objects allocated as long
> > as any of their children were still allocated.  The original and
> > correct approach was to wait only as long as any children were still
> > _registered_; that's what this patch reinstates.
> 
> What happened with this?

As far as I know, it's on Greg's queue.

> > This fixes a problem in the SCSI core made visible by the class_device
> > to regular device conversion: A reference loop (scsi_device holds
> > reference to request_queue, which is the child of a gendisk, which is
> > the child of the scsi_device) prevents the data structures from being
> > released, even though they are deregistered okay.
> > 
> > It's possible that this change will cause a few bugs to surface,
> > things that have been hidden for several years.  They can be fixed
> > easily enough by having the child device take an explicit reference to
> > the parent whenever needed.
> > 
> 
> How will such bugs manifest?  Ideally via a nice printk and a stack trace
> followed by damage avoidance.

They will manifest in the same way as any other use-after-free bug: an 
oops message and either death of the current process or a system hang.

Obviously I'm not aware of any such bugs -- if I were, I'd fix them.  
Greg has expressed concern that some USB serial drivers might have this 
problem.  I'll do what testing I can (not much because I don't have any 
USB serial devices).

> If it's via a mysterious crash or something similarly obscure then can we
> improve that?

I can't think of anything offhand.  Maybe someone else can.

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [ACPI] utilities/: Compliment va_start() with va_end().

2007-11-26 Thread Richard Knutsson

Moore, Robert wrote:

Yes, it's official ANSI C, so I agree with the portability. I'm probably
asking more about the history of the thing.

  
"the history of the thing"? Sorry, you lost me there. I know there were 
a pre-ANSI
version of va_start() & co., but they seemed quite messy. When it comes 
to va_end()
and maintainers, they often seem positive to this. I guess the 
occasional lack off

va_end() is usually an oversight.
  

-Original Message-
From: Richard Knutsson [mailto:[EMAIL PROTECTED]
Sent: Monday, November 26, 2007 4:16 PM
To: Moore, Robert
Cc: Len Brown; linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
Subject: Re: [PATCH] [ACPI] utilities/: Compliment va_start() with
va_end().

Moore, Robert wrote:


This is an interesting one to me.

From various documentation:

After all arguments have been retrieved, va_end resets the pointer to
NULL.

va_end
Each invocation of va_start must be matched by a corresponding
invocation of va_end in the same function. After the call va_end(ap)
  

the
  

variable ap is undefined. Multiple transversals of the list, each
bracketed by va_start and va_end are possible. va_end may be a macro
  

or
  

a function.

Now, I'm all for defensive programming, but I don't really see the
  

point
  

of va_end when the list will be only traversed once.


  

First off, I think it is a good idea to follow the documentation, which
stated:
"va_end
Each invocation of va_start must be matched by a corresponding
invocation of va_end in the same function."

Then if it is not really needed, does it take up extra cycles?
"In practice, with most C compilers, calling |va_end| does nothing
and you do not really need to call it.  This is always true in the GNU


C
  

compiler."[1]

Portability:
"But you might as well call |va_end| just in case your
program is someday compiled with a peculiar compiler."[2]
This argument is not as likely thou, but who knows? (Since I guess


Intel's
  

compiler is included in the 'most C compilers')



We don't set all local pointers to NULL at function exit, what is the
point of doing it here?

  

I think it is a good thing if the code follows the documentation, both
for the person who tries
to understand the code (to see when the 'args' is no longer needed and
not getting confused
by the absent of va_end(), after all, IMHO we should write the code how
we want things to
work and let the compiler do the optimizations (it usually does a


better
  

job at it then we do))
and to automated searches (that is how I found this one).


I suppose some implementation could allocate memory at va_start, but
  

in
  

practice, does this happen?

  

Not sure what you mean.


Bob


  

cu
Richard Knutsson

[1]
http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_28.ht


ml
  

[2] The rest of [1]'s line.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-26 Thread Linus Torvalds


On Mon, 26 Nov 2007, H. Peter Anvin wrote:
> 
> I'm presuming you're not talking about some sort of syslets/fibrils/threadlets
> here (executing an interpreted thread of execution in kernel space).  That's a
> whole separate ball of wax.

Indeed. 

I'm hoping that just dies. It's too complex. But the "do this single 
system call asynchronously" isn't, and has lots of historical 
implementations, ranging from VMS to the braindead POSIX "aio" setup.

I do think that more complex threadlets could be useful in theory, I just 
doubt they'd be used in practice..

> > So the choice is basically one of:
> > 
> >  - come up with a totally new interface to system calls, and effectively
> > duplicating the whole system call table.
> > 
> >I'd hate to do this. We already have duplicated system call tables due
> > to compat stuff, it's painful.
> 
> This would be the right thing to do if we were to redesign the system call
> interface from the ground up, which it doesn't exactly sound like we are
> intending.

Yeah. I'm also not sure it's the right thing even if we did redesign from 
scratch.

The current system call interface may look less than regular, but it has 
some very solid foundation: it's fast. Passing arguments in registers is 
by definition a lot faster *and*safer* than passing them any other way. 
There are no subtle security issues with people playing games with the 
argument base pointer (ie usually the stack pointer) and trying to fool 
the kernel into accessing kernel memory etc.

Immediately when you do anything but registers, it is much *much* more 
costly. The "get_user()" and "copy_from_user()" stuff is not exactly slow, 
but it's quite noticeable overhead for simple system calls. It gets worse 
if this all is described by some indirect table setup.

In the system call path, right now, for some system calls, the biggest two 
overheads are

 - the CPU system call overhead itself. We can't do much about this, but 
   the CPU designers do seem to be slowly getting it fixed (ie it's slower 
   than it should need to be, but it's a hell of a lot faster than a P4 
   used to be ;)

 - the cost of just the single indirect - and unpredictable - call.

(The second cost is actually often totally hidden in the trivial system 
call benchmarks people run: if the benchmark just does "getppid()" a 
million times in a tight loop, the indirect call on the system call number 
seems really quite fast, but outside of benchmarks it is generally totally 
unpredictable indeed, and a real cost for real-life system call usage).

Everything else in the system call path is generally as fast as we can 
make it. Doing more indirection and conditionals would be really quite 
nasty.

Of course, for *most* of system calls, the work the kernel actually does 
ends up being so big that it doesn't much matter, but I was literally 
chasing down why a page fault had slowed down by ~70 cycles two weeks ago. 
And it doesn't take more than a couple of unpredictable jumps to do things 
like that!

> The 6-word limit is a red herring.  There is at least two ways to deal with it
> (and this doesn't mean wiping the legacy stuff we already have):
> 
> - Let each architecture pick a calling convention and redefine the
> architecture-independent bits to take an arbitrary number of arguments.  This
> is a one-time panarchitectural change.

Not applicable on x86-32.

The six-word limit is effectively a hardware limit there. Once it goes 
past that limit, one of the words needs to be a pointer to extended 
information that is fundamentally slower to access. Happily, only very 
rare system calls do that (and none of them are of the simple variety 
where we see a few cycles easily).

On other architectures, we could more easily just use more registers. But 
x86-32 is still a big part (bulk) of what matters for most people.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/18] x86 vDSO: arch/x86/vdso/vdso32

2007-11-26 Thread Roland McGrath
> But whatever works. I'm currently skipping the patches since they didn't 
> seem like 2.6.24 fodder anyway.

The vdso cleanups are pure cleanup, not fixing anything that's actively broken.


Thanks,
Roland
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/18] x86 vDSO: arch/x86/vdso/vdso32

2007-11-26 Thread Linus Torvalds


On Tue, 20 Nov 2007, Roland McGrath wrote:
>
> > git format-patch -p
> > 
> > does the trick at least here :)
> 
> Ok, I can use that in future.  I hope it still means that in the eventual
> merged state, GIT will be aware of all the renames.

Git doesn't care. You can do renames by hand, or with "git mv", you can do 
them as a delete/create pair, you can use "git-apply" with a rename patch, 
and you can do them by re-typing in all of the file contents from scratch.

Regardless of how the rename is done, git will represent the data the 
exact same way: the state of the tree before and after. 

The rename-patches are a lot denser and a lot more readable for humans (ie 
you can actually see what *happens*, unlike a traditional stupid unified 
diff), and I was hoping that eventually somebody in the GNU patch 
community would see how wonderful the extended patch information is, but 
when I tried to write a patch to "patch" to do it, I almost dug out my 
eyes with spoons from looking at the source code, so I haven't actually 
helped it happen.

So you can ask for patches in traditional format (*most* git command lines 
will default to that anyway, and only give a copy-patch with -C or -M on 
the command line), or people could realize that "git-apply" actually works 
even on non-git source code, and just stop using that abomination that is 
"patch" with all of it's totally wrong and unsafe defaults (*).

But whatever works. I'm currently skipping the patches since they didn't 
seem like 2.6.24 fodder anyway.

Linus

(*) Let me count the ways: applying patches partially when it fails 
half-way through a series. Defaulting to totally randomly guessing the 
path-name skip depth when not explicitly given a -pX option. Defaulting to 
"--fuzz=2" which is almost guaranteed to apply a patch even when it makes 
no sense what-so-ever. Yes, git-apply has stricter rules, but they are 
stricter for damn good reasons. For people who want the insane unsafe GNU 
patch defaults, they just have to specifically ask for unsafe modes..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu

2007-11-26 Thread Neil Horman
Hey all-
I've been working on an issue lately involving multi socket x86_64
systems connected via hypertransport bridges.  It appears that some systems,
disable the hypertransport connections during a kdump operation when all but the
crashing processor gets halted in machine_crash_shutdown.  This becomes a
problem when the ioapic attempts to route interrupts to the only remaining
processor.  Even though the active processor is targeted for interrupt
reception, the fact that the hypertransport connections are inactive result in
interrupts not getting delivered.  The effective result is that timer interrupts
are not delivered to the running cpu, and the system hangs on reboot into the
kdump kernel during calibrate_delay.  I've found that I've been able to avoid
this hang, by forcing a transition to the bios defined boot cpu during the
crashing kernel shutdown.  This patch accomplished that.  Tested by myself and
the origional reporter with successful results.

Regards,
Neil

Signed-off-by: Neil Horman <[EMAIL PROTECTED]>


 arch/x86/kernel/crash.c |   46 ++
 include/linux/kexec.h   |3 +++
 init/main.c |6 ++
 kernel/kexec.c  |8 
 4 files changed, 55 insertions(+), 8 deletions(-)


diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 8bb482f..0682e60 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -67,13 +67,36 @@ static int crash_nmi_callback(struct notifier_block *self,
}
 #endif
crash_save_cpu(regs, cpu);
-   disable_local_APIC();
-   atomic_dec(_for_crash_ipi);
-   /* Assume hlt works */
-   halt();
-   for (;;)
-   cpu_relax();
-
+   if (smp_processor_id() == kexec_boot_cpu) {
+   /*
+* This is the boot cpu.  We need to:
+* 1) Wait for the other processors to halt
+* 2) clear our nmi interrupt
+* 3) launch the new kernel
+*/
+   unsigned long msecs = 1000;
+   while ((atomic_read(_for_crash_ipi) > 0) && msecs) {
+   /*
+* Use udelay to avoid the warnings here
+* I know we shouldn't delay in an irq
+* but we're about to reboot the box during
+* a crash, a delay doesn't hurt here
+*/
+   udelay(1000);
+   msecs--;
+   }
+   ack_APIC_irq(); 
+   disable_local_APIC();
+   disable_IO_APIC();
+   machine_kexec(kexec_crash_image);
+
+   } else {
+   disable_local_APIC();
+   atomic_dec(_for_crash_ipi);
+   /* Assume hlt works */
+   for(;;)
+   halt();
+   }
return 1;
 }
 
@@ -138,7 +161,14 @@ void machine_crash_shutdown(struct pt_regs *regs)
nmi_shootdown_cpus();
lapic_shutdown();
 #if defined(CONFIG_X86_IO_APIC)
-   disable_IO_APIC();
+   if (crashing_cpu == kexec_boot_cpu) 
+   disable_IO_APIC();
 #endif
crash_save_cpu(regs, safe_smp_processor_id());
+   if (crashing_cpu != kexec_boot_cpu) {
+   atomic_dec(_for_crash_ipi);
+   for(;;)
+   halt();
+   }
+
 }
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 2d9c448..b5c12d6 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -187,6 +187,9 @@ extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
 extern size_t vmcoreinfo_size;
 extern size_t vmcoreinfo_max_size;
 
+extern int kexec_boot_cpu;
+extern void kexec_record_boot_cpu();
+
 int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
unsigned long long *crash_size, unsigned long long *crash_base);
 
diff --git a/init/main.c b/init/main.c
index 58f5a99..0f11ee0 100644
--- a/init/main.c
+++ b/init/main.c
@@ -58,6 +58,9 @@
 #include 
 #include 
 #include 
+#ifdef CONFIG_KEXEC
+#include 
+#endif
 
 #include 
 #include 
@@ -538,6 +541,9 @@ asmlinkage void __init start_kernel(void)
unwind_setup();
setup_per_cpu_areas();
smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
+#ifdef CONFIG_KEXEC
+   kexec_record_boot_cpu();
+#endif
 
/*
 * Set up the scheduler prior starting any interrupts (such as the
diff --git a/kernel/kexec.c b/kernel/kexec.c
index aa74a1e..cb6b1f3 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -41,6 +41,14 @@ u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
 size_t vmcoreinfo_size;
 size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
 
+int kexec_boot_cpu = 0;
+
+void __init kexec_record_boot_cpu()
+{
+   kexec_boot_cpu = smp_processor_id();
+   printk(KERN_CRIT "kexec records boot cpu as %d\n",kexec_boot_cpu);
+}
+
 /* Location of the reserved area for the crash 

  1   2   3   4   5   6   7   8   9   >