Re: acpi ->video_device_list corruption

2007-12-12 Thread William Lee Irwin III
On Wed, Dec 12, 2007 at 12:48:09PM +0100, Mikael Pettersson wrote:
> IMO the memset(ptr, 0, sizeof(*ptr)) idiom is both safer
> and avoids having to write an uninteresting type name.

How about this, then?

The ->cap fields of struct acpi_video_device and struct acpi_video_bus
are 1B each, not 4B. The oversized memset()'s corrupted the subsequent
list_head fields. This resulted in silent corruption without
CONFIG_DEBUG_LIST and BUG's with it. This patch uses sizeof() to pass
the proper bounds to the memset() calls and thereby correct the bugs.

The patch was seen to resolve the issue on the affected system.

vs. 2.6.24-rc5

Signed-off-by: William Irwin <[EMAIL PROTECTED]>

diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c
index 44a0d9b..bd77e81 100644
--- a/drivers/acpi/video.c
+++ b/drivers/acpi/video.c
@@ -577,7 +577,7 @@ static void acpi_video_device_find_cap(struct 
acpi_video_device *device)
struct acpi_video_device_brightness *br = NULL;
 
 
-   memset(>cap, 0, 4);
+   memset(>cap, 0, sizeof(device->cap));
 
if (ACPI_SUCCESS(acpi_get_handle(device->dev->handle, "_ADR", 
_dummy1))) {
device->cap._ADR = 1;
@@ -697,7 +697,7 @@ static void acpi_video_bus_find_cap(struct acpi_video_bus 
*video)
 {
acpi_handle h_dummy1;
 
-   memset(>cap, 0, 4);
+   memset(>cap, 0, sizeof(video->cap));
if (ACPI_SUCCESS(acpi_get_handle(video->device->handle, "_DOS", 
_dummy1))) {
video->cap._DOS = 1;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


acpi ->video_device_list corruption

2007-12-12 Thread William Lee Irwin III
The ->cap fields of struct acpi_video_device and struct acpi_video_bus
are 1B each, not 4B. The oversized memset()'s corrupted the subsequent
list_head fields. This resulted in silent corruption without
CONFIG_DEBUG_LIST and BUG's with it. This patch uses sizeof() to pass
the proper bounds to the memset() calls and thereby correct the bugs.

Included as a MIME attachment is a compressed dmesg from an affected
system. The patch was seen to resolve the issue on the affected system.

vs. 2.6.24-rc5

Signed-off-by: William Irwin <[EMAIL PROTECTED]>


-- wli

diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c
index 44a0d9b..7895d57 100644
--- a/drivers/acpi/video.c
+++ b/drivers/acpi/video.c
@@ -577,7 +577,7 @@ static void acpi_video_device_find_cap(struct 
acpi_video_device *device)
struct acpi_video_device_brightness *br = NULL;
 
 
-   memset(>cap, 0, 4);
+   memset(>cap, 0, sizeof(struct acpi_video_device_cap));
 
if (ACPI_SUCCESS(acpi_get_handle(device->dev->handle, "_ADR", 
_dummy1))) {
device->cap._ADR = 1;
@@ -697,7 +697,7 @@ static void acpi_video_bus_find_cap(struct acpi_video_bus 
*video)
 {
acpi_handle h_dummy1;
 
-   memset(>cap, 0, 4);
+   memset(>cap, 0, sizeof(struct acpi_video_bus_cap));
if (ACPI_SUCCESS(acpi_get_handle(video->device->handle, "_DOS", 
_dummy1))) {
video->cap._DOS = 1;
}


dmesg.acpibug.gz
Description: dmesg.acpibug.gz


acpi -video_device_list corruption

2007-12-12 Thread William Lee Irwin III
The -cap fields of struct acpi_video_device and struct acpi_video_bus
are 1B each, not 4B. The oversized memset()'s corrupted the subsequent
list_head fields. This resulted in silent corruption without
CONFIG_DEBUG_LIST and BUG's with it. This patch uses sizeof() to pass
the proper bounds to the memset() calls and thereby correct the bugs.

Included as a MIME attachment is a compressed dmesg from an affected
system. The patch was seen to resolve the issue on the affected system.

vs. 2.6.24-rc5

Signed-off-by: William Irwin [EMAIL PROTECTED]


-- wli

diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c
index 44a0d9b..7895d57 100644
--- a/drivers/acpi/video.c
+++ b/drivers/acpi/video.c
@@ -577,7 +577,7 @@ static void acpi_video_device_find_cap(struct 
acpi_video_device *device)
struct acpi_video_device_brightness *br = NULL;
 
 
-   memset(device-cap, 0, 4);
+   memset(device-cap, 0, sizeof(struct acpi_video_device_cap));
 
if (ACPI_SUCCESS(acpi_get_handle(device-dev-handle, _ADR, 
h_dummy1))) {
device-cap._ADR = 1;
@@ -697,7 +697,7 @@ static void acpi_video_bus_find_cap(struct acpi_video_bus 
*video)
 {
acpi_handle h_dummy1;
 
-   memset(video-cap, 0, 4);
+   memset(video-cap, 0, sizeof(struct acpi_video_bus_cap));
if (ACPI_SUCCESS(acpi_get_handle(video-device-handle, _DOS, 
h_dummy1))) {
video-cap._DOS = 1;
}


dmesg.acpibug.gz
Description: dmesg.acpibug.gz


Re: acpi -video_device_list corruption

2007-12-12 Thread William Lee Irwin III
On Wed, Dec 12, 2007 at 12:48:09PM +0100, Mikael Pettersson wrote:
 IMO the memset(ptr, 0, sizeof(*ptr)) idiom is both safer
 and avoids having to write an uninteresting type name.

How about this, then?

The -cap fields of struct acpi_video_device and struct acpi_video_bus
are 1B each, not 4B. The oversized memset()'s corrupted the subsequent
list_head fields. This resulted in silent corruption without
CONFIG_DEBUG_LIST and BUG's with it. This patch uses sizeof() to pass
the proper bounds to the memset() calls and thereby correct the bugs.

The patch was seen to resolve the issue on the affected system.

vs. 2.6.24-rc5

Signed-off-by: William Irwin [EMAIL PROTECTED]

diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c
index 44a0d9b..bd77e81 100644
--- a/drivers/acpi/video.c
+++ b/drivers/acpi/video.c
@@ -577,7 +577,7 @@ static void acpi_video_device_find_cap(struct 
acpi_video_device *device)
struct acpi_video_device_brightness *br = NULL;
 
 
-   memset(device-cap, 0, 4);
+   memset(device-cap, 0, sizeof(device-cap));
 
if (ACPI_SUCCESS(acpi_get_handle(device-dev-handle, _ADR, 
h_dummy1))) {
device-cap._ADR = 1;
@@ -697,7 +697,7 @@ static void acpi_video_bus_find_cap(struct acpi_video_bus 
*video)
 {
acpi_handle h_dummy1;
 
-   memset(video-cap, 0, 4);
+   memset(video-cap, 0, sizeof(video-cap));
if (ACPI_SUCCESS(acpi_get_handle(video-device-handle, _DOS, 
h_dummy1))) {
video-cap._DOS = 1;
}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] hugetlbfs :shmget with SHM_HUGETLB only works as root

2007-11-29 Thread William Lee Irwin III
On Fri, Nov 30, 2007 at 12:02:32AM +0530, Ciju Rajan K wrote:
>   I tested your patch. But that is not solving the problem.
>   If the code change to user_shm_lock() is not a good solution, could 
> you please suggest a method so that the normal user is able to allocate 
> the huge pages, if his gid is added to /proc/sys/vm/hugetlb_shm_group

The patch I posted resolves a race unrelated to your issue. Raising your
locked memory limits should not be difficult. /etc/limits.conf or similar
should set it up for you. You can also change the default rlimit in the
kernel and compile it with default limits elevated to what you want your
unprivileged process to have to start with if you're truly having lots
of trouble getting userspace to set the default limits properly. I'd
look in include/asm-generic/resource.h


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] hugetlbfs :shmget with SHM_HUGETLB only works as root

2007-11-29 Thread William Lee Irwin III
On Fri, Nov 30, 2007 at 12:02:32AM +0530, Ciju Rajan K wrote:
   I tested your patch. But that is not solving the problem.
   If the code change to user_shm_lock() is not a good solution, could 
 you please suggest a method so that the normal user is able to allocate 
 the huge pages, if his gid is added to /proc/sys/vm/hugetlb_shm_group

The patch I posted resolves a race unrelated to your issue. Raising your
locked memory limits should not be difficult. /etc/limits.conf or similar
should set it up for you. You can also change the default rlimit in the
kernel and compile it with default limits elevated to what you want your
unprivileged process to have to start with if you're truly having lots
of trouble getting userspace to set the default limits properly. I'd
look in include/asm-generic/resource.h


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why preallocate pmd in x86 32-bit PAE?

2007-11-15 Thread William Lee Irwin III
Linus Torvalds wrote:
>> IIRC, the present bit is ignored in the magic 4-entry PGD.  All entries 
>> have to be present.

On Thu, Nov 15, 2007 at 02:42:46PM -0800, H. Peter Anvin wrote:
> This is true, although you could point a PGD to an all-zero page if you 
> really wanted to.  You have to re-load CR3 after modifying the top-level 
> entries.

There may be bigger fish to fry in terms of per-process overhead, if
you're trying to cut that down. The trouble with trying to address
some of those is that there is mutual antagonism between compactness
and expansibility in the process address space layout, so you'll end
up instantiating a lot more than you want barring some sort of provision
for a compact address space layout. Pagetable sharing is a far more
powerful resource scalability method, though it also needs cooperation
in user address space layout to reap its gains.

There are other overheads, of course, though they're more typically
per-something besides processes.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why preallocate pmd in x86 32-bit PAE?

2007-11-15 Thread William Lee Irwin III
Linus Torvalds wrote:
 IIRC, the present bit is ignored in the magic 4-entry PGD.  All entries 
 have to be present.

On Thu, Nov 15, 2007 at 02:42:46PM -0800, H. Peter Anvin wrote:
 This is true, although you could point a PGD to an all-zero page if you 
 really wanted to.  You have to re-load CR3 after modifying the top-level 
 entries.

There may be bigger fish to fry in terms of per-process overhead, if
you're trying to cut that down. The trouble with trying to address
some of those is that there is mutual antagonism between compactness
and expansibility in the process address space layout, so you'll end
up instantiating a lot more than you want barring some sort of provision
for a compact address space layout. Pagetable sharing is a far more
powerful resource scalability method, though it also needs cooperation
in user address space layout to reap its gains.

There are other overheads, of course, though they're more typically
per-something besides processes.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] hugetlbfs :shmget with SHM_HUGETLB only works as root

2007-11-14 Thread William Lee Irwin III
On Wed, Nov 14, 2007 at 09:31:41AM -0600, aglitke wrote:
> ... if the user's locked limit (ulimit -l) is set to unlimited, allowed
> (above) is set to 1.  In that case, the second part of that if() is
> bypassed, and the function grants permission.  Therefore, the easy
> solution is to make sure your user's lock_limit is RLIM_INFINITY.

This function deserves a minor cleanup and a bit more commenting.

Reading user->locked_shm within shmlock_user_lock would be nice, too.

Maybe something like this (untested, uncompiled) would do.


-- wli


diff --git a/mm/mlock.c b/mm/mlock.c
index 7b26560..5f51792 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -234,6 +234,12 @@ asmlinkage long sys_munlockall(void)
 /*
  * Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB
  * shm segments) get accounted against the user_struct instead.
+ * First, user_shm_lock() checks that the user has permission to lock
+ * enough memory; then if so, the locked shm is accounted to the user's
+ * system-wide state. shmlock_user_lock protects the per-user field
+ * tracking how much locked_shm is in use within the struct user_struct.
+ * shmlock_user_lock is taken early to guard the read-only check that
+ * user->locked_shm is in-bounds against updates to user->locked_shm.
  */
 static DEFINE_SPINLOCK(shmlock_user_lock);
 
@@ -242,19 +248,22 @@ int user_shm_lock(size_t size, struct user_struct *user)
unsigned long lock_limit, locked;
int allowed = 0;
 
+   spin_lock(_user_lock);
locked = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
if (lock_limit == RLIM_INFINITY)
allowed = 1;
-   lock_limit >>= PAGE_SHIFT;
-   spin_lock(_user_lock);
-   if (!allowed &&
-   locked + user->locked_shm > lock_limit && !capable(CAP_IPC_LOCK))
-   goto out;
-   get_uid(user);
-   user->locked_shm += locked;
-   allowed = 1;
-out:
+   else {
+   lock_limit >>= PAGE_SHIFT;
+   if (locked + user->locked_shm <= lock_limit)
+   allowed = 1;
+   else if (capable(CAP_IPC_LOCK))
+   allowed = 1;
+   }
+   if (allowed) {
+   get_uid(user);
+   user->locked_shm += locked;
+   }
spin_unlock(_user_lock);
return allowed;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] [PATCH] hugetlbfs :shmget with SHM_HUGETLB only works as root

2007-11-14 Thread William Lee Irwin III
On Wed, Nov 14, 2007 at 09:31:41AM -0600, aglitke wrote:
 ... if the user's locked limit (ulimit -l) is set to unlimited, allowed
 (above) is set to 1.  In that case, the second part of that if() is
 bypassed, and the function grants permission.  Therefore, the easy
 solution is to make sure your user's lock_limit is RLIM_INFINITY.

This function deserves a minor cleanup and a bit more commenting.

Reading user-locked_shm within shmlock_user_lock would be nice, too.

Maybe something like this (untested, uncompiled) would do.


-- wli


diff --git a/mm/mlock.c b/mm/mlock.c
index 7b26560..5f51792 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -234,6 +234,12 @@ asmlinkage long sys_munlockall(void)
 /*
  * Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB
  * shm segments) get accounted against the user_struct instead.
+ * First, user_shm_lock() checks that the user has permission to lock
+ * enough memory; then if so, the locked shm is accounted to the user's
+ * system-wide state. shmlock_user_lock protects the per-user field
+ * tracking how much locked_shm is in use within the struct user_struct.
+ * shmlock_user_lock is taken early to guard the read-only check that
+ * user-locked_shm is in-bounds against updates to user-locked_shm.
  */
 static DEFINE_SPINLOCK(shmlock_user_lock);
 
@@ -242,19 +248,22 @@ int user_shm_lock(size_t size, struct user_struct *user)
unsigned long lock_limit, locked;
int allowed = 0;
 
+   spin_lock(shmlock_user_lock);
locked = (size + PAGE_SIZE - 1)  PAGE_SHIFT;
lock_limit = current-signal-rlim[RLIMIT_MEMLOCK].rlim_cur;
if (lock_limit == RLIM_INFINITY)
allowed = 1;
-   lock_limit = PAGE_SHIFT;
-   spin_lock(shmlock_user_lock);
-   if (!allowed 
-   locked + user-locked_shm  lock_limit  !capable(CAP_IPC_LOCK))
-   goto out;
-   get_uid(user);
-   user-locked_shm += locked;
-   allowed = 1;
-out:
+   else {
+   lock_limit = PAGE_SHIFT;
+   if (locked + user-locked_shm = lock_limit)
+   allowed = 1;
+   else if (capable(CAP_IPC_LOCK))
+   allowed = 1;
+   }
+   if (allowed) {
+   get_uid(user);
+   user-locked_shm += locked;
+   }
spin_unlock(shmlock_user_lock);
return allowed;
 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)

2007-07-25 Thread William Lee Irwin III
On Wed, Jul 25, 2007 at 04:39:04PM +0200, Andrea Arcangeli wrote:
> For the kernel stack btw, when alloc_pages(order=1) fails vmalloc
> should be used and 4k stacks can be dropped. Nobody does dma from the
> stack anymore these days IIRC (it doesn't work in all archs anyway).

I have recent code for that circulating, albeit intended for debugging
purposes. There's nothing particularly debug-oriented about it, though,
apart from the fact a guard page is automatically set up by vmalloc()
and that the use of vmalloc() is unconditional.

As for the rest, I'm sure there could be a lively conversation, but
consensus, so I'll let it go.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)

2007-07-25 Thread William Lee Irwin III
On Wed, Jul 25, 2007 at 04:39:04PM +0200, Andrea Arcangeli wrote:
 For the kernel stack btw, when alloc_pages(order=1) fails vmalloc
 should be used and 4k stacks can be dropped. Nobody does dma from the
 stack anymore these days IIRC (it doesn't work in all archs anyway).

I have recent code for that circulating, albeit intended for debugging
purposes. There's nothing particularly debug-oriented about it, though,
apart from the fact a guard page is automatically set up by vmalloc()
and that the use of vmalloc() is unconditional.

As for the rest, I'm sure there could be a lively conversation, but
consensus, so I'll let it go.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)

2007-07-24 Thread William Lee Irwin III
On Wed, Jul 18, 2007 at 06:32:22AM -0700, William Lee Irwin III wrote:
>> Actually I'd worked on what was called MPSS (Multiple Page Size Support)
>> before I ever started on pgcl. Some large portion of the pgcl proposal
>> as I presented it internally was to reduce the order of large page
>> allocations and provide a promotion and demotion mechanism enabling
>> different processes to have different sized translations for the same
>> large page, and hence no out-of-context pagetable/TLB updates during
>> promotion and demotion, essentially by making the TLB translation to
>> page relation M:N. ISTR describing this in a KS presentation for which
>> IIRC you were present. But that's neither here nor there.

On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote:
> Well the whole difference between you back then and SGI now, is that
> your stuff wasn't being pushed to be merged very hard (it was proposed
> but IIRC more as research topic, like the large PAGE_SIZE also fallen
> into that same research area). See now the emails from SGI fs folks
> about variable order page size, they want it merged badly instead.

Neither were research topics, but I'm tired of correcting the history
of my failures. I've got enough ongoing failures as things stand.


On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote:
> My whole point is that the single moment the variable order page size
> isn't pure research anymore like MPSS, the CONFIG_PAGE_SHIFT isn't
> research anymore either, like the tail packing in pagecache with
> kmalloc also isn't research anymore.

There was never any research involved in the page clustering per se.
It was supposed to be a generally advantageous thing that Linus had
at least once explicitly approved of that just so happened to relieve
mem_map[] pressure on 64GB i386, the side effect intended to attract
corporate patronage.

That last fact was not only demonstrable, it was used in the first
ever public demonstration of a 64GB i386 machine running Linux, which
I personally carried out.

Beyond active hindrances and lacks of cooperation, a "competing
solution" with distro backing appeared that removed the last vestige
of corporate patronage from the project. It ended up bitrotting
faster than I could singlehandedly do all the maintenance, testing,
and coding work on it while also trying to get anything else done.

MPSS was not as well-developed at the time the hugetlb "solution"
killed it, but is not terribly dissimilar in how it came into
being, developed, and then died, apart from less active hindrance.

The one and only aspect in which any research was involved was a
proposal, never accepted or pursued, to investigate how larger
base page sizes implemented via page clustering mitigated external
fragmentation for the purposes of MPSS and also how certain
techniques borrowed from page clustering could reduce the frequency
of and performance penalties associated with demotion in MPSS. The
proposal has never been publicly circulated, though some of its content
was described in the KS presentation as "future directions" or similar.


On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote:
> About the fs deciding the size of the pagecache granularity I totally
> dislike that design, there's no reason why the fs should control that,
[...]

This is all valid commentary, though I don't have any particular
response to it.

In any event, I've never been involved in a research project, though
I would've liked to have been. The emphasis in all cases was enabling
specific functionality in production, using techniques whose viability
had furthermore already been demonstrated elsewhere, by others.

In both instances, insurmountable nontechnical obstacles were present,
which remain in place and effectively limit the scale and scope of any
sort of project I can personally lead with any sort of likelihood of
mainline acceptance.

Where I am limited, you are not. Good luck to you.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)

2007-07-24 Thread William Lee Irwin III
On Wed, Jul 18, 2007 at 06:32:22AM -0700, William Lee Irwin III wrote:
 Actually I'd worked on what was called MPSS (Multiple Page Size Support)
 before I ever started on pgcl. Some large portion of the pgcl proposal
 as I presented it internally was to reduce the order of large page
 allocations and provide a promotion and demotion mechanism enabling
 different processes to have different sized translations for the same
 large page, and hence no out-of-context pagetable/TLB updates during
 promotion and demotion, essentially by making the TLB translation to
 page relation M:N. ISTR describing this in a KS presentation for which
 IIRC you were present. But that's neither here nor there.

On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote:
 Well the whole difference between you back then and SGI now, is that
 your stuff wasn't being pushed to be merged very hard (it was proposed
 but IIRC more as research topic, like the large PAGE_SIZE also fallen
 into that same research area). See now the emails from SGI fs folks
 about variable order page size, they want it merged badly instead.

Neither were research topics, but I'm tired of correcting the history
of my failures. I've got enough ongoing failures as things stand.


On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote:
 My whole point is that the single moment the variable order page size
 isn't pure research anymore like MPSS, the CONFIG_PAGE_SHIFT isn't
 research anymore either, like the tail packing in pagecache with
 kmalloc also isn't research anymore.

There was never any research involved in the page clustering per se.
It was supposed to be a generally advantageous thing that Linus had
at least once explicitly approved of that just so happened to relieve
mem_map[] pressure on 64GB i386, the side effect intended to attract
corporate patronage.

That last fact was not only demonstrable, it was used in the first
ever public demonstration of a 64GB i386 machine running Linux, which
I personally carried out.

Beyond active hindrances and lacks of cooperation, a competing
solution with distro backing appeared that removed the last vestige
of corporate patronage from the project. It ended up bitrotting
faster than I could singlehandedly do all the maintenance, testing,
and coding work on it while also trying to get anything else done.

MPSS was not as well-developed at the time the hugetlb solution
killed it, but is not terribly dissimilar in how it came into
being, developed, and then died, apart from less active hindrance.

The one and only aspect in which any research was involved was a
proposal, never accepted or pursued, to investigate how larger
base page sizes implemented via page clustering mitigated external
fragmentation for the purposes of MPSS and also how certain
techniques borrowed from page clustering could reduce the frequency
of and performance penalties associated with demotion in MPSS. The
proposal has never been publicly circulated, though some of its content
was described in the KS presentation as future directions or similar.


On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote:
 About the fs deciding the size of the pagecache granularity I totally
 dislike that design, there's no reason why the fs should control that,
[...]

This is all valid commentary, though I don't have any particular
response to it.

In any event, I've never been involved in a research project, though
I would've liked to have been. The emphasis in all cases was enabling
specific functionality in production, using techniques whose viability
had furthermore already been demonstrated elsewhere, by others.

In both instances, insurmountable nontechnical obstacles were present,
which remain in place and effectively limit the scale and scope of any
sort of project I can personally lead with any sort of likelihood of
mainline acceptance.

Where I am limited, you are not. Good luck to you.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for review] [7/48] i386: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-07-19 Thread William Lee Irwin III
From: William Lee Irwin III <[EMAIL PROTECTED]>
>> PAE is useful for more than supporting more than 4GB RAM.  It supports
>> expanded swapspace and NX executable protections.  Some users may want NX
>> or expanded swapspace support without the overhead or instability of
>> highmem.  For these reasons, the following patch divorces CONFIG_X86_PAE
>> from CONFIG_HIGHMEM64G.

On Thu, Jul 19, 2007 at 03:52:29PM +0100, Christoph Hellwig wrote:
> What overhead of instability of highmem?  Sorry folks but this is utter
> bollocks.  Back in the Caldera days we did a lot of measurement on highmem
> overhead, and CONFIG_HIGHMEM has no measurable overhead at all on a system
> that doesn't use it.  CONFIG_HIGHMEM64G on the other hand has
> a quite visible overhead on small systems, but that's entirely due to the
> bigger page table entries that you need for NX.

The missing context here is CONFIG_VMSPLIT on laptops.

Laptop users, who frequently use CONFIG_VMSPLIT options to avoid
highmem, wanted to turn on NX. Prior to the patch, those options were
barred for all highmem configurations. In response to those requests,
I produced the patch.

The overhead and instability derived from tiny zones as opposed to
kmap()/kunmap(), or at least such was the case historically.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Check for compound pages in set_page_dirty()

2007-07-19 Thread William Lee Irwin III
On Thu, Jul 19, 2007 at 06:35:17PM +0100, Hugh Dickins wrote:
> I started from your patch.  But it now seems to me a bugfix to remove
> those PageCompound tests, because they're preventing a hugetlb page
> from being marked dirty, when Ken needs it to be marked dirty so
> /proc/sys/vm/drop_caches doesn't drop the data read in by DIO.
> (His original patch went into -stable: would the patch fixing
> this all up need to go into -stable?)

This needs to be done some other way. The dirty bit should not be
significant for pseudofs's with no backing store. The consequences
of making it so are becoming apparent in the IO path, and it
caused performance regressions elsewhere as well. ramfs, for instance,
doesn't require anything of this sort to cope with drop_caches.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Check for compound pages in set_page_dirty()

2007-07-19 Thread William Lee Irwin III
On Thu, Jul 19, 2007 at 06:35:17PM +0100, Hugh Dickins wrote:
 I started from your patch.  But it now seems to me a bugfix to remove
 those PageCompound tests, because they're preventing a hugetlb page
 from being marked dirty, when Ken needs it to be marked dirty so
 /proc/sys/vm/drop_caches doesn't drop the data read in by DIO.
 (His original patch went into -stable: would the patch fixing
 this all up need to go into -stable?)

This needs to be done some other way. The dirty bit should not be
significant for pseudofs's with no backing store. The consequences
of making it so are becoming apparent in the IO path, and it
caused performance regressions elsewhere as well. ramfs, for instance,
doesn't require anything of this sort to cope with drop_caches.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for review] [7/48] i386: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-07-19 Thread William Lee Irwin III
From: William Lee Irwin III [EMAIL PROTECTED]
 PAE is useful for more than supporting more than 4GB RAM.  It supports
 expanded swapspace and NX executable protections.  Some users may want NX
 or expanded swapspace support without the overhead or instability of
 highmem.  For these reasons, the following patch divorces CONFIG_X86_PAE
 from CONFIG_HIGHMEM64G.

On Thu, Jul 19, 2007 at 03:52:29PM +0100, Christoph Hellwig wrote:
 What overhead of instability of highmem?  Sorry folks but this is utter
 bollocks.  Back in the Caldera days we did a lot of measurement on highmem
 overhead, and CONFIG_HIGHMEM has no measurable overhead at all on a system
 that doesn't use it.  CONFIG_HIGHMEM64G on the other hand has
 a quite visible overhead on small systems, but that's entirely due to the
 bigger page table entries that you need for NX.

The missing context here is CONFIG_VMSPLIT on laptops.

Laptop users, who frequently use CONFIG_VMSPLIT options to avoid
highmem, wanted to turn on NX. Prior to the patch, those options were
barred for all highmem configurations. In response to those requests,
I produced the patch.

The overhead and instability derived from tiny zones as opposed to
kmap()/kunmap(), or at least such was the case historically.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)

2007-07-18 Thread William Lee Irwin III
On Tue, Jul 17, 2007 at 10:47:37AM -0700, William Lee Irwin III wrote:
>> You may rest assured that it's technically feasible. It's been done.
>> The larger obstacles to all this are nontechnical.

On Tue, Jul 17, 2007 at 09:33:08PM +0200, Andrea Arcangeli wrote:
> Back then there was no variable order page size proposal, no slub,
> generally nothing of that kind.
> I think these days it worth to get it working again and solve the
> technical obstacles once more time. Then we should plug into it a
> pagecache logic to handle small files. That means if the soft page
> size is 64k, we should kmalloc 32k of pagecache if the file is < 64k
> but >= 32k, or kmalloc 16k if the file is < 32k but >= 16k, etc...

Actually I'd worked on what was called MPSS (Multiple Page Size Support)
before I ever started on pgcl. Some large portion of the pgcl proposal
as I presented it internally was to reduce the order of large page
allocations and provide a promotion and demotion mechanism enabling
different processes to have different sized translations for the same
large page, and hence no out-of-context pagetable/TLB updates during
promotion and demotion, essentially by making the TLB translation to
page relation M:N. ISTR describing this in a KS presentation for which
IIRC you were present. But that's neither here nor there.


On Tue, Jul 17, 2007 at 09:33:08PM +0200, Andrea Arcangeli wrote:
> Down to 32bytes if we memcpy the 32bytes away to a 64k page, and we
> disable the logic the moment somebody attempts to mmap the "kmalloced"
> pagecache (which I think it's a lot simpler than trying to mmap a
> kmalloced 4k naturally aligned object into userland). I wouldn't call
> it tail packing, it's more a fine-granular pagecache with the already
> available kmalloc granularities. That will maximize pagecache
> utilization with read syscall for hg/git compared to current 2.6.22
> plus memory will be allocated faster in 64k chunks etc... Ideally it
> should be possible to disable the finer-granular-kmalloc-pagecache on
> the big irons with lots of memory and only working with big files.

In any event, that is a sound strategy for mitigating internal
fragmentation of pagecache, though internal fragmentation of anonymous
memory has more severe consequences and is less easily mitigated.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)

2007-07-18 Thread William Lee Irwin III
On Tue, Jul 17, 2007 at 10:47:37AM -0700, William Lee Irwin III wrote:
 You may rest assured that it's technically feasible. It's been done.
 The larger obstacles to all this are nontechnical.

On Tue, Jul 17, 2007 at 09:33:08PM +0200, Andrea Arcangeli wrote:
 Back then there was no variable order page size proposal, no slub,
 generally nothing of that kind.
 I think these days it worth to get it working again and solve the
 technical obstacles once more time. Then we should plug into it a
 pagecache logic to handle small files. That means if the soft page
 size is 64k, we should kmalloc 32k of pagecache if the file is  64k
 but = 32k, or kmalloc 16k if the file is  32k but = 16k, etc...

Actually I'd worked on what was called MPSS (Multiple Page Size Support)
before I ever started on pgcl. Some large portion of the pgcl proposal
as I presented it internally was to reduce the order of large page
allocations and provide a promotion and demotion mechanism enabling
different processes to have different sized translations for the same
large page, and hence no out-of-context pagetable/TLB updates during
promotion and demotion, essentially by making the TLB translation to
page relation M:N. ISTR describing this in a KS presentation for which
IIRC you were present. But that's neither here nor there.


On Tue, Jul 17, 2007 at 09:33:08PM +0200, Andrea Arcangeli wrote:
 Down to 32bytes if we memcpy the 32bytes away to a 64k page, and we
 disable the logic the moment somebody attempts to mmap the kmalloced
 pagecache (which I think it's a lot simpler than trying to mmap a
 kmalloced 4k naturally aligned object into userland). I wouldn't call
 it tail packing, it's more a fine-granular pagecache with the already
 available kmalloc granularities. That will maximize pagecache
 utilization with read syscall for hg/git compared to current 2.6.22
 plus memory will be allocated faster in 64k chunks etc... Ideally it
 should be possible to disable the finer-granular-kmalloc-pagecache on
 the big irons with lots of memory and only working with big files.

In any event, that is a sound strategy for mitigating internal
fragmentation of pagecache, though internal fragmentation of anonymous
memory has more severe consequences and is less easily mitigated.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)

2007-07-17 Thread William Lee Irwin III
On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
> BTW, in a parallel thread (the thread where I've been suggested to
> post this), Rik rightfully mentioned Bill once also tried to get this
> working and basically asked for the differences. I don't know exactly
> what Bill did, I only remember well the major reason he did it. Below
> I add some more comment on the Bill, taken from my answer to Rik:

I got it working. It merely bitrotted faster than I could maintain it.


On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
> Right, I almost forgot he also tried enlarging the PAGE_SIZE at some
> point, back then it was for the 32bit systems with 64G of ram, to
> reduce the mem_map array, something my patch achieves too btw.

It was done for the occasion of the first publicly-announced boot of
Linux on a 64GB x86-32 machine.


On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
> I thought his approach was of the old type, not backwards compatible,
> the one we also thought for amd64, and I seem to remember he was
> trying to solve the backwards compatibility issue without much
> success.

It was not of the old type. It followed Hugh's strategy, which made
it fully backward-compatible. The only deficits in terms of success
were performance, maintenance, and attracting any sort of audience.
The only tester besides myself was literally Zwane.


On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
> But really I'm unsure how Bill could achieve anything backwards
> compatible back then without anon-vma... anon-vma is the enabler. I
> remember he worked on enlarging the PAGE_SIZE back then, but I don't
> recall him exposing HARD_PAGE_SIZE to the common code either (actually
> I never seen his code so I can't be sure of this). Even if he had pte
> chains back then, reaching the pte wasn't enough and I doubt he could
> unwalk the pagetable tree from pte up to pmd up to pgd/mm, up to vma
> to read the vm_pgoff that btw was meaningless back then for the anon
> vmas ;).

It was exposed to the common code as MMUPAGE_SIZE. Significant pte
vectoring code in the core was involved, as well as partial page
distribution policies, mmap()/mprotect() et al handling splitting
across physical page boundaries, and the like. When done wrong,
applications such as /sbin/init didn't run. It was all there, though
Hugh's earlier implementation was far superior.

pte_chains didn't make things anywhere near as awkward as highpte.
pte_chains didn't really care so much how large an area a struct
page tracked. highpte OTOH needed more effort, though I don't recall
specifically why anymore.

My long-dead code should be at:

ftp://ftp.kernel.org/pub/linux/kernel/people/wli/vm/pgcl/

dmesg's from 64GB x86-32 machines are also in that directory, dating
from March 2003.


On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
> Things are very complex, but I think it's possible by doing proper
> math on vm_pgoff, vm_start/vm_end and address, just with that 4 things
> we should have enough info to know which parts of each page to map in
> which pte, and that's all we need to solve it. At the second mprotect
> of 4k over the same 8k page will get two vmas queued in the same
> anon-vma. So we check both vmas and looking at the vm_pgoff(hardpage
> units)+(((address-vm_start)&~PAGE_MASK)>>HARD_PAGE_SHIFT we should be
> able to tell if the ptes behind the vma need to be updated and if the
> second vma can be merged back.
> The idea to make it work is to synchronously map all the ptes for all
> indexes covered by each page as long as they're in the range
> vm_start>>HARD_PAGE_SHIFT to vm_end >> HARD_PAGE_SHIFT. We should
> threat a page fault like a multiple page fault. Then when you mprotect
> or mremap you already know which ptes are mapped and that you need to
> unmap/update by looking the start/end hard-page-indexes, and you also
> have to always check all vmas that could possibly map that page, if
> the page cross the vm_start/vm_end boundary.

Hugh had this all worked out in 2001. I explored some alternatives in
the design space, but they didn't perform as well as the original.
It's best to refer to his original patch for reference as it's far
cleaner, though in principle one should be able to find machines where
the late 2.5.x patches I did will run. It was never exposed to a very
broad variety of systems, so I can't vouch for much beyond NUMA-Q and
ThinkPad and whatever Zwane booted it on.


On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
> Easy definitely not, but feasible I hope yes because I couldn't think
> of a case where we can't figure out which part of the page to map in
> which pte. I wish I had it implemented before posting because then I
> would be 100% sure it was feasible ;).
> Now if somebody here can think of a case where we can't know where to
> map which part of the page in which pte, then *that* would be very
> interesting and it 

Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

2007-07-17 Thread William Lee Irwin III
At some point in the past, I wrote:
>> If at some point one of the pro-4k stacks crowd can prove that all
>> code paths are safe, or introduce another viable alternative (such as
>> Matt's idea for extending the stack dynamically), then removing the 8k
>> stacks option makes sense.

On Mon, Jul 16, 2007 at 11:54:38PM +0100, Alan Cox wrote:
> Any x86-32 path unsafe with 4K stacks is almost certainly unsafe with 8K
> stacks because the 8K stacks do not have seperate IRQ stack paths, so you
> have the same space but split. It might be less predictable on 8K stacks
> but it isn't absent.

At hch's suggestion I rewrote the separate IRQ stack configurability
patch into one making IRQ stacks mandatory and unconfigurable, and
hence enabled with 8K stacks.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

2007-07-17 Thread William Lee Irwin III
At some point in the past, I wrote:
 If at some point one of the pro-4k stacks crowd can prove that all
 code paths are safe, or introduce another viable alternative (such as
 Matt's idea for extending the stack dynamically), then removing the 8k
 stacks option makes sense.

On Mon, Jul 16, 2007 at 11:54:38PM +0100, Alan Cox wrote:
 Any x86-32 path unsafe with 4K stacks is almost certainly unsafe with 8K
 stacks because the 8K stacks do not have seperate IRQ stack paths, so you
 have the same space but split. It might be less predictable on 8K stacks
 but it isn't absent.

At hch's suggestion I rewrote the separate IRQ stack configurability
patch into one making IRQ stacks mandatory and unconfigurable, and
hence enabled with 8K stacks.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)

2007-07-17 Thread William Lee Irwin III
On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
 BTW, in a parallel thread (the thread where I've been suggested to
 post this), Rik rightfully mentioned Bill once also tried to get this
 working and basically asked for the differences. I don't know exactly
 what Bill did, I only remember well the major reason he did it. Below
 I add some more comment on the Bill, taken from my answer to Rik:

I got it working. It merely bitrotted faster than I could maintain it.


On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
 Right, I almost forgot he also tried enlarging the PAGE_SIZE at some
 point, back then it was for the 32bit systems with 64G of ram, to
 reduce the mem_map array, something my patch achieves too btw.

It was done for the occasion of the first publicly-announced boot of
Linux on a 64GB x86-32 machine.


On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
 I thought his approach was of the old type, not backwards compatible,
 the one we also thought for amd64, and I seem to remember he was
 trying to solve the backwards compatibility issue without much
 success.

It was not of the old type. It followed Hugh's strategy, which made
it fully backward-compatible. The only deficits in terms of success
were performance, maintenance, and attracting any sort of audience.
The only tester besides myself was literally Zwane.


On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
 But really I'm unsure how Bill could achieve anything backwards
 compatible back then without anon-vma... anon-vma is the enabler. I
 remember he worked on enlarging the PAGE_SIZE back then, but I don't
 recall him exposing HARD_PAGE_SIZE to the common code either (actually
 I never seen his code so I can't be sure of this). Even if he had pte
 chains back then, reaching the pte wasn't enough and I doubt he could
 unwalk the pagetable tree from pte up to pmd up to pgd/mm, up to vma
 to read the vm_pgoff that btw was meaningless back then for the anon
 vmas ;).

It was exposed to the common code as MMUPAGE_SIZE. Significant pte
vectoring code in the core was involved, as well as partial page
distribution policies, mmap()/mprotect() et al handling splitting
across physical page boundaries, and the like. When done wrong,
applications such as /sbin/init didn't run. It was all there, though
Hugh's earlier implementation was far superior.

pte_chains didn't make things anywhere near as awkward as highpte.
pte_chains didn't really care so much how large an area a struct
page tracked. highpte OTOH needed more effort, though I don't recall
specifically why anymore.

My long-dead code should be at:

ftp://ftp.kernel.org/pub/linux/kernel/people/wli/vm/pgcl/

dmesg's from 64GB x86-32 machines are also in that directory, dating
from March 2003.


On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
 Things are very complex, but I think it's possible by doing proper
 math on vm_pgoff, vm_start/vm_end and address, just with that 4 things
 we should have enough info to know which parts of each page to map in
 which pte, and that's all we need to solve it. At the second mprotect
 of 4k over the same 8k page will get two vmas queued in the same
 anon-vma. So we check both vmas and looking at the vm_pgoff(hardpage
 units)+(((address-vm_start)~PAGE_MASK)HARD_PAGE_SHIFT we should be
 able to tell if the ptes behind the vma need to be updated and if the
 second vma can be merged back.
 The idea to make it work is to synchronously map all the ptes for all
 indexes covered by each page as long as they're in the range
 vm_startHARD_PAGE_SHIFT to vm_end  HARD_PAGE_SHIFT. We should
 threat a page fault like a multiple page fault. Then when you mprotect
 or mremap you already know which ptes are mapped and that you need to
 unmap/update by looking the start/end hard-page-indexes, and you also
 have to always check all vmas that could possibly map that page, if
 the page cross the vm_start/vm_end boundary.

Hugh had this all worked out in 2001. I explored some alternatives in
the design space, but they didn't perform as well as the original.
It's best to refer to his original patch for reference as it's far
cleaner, though in principle one should be able to find machines where
the late 2.5.x patches I did will run. It was never exposed to a very
broad variety of systems, so I can't vouch for much beyond NUMA-Q and
ThinkPad and whatever Zwane booted it on.


On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
 Easy definitely not, but feasible I hope yes because I couldn't think
 of a case where we can't figure out which part of the page to map in
 which pte. I wish I had it implemented before posting because then I
 would be 100% sure it was feasible ;).
 Now if somebody here can think of a case where we can't know where to
 map which part of the page in which pte, then *that* would be very
 interesting and it could save some wasted development effort. Unless
 this 

Re: state of stack patches

2007-07-06 Thread William Lee Irwin III
On Thu, Jul 05, 2007 at 01:34:25PM -0700, Jeremy Fitzhardinge wrote:
> What's the state of your stack patches?  I'm still using the ones you 
> posted some time ago, and they seem like useful things to have in the 
> kernel.  Is there anything preventing you from pushing them upstream?

Just one thing: 2.6.22. I can, of course, do updating for -mm, -ak, et al.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: state of stack patches

2007-07-06 Thread William Lee Irwin III
On Thu, Jul 05, 2007 at 01:34:25PM -0700, Jeremy Fitzhardinge wrote:
 What's the state of your stack patches?  I'm still using the ones you 
 posted some time ago, and they seem like useful things to have in the 
 kernel.  Is there anything preventing you from pushing them upstream?

Just one thing: 2.6.22. I can, of course, do updating for -mm, -ak, et al.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] fsblock

2007-06-23 Thread William Lee Irwin III
On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
> fsblock is a rewrite of the "buffer layer" (ding dong the witch is
> dead), which I have been working on, on and off and is now at the stage
> where some of the basics are working-ish. This email is going to be
> long...

Long overdue. Thank you.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-23 Thread William Lee Irwin III
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
>>> c. open() flag to unlink a file before returning the fd

On Jun 19, 2007, at 11:08:24, William Lee Irwin III wrote:
>> You probably want a tmpfile(3) -like affair which never has a  
>> pathname to begin with. It could be useful for security purposes  
>> more generally.

On Fri, Jun 22, 2007 at 11:52:12PM -0400, Kyle Moffett wrote:
> maybe this: open("/some/dir", O_TMPFILE);
> and this? open("/some/dir", O_TMPFILE|O_DIRECTORY);
> The former would return a filehandle to a new anonymous file  
> somewhere on whatever filesystem backs the specified path.  The  
> latter would do the same, except create an anonymous directory where  
> you could use "openat()" or something.  Presumably "lsof" and "/proc"  
> should show either type of handle as referring to either "/some/ 
> filesystem/" or "/some/filesystem/ (anonymous temp file)" or something.

This is plausible (and I did indeed consider the file variant),
though it may require more infrastructure than for tmpfs only.

It may be worth clarifying that I have no concrete plans to work on
the JIT emulator issues myself. I'm only disseminating ideas I think
will pass review. I expect others to take up the issue(s) perhaps with
some inspiration from what I described. I may review some, but I have
a large review backlog as things now stand.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-23 Thread William Lee Irwin III
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 c. open() flag to unlink a file before returning the fd

On Jun 19, 2007, at 11:08:24, William Lee Irwin III wrote:
 You probably want a tmpfile(3) -like affair which never has a  
 pathname to begin with. It could be useful for security purposes  
 more generally.

On Fri, Jun 22, 2007 at 11:52:12PM -0400, Kyle Moffett wrote:
 maybe this: open(/some/dir, O_TMPFILE);
 and this? open(/some/dir, O_TMPFILE|O_DIRECTORY);
 The former would return a filehandle to a new anonymous file  
 somewhere on whatever filesystem backs the specified path.  The  
 latter would do the same, except create an anonymous directory where  
 you could use openat() or something.  Presumably lsof and /proc  
 should show either type of handle as referring to either /some/ 
 filesystem/ or /some/filesystem/ (anonymous temp file) or something.

This is plausible (and I did indeed consider the file variant),
though it may require more infrastructure than for tmpfs only.

It may be worth clarifying that I have no concrete plans to work on
the JIT emulator issues myself. I'm only disseminating ideas I think
will pass review. I expect others to take up the issue(s) perhaps with
some inspiration from what I described. I may review some, but I have
a large review backlog as things now stand.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] fsblock

2007-06-23 Thread William Lee Irwin III
On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
 fsblock is a rewrite of the buffer layer (ding dong the witch is
 dead), which I have been working on, on and off and is now at the stage
 where some of the basics are working-ish. This email is going to be
 long...

Long overdue. Thank you.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-20 Thread William Lee Irwin III
William Lee Irwin III wrote:
>> I presumed an ELF note or extended filesystem attributes were already
>> in place for this sort of affair. It may be that the model implemented
>> is so restrictive that users are forbidden to create new executables,
>> in which case using a different model is certainly in order. Otherwise
>> the ELF note or attributes need to be implemented.

On Wed, Jun 20, 2007 at 09:37:31AM -0700, H. Peter Anvin wrote:
> Another thing to keep in mind, since we're talking about security
> policies in the first place, is that anything like this *MUST* be
> "opt-in" on the part of the security policy, because what we're talking
> about is circumventing an explicit security policy just based on a
> user-provided binary saying, in effect, "don't worry, I know what I'm
> doing."
> Changing the meaning of an established explicit security policy is not
> acceptable.

This is what I had in mind with the commentary on the intentions of the
policy. Thank you for correcting my hamhanded attempt to describe it.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-20 Thread William Lee Irwin III
On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> If the policy forbidding self-modifying code lacks a method of
>> exempting programs such as JIT interpreters (which I doubt) then
>> it's a problem. I'm with Alan on this one.

On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
> It does and it doesn't. There is not a reasonable way for a
> user to mark an app as needing full self-modifying ability.
> It's not like the executable stack, which can be set via the
> ELF note markings on the executable. (ELF note markings are
> ideal because they can not be used via a ret-to-libc attack)
> With admin privs, one can change SE Linux settings. Mark the
> executable, disable the protection system-wide, generate a
> completely new SE Linux policy, or just turn SE Linux off.
> Normally we don't expect/require admin privs to install an
> executable in one's own ~/bin directory. This is broken.
> It ought to be easier to get a JIT working well without
> enabling arbitrary mprotect. This would allow a JIT to
> partially benefit from the recent security enhancements.
> (think of all the buggy browser-based JIT things!)

I presumed an ELF note or extended filesystem attributes were already
in place for this sort of affair. It may be that the model implemented
is so restrictive that users are forbidden to create new executables,
in which case using a different model is certainly in order. Otherwise
the ELF note or attributes need to be implemented.


On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> This sort of logic might be appropriate for a sort of parametrized
>> and specialized vma allocator setting the policy in /proc/ along
>> with various sorts of limits. There are limits to such and at some
>> point things will have to manually manage their own process address
>> spaces in a platform-specific fashion. If kernel assistance here is
>> rejected they may have to do so in all cases.

On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
> I prefer ELF notes (for start-up allocations) and prctl,
> plus a mmap flag for per-allocation behavior.

Beware that the kernel (upstream of me) will likely refuse to support
to exotic mmap() placement policies. At that point userspace will have
to implement them itself with a front-end to mmap().

Userspace can actually live without kernel placement support for
everything but the executable itself, which is already implemented via
ELF loading standards. This is not to downplay the tremendous amounts
of pain involved for moving the stack, getting ld.so to land in the
right place, and so on. Actually I'm less sure about .interp placement.
In any event, exotic virtualspace allocation policies are largely yet
another "simple matter of programming" implementable entirely in
userspace.


On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> This is a bad idea. The standard semantics are needed for programs
>> relying upon them.

On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
> I didn't mean that the default default :-) setting would change.
> I meant that people could change the behavior from a boot script.
> Things that break are really foul and nasty anyway, probably with
> serious problems that ought to get fixed.

It's actually not a good idea to make it the default even via sysctl.
People won't realize something will break until it does, and what will
break is likely to be a database responsible for data integrity. The
IPC_RMID creation flag should suffice.


On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> You probably want a tmpfile(3) -like affair which never has a pathname
>> to begin with. It could be useful for security purposes more generally.

On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
> Yes, exactly. I think there are some possible optimizations
> available too, particularly with the cifs filesystem.

I doubt this will be controversial, but it's not clear to me that there
is any convenient way to obtain an anonymous inode on anything but tmpfs,
in which case it's not really anonymous, but not visible to userspace on
account of the default kern_mount(). Essentially it's possible to hoist
the tmpfile name generation in-kernel to where it's in a disconnected
namespace not visible to any userspace whatsoever, and kernel threads
can cooperatively ensure safety via access discipline. Alternatively,
one could kern_mount() a fresh tmpfs filesystem for some concurrency
domain, e.g. per-uid, per-process, or per-thread.


On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> This sounds vaguely like another syscall, like mdup(). This is
>> particularly meaningful in the context of anonymous memory, for
>> which there is no method of replicating mappings with

Re: JIT emulator needs

2007-06-20 Thread William Lee Irwin III
On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote:
 If the policy forbidding self-modifying code lacks a method of
 exempting programs such as JIT interpreters (which I doubt) then
 it's a problem. I'm with Alan on this one.

On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
 It does and it doesn't. There is not a reasonable way for a
 user to mark an app as needing full self-modifying ability.
 It's not like the executable stack, which can be set via the
 ELF note markings on the executable. (ELF note markings are
 ideal because they can not be used via a ret-to-libc attack)
 With admin privs, one can change SE Linux settings. Mark the
 executable, disable the protection system-wide, generate a
 completely new SE Linux policy, or just turn SE Linux off.
 Normally we don't expect/require admin privs to install an
 executable in one's own ~/bin directory. This is broken.
 It ought to be easier to get a JIT working well without
 enabling arbitrary mprotect. This would allow a JIT to
 partially benefit from the recent security enhancements.
 (think of all the buggy browser-based JIT things!)

I presumed an ELF note or extended filesystem attributes were already
in place for this sort of affair. It may be that the model implemented
is so restrictive that users are forbidden to create new executables,
in which case using a different model is certainly in order. Otherwise
the ELF note or attributes need to be implemented.


On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote:
 This sort of logic might be appropriate for a sort of parametrized
 and specialized vma allocator setting the policy in /proc/ along
 with various sorts of limits. There are limits to such and at some
 point things will have to manually manage their own process address
 spaces in a platform-specific fashion. If kernel assistance here is
 rejected they may have to do so in all cases.

On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
 I prefer ELF notes (for start-up allocations) and prctl,
 plus a mmap flag for per-allocation behavior.

Beware that the kernel (upstream of me) will likely refuse to support
to exotic mmap() placement policies. At that point userspace will have
to implement them itself with a front-end to mmap().

Userspace can actually live without kernel placement support for
everything but the executable itself, which is already implemented via
ELF loading standards. This is not to downplay the tremendous amounts
of pain involved for moving the stack, getting ld.so to land in the
right place, and so on. Actually I'm less sure about .interp placement.
In any event, exotic virtualspace allocation policies are largely yet
another simple matter of programming implementable entirely in
userspace.


On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote:
 This is a bad idea. The standard semantics are needed for programs
 relying upon them.

On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
 I didn't mean that the default default :-) setting would change.
 I meant that people could change the behavior from a boot script.
 Things that break are really foul and nasty anyway, probably with
 serious problems that ought to get fixed.

It's actually not a good idea to make it the default even via sysctl.
People won't realize something will break until it does, and what will
break is likely to be a database responsible for data integrity. The
IPC_RMID creation flag should suffice.


On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote:
 You probably want a tmpfile(3) -like affair which never has a pathname
 to begin with. It could be useful for security purposes more generally.

On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
 Yes, exactly. I think there are some possible optimizations
 available too, particularly with the cifs filesystem.

I doubt this will be controversial, but it's not clear to me that there
is any convenient way to obtain an anonymous inode on anything but tmpfs,
in which case it's not really anonymous, but not visible to userspace on
account of the default kern_mount(). Essentially it's possible to hoist
the tmpfile name generation in-kernel to where it's in a disconnected
namespace not visible to any userspace whatsoever, and kernel threads
can cooperatively ensure safety via access discipline. Alternatively,
one could kern_mount() a fresh tmpfs filesystem for some concurrency
domain, e.g. per-uid, per-process, or per-thread.


On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote:
 This sounds vaguely like another syscall, like mdup(). This is
 particularly meaningful in the context of anonymous memory, for
 which there is no method of replicating mappings within a single
 process address space.

On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:
 Yes, mdup() and probably mdup2(). It could be mremap flags or not.
 JIT emulators generally need a second mapping so that they can
 have both read/write and execute

Re: JIT emulator needs

2007-06-20 Thread William Lee Irwin III
William Lee Irwin III wrote:
 I presumed an ELF note or extended filesystem attributes were already
 in place for this sort of affair. It may be that the model implemented
 is so restrictive that users are forbidden to create new executables,
 in which case using a different model is certainly in order. Otherwise
 the ELF note or attributes need to be implemented.

On Wed, Jun 20, 2007 at 09:37:31AM -0700, H. Peter Anvin wrote:
 Another thing to keep in mind, since we're talking about security
 policies in the first place, is that anything like this *MUST* be
 opt-in on the part of the security policy, because what we're talking
 about is circumventing an explicit security policy just based on a
 user-provided binary saying, in effect, don't worry, I know what I'm
 doing.
 Changing the meaning of an established explicit security policy is not
 acceptable.

This is what I had in mind with the commentary on the intentions of the
policy. Thank you for correcting my hamhanded attempt to describe it.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-19 Thread William Lee Irwin III
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> Right now, Linux isn't all that friendly to JIT emulators.
> Here are the problems and suggestions to improve the situation.
> There is an SE Linux execmem restriction that enforces W^X.
> Assuming you don't wish to just disable SE Linux, there are
> two ugly ways around the problem. You can mmap a file twice,
> or you can abuse SysV shared memory. The mmap method requires
> that you know of a filesystem mounted rw,exec where you can
> write a very large temporary file. This arbitrary filesystem,
> rather than swap space, will be the backing store. The SysV
> shared memory method requires an undocumented flag and is
> subject to some annoying size limits. Both methods create
> objects that will fail to be deleted if the program dies
> before marking the objects for deletion.

If the policy forbidding self-modifying code lacks a method of
exempting programs such as JIT interpreters (which I doubt) then
it's a problem. I'm with Alan on this one.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> Processors often have annoying limits on the immediate values
> in instructions. An x86 or x86_64 JIT can go a bit faster if
> all allocations are kept to the low 2 GB of address space.
> There are also reasons for a 32bit-to-x86_64 JIT to chose
> a nearly arbitrary 2 GB region that lies above 4 GB.
> Other archs have other limits, such as 32 MB or 256 MB.

This sort of logic might be appropriate for a sort of parametrized
and specialized vma allocator setting the policy in /proc/ along
with various sorts of limits. There are limits to such and at some
point things will have to manually manage their own process address
spaces in a platform-specific fashion. If kernel assistance here is
rejected they may have to do so in all cases.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> Sometimes it is very helpful to have the read/write mapping
> be a fixed offset from the read/exec mapping. A power of 2
> can be especially desirable.

As far as the kernel is concerned they're unrelated, so this will
likely need MAP_FIXED barring a staggering array of fresh system
calls to act on tuples of memory ranges in lockstep.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> Emulators often need a cheap way to change page permissions.
> One VMA per page is no good. Besides taking up space and making
> many things generally slower, having one VMA per page causes
> a huge performance loss for snapshot roll-back operations.
> Just tearing down all those VMAs takes a good while.

remap_file_pages_prot() is reputedly waiting in the wings somewhere
for this.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> Additions to better support JIT emulators:
> a. sysctl to set IPC_RMID by default

This is a bad idea. The standard semantics are needed for programs
relying upon them.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> b. shmget() flag to set IPC_RMID by default

This is relatively innocuous.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> c. open() flag to unlink a file before returning the fd

You probably want a tmpfile(3) -like affair which never has a pathname
to begin with. It could be useful for security purposes more generally.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> d. mremap() flag to always keep the old mapping

This sounds vaguely like another syscall, like mdup(). This is
particularly meaningful in the context of anonymous memory, for
which there is no method of replicating mappings within a single
process address space.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> e. mremap() flag to get a read/write mapping of a read/exec one
> f. mremap() flag to get a read/exec mapping of a read/write one

Presumably to be used in conjunction with keeping the old mapping.
A composite mdup()/mremap() and mprotect(), presumably saving a TLB
flush or other sorts of overhead, may make some sort of sense here.
Odds are it'll get rejected as the sequence of syscalls is a rather
precise equivalent, though it would optimize things (as would other
composite syscalls, e.g. ones combining fork() and execve() etc.).


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> g. mremap() flag to make the 5th arg (new addr) be the upper limit
> h. 6-bit wide mremap() "flag" to set the upper limit above given base

Essentially more placement support for mremap()/mdup(). It's not clear
to me those particular semantics are the ideal ones. A target range
for placement should do, if not manual address space management.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> i. support the prot argument to remap_file_pages

This is probably going to happen anyway.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
> j. a documented way (madvise?) to punch same-VMA zero-page holes

This is 

Re: JIT emulator needs

2007-06-19 Thread William Lee Irwin III
On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 Right now, Linux isn't all that friendly to JIT emulators.
 Here are the problems and suggestions to improve the situation.
 There is an SE Linux execmem restriction that enforces W^X.
 Assuming you don't wish to just disable SE Linux, there are
 two ugly ways around the problem. You can mmap a file twice,
 or you can abuse SysV shared memory. The mmap method requires
 that you know of a filesystem mounted rw,exec where you can
 write a very large temporary file. This arbitrary filesystem,
 rather than swap space, will be the backing store. The SysV
 shared memory method requires an undocumented flag and is
 subject to some annoying size limits. Both methods create
 objects that will fail to be deleted if the program dies
 before marking the objects for deletion.

If the policy forbidding self-modifying code lacks a method of
exempting programs such as JIT interpreters (which I doubt) then
it's a problem. I'm with Alan on this one.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 Processors often have annoying limits on the immediate values
 in instructions. An x86 or x86_64 JIT can go a bit faster if
 all allocations are kept to the low 2 GB of address space.
 There are also reasons for a 32bit-to-x86_64 JIT to chose
 a nearly arbitrary 2 GB region that lies above 4 GB.
 Other archs have other limits, such as 32 MB or 256 MB.

This sort of logic might be appropriate for a sort of parametrized
and specialized vma allocator setting the policy in /proc/ along
with various sorts of limits. There are limits to such and at some
point things will have to manually manage their own process address
spaces in a platform-specific fashion. If kernel assistance here is
rejected they may have to do so in all cases.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 Sometimes it is very helpful to have the read/write mapping
 be a fixed offset from the read/exec mapping. A power of 2
 can be especially desirable.

As far as the kernel is concerned they're unrelated, so this will
likely need MAP_FIXED barring a staggering array of fresh system
calls to act on tuples of memory ranges in lockstep.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 Emulators often need a cheap way to change page permissions.
 One VMA per page is no good. Besides taking up space and making
 many things generally slower, having one VMA per page causes
 a huge performance loss for snapshot roll-back operations.
 Just tearing down all those VMAs takes a good while.

remap_file_pages_prot() is reputedly waiting in the wings somewhere
for this.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 Additions to better support JIT emulators:
 a. sysctl to set IPC_RMID by default

This is a bad idea. The standard semantics are needed for programs
relying upon them.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 b. shmget() flag to set IPC_RMID by default

This is relatively innocuous.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 c. open() flag to unlink a file before returning the fd

You probably want a tmpfile(3) -like affair which never has a pathname
to begin with. It could be useful for security purposes more generally.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 d. mremap() flag to always keep the old mapping

This sounds vaguely like another syscall, like mdup(). This is
particularly meaningful in the context of anonymous memory, for
which there is no method of replicating mappings within a single
process address space.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 e. mremap() flag to get a read/write mapping of a read/exec one
 f. mremap() flag to get a read/exec mapping of a read/write one

Presumably to be used in conjunction with keeping the old mapping.
A composite mdup()/mremap() and mprotect(), presumably saving a TLB
flush or other sorts of overhead, may make some sort of sense here.
Odds are it'll get rejected as the sequence of syscalls is a rather
precise equivalent, though it would optimize things (as would other
composite syscalls, e.g. ones combining fork() and execve() etc.).


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 g. mremap() flag to make the 5th arg (new addr) be the upper limit
 h. 6-bit wide mremap() flag to set the upper limit above given base

Essentially more placement support for mremap()/mdup(). It's not clear
to me those particular semantics are the ideal ones. A target range
for placement should do, if not manual address space management.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 i. support the prot argument to remap_file_pages

This is probably going to happen anyway.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:
 j. a documented way (madvise?) to punch same-VMA zero-page holes

This is MADV_REMOVE, though most filesystems don't 

Re: [patch 00/14] Page cache cleanup in anticipation of Large Blocksize support

2007-06-17 Thread William Lee Irwin III
On Sun, 17 Jun 2007, Matt Mackall wrote:
>> Is it? Last I looked it had reverted to handing out reverse-contiguous
>> pages.

On Sun, Jun 17, 2007 at 07:08:41PM -0700, Christoph Lameter wrote:
> I thought that was fixed? Bill Irwin was working on it.
> But the contiguous pages usually only work shortly after boot. After 
> awhile memory gets sufficiently scrambled that the coalescing in the I/O 
> layer becomes ineffective.

It fell off the bottom of my priority queue, sorry.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/14] Page cache cleanup in anticipation of Large Blocksize support

2007-06-17 Thread William Lee Irwin III
On Sun, 17 Jun 2007, Matt Mackall wrote:
 Is it? Last I looked it had reverted to handing out reverse-contiguous
 pages.

On Sun, Jun 17, 2007 at 07:08:41PM -0700, Christoph Lameter wrote:
 I thought that was fixed? Bill Irwin was working on it.
 But the contiguous pages usually only work shortly after boot. After 
 awhile memory gets sufficiently scrambled that the coalescing in the I/O 
 layer becomes ineffective.

It fell off the bottom of my priority queue, sorry.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2/2] 2.6.22-rc4: known regressions v3

2007-06-14 Thread William Lee Irwin III
On Thu, Jun 14, 2007 at 03:57:25PM +0100, Mark Fortescue wrote:
> Benh's ptep_set_access_flags() patch needs to be applied in order to get 
> anyware with sun4c for all kernels >= linux-2.6.15. If not applied, you 
> will be lucky to get sash running as your init and even that will have 
> very limitit capabilities before it locks up the processor (power up 
> reset required).
> It has been applied to both the kernels I used for testing so this 
> problem is independent of the ptep_set_access_flags patch but that 
> does not mean that it is not a related issue.
> I will try to get some testing done over the weekend to narrow down 
> when the random illegal instructions first occour.
> If I start with 2.6.21 then if that is OK, then I should be able to narow 
> the issue down without too much trouble. If it is between 2.6.20 and 
> 2.6.21 then it will be a right pig as there are a large number of commits 
> that don't compile for sun4c between these two. What I am hoping is that 
> it occours in the 2.6.22-rc2 as per the x86_64.

Sounds like I'll be digging through my hardware stockpiles this weekend
to find a functional sun4c box.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2/2] 2.6.22-rc4: known regressions v3

2007-06-14 Thread William Lee Irwin III
On Thu, Jun 14, 2007 at 11:30:25AM +0100, Mark Fortescue wrote:
> They apear as soon as simpleinit starts up. Somtimes I get to a login 
> prompt before seeing any. Other times, commands in the simpleinit rc 
> script fail.
> They do apear to be random. If a command failes, you re-run the command 
> and it is OK. Commands seen to fail are basic (depmod, rm cat ..).
> The test I did use the same binaries with both the OK and problem kernels 
> so it is not a change to the application code, it is definatly a kernel 
> issue.

This sounds like it may be addressed by benh's ptep_set_access_flags()
fixes. Those fixes are still in -mm, hopefully to hit mainline by 2.6.22.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2/2] 2.6.22-rc4: known regressions v3

2007-06-14 Thread William Lee Irwin III
On Thu, Jun 14, 2007 at 11:30:25AM +0100, Mark Fortescue wrote:
 They apear as soon as simpleinit starts up. Somtimes I get to a login 
 prompt before seeing any. Other times, commands in the simpleinit rc 
 script fail.
 They do apear to be random. If a command failes, you re-run the command 
 and it is OK. Commands seen to fail are basic (depmod, rm cat ..).
 The test I did use the same binaries with both the OK and problem kernels 
 so it is not a change to the application code, it is definatly a kernel 
 issue.

This sounds like it may be addressed by benh's ptep_set_access_flags()
fixes. Those fixes are still in -mm, hopefully to hit mainline by 2.6.22.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2/2] 2.6.22-rc4: known regressions v3

2007-06-14 Thread William Lee Irwin III
On Thu, Jun 14, 2007 at 03:57:25PM +0100, Mark Fortescue wrote:
 Benh's ptep_set_access_flags() patch needs to be applied in order to get 
 anyware with sun4c for all kernels = linux-2.6.15. If not applied, you 
 will be lucky to get sash running as your init and even that will have 
 very limitit capabilities before it locks up the processor (power up 
 reset required).
 It has been applied to both the kernels I used for testing so this 
 problem is independent of the ptep_set_access_flags patch but that 
 does not mean that it is not a related issue.
 I will try to get some testing done over the weekend to narrow down 
 when the random illegal instructions first occour.
 If I start with 2.6.21 then if that is OK, then I should be able to narow 
 the issue down without too much trouble. If it is between 2.6.20 and 
 2.6.21 then it will be a right pig as there are a large number of commits 
 that don't compile for sun4c between these two. What I am hoping is that 
 it occours in the 2.6.22-rc2 as per the x86_64.

Sounds like I'll be digging through my hardware stockpiles this weekend
to find a functional sun4c box.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2/2] 2.6.22-rc4: known regressions v3

2007-06-13 Thread William Lee Irwin III
On Wed, Jun 13, 2007 at 11:25:20PM +0100, Mark Fortescue wrote:
> The random seg faults on x86_64 is interesting as I have been getting 
> random illegal instruction faults on sparc (sun4c) with 2.6.22-rc3. I have 
> not yet tried to track it down. All I know at present is that it is not a 
> problem on 2.6.20.9.

Very interesting. Any hints as to how to test or how long to wait
before the illegal instructions happen?


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2/2] 2.6.22-rc4: known regressions v3

2007-06-13 Thread William Lee Irwin III
On Wed, Jun 13, 2007 at 11:25:20PM +0100, Mark Fortescue wrote:
 The random seg faults on x86_64 is interesting as I have been getting 
 random illegal instruction faults on sparc (sun4c) with 2.6.22-rc3. I have 
 not yet tried to track it down. All I know at present is that it is not a 
 problem on 2.6.20.9.

Very interesting. Any hints as to how to test or how long to wait
before the illegal instructions happen?


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

2007-06-12 Thread William Lee Irwin III
On Tue, Jun 12, 2007 at 12:20:52AM -0600, Eric W. Biederman wrote:
> Does this perhaps need to be:
>> diff --git a/ipc/shm.c b/ipc/shm.c
>> index 4fefbad..8d2672d 100644
>> --- a/ipc/shm.c
>> +++ b/ipc/shm.c
>> @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct
>> *vma, unsigned long addr)
>> 
>> +pol = NULL;
>>  
>>  if (sfd->vm_ops->get_policy)
>>  pol = sfd->vm_ops->get_policy(vma, addr);
>> -else
>> +else if (vma->vm_policy && vma->vm_policy->policy != MPOL_DEFAULT)
>>  pol = vma->vm_policy;
>>  return pol;

Those paths are above the level where shm_get_policy() is called.
It may be that shm_get_policy() doesn't need to recapitulate them 
if it's only ever called through such codepaths. It's not clear to
me whether that's intended as an invariant or is coincidental and
not guaranteed for future callsites.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

2007-06-12 Thread William Lee Irwin III
On Tue, Jun 12, 2007 at 12:20:52AM -0600, Eric W. Biederman wrote:
 Does this perhaps need to be:
 diff --git a/ipc/shm.c b/ipc/shm.c
 index 4fefbad..8d2672d 100644
 --- a/ipc/shm.c
 +++ b/ipc/shm.c
 @@ -254,8 +254,10 @@ struct mempolicy *shm_get_policy(struct vm_area_struct
 *vma, unsigned long addr)
 
 +pol = NULL;
  
  if (sfd-vm_ops-get_policy)
  pol = sfd-vm_ops-get_policy(vma, addr);
 -else
 +else if (vma-vm_policy  vma-vm_policy-policy != MPOL_DEFAULT)
  pol = vma-vm_policy;
  return pol;

Those paths are above the level where shm_get_policy() is called.
It may be that shm_get_policy() doesn't need to recapitulate them 
if it's only ever called through such codepaths. It's not clear to
me whether that's intended as an invariant or is coincidental and
not guaranteed for future callsites.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

2007-06-11 Thread William Lee Irwin III
On Mon, Jun 11, 2007 at 09:30:20PM -0700, Andrew Morton wrote:
> Can we just double-check the refcounting please?

The refcounting for mpol's doesn't look good in general. I'm more
curious as to what releases the refcounts. alloc_page_vma(), for
instance, does get_vma_policy() which eventually takes a reference,
without ever releasing the reference it acquires. get_vma_policy()
itself uses a similar idiom to that used in aglitke's patch. I think
mpol refcounting needs to be addressed elsewhere besides this patch.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

2007-06-11 Thread William Lee Irwin III
On Mon, Jun 11, 2007 at 04:34:54PM -0500, Adam Litke wrote:
> Here's another breakage as a result of shared memory stacked files :(
> The NUMA policy for a VMA is determined by checking the following (in
> the order given):
> 1) vma->vm_ops->get_policy() (if defined)
> 2) vma->vm_policy (if defined)
> 3) task->mempolicy (if defined)
> 4) Fall back to default_policy
> By switching to stacked files for shared memory, get_policy() is now
> always set to shm_get_policy which is a wrapper function.  This
> causes us to stop at step 1, which yields NULL for hugetlb instead of
> task->mempolicy which was the previous (and correct) result.
> This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the
> wrapped vm_ops.  Andi and Christoph, does this look right to you?
> Signed-off-by: Adam Litke <[EMAIL PROTECTED]>

Thanks for fielding this. The fix is certainly clear enough.

Acked-by: William Irwin <[EMAIL PROTECTED]>


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-11 Thread William Lee Irwin III
On Thu, Jun 07, 2007 at 07:35:51PM -0700, William Lee Irwin III wrote:
>> +  PAE is required for NX support, and furthermore enables
>> +  larger swapspace support for non-overcommit purposes. It
>> +  has the cost of more pagetable lookup overhead, and also
>> +  consumes more pagetable space per process.

On Tue, Jun 12, 2007 at 01:52:35AM +0200, Adrian Bunk wrote:
> It's not specific to this help text, but I start becoming a bit picky 
> about this issues:
> If you understand this help text after reading it, you don't need a help 
> text for this option...  ;-)
> What is "NX support"?
> What are "non-overcommit purposes"?
> What is "pagetable lookup overhead"?
> And if in doubt, should I say Y or N?
> "System administrator who knows which hardware components he put into 
> the computer and which filesystems his data is on" might be a good 
> description for the average kconfig user, and these are the people who 
> should understand this help text.

I would like to have some place to explain issues such as those, but
there are as of yet no designated places for tutorial-level information.

If such a place were provided, I would provide storybook commentary to
explain all those. Similarly actually holds for kernel function docbook.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-11 Thread William Lee Irwin III
On Thu, Jun 07, 2007 at 07:35:51PM -0700, William Lee Irwin III wrote:
 +  PAE is required for NX support, and furthermore enables
 +  larger swapspace support for non-overcommit purposes. It
 +  has the cost of more pagetable lookup overhead, and also
 +  consumes more pagetable space per process.

On Tue, Jun 12, 2007 at 01:52:35AM +0200, Adrian Bunk wrote:
 It's not specific to this help text, but I start becoming a bit picky 
 about this issues:
 If you understand this help text after reading it, you don't need a help 
 text for this option...  ;-)
 What is NX support?
 What are non-overcommit purposes?
 What is pagetable lookup overhead?
 And if in doubt, should I say Y or N?
 System administrator who knows which hardware components he put into 
 the computer and which filesystems his data is on might be a good 
 description for the average kconfig user, and these are the people who 
 should understand this help text.

I would like to have some place to explain issues such as those, but
there are as of yet no designated places for tutorial-level information.

If such a place were provided, I would provide storybook commentary to
explain all those. Similarly actually holds for kernel function docbook.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

2007-06-11 Thread William Lee Irwin III
On Mon, Jun 11, 2007 at 04:34:54PM -0500, Adam Litke wrote:
 Here's another breakage as a result of shared memory stacked files :(
 The NUMA policy for a VMA is determined by checking the following (in
 the order given):
 1) vma-vm_ops-get_policy() (if defined)
 2) vma-vm_policy (if defined)
 3) task-mempolicy (if defined)
 4) Fall back to default_policy
 By switching to stacked files for shared memory, get_policy() is now
 always set to shm_get_policy which is a wrapper function.  This
 causes us to stop at step 1, which yields NULL for hugetlb instead of
 task-mempolicy which was the previous (and correct) result.
 This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for the
 wrapped vm_ops.  Andi and Christoph, does this look right to you?
 Signed-off-by: Adam Litke [EMAIL PROTECTED]

Thanks for fielding this. The fix is certainly clear enough.

Acked-by: William Irwin [EMAIL PROTECTED]


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [shm][hugetlb] Fix get_policy for stacked shared memory files

2007-06-11 Thread William Lee Irwin III
On Mon, Jun 11, 2007 at 09:30:20PM -0700, Andrew Morton wrote:
 Can we just double-check the refcounting please?

The refcounting for mpol's doesn't look good in general. I'm more
curious as to what releases the refcounts. alloc_page_vma(), for
instance, does get_vma_policy() which eventually takes a reference,
without ever releasing the reference it acquires. get_vma_policy()
itself uses a similar idiom to that used in aglitke's patch. I think
mpol refcounting needs to be addressed elsewhere besides this patch.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] fdmap v2 - implement sys_socket2

2007-06-10 Thread William Lee Irwin III
On Sun, Jun 10, 2007 at 04:26:07PM +1000, Paul Mackerras wrote:
> If you don't think we should be bound by POSIX, then you are perfectly
> free to go off and write your own research kernel with whatever
> interface you want, and no programs available to run on it. :)

This isn't fair to research kernels. Breaking applications is not an
active area of research.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] fdmap v2 - implement sys_socket2

2007-06-10 Thread William Lee Irwin III
On Sun, Jun 10, 2007 at 04:26:07PM +1000, Paul Mackerras wrote:
 If you don't think we should be bound by POSIX, then you are perfectly
 free to go off and write your own research kernel with whatever
 interface you want, and no programs available to run on it. :)

This isn't fair to research kernels. Breaking applications is not an
active area of research.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21 numa policy and huge pages not working

2007-06-09 Thread William Lee Irwin III
On Sat, Jun 09, 2007 at 09:10:51PM -0700, dean gaudet wrote:
> ok i've narrowed it some... maybe.
> in commit 8ef8286689c6b5bc76212437b85bdd2ba749ee44 things work fine, numa 
> policy is respected...
> the very next commit bc56bba8f31bd99f350a5ebfd43d50f411b620c7 breaks shm 
> badly causing the test program to oops the kernel.
> commit 516dffdcd8827a40532798602830dfcfc672294c fixes that breakage but 
> numa policy is no longer respected.
> i've added the authors of those two commits to the recipient list and 
> reattached the test program.  hopefully someone can shed light on the 
> problem.

Thanks, this helps a lot.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21 numa policy and huge pages not working

2007-06-09 Thread William Lee Irwin III
On Sat, Jun 09, 2007 at 09:10:51PM -0700, dean gaudet wrote:
 ok i've narrowed it some... maybe.
 in commit 8ef8286689c6b5bc76212437b85bdd2ba749ee44 things work fine, numa 
 policy is respected...
 the very next commit bc56bba8f31bd99f350a5ebfd43d50f411b620c7 breaks shm 
 badly causing the test program to oops the kernel.
 commit 516dffdcd8827a40532798602830dfcfc672294c fixes that breakage but 
 numa policy is no longer respected.
 i've added the authors of those two commits to the recipient list and 
 reattached the test program.  hopefully someone can shed light on the 
 problem.

Thanks, this helps a lot.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-08 Thread William Lee Irwin III
On Fri, Jun 08, 2007 at 10:07:52AM +0200, Mikael Pettersson wrote:
> Is this really needed? I can see why VMSPLIT_{2,3}G_OPT would
> depend on !HIGHMEM, but why would they depend on !X86_PAE?

The only reason they depend on !HIGHMEM is because handling for
1GB-unaligned splits is unimplemented for PAE, which formerly only
occurred in conjunction with HIGHMEM64G. That said, they were oriented
toward avoiding highmem on laptops, hence the broader !HIGHMEM
constraint. The entire point of the patch is to add an option to use
PAE without highmem for the purposes of NX and secondarily expanded
swapspace, at which point CONFIG_VMSPLIT_[23]G_OPT need some other way
besides !HIGHMEM to exclude PAE, such as specifying !X86_PAE directly.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-08 Thread William Lee Irwin III
On Fri, Jun 08, 2007 at 10:07:52AM +0200, Mikael Pettersson wrote:
 Is this really needed? I can see why VMSPLIT_{2,3}G_OPT would
 depend on !HIGHMEM, but why would they depend on !X86_PAE?

The only reason they depend on !HIGHMEM is because handling for
1GB-unaligned splits is unimplemented for PAE, which formerly only
occurred in conjunction with HIGHMEM64G. That said, they were oriented
toward avoiding highmem on laptops, hence the broader !HIGHMEM
constraint. The entire point of the patch is to add an option to use
PAE without highmem for the purposes of NX and secondarily expanded
swapspace, at which point CONFIG_VMSPLIT_[23]G_OPT need some other way
besides !HIGHMEM to exclude PAE, such as specifying !X86_PAE directly.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread William Lee Irwin III
William Lee Irwin III wrote:
>> Beg your pardon? Are you reading the patch description correctly?

On Thu, Jun 07, 2007 at 08:44:09PM -0700, H. Peter Anvin wrote:
> I mean, with your patch CONFIG_HIGHMEM4G versus CONFIG_HIGHMEM64G really
> don't make sense as separate selections anymore.

I thought about sweeping those up, but defaulted to minimal diffsize.
I can sweep them up given more votes in favor of doing so.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread William Lee Irwin III
William Lee Irwin III wrote:
>> !CONFIG_X86_PAE && CONFIG_HIGHMEM64G doesn't make sense and is not allowed
>> by this patch. CONFIG_X86_PAE && !CONFIG_HIGHMEM64G works here.


On Thu, Jun 07, 2007 at 08:38:22PM -0700, H. Peter Anvin wrote:
> But what's the point?
> If you're going to divorce these, at least do it in a way that makes
> sense, specifically the two independent variables are PAE and HIGHMEM.
> PAE and !HIGHMEM does make (some amount of) sense, due to no kmap overhead.

Beg your pardon? Are you reading the patch description correctly?


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread William Lee Irwin III
On Thu, 7 Jun 2007 19:35:51 -0700 William Lee Irwin III <[EMAIL PROTECTED]> 
wrote:
>> PAE is useful for more than supporting more than 4GB RAM. It supports
>> expanded swapspace and NX executable protections. Some users may want
>> NX or expanded swapspace support without the overhead or instability
>> of highmem. For these reasons, the following patch divorces
>> CONFIG_X86_PAE from CONFIG_HIGHMEM64G.

On Thu, Jun 07, 2007 at 07:41:56PM -0700, Andrew Morton wrote:
> Do (CONFIG_X86_PAE && !CONFIG_HIGHMEM64G) and (!CONFIG_X86_PAE && 
> CONFIG_HIGHMEM64G)
> kernels actually work?  I wouldn't be surprised if there are places where we 
> used
> the incorrect one.

!CONFIG_X86_PAE && CONFIG_HIGHMEM64G doesn't make sense and is not allowed
by this patch. CONFIG_X86_PAE && !CONFIG_HIGHMEM64G works here.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread William Lee Irwin III
PAE is useful for more than supporting more than 4GB RAM. It supports
expanded swapspace and NX executable protections. Some users may want
NX or expanded swapspace support without the overhead or instability
of highmem. For these reasons, the following patch divorces
CONFIG_X86_PAE from CONFIG_HIGHMEM64G.

vs. 2.6.22-rc4-mm2

Cc: Mark Lord <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>
Cc: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: William Irwin <[EMAIL PROTECTED]>


Index: mm-2.6.22-rc4-2/arch/i386/Kconfig
===
--- mm-2.6.22-rc4-2.orig/arch/i386/Kconfig  2007-06-07 00:05:53.609599701 
-0700
+++ mm-2.6.22-rc4-2/arch/i386/Kconfig   2007-06-07 17:02:24.333262965 -0700
@@ -544,6 +544,7 @@
 config HIGHMEM64G
bool "64GB"
depends on X86_CMPXCHG64
+   select X86_PAE
help
  Select this if you have a 32-bit processor and more than 4
  gigabytes of physical RAM.
@@ -573,12 +574,12 @@
config VMSPLIT_3G
bool "3G/1G user/kernel split"
config VMSPLIT_3G_OPT
-   depends on !HIGHMEM
+   depends on !X86_PAE
bool "3G/1G user/kernel split (for full 1G low memory)"
config VMSPLIT_2G
bool "2G/2G user/kernel split"
config VMSPLIT_2G_OPT
-   depends on !HIGHMEM
+   depends on !X86_PAE
bool "2G/2G user/kernel split (for full 2G low memory)"
config VMSPLIT_1G
bool "1G/3G user/kernel split"
@@ -598,10 +599,15 @@
default y
 
 config X86_PAE
-   bool
-   depends on HIGHMEM64G
-   default y
+   bool "PAE (Physical Address Extension) Support"
+   default n
+   depends on !HIGHMEM4G
select RESOURCES_64BIT
+   help
+ PAE is required for NX support, and furthermore enables
+ larger swapspace support for non-overcommit purposes. It
+ has the cost of more pagetable lookup overhead, and also
+ consumes more pagetable space per process.
 
 # Common NUMA Features
 config NUMA
Index: mm-2.6.22-rc4-2/arch/i386/kernel/setup.c
===
--- mm-2.6.22-rc4-2.orig/arch/i386/kernel/setup.c   2007-06-06 
23:52:18.839168580 -0700
+++ mm-2.6.22-rc4-2/arch/i386/kernel/setup.c2007-06-07 17:02:24.349263876 
-0700
@@ -273,18 +273,18 @@
printk(KERN_WARNING "Warning only %ldMB will be used.\n",
MAXMEM>>20);
if (max_pfn > MAX_NONPAE_PFN)
-   printk(KERN_WARNING "Use a PAE enabled kernel.\n");
+   printk(KERN_WARNING "Use a HIGHMEM64G enabled 
kernel.\n");
else
printk(KERN_WARNING "Use a HIGHMEM enabled kernel.\n");
max_pfn = MAXMEM_PFN;
 #else /* !CONFIG_HIGHMEM */
-#ifndef CONFIG_X86_PAE
+#ifndef CONFIG_HIGHMEM64G
if (max_pfn > MAX_NONPAE_PFN) {
max_pfn = MAX_NONPAE_PFN;
printk(KERN_WARNING "Warning only 4GB will be used.\n");
-   printk(KERN_WARNING "Use a PAE enabled kernel.\n");
+   printk(KERN_WARNING "Use a HIGHMEM64G enabled 
kernel.\n");
}
-#endif /* !CONFIG_X86_PAE */
+#endif /* !CONFIG_HIGHMEM64G */
 #endif /* !CONFIG_HIGHMEM */
} else {
if (highmem_pages == -1)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: why does the macro "ZERO_PAGE" take an argument?

2007-06-07 Thread William Lee Irwin III
Robert P. J. Day wrote:
>> although it's not clear where in the source tree are the invocations
>> that would actually make a difference to a MIPS system, which is why
>> i've CC'ed ralf on this.  i'm sure he can clear this up. :-)

On Thu, Jun 07, 2007 at 10:32:29AM -0700, H. Peter Anvin wrote:
> x86 could also benefit from coloured zeropages.  In fact, I thought it
> already had them (K8 wants as many as 8.)

How would one demonstrate the beneficial effect of such?


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-rc4-mm2

2007-06-07 Thread William Lee Irwin III
On Thu, Jun 07, 2007 at 12:19:22AM -0700, Andrew Morton wrote:
> hm, OK, this seems to work:
[...]
> -#ifdef CONFIG_HIGHMEM
> +#if defined(CONFIG_HIGHMEM) && defined(CONFIG_ARCH_POPULATES_NODE_MAP)
>   return movable_zone == ZONE_HIGHMEM;
>  #else
>   return 0;
> _
> (the first ifdef is just there to trip things at compile time rather than
> link time)

I guess it's not the arch's fault after all. I probably would've
conditionally out-of-lined the thing so as never to expose movable_zone
but this will do just fine.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-rc4-mm2

2007-06-07 Thread William Lee Irwin III
On Thu, Jun 07, 2007 at 12:01:25AM -0700, Andrew Morton wrote:
>> config, please?

On Thu, Jun 07, 2007 at 12:04:07AM -0700, William Lee Irwin III wrote:
> It's the sparc32 defconfig. Included below for completeness.

The error output looks like the following.


-- wli

$ quilt top 
create-the-zone_movable-zone-fix.patch  
$ (yes "" | make ARCH=sparc CROSS_COMPILE="sparc-linux-" CC="gcc-sparc-4.1" 
quiet=1 -j16 defconfig) >& /dev/null; yes "" | make ARCH=sparc 
CROSS_COMPILE="sparc-linux-" CC="gcc-sparc-4.1" quiet=1 -j16 image modules  
 scripts/kconfig/conf -s arch/sparc/Kconfig
drivers/macintosh/Kconfig:116:warning: 'select' used by config symbol 
'PMAC_APM_EMU' refers to undefined symbol 'APM_EMULATION'
drivers/input/keyboard/Kconfig:170:warning: 'select' used by config symbol 
'KEYBOARD_ATARI' refers to undefined symbol 'ATARI_KBD_CORE'
drivers/input/mouse/Kconfig:182:warning: 'select' used by config symbol 
'MOUSE_ATARI' refers to undefined symbol 'ATARI_KBD_CORE'
sound/soc/sh/Kconfig:6:warning: 'select' used by config symbol 
'SND_SOC_PCM_SH7760' refers to undefined symbol 'SH_DMABRG'
  CHK include/linux/version.h
  UPD include/linux/version.h
  CHK include/linux/utsrelease.h
  UPD include/linux/utsrelease.h
  SYMLINK include/asm -> include/asm-sparc
:752:2: warning: #warning syscall setresuid not implemented
:756:2: warning: #warning syscall getresuid not implemented
:776:2: warning: #warning syscall setresgid not implemented
:780:2: warning: #warning syscall getresgid not implemented
  CHK include/linux/compile.h
  UPD include/linux/compile.h
ipc/msg.c: In function 'sys_msgctl':
ipc/msg.c:390: warning: 'setbuf.qbytes' may be used uninitialized in this 
function
ipc/msg.c:390: warning: 'setbuf.uid' may be used uninitialized in this function
ipc/msg.c:390: warning: 'setbuf.gid' may be used uninitialized in this function
ipc/msg.c:390: warning: 'setbuf.mode' may be used uninitialized in this function
ipc/sem.c: In function 'sys_semctl':
ipc/sem.c:861: warning: 'setbuf.uid' may be used uninitialized in this function
ipc/sem.c:861: warning: 'setbuf.gid' may be used uninitialized in this function
ipc/sem.c:861: warning: 'setbuf.mode' may be used uninitialized in this function
mm/vmalloc.c: In function 'unmap_kernel_range':
mm/vmalloc.c:75: warning: unused variable 'start'
drivers/char/rtc.c:118: warning: 'hpet_rtc_interrupt' defined but not used
kernel/time/ntp.c: In function 'do_adjtimex':
kernel/time/ntp.c:309: warning: comparison of distinct pointer types lacks a 
cast
kernel/time/ntp.c:312: warning: comparison of distinct pointer types lacks a 
cast
drivers/pci/search.c: In function 'pci_find_slot':
drivers/pci/search.c:99: warning: 'pci_find_device' is deprecated (declared at 
include/linux/pci.h:478)
drivers/pci/search.c: At top level:
drivers/pci/search.c:434: warning: 'pci_find_device' is deprecated (declared at 
drivers/pci/search.c:241)
drivers/pci/search.c:434: warning: 'pci_find_device' is deprecated (declared at 
drivers/pci/search.c:241)
drivers/pci/search.c:435: warning: 'pci_find_slot' is deprecated (declared at 
drivers/pci/search.c:96)
drivers/pci/search.c:435: warning: 'pci_find_slot' is deprecated (declared at 
drivers/pci/search.c:96)
drivers/pci/syscall.c: In function 'sys_pciconfig_read':
drivers/pci/syscall.c:22: warning: 'dev' may be used uninitialized in this 
function
fs/partitions/check.c: In function 'add_partition':
fs/partitions/check.c:392: warning: ignoring return value of 'kobject_add', 
declared with attribute warn_unused_result
fs/partitions/check.c:395: warning: ignoring return value of 
'sysfs_create_link', declared with attribute warn_unused_result
fs/partitions/check.c:402: warning: ignoring return value of 
'sysfs_create_file', declared with attribute warn_unused_result
  CHK include/linux/compile.h
  UPD include/linux/compile.h
WARNING: arch/sparc/kernel/head.o(.text+0x9040): Section mismatch: reference to 
.init.text:no_sun4u_here (between 'current_pc' and 'already_mapped')
WARNING: arch/sparc/kernel/head.o(.text+0x9280): Section mismatch: reference to 
.init.text:execute_in_high_mem (after 'go_to_highmem')
WARNING: arch/sparc/kernel/head.o(.text+0x9284): Section mismatch: reference to 
.init.text:execute_in_high_mem (after 'go_to_highmem')
  Building modules, stage 2.
WARNING: vmlinux(.text+0x9040): Section mismatch: reference to 
.init.text:no_sun4u_here (between 'current_pc' and 'already_mapped')
WARNING: vmlinux(.text+0x9280): Section mismatch: reference to 
.init.text:execute_in_high_mem (between 'go_to_highmem' and 'init_thread_union')
WARNING: vmlinux(.text+0x9284): Section mismatch: reference to 
.init.text:execute_in_high_mem (between 'go_to_highmem' and 'init_thread_union')
WARNING: vmlinux(.text+0x1dfb

Re: 2.6.22-rc4-mm2

2007-06-07 Thread William Lee Irwin III
On Wed, 6 Jun 2007 23:55:44 -0700 William Lee Irwin III <[EMAIL PROTECTED]> 
wrote:
>> The fully-applied tree fails with a link error having to do with
>> movable_zone. I'm not entirely sure what arches are supposed to do
>> about that.

On Thu, Jun 07, 2007 at 12:01:25AM -0700, Andrew Morton wrote:
> config, please?

It's the sparc32 defconfig. Included below for completeness.


-- wli

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.22-rc4-mm2
# Thu Jun  7 00:01:24 2007
#
CONFIG_MMU=y
CONFIG_HIGHMEM=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=14
CONFIG_SYSFS_DEPRECATED=y
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_BLOCK=y
# CONFIG_LBD is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# General machine setup
#
# CONFIG_SMP is not set
CONFIG_SPARC=y
CONFIG_SPARC32=y
CONFIG_SBUS=y
CONFIG_SBUSCHAR=y
CONFIG_SERIAL_CONSOLE=y
CONFIG_SUN_AUXIO=y
CONFIG_SUN_IO=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_EMULATED_CMPXCHG=y
CONFIG_SUN_PM=y
# CONFIG_SUN4 is not set
CONFIG_PCI=y
# CONFIG_ARCH_SUPPORTS_MSI is not set
# CONFIG_PCI_DEBUG is not set
CONFIG_SUN_OPENPROMFS=m
# CONFIG_SPARC_LED is not set
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_MISC=m
CONFIG_SUNOS_EMUL=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_RESOURCES_64BIT is not set
CONFIG_ZONE_DMA_FLAG=1

#
# Networking
#
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_MMAP is not set
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=m
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
CONFIG_NET_KEY=m
# CONFIG_NET_KEY_MIGRATE is not set
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
# CONFIG_IP_PNP_BOOTP is not set
# CONFIG_IP_PNP_RARP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_ARPD is not set
# CONFIG_SYN_COOKIES is not set
CONFIG_INET_AH=y
CONFIG_INET_ESP=y
CONFIG_INET_IPCOMP=y
CONFIG_INET_XFRM_TUNNEL=y
CONFIG_INET_TUNNEL=y
CONFIG_INET_XFRM_MODE_TRANSPORT=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
CONFIG_INET_XFRM_MODE_BEET=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
# CONFIG_IPV6_MIP6 is not set
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=m
CONFIG_IPV6_TUNNEL=m
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_NETWORK_SECMARK is not set
# CONFIG_NETFILTER is not set
# CONFIG_IP_DCCP is not set
CONFIG_IP_SCTP=m
# CONFIG_SCTP_D

Re: 2.6.22-rc4-mm2

2007-06-07 Thread William Lee Irwin III
On Wed, 6 Jun 2007 23:42:31 -0700 William Lee Irwin III <[EMAIL PROTECTED]> 
wrote:
>> create-the-zone_movable-zone.patch breaks the build on sparc32.

On Wed, Jun 06, 2007 at 11:51:31PM -0700, Andrew Morton wrote:
> Nope, there are no instances of GFP_HIGH_MOVABLE in the tree once all
> patches are applied.  You hit a bad bisection point: between
> create-the-zone_movable-zone.patch and
> create-the-zone_movable-zone-fix.patch.

The fully-applied tree fails with a link error having to do with
movable_zone. I'm not entirely sure what arches are supposed to do
about that.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-rc4-mm2

2007-06-07 Thread William Lee Irwin III
On Wed, Jun 06, 2007 at 10:03:13PM -0700, Andrew Morton wrote:
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc4/2.6.22-rc4-mm2/
> - Basically a bugfixed version of 2.6.22-rc4-mm1.  None of the subsystem
>   trees were repulled, several bad patches were dropped, a few were fixed.

create-the-zone_movable-zone.patch breaks the build on sparc32.


-- wli

$ good=0; bad=`quilt series -v | wc -l`; time while [[ $(( $bad - $good )) -gt 
1 ]]; do cur=`quilt series -v |egrep -c '(=|\+)'`; chkpt=$(( ($good + $bad)/2 
)); delta=$(( $chkpt - $cur )); if [[ $delta -lt 0 ]]; then (quilt pop $(( 0 - 
$delta )) ) >& /dev/null; elif [[ $delta -gt 0 ]]; then (quilt push $delta) >& 
/dev/null; else true; fi; cur=$chkpt; (yes "" | make ARCH=sparc 
CROSS_COMPILE="sparc-linux-" CC="gcc-sparc-4.1" quiet=1 -j16 defconfig) >& 
/dev/null; echo "last known good = $good, first known bad = $bad, trying 
$chkpt"; yes "" | make ARCH=sparc CROSS_COMPILE="sparc-linux-" 
CC="gcc-sparc-4.1" quiet=1 -j16 image modules; s=$?; if [[ $s -ne 0 ]]; then 
echo "$chkpt bad"; bad=$chkpt; else echo "$chkpt good"; good=$chkpt; fi; done
...
last known good = 641, first known bad = 645, trying 643
scripts/kconfig/conf -s arch/sparc/Kconfig
drivers/macintosh/Kconfig:116:warning: 'select' used by config symbol 
'PMAC_APM_EMU' refers to undefined symbol 'APM_EMULATION'
drivers/input/keyboard/Kconfig:170:warning: 'select' used by config symbol 
'KEYBOARD_ATARI' refers to undefined symbol 'ATARI_KBD_CORE'
drivers/input/mouse/Kconfig:182:warning: 'select' used by config symbol 
'MOUSE_ATARI' refers to undefined symbol 'ATARI_KBD_CORE'
sound/soc/sh/Kconfig:6:warning: 'select' used by config symbol 
'SND_SOC_PCM_SH7760' refers to undefined symbol 'SH_DMABRG'
  CHK include/linux/version.h
  CHK include/linux/utsrelease.h
:752:2: warning: #warning syscall setresuid not implemented
:756:2: warning: #warning syscall getresuid not implemented
:776:2: warning: #warning syscall setresgid not implemented
:780:2: warning: #warning syscall getresgid not implemented
  CHK include/linux/compile.h
mm/page_alloc.c: In function 'nr_free_pagecache_pages':
mm/page_alloc.c:1706: error: 'GFP_HIGH_MOVABLE' undeclared (first use in this 
function)
mm/page_alloc.c:1706: error: (Each undeclared identifier is reported only once
mm/page_alloc.c:1706: error: for each function it appears in.)
make[1]: *** [mm/page_alloc.o] Error 1
make[1]: *** Waiting for unfinished jobs
make: *** [mm] Error 2
make: *** Waiting for unfinished jobs
drivers/char/rtc.c:118: warning: 'hpet_rtc_interrupt' defined but not used
make: *** wait: No child processes.  Stop.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-rc4-mm1

2007-06-07 Thread William Lee Irwin III
On Wed, Jun 06, 2007 at 06:09:24PM -0700, Andrew Morton wrote:
> ooh, yes, lockdep_init() really does want to be called before anything
> else.
> So do we take it that this code hasn't been tested with lockdep?  Please
> don't forget that step - lockdep finds some pretty nasty bugs sometimes.
> This?

I found this patch when I woke and it got things booting with the full
-mm stack. Now to fix the sparc32 build and see if it boots.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread William Lee Irwin III
PAE is useful for more than supporting more than 4GB RAM. It supports
expanded swapspace and NX executable protections. Some users may want
NX or expanded swapspace support without the overhead or instability
of highmem. For these reasons, the following patch divorces
CONFIG_X86_PAE from CONFIG_HIGHMEM64G.

vs. 2.6.22-rc4-mm2

Cc: Mark Lord [EMAIL PROTECTED]
Cc: Andi Kleen [EMAIL PROTECTED]
Cc: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: William Irwin [EMAIL PROTECTED]


Index: mm-2.6.22-rc4-2/arch/i386/Kconfig
===
--- mm-2.6.22-rc4-2.orig/arch/i386/Kconfig  2007-06-07 00:05:53.609599701 
-0700
+++ mm-2.6.22-rc4-2/arch/i386/Kconfig   2007-06-07 17:02:24.333262965 -0700
@@ -544,6 +544,7 @@
 config HIGHMEM64G
bool 64GB
depends on X86_CMPXCHG64
+   select X86_PAE
help
  Select this if you have a 32-bit processor and more than 4
  gigabytes of physical RAM.
@@ -573,12 +574,12 @@
config VMSPLIT_3G
bool 3G/1G user/kernel split
config VMSPLIT_3G_OPT
-   depends on !HIGHMEM
+   depends on !X86_PAE
bool 3G/1G user/kernel split (for full 1G low memory)
config VMSPLIT_2G
bool 2G/2G user/kernel split
config VMSPLIT_2G_OPT
-   depends on !HIGHMEM
+   depends on !X86_PAE
bool 2G/2G user/kernel split (for full 2G low memory)
config VMSPLIT_1G
bool 1G/3G user/kernel split
@@ -598,10 +599,15 @@
default y
 
 config X86_PAE
-   bool
-   depends on HIGHMEM64G
-   default y
+   bool PAE (Physical Address Extension) Support
+   default n
+   depends on !HIGHMEM4G
select RESOURCES_64BIT
+   help
+ PAE is required for NX support, and furthermore enables
+ larger swapspace support for non-overcommit purposes. It
+ has the cost of more pagetable lookup overhead, and also
+ consumes more pagetable space per process.
 
 # Common NUMA Features
 config NUMA
Index: mm-2.6.22-rc4-2/arch/i386/kernel/setup.c
===
--- mm-2.6.22-rc4-2.orig/arch/i386/kernel/setup.c   2007-06-06 
23:52:18.839168580 -0700
+++ mm-2.6.22-rc4-2/arch/i386/kernel/setup.c2007-06-07 17:02:24.349263876 
-0700
@@ -273,18 +273,18 @@
printk(KERN_WARNING Warning only %ldMB will be used.\n,
MAXMEM20);
if (max_pfn  MAX_NONPAE_PFN)
-   printk(KERN_WARNING Use a PAE enabled kernel.\n);
+   printk(KERN_WARNING Use a HIGHMEM64G enabled 
kernel.\n);
else
printk(KERN_WARNING Use a HIGHMEM enabled kernel.\n);
max_pfn = MAXMEM_PFN;
 #else /* !CONFIG_HIGHMEM */
-#ifndef CONFIG_X86_PAE
+#ifndef CONFIG_HIGHMEM64G
if (max_pfn  MAX_NONPAE_PFN) {
max_pfn = MAX_NONPAE_PFN;
printk(KERN_WARNING Warning only 4GB will be used.\n);
-   printk(KERN_WARNING Use a PAE enabled kernel.\n);
+   printk(KERN_WARNING Use a HIGHMEM64G enabled 
kernel.\n);
}
-#endif /* !CONFIG_X86_PAE */
+#endif /* !CONFIG_HIGHMEM64G */
 #endif /* !CONFIG_HIGHMEM */
} else {
if (highmem_pages == -1)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread William Lee Irwin III
On Thu, 7 Jun 2007 19:35:51 -0700 William Lee Irwin III [EMAIL PROTECTED] 
wrote:
 PAE is useful for more than supporting more than 4GB RAM. It supports
 expanded swapspace and NX executable protections. Some users may want
 NX or expanded swapspace support without the overhead or instability
 of highmem. For these reasons, the following patch divorces
 CONFIG_X86_PAE from CONFIG_HIGHMEM64G.

On Thu, Jun 07, 2007 at 07:41:56PM -0700, Andrew Morton wrote:
 Do (CONFIG_X86_PAE  !CONFIG_HIGHMEM64G) and (!CONFIG_X86_PAE  
 CONFIG_HIGHMEM64G)
 kernels actually work?  I wouldn't be surprised if there are places where we 
 used
 the incorrect one.

!CONFIG_X86_PAE  CONFIG_HIGHMEM64G doesn't make sense and is not allowed
by this patch. CONFIG_X86_PAE  !CONFIG_HIGHMEM64G works here.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread William Lee Irwin III
William Lee Irwin III wrote:
 !CONFIG_X86_PAE  CONFIG_HIGHMEM64G doesn't make sense and is not allowed
 by this patch. CONFIG_X86_PAE  !CONFIG_HIGHMEM64G works here.


On Thu, Jun 07, 2007 at 08:38:22PM -0700, H. Peter Anvin wrote:
 But what's the point?
 If you're going to divorce these, at least do it in a way that makes
 sense, specifically the two independent variables are PAE and HIGHMEM.
 PAE and !HIGHMEM does make (some amount of) sense, due to no kmap overhead.

Beg your pardon? Are you reading the patch description correctly?


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread William Lee Irwin III
William Lee Irwin III wrote:
 Beg your pardon? Are you reading the patch description correctly?

On Thu, Jun 07, 2007 at 08:44:09PM -0700, H. Peter Anvin wrote:
 I mean, with your patch CONFIG_HIGHMEM4G versus CONFIG_HIGHMEM64G really
 don't make sense as separate selections anymore.

I thought about sweeping those up, but defaulted to minimal diffsize.
I can sweep them up given more votes in favor of doing so.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-rc4-mm1

2007-06-07 Thread William Lee Irwin III
On Wed, Jun 06, 2007 at 06:09:24PM -0700, Andrew Morton wrote:
 ooh, yes, lockdep_init() really does want to be called before anything
 else.
 So do we take it that this code hasn't been tested with lockdep?  Please
 don't forget that step - lockdep finds some pretty nasty bugs sometimes.
 This?

I found this patch when I woke and it got things booting with the full
-mm stack. Now to fix the sparc32 build and see if it boots.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-rc4-mm2

2007-06-07 Thread William Lee Irwin III
On Wed, Jun 06, 2007 at 10:03:13PM -0700, Andrew Morton wrote:
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc4/2.6.22-rc4-mm2/
 - Basically a bugfixed version of 2.6.22-rc4-mm1.  None of the subsystem
   trees were repulled, several bad patches were dropped, a few were fixed.

create-the-zone_movable-zone.patch breaks the build on sparc32.


-- wli

$ good=0; bad=`quilt series -v | wc -l`; time while [[ $(( $bad - $good )) -gt 
1 ]]; do cur=`quilt series -v |egrep -c '(=|\+)'`; chkpt=$(( ($good + $bad)/2 
)); delta=$(( $chkpt - $cur )); if [[ $delta -lt 0 ]]; then (quilt pop $(( 0 - 
$delta )) )  /dev/null; elif [[ $delta -gt 0 ]]; then (quilt push $delta)  
/dev/null; else true; fi; cur=$chkpt; (yes  | make ARCH=sparc 
CROSS_COMPILE=sparc-linux- CC=gcc-sparc-4.1 quiet=1 -j16 defconfig)  
/dev/null; echo last known good = $good, first known bad = $bad, trying 
$chkpt; yes  | make ARCH=sparc CROSS_COMPILE=sparc-linux- 
CC=gcc-sparc-4.1 quiet=1 -j16 image modules; s=$?; if [[ $s -ne 0 ]]; then 
echo $chkpt bad; bad=$chkpt; else echo $chkpt good; good=$chkpt; fi; done
...
last known good = 641, first known bad = 645, trying 643
scripts/kconfig/conf -s arch/sparc/Kconfig
drivers/macintosh/Kconfig:116:warning: 'select' used by config symbol 
'PMAC_APM_EMU' refers to undefined symbol 'APM_EMULATION'
drivers/input/keyboard/Kconfig:170:warning: 'select' used by config symbol 
'KEYBOARD_ATARI' refers to undefined symbol 'ATARI_KBD_CORE'
drivers/input/mouse/Kconfig:182:warning: 'select' used by config symbol 
'MOUSE_ATARI' refers to undefined symbol 'ATARI_KBD_CORE'
sound/soc/sh/Kconfig:6:warning: 'select' used by config symbol 
'SND_SOC_PCM_SH7760' refers to undefined symbol 'SH_DMABRG'
  CHK include/linux/version.h
  CHK include/linux/utsrelease.h
stdin:752:2: warning: #warning syscall setresuid not implemented
stdin:756:2: warning: #warning syscall getresuid not implemented
stdin:776:2: warning: #warning syscall setresgid not implemented
stdin:780:2: warning: #warning syscall getresgid not implemented
  CHK include/linux/compile.h
mm/page_alloc.c: In function 'nr_free_pagecache_pages':
mm/page_alloc.c:1706: error: 'GFP_HIGH_MOVABLE' undeclared (first use in this 
function)
mm/page_alloc.c:1706: error: (Each undeclared identifier is reported only once
mm/page_alloc.c:1706: error: for each function it appears in.)
make[1]: *** [mm/page_alloc.o] Error 1
make[1]: *** Waiting for unfinished jobs
make: *** [mm] Error 2
make: *** Waiting for unfinished jobs
drivers/char/rtc.c:118: warning: 'hpet_rtc_interrupt' defined but not used
make: *** wait: No child processes.  Stop.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-rc4-mm2

2007-06-07 Thread William Lee Irwin III
On Wed, 6 Jun 2007 23:42:31 -0700 William Lee Irwin III [EMAIL PROTECTED] 
wrote:
 create-the-zone_movable-zone.patch breaks the build on sparc32.

On Wed, Jun 06, 2007 at 11:51:31PM -0700, Andrew Morton wrote:
 Nope, there are no instances of GFP_HIGH_MOVABLE in the tree once all
 patches are applied.  You hit a bad bisection point: between
 create-the-zone_movable-zone.patch and
 create-the-zone_movable-zone-fix.patch.

The fully-applied tree fails with a link error having to do with
movable_zone. I'm not entirely sure what arches are supposed to do
about that.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-rc4-mm2

2007-06-07 Thread William Lee Irwin III
On Wed, 6 Jun 2007 23:55:44 -0700 William Lee Irwin III [EMAIL PROTECTED] 
wrote:
 The fully-applied tree fails with a link error having to do with
 movable_zone. I'm not entirely sure what arches are supposed to do
 about that.

On Thu, Jun 07, 2007 at 12:01:25AM -0700, Andrew Morton wrote:
 config, please?

It's the sparc32 defconfig. Included below for completeness.


-- wli

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.22-rc4-mm2
# Thu Jun  7 00:01:24 2007
#
CONFIG_MMU=y
CONFIG_HIGHMEM=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=14
CONFIG_SYSFS_DEPRECATED=y
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_BLOCK=y
# CONFIG_LBD is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED=cfq

#
# General machine setup
#
# CONFIG_SMP is not set
CONFIG_SPARC=y
CONFIG_SPARC32=y
CONFIG_SBUS=y
CONFIG_SBUSCHAR=y
CONFIG_SERIAL_CONSOLE=y
CONFIG_SUN_AUXIO=y
CONFIG_SUN_IO=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_EMULATED_CMPXCHG=y
CONFIG_SUN_PM=y
# CONFIG_SUN4 is not set
CONFIG_PCI=y
# CONFIG_ARCH_SUPPORTS_MSI is not set
# CONFIG_PCI_DEBUG is not set
CONFIG_SUN_OPENPROMFS=m
# CONFIG_SPARC_LED is not set
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_MISC=m
CONFIG_SUNOS_EMUL=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_RESOURCES_64BIT is not set
CONFIG_ZONE_DMA_FLAG=1

#
# Networking
#
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_MMAP is not set
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=m
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
CONFIG_NET_KEY=m
# CONFIG_NET_KEY_MIGRATE is not set
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
# CONFIG_IP_PNP_BOOTP is not set
# CONFIG_IP_PNP_RARP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_ARPD is not set
# CONFIG_SYN_COOKIES is not set
CONFIG_INET_AH=y
CONFIG_INET_ESP=y
CONFIG_INET_IPCOMP=y
CONFIG_INET_XFRM_TUNNEL=y
CONFIG_INET_TUNNEL=y
CONFIG_INET_XFRM_MODE_TRANSPORT=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
CONFIG_INET_XFRM_MODE_BEET=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG=cubic
# CONFIG_TCP_MD5SIG is not set
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
# CONFIG_IPV6_MIP6 is not set
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=m
CONFIG_IPV6_TUNNEL=m
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_NETWORK_SECMARK is not set
# CONFIG_NETFILTER is not set
# CONFIG_IP_DCCP is not set
CONFIG_IP_SCTP=m
# CONFIG_SCTP_DBG_MSG is not set
CONFIG_SCTP_DBG_OBJCNT=y
# CONFIG_SCTP_HMAC_NONE is not set
# CONFIG_SCTP_HMAC_SHA1 is not set

Re: 2.6.22-rc4-mm2

2007-06-07 Thread William Lee Irwin III
On Thu, Jun 07, 2007 at 12:01:25AM -0700, Andrew Morton wrote:
 config, please?

On Thu, Jun 07, 2007 at 12:04:07AM -0700, William Lee Irwin III wrote:
 It's the sparc32 defconfig. Included below for completeness.

The error output looks like the following.


-- wli

$ quilt top 
create-the-zone_movable-zone-fix.patch  
$ (yes  | make ARCH=sparc CROSS_COMPILE=sparc-linux- CC=gcc-sparc-4.1 
quiet=1 -j16 defconfig)  /dev/null; yes  | make ARCH=sparc 
CROSS_COMPILE=sparc-linux- CC=gcc-sparc-4.1 quiet=1 -j16 image modules  
 scripts/kconfig/conf -s arch/sparc/Kconfig
drivers/macintosh/Kconfig:116:warning: 'select' used by config symbol 
'PMAC_APM_EMU' refers to undefined symbol 'APM_EMULATION'
drivers/input/keyboard/Kconfig:170:warning: 'select' used by config symbol 
'KEYBOARD_ATARI' refers to undefined symbol 'ATARI_KBD_CORE'
drivers/input/mouse/Kconfig:182:warning: 'select' used by config symbol 
'MOUSE_ATARI' refers to undefined symbol 'ATARI_KBD_CORE'
sound/soc/sh/Kconfig:6:warning: 'select' used by config symbol 
'SND_SOC_PCM_SH7760' refers to undefined symbol 'SH_DMABRG'
  CHK include/linux/version.h
  UPD include/linux/version.h
  CHK include/linux/utsrelease.h
  UPD include/linux/utsrelease.h
  SYMLINK include/asm - include/asm-sparc
stdin:752:2: warning: #warning syscall setresuid not implemented
stdin:756:2: warning: #warning syscall getresuid not implemented
stdin:776:2: warning: #warning syscall setresgid not implemented
stdin:780:2: warning: #warning syscall getresgid not implemented
  CHK include/linux/compile.h
  UPD include/linux/compile.h
ipc/msg.c: In function 'sys_msgctl':
ipc/msg.c:390: warning: 'setbuf.qbytes' may be used uninitialized in this 
function
ipc/msg.c:390: warning: 'setbuf.uid' may be used uninitialized in this function
ipc/msg.c:390: warning: 'setbuf.gid' may be used uninitialized in this function
ipc/msg.c:390: warning: 'setbuf.mode' may be used uninitialized in this function
ipc/sem.c: In function 'sys_semctl':
ipc/sem.c:861: warning: 'setbuf.uid' may be used uninitialized in this function
ipc/sem.c:861: warning: 'setbuf.gid' may be used uninitialized in this function
ipc/sem.c:861: warning: 'setbuf.mode' may be used uninitialized in this function
mm/vmalloc.c: In function 'unmap_kernel_range':
mm/vmalloc.c:75: warning: unused variable 'start'
drivers/char/rtc.c:118: warning: 'hpet_rtc_interrupt' defined but not used
kernel/time/ntp.c: In function 'do_adjtimex':
kernel/time/ntp.c:309: warning: comparison of distinct pointer types lacks a 
cast
kernel/time/ntp.c:312: warning: comparison of distinct pointer types lacks a 
cast
drivers/pci/search.c: In function 'pci_find_slot':
drivers/pci/search.c:99: warning: 'pci_find_device' is deprecated (declared at 
include/linux/pci.h:478)
drivers/pci/search.c: At top level:
drivers/pci/search.c:434: warning: 'pci_find_device' is deprecated (declared at 
drivers/pci/search.c:241)
drivers/pci/search.c:434: warning: 'pci_find_device' is deprecated (declared at 
drivers/pci/search.c:241)
drivers/pci/search.c:435: warning: 'pci_find_slot' is deprecated (declared at 
drivers/pci/search.c:96)
drivers/pci/search.c:435: warning: 'pci_find_slot' is deprecated (declared at 
drivers/pci/search.c:96)
drivers/pci/syscall.c: In function 'sys_pciconfig_read':
drivers/pci/syscall.c:22: warning: 'dev' may be used uninitialized in this 
function
fs/partitions/check.c: In function 'add_partition':
fs/partitions/check.c:392: warning: ignoring return value of 'kobject_add', 
declared with attribute warn_unused_result
fs/partitions/check.c:395: warning: ignoring return value of 
'sysfs_create_link', declared with attribute warn_unused_result
fs/partitions/check.c:402: warning: ignoring return value of 
'sysfs_create_file', declared with attribute warn_unused_result
  CHK include/linux/compile.h
  UPD include/linux/compile.h
WARNING: arch/sparc/kernel/head.o(.text+0x9040): Section mismatch: reference to 
.init.text:no_sun4u_here (between 'current_pc' and 'already_mapped')
WARNING: arch/sparc/kernel/head.o(.text+0x9280): Section mismatch: reference to 
.init.text:execute_in_high_mem (after 'go_to_highmem')
WARNING: arch/sparc/kernel/head.o(.text+0x9284): Section mismatch: reference to 
.init.text:execute_in_high_mem (after 'go_to_highmem')
  Building modules, stage 2.
WARNING: vmlinux(.text+0x9040): Section mismatch: reference to 
.init.text:no_sun4u_here (between 'current_pc' and 'already_mapped')
WARNING: vmlinux(.text+0x9280): Section mismatch: reference to 
.init.text:execute_in_high_mem (between 'go_to_highmem' and 'init_thread_union')
WARNING: vmlinux(.text+0x9284): Section mismatch: reference to 
.init.text:execute_in_high_mem (between 'go_to_highmem' and 'init_thread_union')
WARNING: vmlinux(.text+0x1dfb38): Section mismatch: reference to 
.init.text:kernel_init (between 'rest_init

Re: 2.6.22-rc4-mm2

2007-06-07 Thread William Lee Irwin III
On Thu, Jun 07, 2007 at 12:19:22AM -0700, Andrew Morton wrote:
 hm, OK, this seems to work:
[...]
 -#ifdef CONFIG_HIGHMEM
 +#if defined(CONFIG_HIGHMEM)  defined(CONFIG_ARCH_POPULATES_NODE_MAP)
   return movable_zone == ZONE_HIGHMEM;
  #else
   return 0;
 _
 (the first ifdef is just there to trip things at compile time rather than
 link time)

I guess it's not the arch's fault after all. I probably would've
conditionally out-of-lined the thing so as never to expose movable_zone
but this will do just fine.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: why does the macro ZERO_PAGE take an argument?

2007-06-07 Thread William Lee Irwin III
Robert P. J. Day wrote:
 although it's not clear where in the source tree are the invocations
 that would actually make a difference to a MIPS system, which is why
 i've CC'ed ralf on this.  i'm sure he can clear this up. :-)

On Thu, Jun 07, 2007 at 10:32:29AM -0700, H. Peter Anvin wrote:
 x86 could also benefit from coloured zeropages.  In fact, I thought it
 already had them (K8 wants as many as 8.)

How would one demonstrate the beneficial effect of such?


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-rc4-mm1

2007-06-06 Thread William Lee Irwin III
On Wed, 6 Jun 2007 09:30:53 -0700 William Lee Irwin III <[EMAIL PROTECTED]> 
wrote:
>> Something brings down i386/qemu before even earlyprintk can handle.
>> Bisection has narrowed it down to patch 1140 after everything got
>> renumbered by peterz' fix for mm-variable-length-argument-support.patch,
>> namely containersv10-make-cpusets-a-client-of-containers.patch

On Wed, Jun 06, 2007 at 11:13:15AM -0700, Andrew Morton wrote:
> erk.  A step-by-step how-to-make-this-happen might help if poss, please.

(1) build for i386 with my .config
(2) attempt to boot in qemu's i386 system simulator

I'm not seeing the sort of nondeterminism Andy Whitcroft is. It breaks
every time when I try this.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-rc4-mm1

2007-06-06 Thread William Lee Irwin III
On Wed, Jun 06, 2007 at 05:26:49PM +0100, Mel Gorman wrote:
> I do not believe this is Nick's problem. I encountered the same issue and
> the bisect ended up here;
> # BISECT HERE
> mm-variable-length-argument-support.patch
> mm-variable-length-argument-support-fix.patch
> # BISECT BAD
> Reverting those two patches boots ok on my standalone x86 laptop.
> Patch authors cc'd. I have not read the patches yet to see what might
> be the problem.

I found this a while ago and peterz already has a tentative fix for it at
http://programming.kicks-ass.net/kernel-patches/max_arg_pages/move_anon_vma.patch
I'm sure he himself will chime in with more/better code when he returns.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-rc4-mm1

2007-06-06 Thread William Lee Irwin III
On Wed, Jun 06, 2007 at 02:07:37AM -0700, Andrew Morton wrote:
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc4/2.6.22-rc4-mm1/
> - Somebody broke it on my powerpc G5, but I didn't have time to do yet
>   another bisection yet.
> - There's a lengthy patch series here from Nick which attempts to address
>   the longstanding pagefault-vs-buffered-write deadlock.
>   A great shower of filesystems were broken and have been disabled with
>   CONFIG_BROKEN.  This includes reiser4.
> - Complex patches which eliminate the kernel's fixed size limit on the
>   command-line length.  These break nommu builds.

Someone remind me what the pagefault vs. buffered write deadlock is.

Something brings down i386/qemu before even earlyprintk can handle.

Bisection has narrowed it down to patch 1140 after everything got
renumbered by peterz' fix for mm-variable-length-argument-support.patch,
namely containersv10-make-cpusets-a-client-of-containers.patch


-- wli
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.22-rc4-mm1
# Wed Jun  6 09:08:11 2007
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SWAP_PREFETCH=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=15
CONFIG_CONTAINERS=y
CONFIG_CPUSETS=y
CONFIG_SYSFS_DEPRECATED=y
CONFIG_CONTAINER_CPUACCT=y
CONFIG_PROC_PID_CPUSET=y
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_PROC_SMAPS=y
CONFIG_PROC_CLEAR_REFS=y
CONFIG_PROC_PAGEMAP=y
CONFIG_PROC_KPAGEMAP=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_LBD=y
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_SMP=y
# CONFIG_X86_PC is not set
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
CONFIG_X86_GENERICARCH=y
# CONFIG_X86_ES7000 is not set
# CONFIG_PARAVIRT is not set
CONFIG_X86_CYCLONE_TIMER=y
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
CONFIG_M686=y
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MCORE2 is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
CONFIG_X86_GENERIC=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_XADD=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_PPRO_FENCE=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_CMPXCHG64=y

Re: 2.6.22-rc4-mm1

2007-06-06 Thread William Lee Irwin III
On Wed, Jun 06, 2007 at 02:07:37AM -0700, Andrew Morton wrote:
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc4/2.6.22-rc4-mm1/
 - Somebody broke it on my powerpc G5, but I didn't have time to do yet
   another bisection yet.
 - There's a lengthy patch series here from Nick which attempts to address
   the longstanding pagefault-vs-buffered-write deadlock.
   A great shower of filesystems were broken and have been disabled with
   CONFIG_BROKEN.  This includes reiser4.
 - Complex patches which eliminate the kernel's fixed size limit on the
   command-line length.  These break nommu builds.

Someone remind me what the pagefault vs. buffered write deadlock is.

Something brings down i386/qemu before even earlyprintk can handle.

Bisection has narrowed it down to patch 1140 after everything got
renumbered by peterz' fix for mm-variable-length-argument-support.patch,
namely containersv10-make-cpusets-a-client-of-containers.patch


-- wli
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.22-rc4-mm1
# Wed Jun  6 09:08:11 2007
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SWAP_PREFETCH=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=15
CONFIG_CONTAINERS=y
CONFIG_CPUSETS=y
CONFIG_SYSFS_DEPRECATED=y
CONFIG_CONTAINER_CPUACCT=y
CONFIG_PROC_PID_CPUSET=y
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_PROC_SMAPS=y
CONFIG_PROC_CLEAR_REFS=y
CONFIG_PROC_PAGEMAP=y
CONFIG_PROC_KPAGEMAP=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_LBD=y
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED=cfq

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_SMP=y
# CONFIG_X86_PC is not set
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
CONFIG_X86_GENERICARCH=y
# CONFIG_X86_ES7000 is not set
# CONFIG_PARAVIRT is not set
CONFIG_X86_CYCLONE_TIMER=y
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
CONFIG_M686=y
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MCORE2 is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
CONFIG_X86_GENERIC=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_XADD=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_PPRO_FENCE=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_GOOD_APIC=y

Re: 2.6.22-rc4-mm1

2007-06-06 Thread William Lee Irwin III
On Wed, Jun 06, 2007 at 05:26:49PM +0100, Mel Gorman wrote:
 I do not believe this is Nick's problem. I encountered the same issue and
 the bisect ended up here;
 # BISECT HERE
 mm-variable-length-argument-support.patch
 mm-variable-length-argument-support-fix.patch
 # BISECT BAD
 Reverting those two patches boots ok on my standalone x86 laptop.
 Patch authors cc'd. I have not read the patches yet to see what might
 be the problem.

I found this a while ago and peterz already has a tentative fix for it at
http://programming.kicks-ass.net/kernel-patches/max_arg_pages/move_anon_vma.patch
I'm sure he himself will chime in with more/better code when he returns.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22-rc4-mm1

2007-06-06 Thread William Lee Irwin III
On Wed, 6 Jun 2007 09:30:53 -0700 William Lee Irwin III [EMAIL PROTECTED] 
wrote:
 Something brings down i386/qemu before even earlyprintk can handle.
 Bisection has narrowed it down to patch 1140 after everything got
 renumbered by peterz' fix for mm-variable-length-argument-support.patch,
 namely containersv10-make-cpusets-a-client-of-containers.patch

On Wed, Jun 06, 2007 at 11:13:15AM -0700, Andrew Morton wrote:
 erk.  A step-by-step how-to-make-this-happen might help if poss, please.

(1) build for i386 with my .config
(2) attempt to boot in qemu's i386 system simulator

I'm not seeing the sort of nondeterminism Andy Whitcroft is. It breaks
every time when I try this.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: libata & no PCI: dma_[un]map_single undefined

2007-06-04 Thread William Lee Irwin III
From: Alan Cox <[EMAIL PROTECTED]>
Date: Mon, 4 Jun 2007 14:30:05 +0100
>> There are PCMCIA controllers and PCI/PCMCIA/Cardbus adapters for the
>> Sparc platform I thought ?

On Mon, Jun 04, 2007 at 02:22:43PM -0700, David Miller wrote:
> The 32-bit sparc port has some but those PCMCIA controllers aren't
> going to be supported in the foreseeable future, you have to abstract
> out all the inb/outb etc. operations to go through the pcmcia
> controller driver for one thing.
> Secondarily, sparc32 lacks an active maintainer and it's
> been like this for several years, the only things getting
> worked on therefore are basica functionality and the most
> important bug fixes.

I don't foresee my ever dealing with those PCMCIA controllers. If by
some miracle I manage to get any work done on basic functionality I'll
consider that having won.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: Return ZERO_SIZE_PTR for kmalloc(0)

2007-06-04 Thread William Lee Irwin III
On Mon, Jun 04, 2007 at 10:50:41AM -0700, Linus Torvalds wrote:
> The exception is if you use the memory allocator as a "ID allocator", but 
> quite frankly, if you use a size of zero, it's your own damn problem. 
> Insane code is not an argument for insane behaviour.
> If people can't be bothered to create a "random ID generator" themselves, 
> they had damn well better use "kmalloc(1)" rather than "kmalloc(0)" to get 
> a unique cookie. Asking the allocator to do something idiotic because some 
> idiot thinks a memory allocator is a cookie allocator is just crazy.

It's not such a great idea in general. Maybe it's a dumb device to cut
down on lines of code for merging or some such.


On Mon, Jun 04, 2007 at 10:50:41AM -0700, Linus Torvalds wrote:
> I can understand that things like user-level libraries have to take crazy 
> people into account, but the kernel internal libraries definitely do not.
> (Right now we warn once for zero-sized allocations anyway, and all the 
> cases we've found so far are either bugs that would have been found with 
> ZERO_ALLOC_PTR or would have been perfectly fine with it, so I don't think 
> anybody really _is_ that insane in the kernel)

There are always drivers for that, but I doubt any were sufficiently
creative to pick up on this. At least I've not see any.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: Return ZERO_SIZE_PTR for kmalloc(0)

2007-06-04 Thread William Lee Irwin III
On Fri, 1 Jun 2007 21:45:15 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:
>> That would have to occur with objects that are repeatedly allocated and 
>> then linked toghether etc. Linking typicallty requires a listhead so its 
>> typically difficult to do zero length objects.

On Fri, Jun 01, 2007 at 09:54:27PM -0700, Andrew Morton wrote:
> Well I can't immediately think of a scenario in which it's likely to occur,
> but we're in the position of trying to prove a negative.
> Poke Bill Irwin - he'll think of something ;)

I've yet to see anyone get quite that creative, but I've not gone fishing
for instances of this. I can think of plenty of places where one could do
something like this in practice, but don't care to give anyone any ideas.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: Return ZERO_SIZE_PTR for kmalloc(0)

2007-06-04 Thread William Lee Irwin III
On Fri, 1 Jun 2007 21:45:15 -0700 (PDT) Christoph Lameter [EMAIL PROTECTED] 
wrote:
 That would have to occur with objects that are repeatedly allocated and 
 then linked toghether etc. Linking typicallty requires a listhead so its 
 typically difficult to do zero length objects.

On Fri, Jun 01, 2007 at 09:54:27PM -0700, Andrew Morton wrote:
 Well I can't immediately think of a scenario in which it's likely to occur,
 but we're in the position of trying to prove a negative.
 Poke Bill Irwin - he'll think of something ;)

I've yet to see anyone get quite that creative, but I've not gone fishing
for instances of this. I can think of plenty of places where one could do
something like this in practice, but don't care to give anyone any ideas.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: Return ZERO_SIZE_PTR for kmalloc(0)

2007-06-04 Thread William Lee Irwin III
On Mon, Jun 04, 2007 at 10:50:41AM -0700, Linus Torvalds wrote:
 The exception is if you use the memory allocator as a ID allocator, but 
 quite frankly, if you use a size of zero, it's your own damn problem. 
 Insane code is not an argument for insane behaviour.
 If people can't be bothered to create a random ID generator themselves, 
 they had damn well better use kmalloc(1) rather than kmalloc(0) to get 
 a unique cookie. Asking the allocator to do something idiotic because some 
 idiot thinks a memory allocator is a cookie allocator is just crazy.

It's not such a great idea in general. Maybe it's a dumb device to cut
down on lines of code for merging or some such.


On Mon, Jun 04, 2007 at 10:50:41AM -0700, Linus Torvalds wrote:
 I can understand that things like user-level libraries have to take crazy 
 people into account, but the kernel internal libraries definitely do not.
 (Right now we warn once for zero-sized allocations anyway, and all the 
 cases we've found so far are either bugs that would have been found with 
 ZERO_ALLOC_PTR or would have been perfectly fine with it, so I don't think 
 anybody really _is_ that insane in the kernel)

There are always drivers for that, but I doubt any were sufficiently
creative to pick up on this. At least I've not see any.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: libata no PCI: dma_[un]map_single undefined

2007-06-04 Thread William Lee Irwin III
From: Alan Cox [EMAIL PROTECTED]
Date: Mon, 4 Jun 2007 14:30:05 +0100
 There are PCMCIA controllers and PCI/PCMCIA/Cardbus adapters for the
 Sparc platform I thought ?

On Mon, Jun 04, 2007 at 02:22:43PM -0700, David Miller wrote:
 The 32-bit sparc port has some but those PCMCIA controllers aren't
 going to be supported in the foreseeable future, you have to abstract
 out all the inb/outb etc. operations to go through the pcmcia
 controller driver for one thing.
 Secondarily, sparc32 lacks an active maintainer and it's
 been like this for several years, the only things getting
 worked on therefore are basica functionality and the most
 important bug fixes.

I don't foresee my ever dealing with those PCMCIA controllers. If by
some miracle I manage to get any work done on basic functionality I'll
consider that having won.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] Scheduler profiling - Use conditional calls

2007-05-31 Thread William Lee Irwin III
On Wed, May 30, 2007 at 10:00:34AM -0400, Mathieu Desnoyers wrote:
>>> +   if (prof_on)
>>> +   BUG_ON(cond_call_arm("profile_on"));

* William Lee Irwin III ([EMAIL PROTECTED]) wrote:
>> What's the point of this BUG_ON()? The condition is a priori impossible.

On Thu, May 31, 2007 at 05:12:58PM -0400, Mathieu Desnoyers wrote:
> Not impossible: hash_add_cond_call() can return -ENOMEM if kmalloc lacks
> memory.

Shouldn't it just propagate the errors like anything else instead of
going BUG(), then? One can easily live without profiling if the profile
buffers should fail to be allocated e.g. due to memory fragmentation.

These things all have to handle errors for hotplugging anyway AIUI.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-31 Thread William Lee Irwin III
On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
>> Its ->wait_runtime will drop less significantly, which lets it be
>> inserted in rb-tree much to the left of those 1000 tasks (and which
>> indirectly lets it gain back its fair share during subsequent
>> schedule cycles).
>> Hmm ..is that the theory?

On Thu, May 31, 2007 at 02:26:00PM +0530, Srivatsa Vaddagiri wrote:
> My only concern is the time needed to converge to this fair
> distribution, especially in face of fluctuating workloads. For ex: a
> container who does a fork bomb can have a very adverse impact on
> other container's fair share under this scheme compared to other
> schemes which dedicate separate rb-trees for differnet containers
> (and which also support two level hierarchical scheduling inside the
> core scheduler).
> I am inclined to have the core scheduler support atleast two levels
> of hierarchy (to better isolate each container) and resort to the
> flattening trick for higher levels.

Yes, the larger number of schedulable entities and hence slower
convergence to groupwise weightings is a disadvantage of the flattening.
A hybrid scheme seems reasonable enough. Ideally one would chop the
hierarchy in pieces so that n levels of hierarchy become k levels of n/k
weight-flattened hierarchies for this sort of attack to be most effective
(at least assuming similar branching factors at all levels of hierarchy
and sufficient depth to the hierarchy to make it meaningful) but this is
awkward to do. Peeling off the outermost container or whichever level is
deemed most important in terms of accuracy of aggregate enforcement as
a hierarchical scheduler is a practical compromise.

Hybrid schemes will still incur the difficulties of hierarchical
scheduling, but they're by no means insurmountable. Sadly, only
complete flattening yields the simplifications that make task group
weighting enforcement orthogonal to load balancing and the like. The
scheme I described for global nice number behavior is also not readily
adaptable to hybrid schemes.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-31 Thread William Lee Irwin III
On Wed, May 30, 2007 at 11:36:47PM -0700, William Lee Irwin III wrote:
>> Temporarily, yes. All this only works when averaged out.

On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
> So essentially when we calculate delta_mine component for each of those
> 1000 tasks, we will find that it has executed for 1 tick (4 ms say) but 
> its fair share was very very low.
>   fair_share = delta_exec * p->load_weight / total_weight
> If p->load_weight has been calculated after factoring in hierarchy (as
> you outlined in a previous mail), then p->load_weight of those 1000 tasks
> will be far less compared to the p->load_weight of one task belonging to
> other user, correct? Just to make sure I get all this correct:

You've got it all correct.


On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
>   User U1 has tasks T0 - T999
>   User U2 has task T1000
> assuming each task's weight is 1 and each user's weight is 1 then:
>   WT0 = (WU1 / WU1 + WU2) * (WT0 / WT0 + WT1 + ... + WT999)
>   = (1 / 1 + 1) * (1 / 1000)
>   = 1/2000
>   = 0.0005
>   WT1 ..WT999 will be same as WT0
> whereas, weight of T1000 will be:
>   WT1000  = (WU1 / WU1 + WU2) * (WT1000 / WT1000)
>   = (1 / 1 + 1) * (1/1)
>   = 0.5
> ?

Yes, these calculations are correct.


On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
> So when T0 (or T1 ..T999) executes for 1 tick (4ms), their fair share would
> be:
>   T0's fair_share (delta_mine)
>   = 4 ms * 0.0005 / (0.0005 * 1000 + 0.5)
>   = 4 ms * 0.0005 / 1
>   = 0.002 ms (2000 ns)
> This would cause T0's ->wait_runtime to go negative sharply, causing it to be
> inserted back in rb-tree well ahead in future. One change I can forsee
> in CFS is with regard to limit_wait_runtime() ..We will have to change
> its default limit, atleast when group fairness thingy is enabled.
> Compared to this when T1000 executes for 1 tick, its fair share would be
> calculated as:
>   T1000's fair_share (delta_mine)
>   = 4 ms * 0.5 / (0.0005 * 1000 + 0.5)
>   = 4 ms * 0.5 / 1
>   = 2 ms (200 ns)
> Its ->wait_runtime will drop less significantly, which lets it be
> inserted in rb-tree much to the left of those 1000 tasks (and which indirectly
> lets it gain back its fair share during subsequent schedule cycles).

This analysis is again entirely correct.


On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
> Hmm ..is that the theory?
> Ingo, do you have any comments on this approach?
> /me is tempted to try this all out.

Yes, this is the theory behind using task weights to flatten the task
group hierarchies. My prior post assumed all this and described a method
to make nice numbers behave as expected in the global context atop it.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-31 Thread William Lee Irwin III
On Wed, May 30, 2007 at 09:09:26PM -0700, William Lee Irwin III wrote:
>> It's not all that tricky. 

On Thu, May 31, 2007 at 11:18:28AM +0530, Srivatsa Vaddagiri wrote:
> Hmm ..the fact that each task runs for a minimum of 1 tick seems to
> complicate the matters to me (when doing group fairness given a single
> level hierarchy). A user with 1000 (or more) tasks can be unduly
> advantaged compared to another user with just 1 (or fewer) task
> because of this?

Temporarily, yes. All this only works when averaged out.  The basic
idea is that you want a constant upper bound on the difference between
the CPU time a task receives and the CPU time it was intended to get.
This discretization is one of the larger sources of the "error" in the
CPU time granted. The constant upper bound usually only applies to the
largest difference for any task. When absolute values of differences
are summed across tasks the aggregate will be O(tasks) because there's
something almost like a constant per-task lower bound a la Heisenberg.
It would have to get more exact the more tasks there are on the system
for that to work, and something of the opposite actually holds.

It might be appropriate for the scheduler to dynamically adjust a
periodic timer's period or to set up one-shot timers at involuntary
preemption times in order to achieve more precise fairness in this
sort of situation. In the case of few preemption points such one-shot
code or low periodicity code would also save on taking interrupts that
would otherwise manifest as overhead.

In short, a user with many tasks can reap a temporary advantage
relative to users with fewer tasks because of this, but over time,
longer-running tasks will receive the CPU time intended to within
some constant upper bound, provided other things aren't broken.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-31 Thread William Lee Irwin III
On Wed, May 30, 2007 at 09:09:26PM -0700, William Lee Irwin III wrote:
 It's not all that tricky. 

On Thu, May 31, 2007 at 11:18:28AM +0530, Srivatsa Vaddagiri wrote:
 Hmm ..the fact that each task runs for a minimum of 1 tick seems to
 complicate the matters to me (when doing group fairness given a single
 level hierarchy). A user with 1000 (or more) tasks can be unduly
 advantaged compared to another user with just 1 (or fewer) task
 because of this?

Temporarily, yes. All this only works when averaged out.  The basic
idea is that you want a constant upper bound on the difference between
the CPU time a task receives and the CPU time it was intended to get.
This discretization is one of the larger sources of the error in the
CPU time granted. The constant upper bound usually only applies to the
largest difference for any task. When absolute values of differences
are summed across tasks the aggregate will be O(tasks) because there's
something almost like a constant per-task lower bound a la Heisenberg.
It would have to get more exact the more tasks there are on the system
for that to work, and something of the opposite actually holds.

It might be appropriate for the scheduler to dynamically adjust a
periodic timer's period or to set up one-shot timers at involuntary
preemption times in order to achieve more precise fairness in this
sort of situation. In the case of few preemption points such one-shot
code or low periodicity code would also save on taking interrupts that
would otherwise manifest as overhead.

In short, a user with many tasks can reap a temporary advantage
relative to users with fewer tasks because of this, but over time,
longer-running tasks will receive the CPU time intended to within
some constant upper bound, provided other things aren't broken.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-31 Thread William Lee Irwin III
On Wed, May 30, 2007 at 11:36:47PM -0700, William Lee Irwin III wrote:
 Temporarily, yes. All this only works when averaged out.

On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
 So essentially when we calculate delta_mine component for each of those
 1000 tasks, we will find that it has executed for 1 tick (4 ms say) but 
 its fair share was very very low.
   fair_share = delta_exec * p-load_weight / total_weight
 If p-load_weight has been calculated after factoring in hierarchy (as
 you outlined in a previous mail), then p-load_weight of those 1000 tasks
 will be far less compared to the p-load_weight of one task belonging to
 other user, correct? Just to make sure I get all this correct:

You've got it all correct.


On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
   User U1 has tasks T0 - T999
   User U2 has task T1000
 assuming each task's weight is 1 and each user's weight is 1 then:
   WT0 = (WU1 / WU1 + WU2) * (WT0 / WT0 + WT1 + ... + WT999)
   = (1 / 1 + 1) * (1 / 1000)
   = 1/2000
   = 0.0005
   WT1 ..WT999 will be same as WT0
 whereas, weight of T1000 will be:
   WT1000  = (WU1 / WU1 + WU2) * (WT1000 / WT1000)
   = (1 / 1 + 1) * (1/1)
   = 0.5
 ?

Yes, these calculations are correct.


On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
 So when T0 (or T1 ..T999) executes for 1 tick (4ms), their fair share would
 be:
   T0's fair_share (delta_mine)
   = 4 ms * 0.0005 / (0.0005 * 1000 + 0.5)
   = 4 ms * 0.0005 / 1
   = 0.002 ms (2000 ns)
 This would cause T0's -wait_runtime to go negative sharply, causing it to be
 inserted back in rb-tree well ahead in future. One change I can forsee
 in CFS is with regard to limit_wait_runtime() ..We will have to change
 its default limit, atleast when group fairness thingy is enabled.
 Compared to this when T1000 executes for 1 tick, its fair share would be
 calculated as:
   T1000's fair_share (delta_mine)
   = 4 ms * 0.5 / (0.0005 * 1000 + 0.5)
   = 4 ms * 0.5 / 1
   = 2 ms (200 ns)
 Its -wait_runtime will drop less significantly, which lets it be
 inserted in rb-tree much to the left of those 1000 tasks (and which indirectly
 lets it gain back its fair share during subsequent schedule cycles).

This analysis is again entirely correct.


On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
 Hmm ..is that the theory?
 Ingo, do you have any comments on this approach?
 /me is tempted to try this all out.

Yes, this is the theory behind using task weights to flatten the task
group hierarchies. My prior post assumed all this and described a method
to make nice numbers behave as expected in the global context atop it.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-31 Thread William Lee Irwin III
On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
 Its -wait_runtime will drop less significantly, which lets it be
 inserted in rb-tree much to the left of those 1000 tasks (and which
 indirectly lets it gain back its fair share during subsequent
 schedule cycles).
 Hmm ..is that the theory?

On Thu, May 31, 2007 at 02:26:00PM +0530, Srivatsa Vaddagiri wrote:
 My only concern is the time needed to converge to this fair
 distribution, especially in face of fluctuating workloads. For ex: a
 container who does a fork bomb can have a very adverse impact on
 other container's fair share under this scheme compared to other
 schemes which dedicate separate rb-trees for differnet containers
 (and which also support two level hierarchical scheduling inside the
 core scheduler).
 I am inclined to have the core scheduler support atleast two levels
 of hierarchy (to better isolate each container) and resort to the
 flattening trick for higher levels.

Yes, the larger number of schedulable entities and hence slower
convergence to groupwise weightings is a disadvantage of the flattening.
A hybrid scheme seems reasonable enough. Ideally one would chop the
hierarchy in pieces so that n levels of hierarchy become k levels of n/k
weight-flattened hierarchies for this sort of attack to be most effective
(at least assuming similar branching factors at all levels of hierarchy
and sufficient depth to the hierarchy to make it meaningful) but this is
awkward to do. Peeling off the outermost container or whichever level is
deemed most important in terms of accuracy of aggregate enforcement as
a hierarchical scheduler is a practical compromise.

Hybrid schemes will still incur the difficulties of hierarchical
scheduling, but they're by no means insurmountable. Sadly, only
complete flattening yields the simplifications that make task group
weighting enforcement orthogonal to load balancing and the like. The
scheme I described for global nice number behavior is also not readily
adaptable to hybrid schemes.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] Scheduler profiling - Use conditional calls

2007-05-31 Thread William Lee Irwin III
On Wed, May 30, 2007 at 10:00:34AM -0400, Mathieu Desnoyers wrote:
 +   if (prof_on)
 +   BUG_ON(cond_call_arm(profile_on));

* William Lee Irwin III ([EMAIL PROTECTED]) wrote:
 What's the point of this BUG_ON()? The condition is a priori impossible.

On Thu, May 31, 2007 at 05:12:58PM -0400, Mathieu Desnoyers wrote:
 Not impossible: hash_add_cond_call() can return -ENOMEM if kmalloc lacks
 memory.

Shouldn't it just propagate the errors like anything else instead of
going BUG(), then? One can easily live without profiling if the profile
buffers should fail to be allocated e.g. due to memory fragmentation.

These things all have to handle errors for hotplugging anyway AIUI.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   >